Indexing the whole site to SOLR | Community
Skip to main content
New Participant
September 27, 2021
Solved

Indexing the whole site to SOLR

  • September 27, 2021
  • 2 replies
  • 1108 views

Currently we are trying to index our site (with 40K pages) using a scheduler Job to SOLR server.

I have tried web page scraping using HTMLUnit and Jsoup, but both approaches take 10+s to form the required model object to be sent to SOLR.

I was able to form the model object using ModelExporter (getting jcr:content as JSON) within 1s. This works fine for single page. But when I run using scheduler (which iterates over the pages), it takes 2-3s. 

so the full site indexing takes 24 hours.

 

Does anyone has any idea on how to do this optimally or any AEM server activity which can speed this up ?

This post is no longer active and is closed to new replies. Need help? Start a new post to ask your question.
Best answer by joerghoh

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

2 replies

joerghoh
joerghohAccepted solution
Employee
September 27, 2021

I don't think that there is a faster way to extract this information in a structured way, but you can always run this process in a multi-threaded way. And instead of just thinking of the initial filling of the index, please consider the cases of updates during regular operation.

Kiran_Vedantam
New Participant
September 27, 2021
New Participant
September 27, 2021

Hi @kiran_vedantam , our old approach (before using model exporter) is from the above links. This took 8s for get the page data. hence moved to model exporter.