Developer's Documentation

Skip to end of metadata
Go to start of metadata

The SearchPhp Plugin uses a lucene index for the website search. A PHP crawler and Zend_Search_Lucene is used. The website UI consists of a controller and view script and can be included in a search result page by calling the frontend search action.

Frontend index

The frontend index is created by the website crawler. This crawler is initially forced to start with a pimcore maintenance run, as soon as  the frontend is set to enabled in the plugin settings and if start URLs and allowed urls are configured. The crawler starts as many fetcher jobs as specified in maxThreads and starts crawling the site from the urls specified in start URLs. If pcntl is enabled, the fetcher jobs are executed simultanously, otherwise the crawler works in a procedural manner.

A fetcher job retrieves the content of a URL, parses the page and extracts all links. If they match one of the regexes provided in valid regexes and do not match any of the regexes provided in forbidden regexes. The retrieved links are written to a database table, which provides the start URLs for the next loop and/or other concurrent processes. The fetcher jobs stop if this table remains empty, which means that no more valid links could be found to serve as starting points.Moreover, the fetcher jobs extract text relevant for search. This can be the entire HTML page converted to text or it can be part(s) of the HTML page. If only parts of a page should be considered for search results, they need to be surrounded by designated HTML comments to tell the crawler wich parts should be ignored and which parts should be included. A common application for this option is to make navigation, header and footer irrelevant for search results, because they contain the same content on each and every page on the site. The text relevant for search is written to a different database table, which is used by the frontend controller to extract search result sumaries. Moreover, a fetcher job prepares the lucene document for the current page and writes it to a temporary database table. To convert a page to a lucene document the Zend_Search_Lucene_Document_Html is used. Additionally it is atempted to extract the h1 headline from the html document and add it to the lucene document. Fetcher jobs are simply preparing data for the indexer job, they do not write to the lucene index themselves.

Besides the fetcher jobs, there is a single indexer job, which retrieves data prepared by the fetcher jobs from a third database table. Fetcher jobs write serialzed lucene documents to a temporary database table and the indexer simply retrieves them, unserializes them and adds the document to the lucene index. If the indexer job does not discover any data in the table that should feed it, it waits for 3, 5 and last 10 seconds if more data is coming. If it does not, it finally dies. After all (concurrent)  jobs have finished, the parent job makes sure that the temporary table is checked again for data in case a fetcher wrote to the table after the indexer has already finished. This ensures that nothing is left behind in the concurrent processes of fetching and indexing pages.

Frontend Search Controller

TODO

Frontend Search View

TODO

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.