This website uses cookies to personalize your experience. By using this website you agree to our cookie policy.

Reply To: Fix search issue in PDFs

#35920
Ernest MarcinkoErnest Marcinko
Keymaster

Q1: The index table engine can not do exact matching such as matching the words in order – as the text is extracted from the file contents to a table, where the occurences are also counted. Therefore not the full text is stored, but each keyword with additional field information. Exact matching on files is not possible, as the information needs to be extracted first to the index table. Searching the file contents directly would be extremely slow.

Q2: The search “calendar of events” – The “Events” result is a post type result. Currently, the plugin is configured to return Media files and Post types as results. For each result group a separate query needs to be executed.
Currently, first the attachments, then the post type results are displayed in order. So the first 10 results are the matching attachments, then come the 40 post type results (when match): https://i.imgur.com/MHPWCGA.png
You can change the mixed results ordering here, if you want to see the post type results first, then the attachment results after.
The “EVENTS” is #15 on the list, because 10 media file results preceed it, then more relevant matches because of the “of” keyword occurence on the first 4 results (both in titles and contents).

Based on your queries, I made the following changes to get the best possible matches from each group. I turned on the stop-words on the index table, as well as increased the minimum character count for words to 3: https://i.imgur.com/Nj9K83f.png
This should improve the matches greatly, as many unrelated common words get filtered out. Around 25 000 common words were removed from the index.

Then, for very strict matches, I recommend using either only the “AND” logic or the “AND with exact keyword matches” logic: https://i.imgur.com/IHoknQp.png
The secondary “OR” logic currently activated fills up the results with fuzzy matches from any of the keyphrases. I don’t think you need that, as it yields most of the matches you may not want to see.