Fix search issue in PDFs

This topic contains 12 replies, has 2 voices, and was last updated by Ernest Marcinko Ernest Marcinko 9 months, 3 weeks ago.

Viewing 13 posts - 1 through 13 (of 13 total)
  • Author
    Posts
  • #35601
    zonita08
    zonita08
    Participant

    The plugin is not searching inside PDFs, I have enabled all the PDF options inside the plugin.
    Eg:

    Search “packing sector”.
    On the https://www.staging9.canadaid.ca/traceability/newsletters/ page, there is a link to “Abattoir Insights” pdf which contains “packing sector” it’s not showing up in the search results.

    Also, it doesn’t search in IFrames despite that option being turned on. Eg: https://www.staging9.canadaid.ca/who-we-are/

    #35622
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Hi,

    Thank you for the details, it helped a lot.

    The issue was, that the attachment search was turned off, I turned it on under the General Options -> Attachments search panel: https://i.imgur.com/w4bfxq7.png
    I am seeing the PDF file in the results now.

    The iframe extraction is a highly experimental feature, and as it is stated it may not work in all cases. On that page, there is an iframe with a custom flipbook script, embedded with a PDF reader of some sort. I’m afraid there is no way of indexing that, as it is also embedded in the iframe. That feature works great with HTML text content, it can extract most of that, but complex data – like embeds are not possible to fetch.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35675
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35689
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    You are right, I think I found the problem.

    By default, the plugin stops the search process whenever finds the sufficient amount of results to conserve performance. There seems to be an issue with this when multiple sources are selected, specifically the attachment and post type sources at the same time – I can replicate this on our test servers as well.

    The quickest and most effective solution for now was, that I increased the number of results a bit, to 40 at a time, that should increase the potential results pool, and improve the matches a lot: https://i.imgur.com/tvEqmtE.png
    It is an effective bypass to the issue for now.

    I will make sure to resolve this in the upcoming release completely – it should be out within a week.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35774
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35849
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35850
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Can you please update to 4.21.6. We have addressed an issue related to this. After the update, the plugin should return the results individually from both the post types and the attachments, when using the index table engine.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35863
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35869
    Ernest Marcinko
    Ernest Marcinko
    Keymaster
    You cannot access this content. Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35891
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35901
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    On my end it is the 3rd result, here: https://i.imgur.com/z5K31Ii.png
    The two preceeding results have the “sector” keyword in them 2 times, so they come as more relevant.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35914
    zonita08
    zonita08
    Participant
    You cannot access this content.
    #35920
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Q1: The index table engine can not do exact matching such as matching the words in order – as the text is extracted from the file contents to a table, where the occurences are also counted. Therefore not the full text is stored, but each keyword with additional field information. Exact matching on files is not possible, as the information needs to be extracted first to the index table. Searching the file contents directly would be extremely slow.

    Q2: The search “calendar of events” – The “Events” result is a post type result. Currently, the plugin is configured to return Media files and Post types as results. For each result group a separate query needs to be executed.
    Currently, first the attachments, then the post type results are displayed in order. So the first 10 results are the matching attachments, then come the 40 post type results (when match): https://i.imgur.com/MHPWCGA.png
    You can change the mixed results ordering here, if you want to see the post type results first, then the attachment results after.
    The “EVENTS” is #15 on the list, because 10 media file results preceed it, then more relevant matches because of the “of” keyword occurence on the first 4 results (both in titles and contents).

    Based on your queries, I made the following changes to get the best possible matches from each group. I turned on the stop-words on the index table, as well as increased the minimum character count for words to 3: https://i.imgur.com/Nj9K83f.png
    This should improve the matches greatly, as many unrelated common words get filtered out. Around 25 000 common words were removed from the index.

    Then, for very strict matches, I recommend using either only the “AND” logic or the “AND with exact keyword matches” logic: https://i.imgur.com/IHoknQp.png
    The secondary “OR” logic currently activated fills up the results with fuzzy matches from any of the keyphrases. I don’t think you need that, as it yields most of the matches you may not want to see.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic.