some PDF's not indexed or only partly

Home Forums Product Support Forums Ajax Search Pro for WordPress Support some PDF's not indexed or only partly

This topic contains 8 replies, has 2 voices, and was last updated by lvbhb lvbhb 1 year, 10 months ago.

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #37553
    lvbhb
    lvbhb
    Participant

    We purchased Ajax Search Pro to make an archive of our club magazine searchable. It consists of currently 261 text-searchable .PDF documents, around 60 pages each. The number will grow with 4 each year.

    A test you performed (see ASP 3 Pre-purchase questions #0001580) proved 100% successful in indexing 10 sample documents on our site. After installing Ajax Search Pro I set to index only PDF. The indexing took about 40 minutes after which it reported Items Indexed: 480 | Items not indexed: 0 | Total keywords: 884690. No error messages. During the indexing the site was performing normally, maybe just a bit slower than normal.

    Problem: a sample with 34 words in 8 docs indicates that most documents have been indexed 100% but some not at all and some only up to page 8. See the zipped PDF I uploaded.

    My impression: when it is successful with 1 word in the beginning and 1 at the end of a document it is successful with any word I try. How can I make the indexing score 100%? I don’t mind if it needs to run in the background for several days the first time.

    Another question: how can I limit the search to only a specific set of PDF’s? We are currently using a plugin Filebird Pro to create “folders” in our media library. We only want to search the category ‘Bokkepoten’. If we need something else to categorize these PDF’s that’s fine.

    Test page: https://www.lvbhb.nl/test-searchpro/

    regards,
    Willem Bast

    Attachments:
    You must be logged in to view attached files.
    #37565
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Hi Willem,

    I tried to log-in with the details, but I was not able to, either the back-end username or password is incorrec, can you please check that?

    Also, make sure that the Media Service feature is enabled. If not, then after enabling, please re-create the index table. In your pre-sales query you have stated that you have requested a media service key already, but needed the product for it. We have tested the PDFs you have sent to us via our local installation on our Media Service server.
    I suspect the media service feature is probably not enabled.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #37567
    lvbhb
    lvbhb
    Participant

    Hi Ernest,
    I enabled the Media Service and sent you the login credentials by email directly.

    regards,
    Willem

    #37576
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Hi Willem,

    Thank you very much for all the details, I managed to resolve the issue as it seems.

    There was indeed an indexing failure with some parts of the text – it was caused by a HTML tag stripping mechanism. In general it is not simple to properly strip HTML tags from a given string, and false positive matches can make things even worse. The missing texts were caused by mismatched HTML tags, but I found a way to deal with them by making a few direct modifications to the plugin code. It should be okay now.

    If it seemingly works all right, I will make sure to include this fix in the next upcoming release for sure.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #37577
    lvbhb
    lvbhb
    Participant

    Thanks, Ernest. That sounds like good news. I’m currently on the road so a thorough test will have to wait till later this week.

    Do I have to re-index or even re-upload the PDF’s?

    regards,
    Willem

    #37580
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Definitely, do a re-index once you finished everything.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #37623
    lvbhb
    lvbhb
    Participant

    Hi Ernest,

    When rebuilding I’d like find a solution to my second question at once: how can I limit the indexing and search to only a specific set of PDF’s?

    Do I have to add a custom field with a certain value to the PDF’s one by one or is there a better method?

    regards, Willem

    #37625
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Hi Willem,

    Not from indexing, but from the actual search it is possible. You can either exclude them by the IDs, as well as taxonomy term and custom field based filtering also works with media files.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #37976
    lvbhb
    lvbhb
    Participant

    Thanks, this works fine now. But we still have some trouble with the search results page. Since that is a different subject, I created a new ticket for that.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.