This website uses cookies to personalize your experience. By using this website you agree to our cookie policy.

Reply To: Integrating Apache Tika's text extraction with ASP's Index

Home Forums Product Support Forums Ajax Search Pro for WordPress Support Integrating Apache Tika's text extraction with ASP's Index Reply To: Integrating Apache Tika's text extraction with ASP's Index

#34904
nickchomey18nickchomey18
Participant

Relevanssi’s solution to this problem is to have their own server that runs Tika and the plugin sends files there and receives the text content back. I suppose you could set something similar up, but some people (like me) would prefer not to share the files for privacy reasons. https://www.relevanssi.com/knowledge-base/indexing-searching-pdfs-wordpress/

SearchWP allows for Tika as well, but only if you have it set up via ssh (with limited instructions on how to do that).https://searchwp.com/documentation/knowledge-base/using-apache-tika-for-document-processing/

I assume there other plugins that do something similar.

Perhaps you could give people two options within the plugin – 1) send to your server for processing or 2) show instructions (5 ssh commands) on how to install Java, Tesseract and Tika on their own server?

Or, instead of your own tika server, you could do as SearchWP does and offer the two options of 1) use the current libraries that are included in ASP or 2) use Tika/Tesseract if installed on the server. You could create a documentation article about the differences (file type, ocr etc…) between the two and some sort of performance benchmark as well., so that users can better decide whether installing Tika is worthwhile.

Another useful/necessary option would be to be able to process any existing attachments that don’t have the post_content field populated – as it stands, this only processes files upon initial upload. But, some people may have manually put a small description in the post_content field for some files, so perhaps a custom metafield would be better. This is what the other plugins do, but they delete that field if you delete the plugin – I think it would be nice to offer the option to persist the field/data even after you delete the plugin.

You could also offer the option to fully re-process all files regardless of whether it has been done before.

I hope this helps! I’m happy to discuss any of it.