Integrating Apache Tika's text extraction with ASP's Index

This topic has 5 replies, 2 voices, and was last updated 4 years, 8 months ago by nickchomey18.

Viewing 6 posts - 1 through 6 (of 6 total)

Author

Posts
September 27, 2021 at 7:16 pm #34883

nickchomey18
Participant

I have installed Apache Tika on my server and am using a function based off of this old plugin code to extract the contents of attachment files and store them in the WordPress database. In addition to supporting dozens of formats, I can only assume that it is also more efficient and effective than the libraries that are currently used by ASP.

It can also be integrated with Tesseract to provide OCR capabilities, which I have successfully implemented.

Anyway, I do seem to have it configured sufficiently such that ASP is returning results based on document content, but I hope that you can help clarify a few things so that I can ensure it is working in the best way.

What I have done is use wp_update_post to update the attachment’s wp_posts’ post_content field with Tika’s extracted text. I then included the desired mime types in ASP’s “Attachment mime types to index” textbox, and turned off all of the toggles for indexing file contents (PDF, text, rtf, office).

Does this seem appropriate such that ASP will include any document content without using ASP’s built-in extraction methods?

Also, would it be better to store the content in a custom postmeta field rather than the post_content field of the attachment’s wp_posts entry? I like being able to see the content in the description field of each media library entry, rather than needing to figure out how to add a metabox for a custom postmeta field. But I’m not sure if this has any undesirable implications for ASP, or WordPress generally.

Any other thoughts?

Also, if you have any interest in officially integrating any of this into ASP, I’d be happy to share more specific details. It is really quite easy – you just need to install Java (and Tesseract, if desired) via SSH, create a folder on the server and download Tika’s .jar file. The plugin/snippet mentioned above then does all of the work.

Thanks!

September 28, 2021 at 12:28 pm #34890

Ernest Marcinko
Keymaster

Hi,

This sounds just perfect to me. Because the content to be indexed is placed to the actual post_content field, then any search plugin (or any plugin for that matter) can utilize it – it is not limited to ajax search pro only.
If you prefer to leave the post_content as it is, you could use a custom field as in the original code, I think both solutions are perfectly fine, I don’t think this causes any issues.

I will look into this, it seems very interesting. Maybe we can figure out a way to one-click download and execution for it via the plugin back-end.

September 28, 2021 at 1:19 pm #34895

nickchomey18
Participant

Glad to hear that what I’ve done makes sense and that you’re interested in it!

Im very inexperienced with development, so I can’t imagine that I can help you much with making it a one-click config, but it would be great if you are able to figure it out such that the average user doesn’t need to use ssh or ask their host to set it up. Perhaps the relevant libraries can be included directly in the plugin somehow? I suspect not, however.

Or, at the very least, it would be useful if, from the plugin backend, you could set the path to the jar file as well as which file extensions and mime types are to be processed, rather than editing the plugin code directly.

You can also set up Tika as a server, but that was far beyond my capabilities and seemingly unnecessary compared to just running the single jar file upon request.

Anyway, for anyone even slightly technically inclined, or who has a host who is amenable to installing packages on the server, the ssh work isn’t more than a few commands for installing Java, tesseract, tesseract training data, creating a folder for Tika and using wget to download Tika. And then the plugin/snippet automatically processes any files that are uploaded (so long as the file type is included in its code). This includes files uploaded as part of buddypress/buddyboss activity feed and other media collections.

One final thing to note – if using Openlitespeed (and perhaps litespeed Enterprise), running Tika from WordPress creates memory related errors. After considerable debugging with my host/control panel (runcloud) as well as some input from litespeed, the solution was to just change a couple of parameters in the litespeed webapp config. https://forum.openlitespeed.org/threads/ols-is-preventing-wordpress-from-properly-communicating-with-apache-tika.5008/#post-11724

Please let me know if you have any questions about how I’ve configured it!

September 28, 2021 at 2:25 pm #34904

nickchomey18
Participant

Relevanssi’s solution to this problem is to have their own server that runs Tika and the plugin sends files there and receives the text content back. I suppose you could set something similar up, but some people (like me) would prefer not to share the files for privacy reasons. https://www.relevanssi.com/knowledge-base/indexing-searching-pdfs-wordpress/

SearchWP allows for Tika as well, but only if you have it set up via ssh (with limited instructions on how to do that).https://searchwp.com/documentation/knowledge-base/using-apache-tika-for-document-processing/

I assume there other plugins that do something similar.

Perhaps you could give people two options within the plugin – 1) send to your server for processing or 2) show instructions (5 ssh commands) on how to install Java, Tesseract and Tika on their own server?

Or, instead of your own tika server, you could do as SearchWP does and offer the two options of 1) use the current libraries that are included in ASP or 2) use Tika/Tesseract if installed on the server. You could create a documentation article about the differences (file type, ocr etc…) between the two and some sort of performance benchmark as well., so that users can better decide whether installing Tika is worthwhile.

Another useful/necessary option would be to be able to process any existing attachments that don’t have the post_content field populated – as it stands, this only processes files upon initial upload. But, some people may have manually put a small description in the post_content field for some files, so perhaps a custom metafield would be better. This is what the other plugins do, but they delete that field if you delete the plugin – I think it would be nice to offer the option to persist the field/data even after you delete the plugin.

You could also offer the option to fully re-process all files regardless of whether it has been done before.

I hope this helps! I’m happy to discuss any of it.

September 28, 2021 at 2:51 pm #34905

Ernest Marcinko
Keymaster

Thank you for the information, both solutions are very interesting. We like to keep everything local, as the plugin license is not yearly, but includes updates forever – and external server cost would probably not worth it. Maybe if we could add this as a yearly service somehow.

Anyways, this is a great addition, it will be added sooner or later. Maybe at first as local with installation instructions, then as a remote service.

September 28, 2021 at 2:54 pm #34906

nickchomey18
Participant

Great, I look forward to seeing what you come up with! Until then, I’ll use what I have set up. It works just fine for my needs. Feel free to close this ticket.
Author

Posts

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.