Some issues with Index table search of custom fields

This topic has 14 replies, 2 voices, and was last updated 4 years, 8 months ago by nickchomey18.

Viewing 15 posts - 1 through 15 (of 15 total)

Author

Posts
October 21, 2021 at 5:46 am #35235

nickchomey18
Participant

My BuddyBoss theme just added some functionality to add a description field to attachment files. It gets stored in the wp_posts post_content field.

However, I was using post_content to store the document content that was extracted with Apache Tika, as described in this topic: https://wp-dreams.com/forums/topic/integrating-apache-tikas-text-extraction-with-asps-index/

So, I am now storing it in a postmeta field. After adding it to the Index custom fields section in the Index Table settings page, it matches search results, but only when I have an exact full-word match. Yet, I have the primary logic set to OR and “Show exact matches only” is off, so if I search for a partial word that is in the post_title or post_content, it matches.

Is this how it is supposed to work – Custom fields can’t have partial matches? If so, is it possible to change it?

Also, more generally, when I turn on the “Index table engine” in a search instance, the custom fields selection box for that instance disappears. Is there another way to select which indexed custom fields are available in a given search instance? One of the asp_query_args perhaps?

Finally, is it possible to have the live search results display and highlight the matched text in the custom field? Right now it is only showing the post_content.

Thanks!

October 21, 2021 at 1:11 pm #35257

Ernest Marcinko
Keymaster

Hi,

All keywords go to the same exact table, including the custom fields. One thing to note with partial matches is, that by using the index table engine, mid-word matches are not possible (indexed queries does not allow that in SQL). So if these items are in the index table, and indeed the index table is used, then they will match – regardless of the source. This might be more likely a configuration related issue.
May I ask how many rows does the index table have roughly?

When the index table is enabled, the custom field selection per search instance is not possible, that is on purpose. The index table is a very simple concept, is has to be – to store the least amount of data possible.
Therefore after indexing, there is no way to tell which custom field the keywords are originating from – therefore it can be only configured on the index table panel.

The highlighter should work on whatever field on the live results list. It is not running on the server side – it is executed via javascript after the search query is processed.

October 21, 2021 at 3:01 pm #35261

nickchomey18
Participant

Thank you. Yes, I’m aware it doesn’t do mid-word matches. This is not working for strings that start a word. There’s maybe 5000 rows in the table right now – very small. (by the way, how many can it reasonably handle? Or is it simply limited by your server processing power?)

Any suggestions on what configuration I could check? Again, it returns partial matches (start or end of a word) when the match is in the post title or content, but not an associated custom field.

Thanks for the explanation about custom field selection. I understand.

Sorry, re: the highlighting – I’m wondering how to display the custom field content excerpt in the live search results. Is it a matter of adding code to the template files?

October 22, 2021 at 9:10 am #35270

Ernest Marcinko
Keymaster

Okay, so the size is not the issue for sure. Have you re-created the index table after selecting the custom fields? Although that does not explain the problem, why there are only non-partial matches.

If you create a completely new search bar for testing, select the desired post types only and switch to the index table engine, does the problem persist? If not, then there is a configuration issue somewhere.
Maybe there is a generic filter enabled for the exact matches, make sure there is nothing selected: https://i.imgur.com/JcjI0xh.png
The filters do have effect even if they are not visible.

October 23, 2021 at 7:01 pm #35290

nickchomey18
Participant

I have investigated a bit more and it does seem to be doing partial matching for custom fields. But the problem is actually much more mysterious – it is only certain words that aren’t doing partial matches, and it doesn’t matter if the content is stored in wp_posts.post_content or wp_postmeta.meta_value. Nor does it matter (nor should it matter) which attachment it is associated with. I really can’t find a pattern.

Are there certain characters or suffixes that are excluded or treated differently?

October 25, 2021 at 9:22 am #35302

Ernest Marcinko
Keymaster

In that case can you please check if you have any keyword exclusions set here. Also check if the stop words are disabled (although I don’t think it is related) on the index table settings.

October 25, 2021 at 3:10 pm #35320

nickchomey18
Participant

Thanks, but there aren’t any keyword exclusions. I have stop words enabled, but I assume that excludes them from the index table to begin with and I’ve confirmed that the words that I’m searching for are in the table (searching for the full word returns a result, after all). (Also, the words that I’m having trouble with are not stop words).

I have a suspicion that the issue has something to do with the fact that there are duplicate similar entries – one of the words in question is “concepts”. As seen in the attached screenshot, there are entries for “concept” and “concepts” for a few different posts. I have no trouble for the posts that have the “concept” term. It is just the two (1179, 1180) that only have “concepts” that won’t return partial matches. But, I’ve looked at various other situations and the results are inconsistent.

Anyway, because of this issue as well as some others that I’ve noticed (e.g. a lot of junk terms in the table with punctuation), I started looking into how you create tokens. I made some modifications that reduced the junk terms (e.g. $str = str_replace( array( “-\n”, “-\r”), “”, $str ); to create one word from when a PDF has a hyphen at the end of a line).

It should be noted that Relevanssi does a CONSIDERABLY better job at tokenizing content than ASP. Their free version is open-source, so perhaps you could look into incorporating some of their code into ASP. Not sure how open-source licensing affects a commercial product…

But I also realized there are plenty of excellent open-source tools out there dedicated to this specific task (and much more), so why try to recreate (a very complex) wheel. So, I am currently looking into integrating a python-based NLP package to generate tokens, stems or even lemmas, as well as store other data that I might make use of in other ways. That would eliminate the “concept/s” issue, reduce table size considerably, and surely improve overall performance.

As with Tika support, I suspect that it would be difficult to include something like this into the ASP plugin, as it requires server-level packages. But it is another thing for you to consider.

So, given all of this, perhaps it is best to put this issue on pause until I’ve made these changes. I’ll report back then.

October 26, 2021 at 10:20 am #35342

Ernest Marcinko
Keymaster

Having multiple instances of the keywords in the index table is perfectly normal, as the unique key is not the post ID, but multiple column combinations, including the id and the primary term. If you don’t mind, I won’t go into details why or how exactly, the main reason is – this is how indexed queries work the most efficiently. Relevanssi uses a very similar approach in their index table.

The local php PDF indexing tools are of course much less efficient and will not be able to extract as much information. Making some modifications to the tokenize process on your end may fix problems for you, but could cause issues for dozens of other customers – especially regarding PDF contents. Trust me, we have looked at more than the Relevanssi code to make that function as good as possible. Lots of tests, lots of trial & error goes into that method – and of course it may never be perfect, but we need to consider compatibility with all kinds of different configurations.
Of course if you have any specific suggestions, I am more than happy to take note and incorporate them. If they pass the tests, I am surely adding them to the final release.

October 26, 2021 at 3:26 pm #35359

nickchomey18
Participant

Yes, I more or less understand how your index works and it is extremely similar to Relevanssi. I have no problem with the methodology. Its just a mystery that certain words are not working for partial matching. I’ll reassess when I’ve made the changes that I am looking to incorporate.

My comments on the tokenization were just an aside to explain what I’m doing, but here are a few modifications that I made before starting to explore using an NLP package. They made a major difference in reducing junk entries – many thousands of entries full of punctuation were removed (ones which mostly do not exist in the Relevanssi index).

// Replace end-of-line hyphenation with nothing, to create a full word. Needs to be before the line for removing line breaks.
$str = str_replace( array( “-\n”, “-\r”), “”, $str );

// Replace hyphens, and em and en dashes with space.
$str = str_replace( array( “—”, “–”, “-“), ” “, $str );

These were added to the array for various symbols.
“: “,
“__”,
//quote replacement
“’”,
“‘”,
“’”,
“‘”,
“””,
““”,
“„”,
“´”,
“″”,

I particularly like the user customizability options of Relevanssi for which things to replace and with what (see attached screenshot). And their TAIKASANA magic method for preserving decimals, ampersands and hyphens/dashes while removing all other punctuation with

$a = preg_replace( ‘/[[:punct:]]+/u’, apply_filters( ‘relevanssi_default_punctuation_replacement’, ‘ ‘ ), $a );

is quite clever.

Anyway, this is all beside the point of this ticket. If you test and incorporate them, I would be very surprised if they created any negative side-effects for anyone, especially if combined with user-selectable options in the backend. I hope this has been helpful.

I will update the ticket when I have made my changes with the NLP tool.

October 27, 2021 at 9:34 am #35363

Ernest Marcinko
Keymaster

Thank you for your suggestions! All of these seem to be present already in our tokenization method, but I will double check later.

We avoided these settings on purpose, instead the tokenize process will look for some special characters (like hyphens, dashes, quotes etc..) and indexes keywords both with and without them. This was mostly implemented during the years via customer reports. You probably seen some of this as junk data in the database, but it is actually not.

October 27, 2021 at 3:08 pm #35366
nickchomey18
Participant
When I say junk, I mean thousands of strings that contain punctuation. It was something like 2000/10000 rows that were junk – none of which were in the Relevanssi index table.

Colons “: ” are not removed from the end of words, along with the other modifications I shared above, including dashes and hyphens. There’s also a lot with [brackets]/(parentheses) that I haven’t yet dealt with – those will never be matched by the search given that it only matches the start or end of a word.

Again, Relevanssi’s TAIKASANA method works perfectly for all of this, so I strongly recommend you check out their code – it is much more effective. I’ve included it here:
```
function relevanssi_remove_punct( $a ) {
	if ( ! is_string( $a ) ) {
		// In case something sends a non-string here.
		return '';
	}

	$a = preg_replace( '/<(\d|\s)/', '\1', $a );
	$a = html_entity_decode( $a, ENT_QUOTES );
	$a = relevanssi_strip_all_tags( $a );

	$punct_options = get_option( 'relevanssi_punctuation' );

	$hyphen_replacement = ' ';
	$endash_replacement = ' ';
	$emdash_replacement = ' ';
	if ( isset( $punct_options['hyphens'] ) && 'remove' === $punct_options['hyphens'] ) {
		$hyphen_replacement = '';
		$endash_replacement = '';
		$emdash_replacement = '';
	}
	if ( isset( $punct_options['hyphens'] ) && 'keep' === $punct_options['hyphens'] ) {
		$hyphen_replacement = 'HYPHENTAIKASANA';
		$endash_replacement = 'ENDASHTAIKASANA';
		$emdash_replacement = 'EMDASHTAIKASANA';
	}

	$quote_replacement = ' ';
	if ( isset( $punct_options['quotes'] ) && 'remove' === $punct_options['quotes'] ) {
		$quote_replacement = '';
	}

	$ampersand_replacement = ' ';
	if ( isset( $punct_options['ampersands'] ) && 'remove' === $punct_options['ampersands'] ) {
		$ampersand_replacement = '';
	}
	if ( isset( $punct_options['ampersands'] ) && 'keep' === $punct_options['ampersands'] ) {
		$ampersand_replacement = 'AMPERSANDTAIKASANA';
	}

	$decimal_replacement = ' ';
	if ( isset( $punct_options['decimals'] ) && 'remove' === $punct_options['decimals'] ) {
		$decimal_replacement = '';
	}
	if ( isset( $punct_options['decimals'] ) && 'keep' === $punct_options['decimals'] ) {
		$decimal_replacement = 'DESIMAALITAIKASANA';
	}

	$replacement_array = array(
		'ß'                     => 'ss',
		'ı'                     => 'i',
		'₂'                     => '2',
		'·'                     => '',
		'…'                     => '',
		'€'                     => '',
		'®'                     => '',
		'©'                     => '',
		'™'                     => '',
		'&shy;'                 => '',
		"\xC2\xAD"              => '',
		'&nbsp;'                => ' ',
		chr( 194 ) . chr( 160 ) => ' ',
		'×'                     => ' ',
		'’'               => $quote_replacement,
		"'"                     => $quote_replacement,
		'’'                     => $quote_replacement,
		'‘'                     => $quote_replacement,
		'”'                     => $quote_replacement,
		'“'                     => $quote_replacement,
		'„'                     => $quote_replacement,
		'´'                     => $quote_replacement,
		'″'                     => $quote_replacement,
		//'-'                     => $hyphen_replacement,
		'–'                     => $endash_replacement,
		'—'                     => $emdash_replacement,
		'&'                => $ampersand_replacement,
		'&'                 => $ampersand_replacement,
		'&'                     => $ampersand_replacement,
		'@'                     => $at_replacement,
	);

	/**
	 * Filters the punctuation replacement array.
	 *
	 * This filter can be used to alter the way some of the most common punctuation
	 * is handled by Relevanssi.
	 *
	 * @param array $replacement_array The array of punctuation and the replacements.
	 */
	$replacement_array = apply_filters( 'relevanssi_punctuation_filter', $replacement_array );

	$a = preg_replace( '/\.(\d)/', $decimal_replacement . '\1', $a );

	// Replace end-of-line hyphenation with nothing, to create a full word
	$a = str_replace( array( "-\n", "-\r"), "", $a );
	
	$a = str_replace( "\r", ' ', $a );
	$a = str_replace( "\n", ' ', $a );
	$a = str_replace( "\t", ' ', $a );

	$a = stripslashes( $a );

	$a = str_replace( array_keys( $replacement_array ), array_values( $replacement_array ), $a );
	/**
	 * Filters the default punctuation replacement value.
	 *
	 * By default Relevanssi replaces unspecified punctuation with spaces. This
	 * filter can be used to change that behaviour.
	 *
	 * @param string $replacement The replacement value, default ' '.
	 */
	$a = preg_replace( '/[[:punct:]]+/u', apply_filters( 'relevanssi_default_punctuation_replacement', ' ' ), $a );
	$a = preg_replace( '/[[:space:]]+/', ' ', $a );

	$a = str_replace( 'AMPERSANDTAIKASANA', '&', $a );
	$a = str_replace( 'HYPHENTAIKASANA', '-', $a );
	$a = str_replace( 'ENDASHTAIKASANA', '–', $a );
	$a = str_replace( 'EMDASHTAIKASANA', '—', $a );
	$a = str_replace( 'DESIMAALITAIKASANA', '.', $a );

	$a = trim( $a );

	return $a;
}
```
But, just to clarify, I’m much more impressed with ASP than Relevanssi (which is why I’m here and using it). Its just a few things like this that could use improvement. But, again, I will be incorporating an NLP package to augment the power considerably with lemmas, since lemmas > stems > tokens > nothing.
- This reply was modified 4 years, 8 months ago by nickchomey18.
October 28, 2021 at 2:24 pm #35383

Ernest Marcinko
Keymaster

Thank you!

I understand your reasoning, and thanks for sharing the code as well, I have seen this one many times before. I know their method works pefrectly for your case, but there many other cases where this is not desired at all. Just an example:
The product description includes a string “AB:123-44_42;567”. When entering this string exactly to the input, the user optimally wants this result as the first one – as well as when entering without the punctuation “AB1234442567” – same when entering with spaces “AB 123 44 42 567”.
This seems straightforward, but it actually brings up many issues – and by removing the punctuation/other special characters from both the input and the index – many undesired results may appear when entering “AB:123-44_42;567”. Basically running this:

$a = preg_replace( ‘/[[:punct:]]+/u’, apply_filters( ‘relevanssi_default_punctuation_replacement’, ‘ ‘ ), $a );

..will ruing the original information, so we need to process it first – and create the “junk”.
Of course there is still some real “junk” we need to address, like punctuations/some special characters from the end of the words where possible, that is not working the way it should.

This is just a very basic example, there are many other more complex cases. These all were reported to us by many many customers, so we had to make a working solution, which is transparent from the users.
Of course indexing “junk” data is not the best, but the size of the index table was almost never reported being an issue. We are constantly running test servers with millions of posts, where the index table reaches 10GB, and a basic 1 core, 512MB ram server can handle a query in a fraction of a second. Besides the overhead of the “junk” is extremely variable. In some cases there is almost none, in others there is a lot, very much depends on the original data.

All I wanted to say with this is, that we decided to sacrifice some database space for much better search accuracy. And if users start reporting issues with the index table, we slowly start making some changes and gradually add options to turn these off.

I would love to keep the method the way it is now, but do you think that adding some options to control the punctuation (and others) would be helpful? Would that work for you?

October 28, 2021 at 3:57 pm #35385

nickchomey18
Participant

I understand completely – it is very hard to create a tool that works for all use-cases, so you’ve chosen certain ones over others. I do think options for choosing how to handle each type of punctuation would be useful, and Relevanssi’s code seems like a good starting point. Someone who wants to index “AB:123-44_42;567” could do so, and I could remove all of that. One of the many reasons I chose ASP is for its enormous amount of customizability, and this would be another example of that.

However, the relevant question for you is “would many other ASP users sufficiently appreciate this to make it worth your time?” I don’t know.

But, don’t take me into consideration for this decision. Regardless of how you proceed, I plan to replace ASP’s tokenization with a python NLP package (spaCy) – as a world-class NLP tool, I assume it does an excellent job with tokenization, but more importantly, it produces lemmas that should improve my table size and search performance considerably. It also creates all sorts of other interesting and useful data – identification of Parts of Speech, Named Entities, and more. Moreover, I’m also just curious to tinker with it all.

Unlike with Apache Tika, I really doubt this is something that would ever be worth you working to integrate with ASP, but I figured I’d mention it.

The extensive hooks in ASP make all of this relatively easy to do, and I’m sure you won’t mind adding some others should they be necessary.

October 29, 2021 at 9:25 am #35391

Ernest Marcinko
Keymaster

Thank you very much for all the information, it was extremely helpful 🙂

Currently it looks like we may open a small shop section for additional optional features – like external apache tika processing server – for users needing that feature, for a reasonable monthly/yearly fee. The plugin has too much features already, and I feel like adding more and more might do worse. Some sort of an add-on package/system could probably resolve that too. I am not much in favor of this, but there is so much overhead in the code already.
Local apache tika execution is not possible via including the package (or using a remote) and installing it, as I suspected. Almost all hosts have execution restrictions for the web folders, and PHP has no/restricted access to the server bin folders. We will however add a feature similar to relevanssi for the user to use it’s own tika server instead of paying for one.
I am currently also reviewing the tokenization process, adding a few more tests for all sorts of gibberish to see how the process handles that. Just yesterday I added at least 50 more special characters to handle, as well a process to detect word start/ending with garbage characters.

October 29, 2021 at 3:37 pm #35401

nickchomey18
Participant

Great to hear! It definitely must be difficult to balance having a clean, easy-to-use interface with lots of powerful features.

I’m happy to do my own custom work where it falls outside of what is reasonable for you to include for the average user – especially with regards to things that need packages installed on the server, like Tika and spaCy. But its good that you’ll be adding support for Tika, since it is a pretty “Core” feature that is supported by various competitors. Adding this support through an add-on system seems appropriate to reduce bloat in the core plugin for those who don’t need the features – it is what many other plugins do (SearchWP is a good example). The add-on system should also open the door for further integrations/extensions by other plugins/users.

I’m glad that I could be helpful and I look forward to seeing what you come up with!
Author

Posts

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic.