Ignoring diacritical marks in non-Latin scripts during index table generation

Home Forums Product Support Forums Ajax Search Pro for WordPress Support Ignoring diacritical marks in non-Latin scripts during index table generation

This topic contains 5 replies, has 2 voices, and was last updated by Ernest Marcinko Ernest Marcinko 6 months ago.

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
    Posts
  • #35663
    aharonium
    aharonium
    Participant

    We maintain a multilingual website and use Ajax Search Pro to index texts on the site in many different languages and language scripts, mainly Latin and Hebrew, but also Greek, Arabic, Cyrillic, etc. Some of these languages employ diacritical marks for vocalization and cantillation. However, we don’t want these marks included in the index generation process. (Users will likely not be using the diacritical marks in their queries.)

    I did see your answer to the question posed here, directing a user to the index table settings as a means for ignoring punctuation. With this in mind, I added a comma separated list of the diacritical marks that Ajax Search Pro should ignore into the advanced index table options stop-word list and generated a new index. So, for example, we would like the word in Hebrew, מַפְתֵּחַ (‘index’) to be indexed without any diacritical vocalization marks as מפתח. Alternately, we would like Ajax Search Pro to return searches to queries that ignore any indexed diacritical marks. (This alternative may be preferable but technically more difficult — we don’t know.)

    Unfortunately, we aren’t finding any real change in how the Ajax Search Pro indexes the site when we turn the word-stop setting on. Is there something we should be adding to our functions.php to provide the expected change in how Ajax Search Pro indexes without diacritical marks, or for that matter, any unneeded or unwanted unicode? Additionally, it would be good to know how we can employ a replacement table for common ligatures, so that if any ligatures are found during indexing, they are replaced by the combination of letters they correspond to.

    Please let us know — and thank you.

    I’m attaching a UTF-8 formatted plaintext file with the diacritical marks we’d like to ignore in the index table generation. It also includes a replacement table for common ligatures.

    Attachments:
    You must be logged in to view attached files.
    #35686
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    Hi,

    Thank you very much for the details, and the list. I am sorry for the late response.

    Normally, the database engine is responsible for vocalization/accent cancellation within matches. This issue turned out to be much more interesting than I thought. Initially I wrote a script to handle the ligatures and such, but upon inserting the data to the database, basically only half of the information was inserted – either the words with the “punctuation” and accent marks, or the ones without them, whichever came first.
    At first I thought, that the database simply does not differenciate between the original and the unvocalized versions – which was not true – but still only inserted one version, and treated both as the same words.
    Interestingly, searching them did not consider the database as a single word.

    Long story short I figured out why, and it was related to specific indexes and how the database treats them – so the only possible solution for this is to remove all of vocalizations and store the keywords that way. Then, when the user does a search, do the same to the input keyword – an voila, everything matches as it should (for most cases).

    Try adding this code to the functions.php file in your theme/child theme directory – make sure to have a full server back-up first for safety. For more details you can check the safe coding guidelines.

    add_filter('asp_indexing_keywords', 'diacritic_asp_indexing_keywords', 10, 1);
    function diacritic_asp_indexing_keywords($keywords) {
    	$new_kw_arr = array();
    	foreach ( $keywords as $keyword => $arr ) {
    		$new_kw = hebrew_unvocalize($keyword);
    		if ( $new_kw != '' ) {
    			if ( !isset($new_kw_arr[$new_kw]) ) {
    				$new_kw_arr[$new_kw] = array($new_kw, 1);
    			} else {
    				$new_kw_arr[$new_kw][1]++;
    			}
    		}
    	}
    	return $new_kw_arr;
    }
    
    add_filter('asp_keyword_after_postproc', 'hebrew_unvocalize', 10, 1);
    function hebrew_unvocalize( $str ) {
    	$hebrew_common_ligatures = array(
    		'ײַ' => 'ײ',
    		'ﬠ' => 'ע',
    		'ﬡ' => 'א',
    		'ﬢ' => 'ד',
    		'ﬣ' => 'ה',
    		'ﬤ' => 'כ',
    		'ﬥ' => 'ל',
    		'ﬦ' => 'ם',
    		'ﬧ' => 'ר',
    		'ﬨ' => 'ת',
    		'שׁ' => 'ש',
    		'שׂ' => 'ש',
    		'שּׁ' => 'ש',
    		'שּׂ' => 'ש',
    		'אַ' => 'א',
    		'אָ' => 'א',
    		'אּ' => 'א',
    		'בּ' => 'ב',
    		'גּ' => 'ג',
    		'דּ' => 'ד',
    		'הּ' => 'ה',
    		'וּ' => 'ו',
    		'זּ' => 'ז',
    		'טּ' => 'ט',
    		'יּ' => 'י',
    		'ךּ' => 'ך',
    		'כּ' => 'כ',
    		'לּ' => 'ל',
    		'מּ' => 'מ',
    		'נּ' => 'נ',
    		'סּ' => 'ס',
    		'ףּ' => 'ף',
    		'פּ' => 'פ',
    		'צּ' => 'צ',
    		'קּ' => 'ק',
    		'רּ' => 'ר',
    		'שּ' => 'ש',
    		'תּ' => 'ת',
    		'וֹ' => 'ו',
    		'בֿ' => 'ב',
    		'כֿ' => 'כ',
    		'פֿ' => 'פ',
    		'ﭏ' => 'אל'
    	);
    	$new_kw = trim( preg_replace('/\p{Mn}/u', '', $str) );
    	foreach( $hebrew_common_ligatures as $word1 => $word2 ) {
    		$new_kw = trim(str_replace( $word1, $word2, $new_kw ));
    	}
    	return $new_kw;
    }

    After adding the code, please re-create the index table (click “Delete index” then “Create new index” buttons).

    If all goes well, this should do the trick. Please let me know if this helps at least a tiny bit, as I will include this code then in the next live release.

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35693
    aharonium
    aharonium
    Participant

    Thank you for taking a look at this and for this code. I’ve added it to functions.php and I generated a new index.

    To test this out, I did a search for a phrase that appears in one post with and without diacritics:
    בּוֹרֵא הַשָּׁמַֽיִם וְנוֹטֵיהֶם
    בורא השמים ונוטיהם

    (It’s on the second line of the post here: https://opensiddur.org/?p=40747.)

    The search menu is accessed via an off-canvas sidebar. There’s a link just next to the logo in the head of the page.

    The good news is that Ajax Search Pro did well for the phrase without diacritical marks.

    Search result of Hebrew text without diacritical marks

    The bad news is that Ajax Search Pro failed on the phrase with diacritical marks, which means, I think, that the diacritical marks aren’t being removed from the search query.

    Search result of Hebrew text with diacritical marks

    I mean, the good news is very good news as I think most users will be typing in their search query without diacritical marks. But for some other users, they will be copying and pasting their search query from text found elsewhere on the site (or from off-site resources). Stripping the diacritical marks from their query in order to better match the index table is a very good idea that I hadn’t thought of in my support question. So thank you very much for thinking of that!

    (Figuring out how to integrate a virtual Hebrew or Universal Unicode keyboard with Ajax will probably be my next question after this one.)

    • This reply was modified 6 months ago by aharonium aharonium.
    • This reply was modified 6 months ago by aharonium aharonium.
    • This reply was modified 6 months ago by aharonium aharonium.
    • This reply was modified 6 months ago by aharonium aharonium.
    • This reply was modified 6 months ago by aharonium aharonium.
    Attachments:
    You must be logged in to view attached files.
    #35705
    Ernest Marcinko
    Ernest Marcinko
    Keymaster

    You are right, sorry about that. I made a tiny mistake in the code, and it was not applying for the search input. I made the correction via the file editor on your server directly, it should work as expected now.

    Can you please check?

    Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


    #35714
    aharonium
    aharonium
    Participant

    It works. Thank you!

    #35715
    Ernest Marcinko
    Ernest Marcinko
    Keymaster
    You cannot access this content. Best,
    Ernest Marcinko

    If you like my products, don't forget to rate them on codecanyon :)


Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.