Home › Forums › Product Support Forums › Ajax Search Pro for WordPress Support › Ignoring diacritical marks in non-Latin scripts during index table generation › Reply To: Ignoring diacritical marks in non-Latin scripts during index table generation
Hi,
Thank you very much for the details, and the list. I am sorry for the late response.
Normally, the database engine is responsible for vocalization/accent cancellation within matches. This issue turned out to be much more interesting than I thought. Initially I wrote a script to handle the ligatures and such, but upon inserting the data to the database, basically only half of the information was inserted – either the words with the “punctuation” and accent marks, or the ones without them, whichever came first.
At first I thought, that the database simply does not differenciate between the original and the unvocalized versions – which was not true – but still only inserted one version, and treated both as the same words.
Interestingly, searching them did not consider the database as a single word.
Long story short I figured out why, and it was related to specific indexes and how the database treats them – so the only possible solution for this is to remove all of vocalizations and store the keywords that way. Then, when the user does a search, do the same to the input keyword – an voila, everything matches as it should (for most cases).
Try adding this code to the functions.php file in your theme/child theme directory – make sure to have a full server back-up first for safety. For more details you can check the safe coding guidelines.
add_filter('asp_indexing_keywords', 'diacritic_asp_indexing_keywords', 10, 1);
function diacritic_asp_indexing_keywords($keywords) {
$new_kw_arr = array();
foreach ( $keywords as $keyword => $arr ) {
$new_kw = hebrew_unvocalize($keyword);
if ( $new_kw != '' ) {
if ( !isset($new_kw_arr[$new_kw]) ) {
$new_kw_arr[$new_kw] = array($new_kw, 1);
} else {
$new_kw_arr[$new_kw][1]++;
}
}
}
return $new_kw_arr;
}
add_filter('asp_keyword_after_postproc', 'hebrew_unvocalize', 10, 1);
function hebrew_unvocalize( $str ) {
$hebrew_common_ligatures = array(
'ײַ' => 'ײ',
'ﬠ' => 'ע',
'ﬡ' => 'א',
'ﬢ' => 'ד',
'ﬣ' => 'ה',
'ﬤ' => 'כ',
'ﬥ' => 'ל',
'ﬦ' => 'ם',
'ﬧ' => 'ר',
'ﬨ' => 'ת',
'שׁ' => 'ש',
'שׂ' => 'ש',
'שּׁ' => 'ש',
'שּׂ' => 'ש',
'אַ' => 'א',
'אָ' => 'א',
'אּ' => 'א',
'בּ' => 'ב',
'גּ' => 'ג',
'דּ' => 'ד',
'הּ' => 'ה',
'וּ' => 'ו',
'זּ' => 'ז',
'טּ' => 'ט',
'יּ' => 'י',
'ךּ' => 'ך',
'כּ' => 'כ',
'לּ' => 'ל',
'מּ' => 'מ',
'נּ' => 'נ',
'סּ' => 'ס',
'ףּ' => 'ף',
'פּ' => 'פ',
'צּ' => 'צ',
'קּ' => 'ק',
'רּ' => 'ר',
'שּ' => 'ש',
'תּ' => 'ת',
'וֹ' => 'ו',
'בֿ' => 'ב',
'כֿ' => 'כ',
'פֿ' => 'פ',
'ﭏ' => 'אל'
);
$new_kw = trim( preg_replace('/\p{Mn}/u', '', $str) );
foreach( $hebrew_common_ligatures as $word1 => $word2 ) {
$new_kw = trim(str_replace( $word1, $word2, $new_kw ));
}
return $new_kw;
}
After adding the code, please re-create the index table (click “Delete index” then “Create new index” buttons).
If all goes well, this should do the trick. Please let me know if this helps at least a tiny bit, as I will include this code then in the next live release.