SAP Metaphone in ABAP
This ABAP effort is based on: http://www.wbrogden.com/
by Bill
Brogden
http://www.wbrogden.com/phonetic/index.html
-- has Java code I translated.
I wrote a SAP ABAP Function module ZJNC_PARSE_ARTDESC which
tokenizes any description into max 3 metaphones and max 2 numbers.
It uses a shared memory class to weed out noise words like "the" "for" ....
It is a SAP NW2004s+ compatible to extent of use of Regular Expressions
feature.
My objective was to search for Retail articles covering Books, Music & Standard Retail stuff.
I used this to create a SAP database table ZMETAPHONE from SAP's material Description table.
It also puts a tag for 3 business areas - Standard Retail, Music, Books.
Then in the final Query screen I displayed a ranked search.
Searching for "Cocktail Bible" "Beatles hits" "Colgate 100gm" "Surf Excel" "Excel Surf" "Basmati Rice 5kg"
got me excellent results with good response times.
SAP has a standard facility for Advanced Text search that is very well integrated in ERP & BI.
TREX is enabled for many Business objects in Master data.
https://www.sdn.sap.com/irj/sdn/go/portal/prtroot/docs/library/uuid/a751a1ec-0a01-0010-f0ba-89e4c5cd0261
shows how easy it is to use. But I wanted something simple and easy to control.
For me the numeric parts are very important in product search.
I had a look at publicly available "Double Metaphone" which I found unsuitable for essentially English English India.
Also having primary & secondary codes is not suitable for creating a Database table.
------------------------------------------------------------------------------------------
FUNCTION zjnc_parse_artdesc.
*"----------------------------------------------------------------------
*"*"Local Interface:
*" IMPORTING
*" REFERENCE(INPTEXT) TYPE CHAR40
*" EXPORTING
*" REFERENCE(METAPHONE1) TYPE CHAR4
*" REFERENCE(METAPHONE2) TYPE CHAR4
*" REFERENCE(METAPHONE3) TYPE CHAR4
*" REFERENCE(NUMBER1) TYPE NUM8
*" REFERENCE(NUMBER2) TYPE NUM8
*"----------------------------------------------------------------------
* shared ABAP class for finding very common
words like THE FOR ...
DATA : myshm TYPE REF TO zshm_common_words.
DATA : hit(1)
TYPE c,
wword
TYPE char5,
len
TYPE i,
off
TYPE i.
TYPES: BEGIN OF t_token,
token(40)
TYPE c,
END OF t_token.
DATA: it_token TYPE STANDARD
TABLE OF t_token.
DATA: wa_token TYPE t_token.
FIELD-SYMBOLS: <fs_token> TYPE
t_token.
TYPES: BEGIN OF t_number,
numlen
TYPE i,
number(8)
TYPE n,
END OF t_number.
DATA: it_number TYPE STANDARD TABLE
OF t_number.
DATA: wa_number TYPE t_number.
FIELD-SYMBOLS: <fs_number> TYPE
t_number.
TYPES: BEGIN OF t_word,
wlen
TYPE i,
word(16)
TYPE c,
END OF t_word.
DATA: it_word TYPE STANDARD TABLE OF
t_word.
DATA: wa_word TYPE t_word.
FIELD-SYMBOLS: <fs_word> TYPE
t_word.
CLEAR: metaphone1, metaphone2,
metaphone3, number1, number2.
" Open default instance for read
TRY.
myshm = zshm_common_words=>attach_for_read(
).
CATCH cx_shm_no_active_version.
WAIT UP TO 1 seconds.
myshm = zshm_common_words=>attach_for_read(
).
ENDTRY.
SPLIT inptext AT space INTO TABLE
it_token.
LOOP AT it_token ASSIGNING <fs_token>.
len = STRLEN( <fs_token>-token
).
IF len = 0.
CONTINUE.
ENDIF.
TRANSLATE <fs_token>-token
TO UPPER CASE.
" Remove other than A-Z 0-9
REPLACE ALL OCCURRENCES OF REGEX '[^A-Z0-9]' IN <fs_token>-token
WITH ''.
CLEAR len.
FIND REGEX '[0-9]+' IN <fs_token>-token MATCH OFFSET
off MATCH LENGTH len.
IF len > 0 AND len
< 9.
MOVE <fs_token>-token+off(len) TO
wa_number-number.
MOVE len TO wa_number-numlen.
APPEND wa_number TO it_number.
ENDIF.
" Remove other than
A-Z
REPLACE ALL OCCURRENCES OF REGEX '[^A-Z]' IN <fs_token>-token
WITH ''.
len = STRLEN( <fs_token>-token
).
IF len > 2.
hit = 'N'.
IF len <
6.
MOVE <fs_token>-token TO
wword.
hit = myshm->root->wordsearch(
wword ).
ENDIF.
IF hit = 'N'.
MOVE <fs_token>-token TO
wa_word-word.
IF len > 16.
MOVE 16 TO len.
ENDIF.
MOVE len TO wa_word-wlen.
APPEND wa_word TO it_word.
ENDIF.
ENDIF.
ENDLOOP.
myshm->detach( ).
SORT it_number BY numlen DESCENDING.
SORT it_word BY wlen DESCENDING.
LOOP AT it_number ASSIGNING <fs_number>.
CASE sy-tabix.
WHEN 1.
MOVE <fs_number>-number TO
number1.
WHEN 2.
MOVE <fs_number>-number TO
number2.
WHEN OTHERS.
EXIT.
ENDCASE.
ENDLOOP.
LOOP AT it_word ASSIGNING <fs_word>.
CASE sy-tabix.
WHEN 1.
PERFORM f_metaphone USING <fs_word>-word
<fs_word>-wlen metaphone1.
WHEN 2.
PERFORM f_metaphone USING <fs_word>-word
<fs_word>-wlen metaphone2.
WHEN 3.
PERFORM f_metaphone USING <fs_word>-word
<fs_word>-wlen metaphone3.
WHEN OTHERS.
EXIT.
ENDCASE.
ENDLOOP.
ENDFUNCTION.
*&--------------------------------------------------------------------*
*& Form f_metaphone
*&--------------------------------------------------------------------*
FORM f_metaphone USING inpword TYPE c
inpwlen TYPE i
CHANGING metaphone TYPE c.
DATA: inpoff
TYPE i,
inofp1 TYPE
i,
inofp2 TYPE
i,
inofm1 TYPE
i,
wlocal(40) TYPE c,
wmetphn(8) TYPE c,
wrdsiz TYPE
i,
outoff TYPE
i,
hard(1) TYPE c,
inpwlm1 TYPE i,
inpchr(1) TYPE c.
CONSTANTS:
maxcodelen TYPE i VALUE 4.
CLEAR: inpoff, outoff, wmetphn.
outoff = 0.
hard = 'N'.
inpwlm1 = inpwlen - 1.
" handle initial 2 characters
exceptions
CASE inpword+0(1).
WHEN 'K' OR 'G' OR 'P'. " looking for KN, etc
IF inpword+1(1) = 'N'.
wlocal = inpword+1(inpwlm1).
ELSE.
wlocal = inpword.
ENDIF.
WHEN 'A'. " looking
for AE
IF inpword+1(1) = 'E'.
wlocal = inpword+1(inpwlm1).
ELSE.
wlocal = inpword.
ENDIF.
WHEN 'W'. " looking
for WR or WH
IF inpword+1(1) = 'R'. " WR -> R
wlocal = inpword+1(inpwlm1).
ELSE.
IF inpword+1(1) = 'H'.
wlocal =
inpword+1(inpwlm1).
wlocal+0(1) = 'W'.
" WH -> W
ELSE.
wlocal = inpword.
ENDIF.
ENDIF.
WHEN 'X'. " initial
X becomes S
wlocal = inpword.
wlocal+0(1) = 'S'.
WHEN OTHERS.
wlocal = inpword.
ENDCASE. " now wlocal has working string with initials fixed
wrdsiz = STRLEN( wlocal ).
inpoff = 0.
DO.
IF outoff >= maxcodelen. " max metaphone size of 4
works well
EXIT.
ENDIF.
IF inpoff >= wrdsiz.
EXIT.
ENDIF.
inofm1 = inpoff - 1.
inofp1 = inpoff + 1.
inofp2 = inpoff + 2.
inpchr =
wlocal+inpoff(1).
" remove duplicate
letters except C
IF ( inpchr <> 'C') AND ( inpoff > 0 ) AND (
wlocal+inofm1(1) = inpchr ).
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
CASE inpchr.
WHEN 'A' OR
'E' OR 'I' OR 'O' OR 'U'.
IF inpoff = 0.
wmetphn+outoff(1) =
inpchr.
outoff = outoff + 1.
ENDIF. " only use vowel if
leading char
WHEN 'B'.
IF ( inpoff > 0 AND
wlocal+inofm1(1) = 'M'
AND (
inofp1 = wrdsiz
OR ( inofp2 = wrdsiz AND wlocal+inofp1(1) = 'E' ) ) ).
" MB or MBE at end of word
inpoff = inpoff + 1.
CONTINUE.
ELSE.
wmetphn+outoff(1) =
inpchr.
outoff = outoff + 1.
ENDIF.
WHEN 'C'.
" lots of C special cases
" discard if SCI, SCE or SCY
IF ( inpoff > 0 ) AND (
wlocal+inofm1(1) = 'S' ) AND ( inofp1 < wrdsiz ).
CASE wlocal+inofp1(1).
WHEN 'E' OR
'I' OR 'Y'.
inpoff = inpoff + 1.
CONTINUE.
ENDCASE.
ENDIF.
IF wlocal+inpoff(3) = 'CIA'. " CIA -> X
wmetphn+outoff(1) =
'X'.
outoff = outoff + 1.
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
IF inofp1 < wrdsiz.
CASE wlocal+inofp1(1).
WHEN 'E' OR
'I' OR 'Y'.
wmetphn+outoff(1) = 'S'.
outoff = outoff + 1. " CI,CE,CY -> S
inpoff = inpoff + 1.
CONTINUE.
ENDCASE.
ENDIF.
IF ( inpoff > 0 ) AND ( wlocal+inofm1(3) = 'SCH' ). " SCH->sk
wmetphn+outoff(1) =
'K'.
outoff = outoff + 1.
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
IF wlocal+inpoff(2) = 'CH'. " detect CH
IF ( inpoff = 0 ) AND (
wrdsiz >= 3 ). " CH consonant -> K consonant
CASE
wlocal+2(1).
WHEN 'A' OR 'E' OR 'I' OR 'O' OR 'U'.
wmetphn+outoff(1) = 'X'. " CHvowel -> X
WHEN OTHERS.
wmetphn+outoff(1) = 'K'.
ENDCASE.
ENDIF.
ELSE.
wmetphn+outoff(1) =
'K'.
ENDIF.
outoff = outoff + 1.
WHEN 'D'.
IF ( inofp2 < wrdsiz ) AND (
wlocal+inofp1(1) = 'G' ). " DGE DGI DGY -> J
CASE wlocal+inofp2(1).
WHEN 'E' OR
'I' OR 'Y'.
wmetphn+outoff(1) = 'J'.
inpoff = inpoff + 2.
ENDCASE.
ELSE.
wmetphn+outoff(1) =
'T'.
ENDIF.
outoff = outoff + 1.
WHEN 'G'.
" GH silent at end or before consonant
IF ( inofp2 = wrdsiz ) AND (
wlocal+inofp1(1) = 'H' ).
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
IF ( inofp2 < wrdsiz ) AND ( wlocal+inofp1(1) = 'H' ).
CASE wlocal+inofp2(1).
WHEN 'A' OR
'E' OR 'I' OR 'O' OR 'U'.
inpoff = inpoff + 0. " do Nothing!
WHEN
OTHERS.
inpoff = inpoff + 1.
CONTINUE.
ENDCASE.
ENDIF.
IF ( inpoff > 0 ) AND ( wlocal+inpoff(2) = 'GN'
OR wlocal+inpoff(4) = 'GNED' ).
inpoff = inpoff + 1.
CONTINUE.
ENDIF. " silent G
IF ( inpoff > 0 ) AND ( wlocal+inofm1(1) = 'G' ).
hard = 'Y'.
ELSE.
hard = 'N'.
ENDIF.
IF ( inofp1 < wrdsiz ) AND ( hard = 'N' ).
CASE wlocal+inofp1(1).
WHEN 'E' OR
'I' OR 'Y'.
wmetphn+outoff(1) = 'J'.
ENDCASE.
ELSE.
wmetphn+outoff(1) =
'K'.
ENDIF.
outoff = outoff + 1.
WHEN 'H'.
IF inofp1 = wrdsiz.
inpoff = inpoff + 1.
CONTINUE.
ENDIF. " terminal H
IF ( inpoff > 0 ).
CASE wlocal+inofm1(1).
WHEN 'C' OR
'S' OR 'P' OR 'T' OR 'G'.
inpoff = inpoff + 1.
CONTINUE.
ENDCASE.
ENDIF.
CASE wlocal+inofp1(1).
WHEN 'A' OR 'E' OR 'I'
OR 'O' OR 'U'.
wmetphn+outoff(1) = 'H'.
outoff =
outoff + 1. " Hvowel
ENDCASE.
WHEN 'F' OR
'J' OR 'L' OR 'M' OR 'N' OR 'R'.
wmetphn+outoff(1) = inpchr.
outoff = outoff + 1.
WHEN 'K'.
IF inpoff > 0. " not
initial
IF wlocal+inofm1(1)
<> 'C'.
wmetphn+outoff(1) = inpchr.
outoff =
outoff + 1.
ENDIF.
ELSE.
wmetphn+outoff(1) =
inpchr. " initial K
outoff = outoff + 1.
ENDIF.
WHEN 'P'.
IF ( inofp1 < wrdsiz ) AND (
wlocal+inofp1(1) = 'H' ). " PH -> F
wmetphn+outoff(1) =
'F'.
ELSE.
wmetphn+outoff(1) =
inpchr.
ENDIF.
outoff = outoff + 1.
WHEN 'Q'.
wmetphn+outoff(1) = 'K'.
outoff = outoff + 1.
WHEN 'S'.
IF ( wlocal+inpoff(2) = 'SH' )
OR ( wlocal+inpoff(3) = 'SIO' )
OR ( wlocal+inpoff(3) = 'SIA' ).
wmetphn+outoff(1) =
'X'.
ELSE.
wmetphn+outoff(1) =
'S'.
ENDIF.
outoff = outoff + 1.
WHEN 'T'.
IF ( wlocal+inpoff(3) = 'TIA' )
OR ( wlocal+inpoff(3) = 'TIO' ).
wmetphn+outoff(1) =
'X'.
outoff = outoff + 1.
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
IF wlocal+inpoff(3) = 'TCH'.
inpoff = inpoff + 1.
CONTINUE.
ENDIF.
" substitute numeral 0 for TH (resembles theta after all)
IF wlocal+inpoff(2) = 'TH'.
wmetphn+outoff(1) =
'0'.
ELSE.
wmetphn+outoff(1) =
'T'.
ENDIF.
outoff = outoff + 1.
WHEN 'V'.
wmetphn+outoff(1) = 'F'.
outoff = outoff + 1.
WHEN 'W' OR
'Y'. " silent if not followed by vowel
IF inofp1 < wrdsiz.
CASE wlocal+inofp1(1).
WHEN 'A' OR
'E' OR 'I' OR 'O' OR 'U'.
wmetphn+outoff(1) = inpchr.
outoff = outoff + 1.
ENDCASE.
ENDIF.
WHEN 'X'.
wmetphn+outoff(1) = 'K'.
outoff = outoff + 1.
wmetphn+outoff(1) = 'S'.
outoff = outoff + 1.
WHEN 'Z'.
wmetphn+outoff(1) = 'S'.
outoff = outoff + 1.
ENDCASE.
inpoff = inpoff + 1.
ENDDO.
metaphone = wmetphn+0(4).
ENDFORM. " end f_metaphone
* Ref: http://aspell.net/metaphone/
* Program Based on: http://www.wbrogden.com/
* http://www.wbrogden.com/phonetic/index.html
*
* Objective: get back a list of words that might have a similar pronunciation.
* This list might be useful if you are looking for alternate spellings.
* Bill Brogden originally made use of this algorithm in an application to help
people transcribing
* legal documents find consistent spellings for the names of people mentioned
in the documents.
*
* The original metaphone algorithm was published by Lawrence Philips
* in an article entitled "Hanging on the Metaphone" in the journal
* Computer Language v7 n12, December 1990, pp39-43.
* His algorithm - translated into Java, and with minor tweaks
* - is what we are using here.
* Naturally, a phonetic encoding system has to assume a particular language
and culture.
* Here we are using essentially American English.
*
* The Metaphone Rules
* Metaphone reduces the alphabet to 16 consonant sounds:
*
* B X S K J T F H L M N P R 0 W Y
*
* That isn't an O but a zero - representing the 'th' sound.
*
* Transformations
* Metaphone uses the following transformation rules:
* Doubled letters except "c" -> drop 2nd letter.
* Vowels are only kept when they are the first letter.
*
* B -> B unless at the end of a word after "m" as in
"dumb"
* C -> X (sh) if -cia- or -ch-
* S if -ci-, -ce- or -cy-
* K otherwise, including -sch-
* D -> J if in -dge-, -dgy- or -dgi-
* T otherwise
* F -> F
* G -> silent if in -gh- and not at end or before a
vowel
* in -gn- or -gned-
(also see dge etc. above)
* J if before i or e or y if not
double gg
* K otherwise
* H -> silent if after vowel and no vowel follows
* H otherwise
* J -> J
* K -> silent if after "c"
* K otherwise
* L -> L
* M -> M
* N -> N
* P -> F if before "h"
* P otherwise
* Q -> K
* R -> R
* S -> X (sh) if before "h" or in -sio- or -sia-
* S otherwise
* T -> X (sh) if -tia- or -tio-
* 0 (th) if before "h"
* silent if in -tch-
* T otherwise
* V -> F
* W -> silent if not followed by a vowel
* W if followed by a vowel
* X -> KS
* Y -> silent if not followed by a vowel
* Y if followed by a vowel
* Z -> S
*
* Initial Letter Exceptions
*
* Initial kn-, gn- pn, ac- or wr- ->
drop first letter
* Initial x-
-> change to "s"
* Initial wh-
-> change to "w"
*
* The code is truncated at 4 characters in this example, but more could be
used.
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Class ZCL_COMMON_WORDS Definition.
Public Section.
Methods:
Constructor,
WordSearch IMPORTING inpword
RETURNING found.
Private Section.
TYPES: BEGIN OF t_common_word,
word(40),
END
OF t_common_word.
DATA:
ht_common_words TYPE HASHED TABLE OF t_common_word
WITH UNIQUE KEY matnr
INITIAL SIZE 0.
DATA:
wa_common_word TYPE t_common_word.
FIELD-SYMBOLS: <fs_common_word> TYPE
t_common_word.
EndClass.
Class ZCL_COMMON_WORDS Implementation.
Method Constructor.
MOVE 'ALL' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'AND' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'ARE' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'BIG' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'BOOK' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'FOR' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'FROM' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'HOW' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'NEW' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'NOT' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'NOW' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'OUT' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'SET' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'THAT' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'THE' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'USE' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'WHEN' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'WHO' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'WHY' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'WILL' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'WITH' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
MOVE 'YOU' TO wa_common_word-word.
INSERT wa_common_word INTO TABLE ht_common_words.
EndMethod.
METHOD wordsearch.
READ TABLE ht_common_words WITH TABLE KEY word = inpword
TRANSPORTING NO FIELDS.
IF sy-subrc = 0.
found = 'Y'.
ELSE.
found = 'N'.
ENDIF.
ENDMETHOD.
METHOD if_shm_build_instance~build.
DATA: area TYPE REF TO
zshm_common_words,
root TYPE REF TO zcl_common_words.
****get a pointer to the Shared Area
TRY.
area = zshm_common_words=>attach_for_write(
).
CATCH cx_shm_no_active_version.
WAIT UP TO 1 seconds.
area = zshm_common_words=>attach_for_write(
).
ENDTRY.
****Create an instance of our root
CREATE OBJECT root AREA HANDLE area.
****Set the root back into the Area
area->set_root( root ).
****Commit and detatch
area->detach_commit( ).
ENDMETHOD.
EndClass.
ZSHM_COMMON_WORDS
------------------------------------------------------------------------------------------
REPORT zmetphnbld.
TABLES: zmetaphone.
DATA: wmatnr LIKE
makt-matnr,
wartnr LIKE wlk1-artnr,
wmaktx LIKE makt-maktx,
wmtart LIKE mara-mtart.
DELETE FROM zmetaphone.
COMMIT WORK.
SELECT makt~matnr
makt~maktx
mara~mtart
INTO (wmatnr, wmaktx, wmtart)
FROM makt INNER JOIN mara
ON makt~matnr = mara~matnr
WHERE spras = 'E'.
CALL FUNCTION 'ZJNC_PARSE_ARTDESC'
EXPORTING
inptext = wmaktx
IMPORTING
metaphone1 = zmetaphone-metaphone1
metaphone2 = zmetaphone-metaphone2
metaphone3 = zmetaphone-metaphone3
number1 = zmetaphone-number1
number2 = zmetaphone-number2.
MOVE wmatnr TO zmetaphone-matnr.
CASE wmtart.
WHEN 'GBNB'.
MOVE 'BB' TO zmetaphone-bustyp.
WHEN 'GADO'.
MOVE 'MW' TO zmetaphone-bustyp.
WHEN OTHERS.
MOVE 'SR' TO zmetaphone-bustyp.
ENDCASE.
SELECT SINGLE artnr
INTO wartnr
FROM wlk1
WHERE artnr = wmatnr.
IF sy-subrc = 0.
INSERT into zmetaphone values zmetaphone.
ENDIF.
ENDSELECT.
COMMIT WORK.
------------------------------------------------------------------------------------------
SAP Table ZMETAPHONE Metaphone Soundex
MANDT MANDT X *
C 000003 00
BUSTYP CHAR2 X *
MATNR MATNR X *
C 000018 00
METAPHONE1 CHAR4
C 000004 00
METAPHONE2 CHAR4
C 000004 00
METAPHONE3 CHAR4
C 000004 00
NUMBER1 NUM8
N 000008 00
NUMBER2 NUM8
N 000008 00
# additional 3 Indexes on (BUSTYP,
METAPHONEn)
------------------------------------------------------------------------------------------