Home » SQL & PL/SQL » SQL & PL/SQL » Term Match with Jaro-Winkler logic (11)
Term Match with Jaro-Winkler logic [message #666934] |
Fri, 01 December 2017 13:52 |
|
OracleUser2017
Messages: 7 Registered: December 2017
|
Junior Member |
|
|
I'm looking for a solution that works for the speed and does not need days or even months / years.
«SEARCH_Term_tbl» has about 300 thousand records and
«TEXT_Term_tbl» has about 10 million records
Would like to search every single word in table «SEARCH_Term_tbl» against every single word in table «TEXT_Term_tbl» and compare it with the Jaro-Winkler logic. The output should be, count every single word found in table «TEXT_Term_tbl» which matches with Jaro-Winkler logic > 96%.
Table Example:
Insert into SEARCH_Term_tbl
SELECT 1 as ST_ID 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
SELECT 2 as ST_ID 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
SELECT 3 as ST_ID 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
SELECT 4 as ST_ID 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
SELECT 5 etc etc
Insert into TEXT_Term_tbl
SELECT 1 as ST_ID ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL
SELECT 2 as ST_ID 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
SELECT 3 etc etc
|
|
|
|
Re: Term Match with Jaro-Winkler logic [message #666937 is a reply to message #666934] |
Fri, 01 December 2017 23:09 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The following separates the words in each table, then compares and counts them. It counts occurrences of words, not distinct words, so for example it counts der in the search_term_tbl matching 3 occurrences of der in the text_term_tbl. If you want to count distinct words, then you could add the distinct keyword, but that would slow it down significantly. Anything that you use is going to be slow. You will have to test the following to see how fast or slow it is.
-- test data:
SCOTT@orcl_12.1.0.2.0> create table SEARCH_Term_tbl as
2 SELECT 1 as ST_ID, 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
3 SELECT 2 as ST_ID, 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
4 SELECT 3 as ST_ID, 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
5 SELECT 4 as ST_ID, 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
6 /
Table created.
SCOTT@orcl_12.1.0.2.0> create table TEXT_Term_tbl as
2 SELECT 1 as ST_ID, ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL UNION ALL
3 SELECT 2 as ST_ID, 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
4 /
Table created.
-- query:
SCOTT@orcl_12.1.0.2.0> column term format a30
SCOTT@orcl_12.1.0.2.0> column text format a30
SCOTT@orcl_12.1.0.2.0> select terms.term, count(texts.text)
2 from (select regexp_substr (search_term, '[^ ]+', 1, column_value) term
3 from search_term_tbl,
4 table
5 (cast
6 (multiset
7 (select level
8 from dual
9 connect by level <= regexp_count (search_term, ' ') + 1)
10 as sys.odcinumberlist))) terms,
11 (select regexp_substr (text_term, '[^ ]+', 1, column_value) text
12 from text_term_tbl,
13 table
14 (cast
15 (multiset
16 (select level
17 from dual
18 connect by level <= regexp_count (text_term, ' ') + 1)
19 as sys.odcinumberlist))) texts
20 where utl_match.jaro_winkler_similarity(terms.term, texts.text) > 96
21 group by terms.term
22 /
TERM COUNT(TEXTS.TEXT)
------------------------------ -----------------
van 1
dos 1
Cristiano 1
Santos 1
Ronaldo 1
der 3
6 rows selected.
|
|
|
Re: Term Match with Jaro-Winkler logic [message #666942 is a reply to message #666937] |
Sat, 02 December 2017 07:11 |
|
OracleUser2017
Messages: 7 Registered: December 2017
|
Junior Member |
|
|
Hi, thanks for the feedback, the script works if you have small data to compare, but if you run 10 terms against 10 million text records (ca have 1:n terms), then it taks too long.
Think the "connect by level" taks to long.
No idea how I can run 300 thousand against 10 million?
Is there a way to run the script in parallel or maybe with a smal data dump?
Thanks
[Updated on: Sat, 02 December 2017 07:52] Report message to a moderator
|
|
|
Re: Term Match with Jaro-Winkler logic [message #666949 is a reply to message #666942] |
Sat, 02 December 2017 11:32 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The connect by level is part of separating each word from each string. There may be more efficient methods. You can find several on this forum. However, the main problem is the shear bulk. Comparing every word from one table to every word in another table requires a huge Cartesian product. I once responded to a similar problem using Levenshtein distance instead of jaro_winkler, with the same speed problem. Tom Kyte said he had a similar problem and speeded it up by first creating the Cartesian product in one table with a blank score, then doing the comparison in sections in parallel. You might be able to do something similar. You can view his comments in the response to my response at the following link, which includes a reference to another link:
https://asktom.oracle.com/pls/asktom/asktom.search?tag=soundex-searches#10577325169281
It might also help, if you would state here what your ultimate goal is, as there may be a simpler solution, perhaps involving Oracle Text features and Oracle Text indexes instead of jaro_winkler which is devised purely for names and is slow. The main cause of the slowness is not the process of splitting the strings into words, but the huge Cartesian product and the jaro_winkler that does not use an index.
|
|
|
Re: Term Match with Jaro-Winkler logic [message #666956 is a reply to message #666949] |
Sun, 03 December 2017 04:37 |
|
OracleUser2017
Messages: 7 Registered: December 2017
|
Junior Member |
|
|
Hi, thanks for the reply
the goal is to check if the first, last name or company names in the table appear in the second table and if so, how exact is the match
It still raises the question of whether the search of names in the second table can not be performed in parallel?
At the moment my script is running the first term over the second table and only when this is finish it goes on with
the next term etc. etc.
I'll work again on this topic tomorrow.....
[Updated on: Sun, 03 December 2017 04:49] Report message to a moderator
|
|
|
Re: Term Match with Jaro-Winkler logic [message #666957 is a reply to message #666956] |
Sun, 03 December 2017 11:33 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
It sounds like an Oracle Text index and Oracle Text queries using the CONTAINS operator with soundex and fuzzy matching might be a good solution for you. I have provided some examples below, using your sample data. This is just a small sampling of what can be done. The fuzzy matching can be limited to scores above anywhere from 1 to 80. The overall text score can be limited as well. The simple queries that I posted typically raise an error when comparing two large tables, which is why I posted the looping examples, which still use the index. It does fine with one large table, one search term or search phrase at a time. I think you will find that looping with a text index and contains and soundex and fuzzy will be far faster than looping with jaro_winkler. There is also Oracle Text ndata, which is specifically intended for names and classification, but those are much more complex and probably more than you need. I don't know whether it is sufficient to count matched names or if you really need to count matched individual parts of names, so I included both options. I used dbms_output for simple display of results, but you could just as easily insert the results into a table using insert into ... select ... instead of the second loop, which should be faster.
-- test tables and data:
SCOTT@orcl_12.1.0.2.0> create table SEARCH_Term_tbl as
2 SELECT 1 as ST_ID, 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
3 SELECT 2 as ST_ID, 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
4 SELECT 3 as ST_ID, 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
5 SELECT 4 as ST_ID, 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
6 /
Table created.
SCOTT@orcl_12.1.0.2.0> create table TEXT_Term_tbl as
2 SELECT 1 as ST_ID, ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL UNION ALL
3 SELECT 2 as ST_ID, 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
4 /
Table created.
-- Oracle Text context index:
SCOTT@orcl_12.1.0.2.0> create index text_idx on text_term_tbl (text_term) indextype is ctxsys.context
2 /
Index created.
-- queries that use Oracle Text context index to search for soundex or fuzzy matches:
-- soundex match or close fuzzy match (score >= 80) on any word in name:
SCOTT@orcl_12.1.0.2.0> select s.search_term, count(*)
2 from search_term_tbl s, text_term_tbl t
3 where contains
4 (t.text_term,
5 '!' || s.search_term ||
6 ' or fuzzy(' || replace (s.search_term, ' ', ',80,5000,w) or fuzzy(') || ',80,5000,w)',
7 1) > 0
8 group by s.search_term
9 /
SEARCH_TERM COUNT(*)
--------------------------------------------- ----------
Magitta Evolet Jerome van der Vart 2
Géraldine Jerome dos Santos 1
Ronaldo Cristiano Perez Portugal 1
3 rows selected.
SCOTT@orcl_12.1.0.2.0> select s.search_term as search_term, count(*)
2 from (select regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
3 from search_term_tbl,
4 table
5 (cast
6 (multiset
7 (select level
8 from dual
9 connect by level <= regexp_count (search_term, ' ') + 1)
10 as sys.odcinumberlist))) s,
11 text_term_tbl t
12 where contains (t.text_term, '!' || s.search_term || ' or fuzzy(' || s.search_term || ',80,5000,w)', 1) > 0
13 group by s.search_term
14 /
SEARCH_TERM COUNT(*)
--------------------------------------------- ----------
van 1
dos 1
Cristiano 1
Santos 1
Vart 1
Ronaldo 1
der 2
7 rows selected.
-- soundex match or fuzzy match (score >= 1) on any word in name:
SCOTT@orcl_12.1.0.2.0> select s.search_term, count(*)
2 from search_term_tbl s, text_term_tbl t
3 where contains
4 (t.text_term,
5 '!' || s.search_term ||
6 ' or fuzzy(' || replace (s.search_term, ' ', ',1,5000,w) or fuzzy(') || ',1,5000,w)',
7 1) > 0
8 group by s.search_term
9 /
SEARCH_TERM COUNT(*)
--------------------------------------------- ----------
Magitta Evolet Jerome van der Vart 2
Géraldine Jerome dos Santos 2
Ronaldo Cristiano Perez Portugal 1
3 rows selected.
SCOTT@orcl_12.1.0.2.0> select s.search_term as search_term, count(*)
2 from (select regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
3 from search_term_tbl,
4 table
5 (cast
6 (multiset
7 (select level
8 from dual
9 connect by level <= regexp_count (search_term, ' ') + 1)
10 as sys.odcinumberlist))) s,
11 text_term_tbl t
12 where contains (t.text_term, '!' || s.search_term || ' or fuzzy('||s.search_term||',1,5000,w)', 1) > 0
13 group by s.search_term
14 /
SEARCH_TERM COUNT(*)
--------------------------------------------- ----------
van 1
dos 1
Cristiano 1
Santos 1
Vart 2
Ronaldo 1
der 2
Géraldine 1
8 rows selected.
-- examples of looping using Oracle text context index:
-- soundex match or close fuzzy match (score >= 80) on any word in name:
SCOTT@orcl_12.1.0.2.0> begin
2 for s in
3 (select search_term from search_term_tbl)
4 loop
5 for tt in
6 (select s.search_term, count(*) cnt
7 from text_term_tbl t
8 where contains
9 (t.text_term,
10 '!' || s.search_term ||
11 ' or fuzzy(' || replace (s.search_term, ' ', ',80,5000,w) or fuzzy(') || ',80,5000,w)',
12 1) > 0
13 group by s.search_term)
14 loop
15 dbms_output.put_line (rpad (tt.search_term, 60) || tt.cnt);
16 end loop;
17 end loop;
18 end;
19 /
Géraldine Jerome dos Santos 1
Magitta Evolet Jerome van der Vart 2
Ronaldo Cristiano Perez Portugal 1
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> begin
2 for s in
3 (select regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
4 from search_term_tbl,
5 table
6 (cast
7 (multiset
8 (select level
9 from dual
10 connect by level <= regexp_count (search_term, ' ') + 1)
11 as sys.odcinumberlist)))
12 loop
13 for tt in
14 (select s.search_term, count(*) cnt
15 from text_term_tbl t
16 where contains (t.text_term, '!' || s.search_term || ' or fuzzy('||s.search_term||',80,5000,w)', 1) > 0
17 group by s.search_term)
18 loop
19 dbms_output.put_line (rpad (tt.search_term, 60) || tt.cnt);
20 end loop;
21 end loop;
22 end;
23 /
dos 1
Santos 1
van 1
der 2
Vart 1
Ronaldo 1
Cristiano 1
PL/SQL procedure successfully completed.
|
|
|
|
Re: Term Match with Jaro-Winkler logic [message #667000 is a reply to message #666985] |
Mon, 04 December 2017 13:05 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
OracleUser2017 wrote on Mon, 04 December 2017 03:41Hi, get not the smae results, maybe in cas of the index?
I I create --> indextype is ctxsys.context then it shows me the indextype domain ?
Are you saying that when you create the same table and same index, and run the same queries as I did that you do not get the same results that I posted? If so, then please post a copy and paste of a run from SQL*Plus of the complete process as I did. If this is not the problem, then please clarify, preferably with a copy and paste of an example.
Oracle Text context indexes are a type of domain index. If you are using autotrace or some such thing to show the execution plan that the optimizer has chosen, then the context index will be listed as a domain index.
|
|
|
|
|
|
|
|
Re: Term Match with Jaro-Winkler logic [message #667036 is a reply to message #667013] |
Tue, 05 December 2017 10:48 |
|
Barbara Boehmer
Messages: 9097 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
As the error message says you have a syntax error, you probably have a syntax error. If you will just post a copy and paste of the code you are actually running that produces the error, I might be able to tell what the problem is. You can certainly add a loop within your loop to log errors and log the values. I suspect that some of your names that you are searching for may have some special characters in them that have special meaning to Oracle Text and cause a syntax error. You can eliminate this by either removing those characters or enclosing the string within double quotes or escaping the characters with a backslash. If you use double quotes it only looks for an exact match and will not use soundex or fuzzy. I cannot help you if you will not post your code. I don't know why you keep ignoring that.
|
|
|
Goto Forum:
Current Time: Sat Sep 28 18:05:50 CDT 2024
|