lunes, 20 de febrero de 2017

UTL_MATCH : String Matching by Testing Levels of Similarity/Difference


The UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented (and therefore supported) in Oracle 11g Release 2. It contains a variety of functions that are helpful for testing the level of similarity/difference between strings.
Related articles.

Setup

The examples in this article require the following table definition.
DROP TABLE match_tab;

CREATE TABLE match_tab (
  id   NUMBER,
  col1 VARCHAR2(15),
  col2 VARCHAR2(15),
  CONSTRAINT match_tab_pk PRIMARY  KEY (id)
);

INSERT INTO match_tab VALUES (1, 'Peter Parker', 'Pete Parker');
INSERT INTO match_tab VALUES (2, 'Peter Parker', 'peter parker');
INSERT INTO match_tab VALUES (3, 'Clark Kent', 'Claire Kent');
INSERT INTO match_tab VALUES (4, 'Wonder Woman', 'Ponder Woman');
INSERT INTO match_tab VALUES (5, 'Superman', 'Superman');
INSERT INTO match_tab VALUES (6, 'The Hulk', 'Iron Man');
COMMIT;

EDIT_DISTANCE

The "Edit Distance", or "Levenshtein Distance", test measures the similarity between two strings by counting the number of character changes (inserts, updates, deletes) required to transform the first string into the second. The number of changes required is know as the distance.
SELECT col1,
       col2,
       UTL_MATCH.edit_distance(col1, col2) AS ed
FROM   match_tab
ORDER BY id;

COL1            COL2                    ED
--------------- --------------- ----------
Peter Parker    Pete Parker              1
Peter Parker    peter parker             2
Clark Kent      Claire Kent              2
Wonder Woman    Ponder Woman             1
Superman        Superman                 0
The Hulk        Iron Man                 8

6 rows selected.

SQL>

EDIT_DISTANCE_SIMILARITY

The EDIT_DISTANCE_SIMILARITY function uses the same method as the EDIT_DISTANCE function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).
SELECT col1,
       col2,
       UTL_MATCH.edit_distance_similarity(col1, col2) AS eds
FROM   match_tab
ORDER BY id;

COL1            COL2                   EDS
--------------- --------------- ----------
Peter Parker    Pete Parker             92
Peter Parker    peter parker            84
Clark Kent      Claire Kent             82
Wonder Woman    Ponder Woman            92
Superman        Superman               100
The Hulk        Iron Man                 0

6 rows selected.

SQL>


SELECT id,
       col1,
       col2,
       UTL_MATCH.edit_distance_similarity(col1, col2) AS eds
FROM   match_tab
WHERE  UTL_MATCH.edit_distance_similarity(col1, col2) > 90
ORDER BY id;

        ID COL1            COL2                   EDS
---------- --------------- --------------- ----------
         1 Peter Parker    Pete Parker             92
         4 Wonder Woman    Ponder Woman            92
         5 Superman        Superman               100

SQL>

JARO_WINKLER

The "Jaro-Winkler Algorithm" provides a different method for finding the distance between two strings.
SELECT col1,
       col2,
       UTL_MATCH.jaro_winkler(col1, col2) AS jw
FROM   match_tab
ORDER BY id;

COL1            COL2                    JW
--------------- --------------- ----------
Peter Parker    Pete Parker     9.288E-001
Peter Parker    peter parker    8.889E-001
Clark Kent      Claire Kent     9.083E-001
Wonder Woman    Ponder Woman    9.444E-001
Superman        Superman          1.0E+000
The Hulk        Iron Man        4.167E-001

6 rows selected.

SQL>

JARO_WINKLER_SIMILARITY

The JARO_WINKLER_SIMILARITY function uses the same method as the JARO_WINKLER function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).
SELECT col1,
       col2,
       UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
FROM   match_tab
ORDER BY id;

COL1            COL2                   JWS
--------------- --------------- ----------
Peter Parker    Pete Parker             92
Peter Parker    peter parker            88
Clark Kent      Claire Kent             90
Wonder Woman    Ponder Woman            94
Superman        Superman               100
The Hulk        Iron Man                41

6 rows selected.

SQL>


SELECT col1,
       col2,
       UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
FROM   match_tab
WHERE  UTL_MATCH.jaro_winkler_similarity(col1, col2) > 90
ORDER BY id;

COL1            COL2                   JWS
--------------- --------------- ----------
Peter Parker    Pete Parker             92
Wonder Woman    Ponder Woman            94
Superman        Superman               100

SQL>

0 comentarios:

Publicar un comentario