IE83577B1 - A data quality system - Google Patents
A data quality systemInfo
- Publication number
- IE83577B1 IE83577B1 IE2002/0648A IE20020648A IE83577B1 IE 83577 B1 IE83577 B1 IE 83577B1 IE 2002/0648 A IE2002/0648 A IE 2002/0648A IE 20020648 A IE20020648 A IE 20020648A IE 83577 B1 IE83577 B1 IE 83577B1
- Authority
- IE
- Ireland
- Prior art keywords
- record
- matching
- data
- field
- records
- Prior art date
Links
- 238000000034 method Methods 0.000 claims description 44
- 230000002104 routine Effects 0.000 claims description 34
- 230000001131 transforming Effects 0.000 claims description 20
- 238000007781 pre-processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 230000001537 neural Effects 0.000 claims description 3
- 238000007670 refining Methods 0.000 claims description 2
- 238000000844 transformation Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 4
- 239000007799 cork Substances 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000737 periodic Effects 0.000 description 2
- 241000219492 Quercus Species 0.000 description 1
- 235000016976 Quercus macrolepis Nutrition 0.000 description 1
- 230000004931 aggregating Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000004450 types of analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Description
A data quality system
INTRODUCTION
Field of the Invention
The invention relates to a data quality system.
Prior Art Discussion
Data quality is important for companies maintaining large volumes of information in the
form of structured data. It is becoming an increasingly critical issue for companies with
very large numbers of customers (for example banks, utilities, and airlines). Many such
companies have already, or are about to, implement customer relationship management
(CRM) systems to improve their business development. Effective operation of CRM
systems involves drawing data from a range of operational systems and aggregating it on
a customer-by-customer basis. This involves a large degree of data matching based on
» criteria such as customer identification details. Such matching and associated operations
are often ineffective because of bad quality data. The data quality problems which often
arise includes-
empty fields;
lack of conformity, such as the letter “H” in a phone number field;
lack of consistency across fields of a record such as “customer status = live” and
“last invoice date = 20/01/99”;
lack of integrity of field values; and
duplicates .
In more detail, data matching difiiculties arise from (a) the multitude of different ways in
which two equivalent sets of data can differ, and (b) the very large volumes of data
generally involved. This means that carrying out the task manually is impossible or
hugely costly and defining a finite set of basic matching rules to automate the process is
extremely difficult. As organisations collect more data from more sources and attempt
to use this data efficiently and effectively they are encountering this problem more
frequently and the negative impact is growing.
It is therefore an objective of the invention to provide a data quality system to improve
data quality.
SUMMARY OF THE INVENTION
According to the invention, there is provided a data quality system for matching input
data across data records, the system comprising:-
means for pre-processing the input data to remove noise or reformat the data,
means for matching record pairs based on measuring similarity of selected field
pairs within the record, and for generating a similarity indicator for each record
pair.
wherein the matching means comprises means for extracting a similarity vector
for each record pair by generating a similarity score for each of a plurality of pairs
of fields in the records, the set of scores for a record pair being a vector.
In another embodiment, the vector extraction means comprises means for executing
string matching routines on pre-selected field pairs of the records.
In a further embodiment, a matching routine comprises means for determining an edit
distance indicating the number of edits required to change from one value to the other
value.
In one embodiment, a matching routine comprises means for comparing numerical
values by applying numerical weights to digit positions.
In another embodiment, the vector extraction means comprises means for generating a
vector value between 0 and 1 for each field pair in a record pair.
In a further embodiment, the matching means comprises record scoring means for
converting the vector into a single similarity score representing overall similarity of the
fields in each record pair.
In one embodiment, the record scoring means comprises means for executing rule—based
routines using weights applied to fields according to the extent to which each field is
indicative of record matching.
In another embodiment, the record scoring means comprises means for computing
scores using an artificial intelligence technique to deduce from examples given by the
user an optimum routine for computing the score from the vector.
In a fiirther embodiment, the artificial intelligence technique used is cased based
reasoning (CBR).
In one embodiment, the artificial intelligence technique used comprises neural network
processing.
In another embodiment, the pre-processing means comprises a standardisation module
comprising means for transforming each data field into one or more target data fields
each of which is a variation of the original.
In a further embodiment, the standardisation module comprises means for splitting a
data field into multiple field elements, converting the field elements to a different format,
removing noise characters, and replacing elements with equivalent elements selected
from an equivalence table.
In one embodiment, the pre-processing means comprises a grouping module comprising
means for grouping records according to features to ensure that all actual matches of a
record are within a group, and wherein the matching means comprises means for
comparing records within groups only.
In a further embodiment, the grouping module comprises means for applying labels to a
record in which a label is determined for a plurality of fields in a record and records are
grouped according to similarity of the labels.
In one embodiment, a label is a key letter for a field.
In another embodiment, the system further comprises a configuration manager
comprising means for applying configurable settings for the pre-processing means and
for the matching means.
In a further embodiment, the system further comprises a tuning manager comprising
means for refining, according to user inputs, operation of the record scoring means.
In one embodiment, the tuning manager comprises means for using a rule-based
approach for a first training run and an artificial intelligence approach for subsequent
training runs.
DETAILED DESCRIPTION OF THE INVENTION
Brief Description of the Drawings
The invention will be more clearly understood from the following description of some
embodiments thereof, given by way of example only with reference to Fig. 1, which is a
block diagram illustrating a data quality system of the invention.
Description of the Embodiments
Referring to Fig. l, a data quality system 1 comprises a user interface 2 linked with a
configuration manager 3 and a tuning manager 4. A data input adapter 5 directs input
data to a pipeline 6 which performs data matching in a high-speed and accurate manner.
The pipeline 6 comprises:
a pre-processor 7 having a standardisation module 8 and a grouping module 9,
a matching system 11 comprising a similarity vector extraction module 12 and a
record scoring module 13.
The output of the pipeline 6 is fed to an output datafile 15.
The system 1 operates to match equivalent but non-identical information. This
matching enables records to be amended to improve data quality.
The system 1 (“engine”) processes one or multiple datasets to create an output data file
containing a list of all possible matching record pairs and a sz'mz'larz'ty score. Depending on
the needs of the user the engine can then automatically mark certain record pairs above a
specified score as definite matches and below a specified score as non-matches. Record
pairs with scores between these two thresholds may be sent to a user interface for manual
verification.
There are a number of discrete activities within the matching process. These can be
grouped into two separate phases: — pre-processing and matching
Pre—processing
In the pre-processing phase all data records are read sequentially from the data input
adapters 5. Firstly each record is fed to the standardisation module 8 where a range of
different routines are applied to generate an output record which can be matched more
effectively with other records. Each record is then fed to a grouping module 9. In this
process labels are attached to each record to enable it to be easily and quickly grouped
with other similar records. This makes the downstream matching process more efficient
as it eliminates the need to compare records which are definitely non matches.
Following the grouping process the output record (transformed and labelled) is written to
the pre-processed datafile.
Matching
In the matching phase, each record is read in sequence from the pre-processed dataset
. It is then compared to each similar record in the dataset — i.e. records within the
same group. The comparison process involves:
l. Similarity Vector Extraction: This involves comparing individual fields within a
record pair using matching algorithms to generate a similarity score for each pair of
fields. Data element scoring is carried out on a number of field pairs within the
record pair to generate a set of similarity scores called a similarity vector.
. Data record Scoring: Once a similarity vector has been produced for a record pair by
a series of data element scoring processes, the data record scoring process converts
the vector into a single similarity score. This score represents the overall similarity of
the two records.
The pair of output records is then written to the output datafile along with the similarity
score. The matching phase then continues with the next pair of possible matching pairs.
To achieve high accuracy matching, the setup of the modules is highly specific to the
structure and format of the dataset(s) being processed. A key advantage of the engine is
built-in intelligence and flexibility which allow easy configuration of optimum setup for
each of the modules. Initial setup of the four processing modules is managed by the
configuration manager 3 and the tuning manager 4.
Standardisation (“Transformation”) Module 8
The aim of the transformation process is to remove many of the common sources of
matching difficulty while ensuring that good data is not destroyed in the process. This is
done by transforming the individual elements of a record into a range of different
formats which will aid the matching process. Each data field in a record is transformed
into a number of new data fields each of which is a variation of the original.
Each data record is read in turn from the adaptor 5. Each field within a record is
processed by applying a number of predefined transformation routines to the field. Each
transformation routine produces a new output data field. Thus, an output record is
produced containing a number of data fields for each field in the input record. Field
transformation routines include:
- Splitting a data field into multiple fields, for example splitting street address into
number, name and identifier.
- Converting field elements to other format using conversion routines, for example:
' Converting to uppercase.
- Converting to phonetic code (Soundex).
- Convert to abbreviated version.
0 Convert to standardised format (e.g. international telephone codes).
- Convert to business-specific version.
- Removal of characters from within data field, for example:
- Removal of spaces between specified elements.
- Removal of specified symbols from between specified elements (eg. punctuation
marks / hyphens).
- Replacement of element with an equivalent element selected from an equivalence
table, for example:
- Replacement of nickname / shortened name with rootname.
- Replacement of Irish/foreign language place or person name with English
equivalent.
- Replacement of standard abbreviations with root term (st. to street, rd. to road
etc.).
- Replacement of company name with standardised version of name.
The transformation module 8 is capable of carrying out a user-defmed number of
transforms such as those above to each input data field and generating a user-defined
number of output fields for each input field. The transforms required for each field type
may be configured by:
- Selecting from a menu of default transformation configurations (set of routines)
predefined for use with a particular field type of a particular
structure/ format/ quality level.
- Developing new configurations for each data field / element from a menu of
transformations such as those above.
- Developing new configurations for each data field / element using bespoke
transformations input by the user — probably combined with some predefined
transformations.
In batch matching projects the transformation process is carried out on the whole
database before any matching is done. A new data file of transformed elements is then
created for use in the matching process. This saves time by ensuring that the minimum
number of transformations N are carried out (where N = number of records in the
database) rather than the potential maximum number of transformations NxN.
However in realtime search and match operation the transformation process is carried
out directly before the matching process for each record.
The following is a transformation example.
Input Record:
Firstname Surnam Addressl Address2 Address3 DOB Telephone
e
John O’Brien 30ak Rd. Douglas Co. Cork 20/4/66 021-234678
Output Record
FN_stan FN_Soundex FN_Root SN_stan SN_Soundex SN_roo Al_Nu
t m
John Jon Jonathon OBrien 0-165 Brien 3
A1_text A1__text_sound A1__st A2_text A2_str_sound A3_st A3_text
ex ex
Oak O-200 Road Douglas Duglass County Cork
DOB_E DOB_US Telephone Tel_loca
ur 1
353212346 234678
Grouping Module 9
The aim of the data record grouping process is to significantly speed up the matching
step by reducing the number of record pairs which go through the set of complex match
scoring routines. This is done by grouping records which have certain similar features —
only records within the same group are then compared in the matching phase. (This
greatly reduces the number of matching steps required from NxN to GxHxH where G is
the number of groups and H is the number of elements per group).
The module 9 ensures that all actual matches of any record are contained within the
same group. The grouping process must be kept simple so that minimal processing time
is required to identify elements in the same group. In addition, to have a real impact on
efficiency the groups must be substantially smaller than the full dataset (at least 10 times)
After the transformation process is performed on an individual data record a further set
of predefined routines is applied to certain fields of the record. These routines extract
features from the data fields. These features are included in a small number (2-4) of extra
data fields appended to the output record. These labels allow the record to be grouped
with other similar records.
The key attributes of the labels are:
- Must be very high probability (99.999%) that all matching records have some or
all of the same labels.
- Labels must be easily extracted from the data fields.
- Labels must be impervious to any range data errors which have not been
corrected by the transformation process, for example, spelling errors, typing
errors, different naming conventions, and mixed fields.
The grouping process is a high speed filtering process to significantly reduce the amount
of matches required rather than as a substitute for the matching process. As such, in
order to keep the grouping process simple but ensure that no matches are missed, each
group is large and the vast majority of records within a group will not match.
An example of the type of routine used in the grouping process is a keyletter routine.
The keyletter is defined as the most important matching letter in the field - generally the
first letter of the main token — J for John, B for oBrien, O for Oak, D for Douglas, C for
Cork. For example the label fields may then contain : first letters of firstname, surname,
addressl and address 2.
The grouping criteria may then be set to: X(2 to 4) number of common labels. Matching
is only carried out on records whose label fields contain two or more of the same letters.
The keyletter may also be derived from the soundex fields.
In many cases keyletter may not be the appropriate labelling routine. The grouping
module must have the flexibility to allow the user to define a number of bespoke
labelling routines appropriate to the dataset (for example — if a particular data element
within a dataset has a particularly high confidence level, grouping may be focused
largely on this). He may do this by:
a. selecting a default grouping configuration predefined for this type of dataset,
b. firstly selecting the most appropriate fields, secondly selecting the appropriate
labelling routines from a menu, thirdly defining the grouping criteria for the
labels, or
c. as above but inputting customised labelling routines.
Example
Input Record:
Firstname Surname Addressl Address2 Address3 DOB Telephone
John O’Brien 30akRd. Douglas Co. Cork 20/4/66 021-234678
Output Record
FN_stan FN_Soundex FN_Root SN_stan SN_Soundex SN_roo A1_Nu
John Jon Jonathon OBrien O-l65 Brien 3
A1_tcxt A1_text_sound A1_st A2_text A2_text_nysiis A3_st A3_text
ex
Oak O-200 Road Douglas DAGL County Cork
DOB_E DOB_US Telephone Tel_loca
In 1
353212346 234678
Output Record Grouping Labels
FN_key1etter SN_keyletter A1_keyletter A2_keyletter A3_keyletter
J B O D C
Similarity Vector Extraction Module 12
Each data field within a record is compared with one or more fields from the other
record of a pair being compared. All records in each group are compared with all of the
other records in the same group. The objective here is to ensure that equivalent data
elements are matched using an appropriate matching routine even if the elements are not
stored in equivalent fields.
Each pair of records is read into the vector extraction module from the preprocessed
datafile. This module firstly marks the data fields from each record which should be
compared to each other. It then carries out the comparison using one of a range of
different string matching routines. The string matching routines are configured to
accurately estimate the “similarity” of two data elements. Depending on the type/ format
of the data elements being compared, different matching routines are required. For
example, for a normal word an “edit distance” routine measures how many edits are
required to change one element to the other is a suitable comparison routine. However
for an integer it is more appropriate to use a routine which takes into account the
difference between each individual digit and the different importance level of the various
digits (i.e. in number 684 the 6 is more important than the 8 which is more important
than the 4). Examples of matching routines are edit distance, hamming distance, dyce,
and least common substring routines.
The output of the matching routine is a score between 0 and 1 where 1 indicates an
identical match and 0 indicates a definite nonmatch. The output of a data field scoring
routine is a set of similarity scores one for each of the datafield pairs compared. This set of
scores is called a sz‘mz'Zarz'ty vector.
The module 12 allows the user to select the data fields within the dataset/(s) to be used
in the matching process, to select which fields are to be matched with which, and to
define the matching routine used for each comparison. The user configures the process
- selecting from a menu of default configurations suitable for the dataset(s),
- manually selecting the data fields to be compared and selecting the appropriate
matching routine from a menu of predefined routines, and
- manually creating customised matching routines to suit particular data field
types.
Example
Input Record 1
FN_stan FN_Soundex F N_Root SN_stan SN_Soundex SN_root A1_Nu
m
John J -5 O0 Jonathon OBrien O-165 Brien 3
Al_text A1_text_sound A1_st A2_text A2_str_sound A3_st A3_text
ex ex
Oak 0-200 Road Douglas D242 County Cork
DOB_E DOB_US Telephone Tel_loca M
ur l i
2004196 04201966 353212346 234678 3
6
Input Record 2
FN_sta FN_Soundex FN_Root SN_stan SN_Sounde SN_root A1_Num
n x
Jon J-500 Jonathon Bryan B-650 Brien
A1_tex Al_text_sound Al_st A2_text A2_text_sd A2_st A3_text l
t ex x 1
Oakdal O-234 Close Oake 0-230 Road Duglass
e
A4_st A4_text A4_text_sd DOB_Eu DOB_US Telephon Tel_local
x r e 6
County Cork C-620 02041968 04021968
Output Similarity Vector
FN_stan FN_Root SN_stan SN_root A1_Num A1_text A1_st
.7 l .5 1 .5 .5 0
A2_text A2_st A3_text A4_st A4_text A1A2_text A2A1_text
0 0 0 0 0 .8 0
A2A3_te A3A2_tex A3A4_tx DOB_Eu DOB_US Telephone Tel_local
xt t t r
.8 0 1 .8 .8 -
The output of the data field matching process is a vector of similarity scores indicating
the similarity level of the data fields within the two records. The data field matching
module is capable of doing a user-defined number and type of comparisons between two
data records and generating a score for each ~ i.e. the user will define which fields /
elements of one record will be compared to which elements in the other record. The
user will also define which matching algorithm is used for each comparison. In defining
these parameters the user can:
- Select a default matching configuration stored in the system 1 for a specified field
type.
- Select the required matching routine for a particular data field type from a menu
of predefined routines.
- Input a customised matching routine
Data Record Scoring Module 13
The aim of the data record scoring is to generate a single similarity score for a record
pair which accurately reflects the true similarity of the record pair relative to other record
pairs in the dataset. This is done by using a variety of routines to compute a similarity
score from the similarity vector generated by the module 12.
There are two different types of routine used by the module 13 to generate a score.
- Rule-based routines — these routines use a set of rules and weights to compute an
overall score from the vector. The weights are used to take into account that some
fields are more indicative of overall record similarity than others. The rules are used
to take into account that the relationship between individual field scores and overall
score may not be linear. The following is an example of a rule based computation.
FN = Largest of (FN_stan ,FN_Root)
SN = Largest of (SN_stan, FN_Root)
Al_text = Largest of (Al_text, AlA2_text)
A2_text = Largest of (A2_text, A2Al_text, A2A3_text)
A3_text = Largest of (A3_text, A3A2_text)
DOB = Largest of (DOB_Eur, DOB_US)
Score = FN + SN +Al_text+A2_text+A3__text+A4_text+
(Alst+A2st+A3st+A4st)/ 4
- AI based routines ~ these routines automatically derive an optimum match score
computation algorithm based on examples of correct and incorrect matches
identified by the user. Depending on the situation — the type of AI technology used
may be based on either neural networks or case based reasoning.
The optimum routine required to derive the most accurate similarity scores for all record
pairs are highly specific to the types and quality of data within a particular dataset. For
this reason default routines generally do not give the best match accuracy. In order to
achieve top levels of accuracy, a trial and error process is implemented by the tuning
manager 4 to “tune” the scoring routine. This requires the user to:
- run the whole matching process a number of times for a portion of the dataset.
- inspect the results after each run to check the proportion of correct and incorrect
matches.
- manually adjust the parameters of the score computation routine.
This process is difficult to do with a rule based routine as there are a large number of
variables to tweak. However the AI based system is ideal for this process. It removes the
need to tweak different variables as the AI technology derives a new score computation
routine automatically based on the learning from the manual inspection of the match
results. Since the AI process requires training data, the system 1 uses a rule based
routine on the first training run and uses an AI routine thereafter.
The record scoring module 13 is configured to allow user selection or setup of both the
rules based and AI-based routines. The user configures the rule based routine by:
- Selecting from a menu of rule—based routine configurations predefined for
common dataset types.
° Selecting a predefined configuration but adjusting individual parameters (e.g.
weighting of a certain field type).
- Defining a customised routine.
The user will setup the AI based routine by:
Selecting a recommended AI-based routine for the particular matching
conditions (one-off batch matching, ongoing periodic matches etc.)
- Selecting from a menu of configurations of that AI-based routine predefined for
common dataset types.
- Selecting a predefined configuration but adjusting individual parameters.
It will be appreciated that the system achieves fast and easy set up and configuration of
new matching processes involving new datasets or match criteria, and easy set up of
adhoc matching analyses. The system also achieves scheduling of ongoing periodic
matching processes using predefined configurations. The system is callable from third
party applications or middleware, and it has the capability to read data from a range of
input data formats, and to provide a range of output data functions.
Important advantages of the system are:
. Accuracy. It is capable of delivering highly accurate automated matching
through the use of complex layers of processing and matching routines to
compensate for the full range of data matching problems. It minimises the
number of true matches not identified and non-matches labelled as matches.
. Configurability. It enables easy setup of customised routines often required due
to the highly specific nature of individual datasets. It allows the user to select
parameters based on knowledge of which fields are likely to be most indicative of
a match, likely quality of individual fields, and likely problems with
fields/elements. The system 1 uses “wizard” type process to help the user to
configure bespoke routines to remove problem characters within fields, and
transform elements into standardised formats.
. Ease of set up. There is built-n intelligence to facilitate high accuracy set up and
tuning by a non-expert user. Setup is based on users knowledge of the data, and it
guides user on development of processing routines. Articial intelligence is used to
automatically tune the matching process based on examples of good and bad
matches as verified by user.
. Speed: It uses intelligent processing to quickly reduce a dataset to a subset of “all
possible matches”. The high—speed pipeline 6 maximises processing speed.
. Open Architecture. The architecture uses component — based design to facilitate
easy integration with other systems or embedding of core engine within other
technologies.
The system of the invention is therefore of major benefit to businesses by, for example:
improving the value of data so that it is business-ready;
reducing project risks and time overruns in data migration projects; and
reducing manual verification costs.
The system is also very versatile as it may interface on the input side with any of a wide
range of legacy systems and output cleaned data to a variety of systems such as CRM,
data-mining, data warehouse, and ERP systems. Furthermore, the structure of the
system allows different modes of operation including interactive data cleaning for data
projects, batch mode for embedded processes, and real time mode for end—user
applications.
The invention is not limited to the embodiments described but may be varied in
construction and detail.
Claims (1)
- Claims A data quality system for matching input data across data records, the system comprising:- means for pre-processing the input data to remove noise or reformat the data, means for matching record pairs based on measuring similarity of selected field pairs within the record, and for generating a similarity indicator for each record pair, wherein the matching means comprises means for extracting a similarity vector for each record pair by generating a similarity score for each of a plurality of pairs of fields in the records, the set of scores for a record pair being a vector. A system as claimed in claim 1, wherein the vector extraction means comprises means for executing string matching routines on pre-selected field pairs of the records. A system as claimed in claim 2, wherein a matching routine comprises means for determining an edit distance indicating the number of edits required to change from one value to the other value. A system as claimed in claims 2 or 3, wherein a matching routine comprises means for comparing numerical values by applying numerical weights to digit positions. A system as claimed in any preceding claim, wherein the vector extraction means comprises means for generating a Vector value between 0 and l for each field pair in a record pair. A system as claimed in any preceding claim, wherein the matching means comprises record scoring means for converting the vector into a single similarity score representing overall similarity of the fields in each record pair. A system as claimed in claim 6, wherein the record scoring means comprises means for executing rule-based routines using weights applied to fields according to the extent to which each field is indicative of record matching. A system as claimed in claims 6 or 7, wherein the record scoring means comprises means for computing scores using an artificial intelligence technique to deduce from examples given by the user an optimum routine for computing the score from the vector. A system as claimed in claim 8, wherein the artificial intelligence technique used is cased based reasoning (CBR). A system as claimed in claim 8, where the artificial intelligence technique used comprises neural network processing. A system as claimed in any preceding claim, wherein the pre-processing means comprises a standardisation module comprising means for transforming each data field into one or more target data fields each of which is a variation of the original. A system as claimed in claim 11, wherein the standardisation module comprises means for splitting a data field into multiple field elements, converting the field elements to a different format, removing noise characters, and replacing elements with equivalent elements selected from an equivalence table. A system as claimed in any preceding claim, wherein the pre-processing means comprises a grouping module comprising means for grouping records according to features to ensure that all actual matches of a record are within a group, and wherein the matching means comprises means for comparing records within groups only. A system as claimed in claim 13, wherein the grouping module comprises means for applying labels to a record in which a label is determined for a plurality of fields in a record and records are grouped according to similarity of the labels. A system as claimed in claim 14, in which a label is a key letter for a field. A system as claimed in any preceding claim, wherein the system further comprises a configuration manager comprising means for applying configurable settings for the pre-processing means and for the matching means. A system as claimed in any of claims 6 to 16, wherein the system further comprises a tuning manager comprising means for refining, according to user inputs, operation of the record scoring means. A system as claimed in claim 17, wherein the tuning manager comprises means for using a rule-based approach for a first training run and an artificial intelligence approach for subsequent training runs. A data quality system substantially as described with reference to the drawings.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IE2002/0648A IE83577B1 (en) | 2002-08-02 | A data quality system |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IEIRELAND03/08/20012001/0744 | |||
IE20010744 | 2001-08-03 | ||
IE2002/0648A IE83577B1 (en) | 2002-08-02 | A data quality system |
Publications (2)
Publication Number | Publication Date |
---|---|
IE20020648A1 IE20020648A1 (en) | 2003-03-19 |
IE83577B1 true IE83577B1 (en) | 2004-09-22 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7281001B2 (en) | Data quality system | |
CN105373365B (en) | For managing the method and system of the archives about approximate string matching | |
CN112035599B (en) | Query method and device based on vertical search, computer equipment and storage medium | |
EP1192789A1 (en) | A method of developing an interactive system | |
CN101794307A (en) | Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
US7627567B2 (en) | Segmentation of strings into structured records | |
CN109902142B (en) | Character string fuzzy matching and query method based on edit distance | |
CN111782763A (en) | Information retrieval method based on voice semantics and related equipment thereof | |
CN110276080B (en) | Semantic processing method and system | |
EP1331574A1 (en) | Named entity interface for multiple client application programs | |
CN113010632A (en) | Intelligent question answering method and device, computer equipment and computer readable medium | |
CN115470338B (en) | Multi-scenario intelligent question answering method and system based on multi-path recall | |
CN117453717B (en) | Data query statement generation method, device, equipment and storage medium | |
CN110362596A (en) | A kind of control method and device of text Extracting Information structural data processing | |
CN114625748A (en) | SQL query statement generation method and device, electronic equipment and readable storage medium | |
CN117349420A (en) | Reply method and device based on local knowledge base and large language model | |
CN109446277A (en) | Relational data intelligent search method and system based on Chinese natural language | |
CN113886420B (en) | SQL sentence generation method and device, electronic equipment and storage medium | |
CN112214494B (en) | Retrieval method and device | |
IE83577B1 (en) | A data quality system | |
CN112632991B (en) | Method and device for extracting characteristic information of Chinese language | |
Hanafi et al. | Alliance Rules-based Algorithm on Detecting Duplicate Entry Email | |
CN115470787A (en) | Similar word processing method and device based on word vectors | |
CN117573828A (en) | Data extraction method and system based on annual report content retrieval |