CN112463774B - Text data duplication eliminating method, equipment and storage medium - Google Patents

Text data duplication eliminating method, equipment and storage medium Download PDF

Info

Publication number
CN112463774B
CN112463774B CN202011150210.0A CN202011150210A CN112463774B CN 112463774 B CN112463774 B CN 112463774B CN 202011150210 A CN202011150210 A CN 202011150210A CN 112463774 B CN112463774 B CN 112463774B
Authority
CN
China
Prior art keywords
data
deduplicated
block
deduplication
pairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011150210.0A
Other languages
Chinese (zh)
Other versions
CN112463774A (en
Inventor
于淼
刘炎
覃建策
陈邦忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Perfect World Holding Group Ltd
Original Assignee
Perfect World Holding Group Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Perfect World Holding Group Ltd filed Critical Perfect World Holding Group Ltd
Priority to CN202011150210.0A priority Critical patent/CN112463774B/en
Publication of CN112463774A publication Critical patent/CN112463774A/en
Application granted granted Critical
Publication of CN112463774B publication Critical patent/CN112463774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages

Abstract

The embodiment of the application provides a text data duplicate removal method, text data duplicate removal equipment and a storage medium. In this embodiment, when the target data to be deduplicated is deduplicated, a data pair to be deduplicated is obtained from the target data, and the data pair is input into a previously trained deduplication model. In the deduplication model, field values of key fields to be deduplicated in the input data pairs to be deduplicated can be compared based on preset comparison rules of different field types, so that the similarity of the data pairs to be deduplicated is calculated. Based on the implementation mode, different types of fields can be effectively deduplicated, and a better data deduplication effect is realized.

Description

Text data duplication eliminating method, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, a device, and a storage medium for removing duplicate text data.
Background
In the era of the development of the internet, mass data are rushed into our lives, and how to mine effective information and remove redundant data becomes a key place for improving the information acquisition efficiency.
The existing data deduplication method has poor deduplication effect, and cannot realize better data deduplication effect in a large-scale data deduplication scene. Therefore, a new solution is yet to be proposed.
Disclosure of Invention
Aspects of the present application provide a method, device and storage medium for removing duplicate data, so as to effectively remove duplicate data and improve the effect of removing duplicate data.
The embodiment of the present application further provides a duplication elimination method for text data, including: acquiring target data to be deduplicated; acquiring a plurality of groups of data pairs to be deduplicated from the target data, wherein each group of data pairs to be deduplicated comprises a plurality of data records; respectively inputting the multiple groups of data pairs to be deduplicated into a pre-trained deduplication model, and acquiring respective deduplication results of the multiple groups of data pairs to be deduplicated output by the deduplication model; determining a first duplicate removal result of the target data according to respective duplicate removal results of the multiple groups of data to be subjected to duplicate removal; wherein the de-duplication model is to: comparing field values of key fields to be deduplicated in the input data pairs to be deduplicated according to preset comparison rules corresponding to different field types to determine the similarity of the data pairs to be deduplicated and obtain the deduplication result of the data pairs to be deduplicated.
Further optionally, the respectively inputting the multiple groups of data pairs into a pre-trained deduplication model, and obtaining deduplication results of the multiple groups of data pairs output by the deduplication model, includes: inputting the data pairs to be deduplicated into the deduplication model for any one of the multiple groups of data pairs to be deduplicated; extracting field values of the key fields to be deduplicated from the data pairs to be deduplicated in the deduplication model; comparing the field values of the key fields to be deduplicated according to the comparison rules corresponding to the different field types to obtain comparison results of the key fields to be deduplicated; and performing weighted calculation on the comparison result of the key fields to be deduplicated by using the weight parameters of different field types learned in advance by the deduplication model to obtain the similarity of the data pairs to be deduplicated.
Further optionally, acquiring multiple sets of data pairs to be deduplicated from the target data, where each set of data pairs to be deduplicated includes multiple data records, including: dividing the target data into a plurality of blocks, wherein the data records in each block have the same specific characteristics; determining the corresponding relation between the data record contained in the target data and the plurality of blocks; and selecting the data records of which the corresponding blocks meet set conditions from the target data as a group of data pairs to be deduplicated according to the corresponding relation between the data records contained in the target data and the blocks.
Further optionally, the block dividing the target data to obtain a plurality of blocks, where data records in each block have the same specific characteristics, includes: extracting respective predicate indexes of each data record contained in the target data by adopting a predicate function; and dividing the data records with at least one same predicate index into the same block to obtain the plurality of blocks contained in the target data.
Further optionally, determining the corresponding relationship between the data record included in the target data and the plurality of blocks includes: determining a block key of each of the plurality of blocks according to at least one predicate index corresponding to each of the plurality of blocks; respectively setting block IDs for the blocks to obtain a plurality of block IDs; and determining the block ID corresponding to the data record contained in the target data according to the corresponding relation between the predicate index of the data record contained in the target data and the block keyword of each block, and establishing the corresponding relation between the data ID and the block ID of each data record in the target data.
Further optionally, selecting, from the target data, a data record whose corresponding block meets a set condition as a group of data pairs to be deduplicated, including: and determining the corresponding data records with the same block ID from the target data as a group of data pairs to be deduplicated.
Further optionally, the method further comprises: sequencing the plurality of block IDs in an ascending order to obtain an ascending sequencing result; for any one of the plurality of block IDs, determining at least one block ID smaller than the block ID from the ascending sorting result as a small-valued block ID of the block ID.
Further optionally, selecting, from the target data, a data record whose corresponding block meets a set condition as a group of data pairs to be deduplicated, including: determining small-value block IDs of block IDs corresponding to a plurality of data records in the target data, and using the small-value block IDs as the small-value block IDs corresponding to the plurality of data records; and determining the data records with partial overlapping small-value block IDs from the target data as a group of data records to be deduplicated.
Further optionally, after obtaining respective deduplication results of the multiple sets of data pairs output by the deduplication model, the method further includes: if the target data pair is incremental deduplication data, determining stock deduplication data; extracting data records from the first duplicate removal result and the stock duplicate removal data respectively to obtain a plurality of groups of data pairs to be matched; respectively inputting the multiple groups of data pairs to be matched into a pre-trained matching model, and acquiring the matching results of the multiple groups of data pairs to be matched output by the matching model; determining a second duplicate removal result of the target data according to the matching result of the plurality of groups of data to be matched to each other; wherein the matching model is to: and extracting a set field value of a second key field from the input data pair to be matched, and comparing the extracted field value of the second key field according to preset comparison rules corresponding to different field types to determine the similarity of the data pair to be matched.
Further optionally, extracting data records from the first deduplication result and the stock deduplication data respectively to obtain a plurality of sets of data pairs to be matched, including: selecting a plurality of data blocks with the same data size as the first duplication removal result from the inventory duplication removal data; respectively carrying out block division on the plurality of data blocks to obtain a plurality of blocks contained in each of the plurality of data blocks so as to determine block IDs of data records contained in each of the plurality of data blocks; selecting a data record with the same block ID as the ith data record from the first duplicate removal result aiming at any data block in the data blocks, and selecting the ith data record from the data blocks to form an ith group of data pairs to be matched; wherein i =1,2,3 … n, where n represents the total number of data records contained by the first deduplication result.
Further optionally, determining a second deduplication result of the target data according to the matching result of each of the multiple sets of data to be matched, including: dividing data records contained in the multiple groups of data pairs to be matched into multiple data groups according to the matching results of the multiple groups of data pairs to be matched; the data records contained in the data pairs to be matched, the matching results of which are not repeated, are respectively divided into different data groups, and the data records contained in the data pairs to be matched, the matching results of which are repeated, are divided into the same data groups; and carrying out repeatability judgment on the plurality of data sets so as to determine a repeated data set and a non-repeated data set from the plurality of data sets.
Further optionally, performing a repeatability judgment on the plurality of data sets to determine a repeated data set and a non-repeated data set from the plurality of data sets includes: for a first data group and a second data group in the plurality of data groups, if one data record in the first data group is duplicated with one data record in the second data group, determining that the first data group and the second data group are duplicated.
Further optionally, the method further comprises: if one second data group and the first data group exist in the plurality of data groups and are repeated data groups, distributing the data records in the first data group to the second data group; if a plurality of second data groups and the first data group exist in the plurality of data groups, the centroid distances between the data records in the first data group and the plurality of second data groups are respectively calculated, and the data records in the first data group are distributed to the second data group with the smallest centroid distance.
Further optionally, the alignment rule corresponding to the different field types includes at least one of the following: comparing the numerical values corresponding to the fields of the data types; affine gap penalty comparison rules corresponding to fields of the common short text type; a completely consistent comparison rule corresponding to the field of the short text unique identification type; and (4) comparing cosine similarity corresponding to the field of the long text type.
Further optionally, the method further comprises: determining the key fields to be deduplicated and comparison rules corresponding to the different field types; acquiring a plurality of groups of data pairs as training data, wherein the training data comprises a plurality of groups of repeated data pairs and a plurality of groups of non-repeated data pairs; determining respective similarity calculation values of the multiple groups of data on the basis of the weight parameters of the algorithm model and the comparison rules corresponding to the different field types; and taking the respective similarity true values of the multiple groups of data pairs as supervision signals, calculating values according to the respective similarity of the multiple groups of data pairs, and optimizing the weight parameters of the algorithm model to obtain the de-weighting model.
Further optionally, the deduplication model is a logistic regression model.
An embodiment of the present application further provides an electronic device, including: a memory and a processor; the memory is to store one or more computer instructions; the processor is to execute the one or more computer instructions to: and executing the steps in the data deduplication method provided by the embodiment of the application.
The embodiment of the present application further provides a computer-readable storage medium storing a computer program, and the computer program can implement the steps in the text data deduplication method provided in the embodiment of the present application when executed.
In the duplication elimination method for text data provided by the embodiment of the application, in the embodiment, when the target data to be duplicated is duplicated, the data pair to be duplicated is acquired from the target data, and the data pair is input into the duplication elimination model trained in advance. In the deduplication model, field values of key fields to be deduplicated in the input data pairs to be deduplicated can be compared based on preset comparison rules of different field types, so that the similarity of the data pairs to be deduplicated is calculated. Based on the implementation mode, different types of fields can be effectively deduplicated, and a better data deduplication effect is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart illustrating a deduplication model training method according to an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of a database table and a deduplication intermediate table provided by an exemplary embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a model training method provided in an exemplary embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a data deduplication method according to an exemplary embodiment of the present application;
FIG. 5 is a schematic flow chart diagram illustrating a data deduplication method according to another exemplary embodiment of the present application;
FIG. 6 is a schematic flow chart diagram illustrating a data deduplication method according to yet another exemplary embodiment of the present application;
fig. 7 is a structural diagram of an electronic device according to an exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Knowledge fusion refers to fusion of description information about the same entity or concept from multiple data sources, and performing heterogeneous data integration and disambiguation on knowledge from different data sources under a unified specification, and data deduplication can be understood in a specific scene, and the following description may also be referred to as deduplication.
Today, mass data are rushed into our lives when the internet is developed vertically and horizontally, and how to grasp effective information and remove repeated and redundant data become the key points for improving the working efficiency when the business is processed.
At present, the way of performing deduplication on large-scale data mainly includes: manual check and duplicate removal, data comparison and duplicate removal by using a program, determination of an MD5 value (Message-Digest Algorithm, MD for short) for data, Hash (Hash ) duplicate removal, BitMap (BitMap) duplicate removal and the like.
In the manual review mode, data are compared manually, and repeated data are marked and removed. However, the manual review requires high labor and time costs, and is not suitable for deduplication scenarios with large data volumes.
In the method of comparing and removing duplicate data by using a program, a database table needs to be constructed, codes are compiled, equivalent comparison or regular string comparison is performed on the data to select duplicate data for removing duplicate data, and the content of the data table needs to be maintained and updated regularly. However, the program comparison is rigid, and only equivalent comparison can be performed on completely identical contents, and it is difficult to remove duplicate data from data with similar contents, even with very small text differences, such as texts, and the effect is poor. Meanwhile, the constructed database table needs to be maintained regularly, repeated data elimination needs to be carried out again after the database table is updated, the manual checking or coding checking steps are repeatedly executed, the required time is long, and when the number of data fields is large, rules of different fields are difficult to formulate and maintain.
The MD5 is a message digest algorithm, which can generate a special string from a string or a file according to a certain rule, and the MD5 digest corresponding to a file is fixed, and the MD5 value of the file will be different after the content of the file changes, so the MD5 value is often used in applications to verify whether a piece of data has been tampered.
In the mode of carrying out deduplication on the data by calculating the MD5 value, the MD5 value is calculated on the dimensional data content of each record mainly according to the characteristics of the MD5 value, and then repeated records are judged according to the MD5 value obtained by calculation. After data is binned, duplicate data is located using a database query statement (e.g., an sql statement), and then removed or flagged. The MD5 value is subjected to deduplication to achieve the comparison effect on different types of data, however, the execution efficiency of the memory and the CPU is limited within a fixed time, and a large amount of data deduplication and deduplication processing causes huge consumption on the memory.
In the Hash duplicate removal mode, a Hash table method can be adopted to divide the data into a plurality of groups for scanning and establishing the Hash table. During each scanning, the first byte, the last byte and any two bytes in the middle of the data are taken as Hash Code (Hash Code) and inserted into a Hash table, and the address, the information length and the repetition number of the Hash Code are recorded. One common application that utilizes Hash deduplication is a bloom filter. The Hash de-duplication mode does not need to occupy a large memory space, and a plurality of Hash algorithms can be executed concurrently, but the mode is easy to generate errors and generate misjudgment.
In the method of removing duplication by the BitMap, the data state is set to three types, which are respectively: none, only one, and duplicates. During deduplication, 2bits are adopted to process data in different states, data are traversed, and the state bits of the data are modified, so that the deduplication effect is achieved. BitMap de-emphasis is good at handling large data scenes, especially for numeric type data, but it is difficult to deal with text type data.
In conclusion, the existing large-scale data deduplication technology has the defects of high labor cost and time cost, difficulty in processing text fields, large memory occupation, easiness in generating misjudgment deduplication effect, requirement for maintaining a data table, difficulty in formulating deduplication rules and the like, and cannot be well applied to a large-scale data deduplication scene.
In view of the above technical problems, in some embodiments of the present application, a solution is provided, and the technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
The data deduplication method provided by the embodiment of the application is realized based on an algorithm model obtained by supervised learning, and the algorithm model comprises a deduplication model and a matching model. The duplication removing model is used for carrying out self duplication removing on the data, and the matching model is used for matching the incremental data with other duplicated data again to remove duplication after the incremental data are subjected to self duplication removing so as to optimize a final duplication removing result.
A data deduplication method based on an algorithm model mainly relates to the steps of customization of key fields to be deduplicated, formulation of key field comparison rules, data sample labeling, training of deduplication models and matching models, data deduplication prediction, multithreading incremental data matching, manual data labeling, model iteration and the like. The following section will first describe the training method of the deduplication model.
Fig. 1 is a schematic flowchart of a deduplication model training method according to an exemplary embodiment of the present application, and as shown in fig. 1, the deduplication model training method includes:
step 101, determining key fields to be deduplicated and comparison rules corresponding to different field types.
Step 102, acquiring a plurality of groups of data pairs as training data, wherein the plurality of groups of data pairs comprise repeated data pairs and non-repeated data pairs.
And 103, determining respective similarity calculation values of the multiple groups of data pairs based on the weight parameters of the algorithm model and the comparison rules corresponding to the different field types.
And 104, taking the respective similarity true values of the multiple groups of data pairs as supervision signals, calculating values according to the respective similarity of the multiple groups of data pairs, and optimizing weight parameters of the algorithm model to obtain the de-weighting model.
In the present embodiment, for the plurality of sets of data pairs as training data, JSON (JSON Object Notation) format description may be adopted. Each data pair may contain multiple (e.g., two or more) data records. In some embodiments, duplicate data pairs and non-duplicate data pairs may be written in two vocabularies, each of which contains multiple sets of data pairs, each of which contains key fields required for deduplication and field values for the respective fields. A typical training sample is labeled as follows:
{"distinct":[[{data_json_11},{data_json_12}],[{data_json_21}, {data_json_22}], ...],
"match":[[{data_json_31},{data_json_32}],[{data_json_33}, {data_json_34}], ...]}
wherein, the word list composed of multiple non-repeated data pairs is under the distint key; below the match key is a vocabulary consisting of multiple sets of repeating data pairs. data _ json represents a data record, which contains key fields and field values to be compared in the deduplication process. For convenience of description and distinction, the key fields to be compared by the deduplication model are described as the key fields to be deduplicated, and the key fields to be deduplicated may include one key field or multiple key fields, which is not limited in this embodiment.
In the large-scale data scene, manual labeling is time-consuming and labor-consuming, and a deduplication model can be trained in a semi-supervised learning mode. In the process of semi-supervised learning, a small part of labeled data can be used for carrying out model training on a large amount of unlabeled data. Most of the unlabeled data can be the de-duplication prediction result of the de-duplication model after being put into use, and the de-duplication model can be subjected to iterative training based on the de-duplication prediction result. The following section will mainly describe an alternative embodiment of training the deduplication model based on labeled data.
The comparison rules corresponding to different field types comprise at least one of the following: comparing the numerical values corresponding to the fields of the data types; affine gap penalty comparison rules corresponding to fields of the common short text type; a completely consistent comparison rule corresponding to the field of the short text unique identification type; and (4) comparing cosine similarity corresponding to the field of the long text type.
Wherein, the field of the common short text type means that the field value corresponding to the field is of a common character string type, namely, a string type in a database, which is in a text format but has a limit on the length; the field of the short text unique identification type means that the field value corresponding to the field is also of a common character string type, namely a sting type, but has special attributes which can be used as a proof of unique identification data; the field of the long text type means that the field value corresponding to the field is the text type, i.e. the text type in the database and the text length is generally larger than the set threshold.
The alignment rules for the different types of fields described above will be further illustrated in conjunction with the data table shown in FIG. 2. It is assumed that, in a service scenario of recruitment information deduplication, a key field to be deduplicated is shown in the job information table of fig. 2, and includes: user ID, unique identification ID, company name, job title, job type ID, skill level ID, skill ID, job description, education information ID, resume number, job experience ID, province ID, city ID, area ID, job area, detailed address, lower monthly salary limit, upper monthly salary limit, data status, etc.
The above-mentioned key fields to be deduplicated relate to two data types: integer and string types.
For integer data, the numerical content can be directly compared based on the numerical comparison rule. For example, directly comparing values corresponding to job type IDs in different data records, or directly comparing values corresponding to skill level IDs in different data records, etc.
For the character strings corresponding to the fields of the short text unique identification types, whether the contents of the character strings are completely consistent or not can be compared based on a completely consistent comparison rule. For example, it may be compared whether the strings corresponding to the company ID fields in different data records are identical.
And aiming at the character strings corresponding to the fields of the common short text types, comparing the contents of the character strings based on an affine gap penalty comparison rule. For example, affine gap penalty alignment rules can be used for alignment of fields such as job title, company name, work address, etc.
Where gap penalties are a means of scoring a di-or multi-sequence, introducing gaps in the text sequence when aligning the sequences can allow the algorithm to match more terms for a gap-free alignment. Gap penalties are used to adjust the alignment score according to the number and length of gaps. Affine gap penalty is an important and widely used type of gap penalty comparison. Affine gap penalty combines a constant gap penalty and a linear gap penalty, of the form a + B · L. This introduces new terms, a called gap opening penalty, B called gap extension penalty, and L called gap length. Void opening refers to the cost required to open a void of any length, while void extension refers to the cost required to extend the length of an existing void by 1. In general, it is unclear what the values a and B should be, because it may differ according to purpose. Typically, if closely related matches are to be found (e.g., vector sequences are removed during genome sequencing), then a higher gap penalty should be used to reduce gap opening. On the other hand, when there is an interest in finding more distant matches, the gap penalty should be reduced. The relationship between a and B also affects the gap size. If the size of the gap is important, then a smaller a and a larger B (with higher cost for enlarging the gap) are used, and vice versa. Only the ratio a/B is important, since multiplying both by the same normal number k increases all penalties by k: kA + kBL = k (a + BL), which does not change the relative penalty between different routes.
For the character strings corresponding to the fields of the long text types, comparison can be performed based on cosine similarity comparison rules. For example, when comparing job description fields in different data records, the character strings corresponding to the job description fields may be compared based on the cosine similarity rule.
The principle of cosine similarity is as follows: the closer the angle between the two vectors is to 0 and the closer the remaining chord values are to 1, indicating that the two vectors are more similar. That is, the smaller the similarity, the larger the distance, the larger the similarity, and the smaller the distance. When performing the long text comparison operation, processes of word segmentation, merging, feature value calculation, vectorization, vector included angle cosine value calculation, and the like of the text need to be performed, which is not described in detail in this embodiment.
The comparison rules of the fields of different types are set in the algorithm model, and for each data pair input into the algorithm model, the similarity between the key fields can be calculated according to the comparison rules of the fields of different types and the field values of the key fields needing to be compared. And then, combining the current weight parameters of the algorithm model, carrying out weighted calculation on the similarity between each key field, and obtaining the similarity calculation value of the data pair. The data pair is a pre-labeled repeating data pair or a non-repeating data pair, and therefore, the true similarity value of the data pair is known. Based on the true similarity value of the data pair and the difference between the calculated similarity values of the data pair calculated by the algorithm model, the prediction loss of the algorithm model can be determined. Based on the predicted loss, the weight parameters of the algorithm model may be adjusted in reverse. Based on a large number of repeated data pairs and non-repeated data pairs in the training data, the algorithm model can be subjected to iterative training for multiple times so as to continuously optimize the weight parameters of the algorithm model until the loss between the similarity calculation value and the similarity true value output by the algorithm model converges to a set range, and a result model, namely a de-weighting model, is output.
In some alternative embodiments, the deduplication model may be implemented as a logistic regression (logistic regression) model. The weight parameter in the deduplication model may be a weight parameter corresponding to each field type. For example, a weight parameter for a field of text type, a weight parameter for a field of numeric type, and so forth.
Among them, logistic regression is a classification algorithm that can handle binary and multivariate classifications. Logistic regression can fit the relationship between the input feature matrix X and the output feature vector Y, which are composed of discrete points. In the process of training the deduplication model, based on the comparison rules corresponding to the different types of fields, the similar distance between the data records included in the input data pair can be calculated, and the similar distance can be used as the input feature matrix X. If the input data pair is a duplicate data pair, its corresponding data tag (i.e., true value) may be defined as 1; if the data pair of the data is a non-repeating data pair, the corresponding data tag may be defined as 0. When the relationship between the input feature matrix X and the output feature vector Y is fitted, the logistic regression output result Y of the repeated data pairs is 1, and the logistic regression output result Y of the unrepeated data pairs is 0. Based on the known X and the output target Y, the calculation may be iterated continuously to determine the weight parameters in the logistic regression model.
When the training deduplication model uses more key fields to be deduplicated and the amount of data participating in training is larger, a regular term can be added into a loss function adopted by the training deduplication model. The regular term can be used as a penalty term to adjust and constrain model training so as to prevent an overfitting phenomenon. Meanwhile, the training process of the model can be evaluated by adopting the model evaluation index containing the recall rate, and when the recall rate reaches 95%, the training process of the deduplication model can be stopped.
In some alternative embodiments, the matching model may also be implemented as a logistic regression model (LR). The training process of the matching model is similar to that of the duplication elimination model, and the comparison rule adopted by the matching model for comparing the key fields to be duplicated can be consistent with the comparison rules corresponding to different field types adopted by the duplication elimination model. The matching model differs from the deduplication model in that: the input data of the matching model is the output data of the deduplication model. The output data of the deduplication model has a certain difference in format from the input data of the deduplication model. Therefore, when preparing the training data of the matching model, the training data of the matching model can be formatted according to the format of the output data of the deduplication model. In addition to the above difference in the format of the input data, the training process of the matching model can be implemented by referring to the training process of the deduplication model, and the training process of the matching model is not repeated in this embodiment.
The model training flow chart is shown in fig. 3.
Since repeated data almost always share some common points, data groups sharing some specific contents can be found from the target data to be deduplicated and considered to have a higher repetition probability. Based on this way of finding data sets that share certain specific content, duplicate data can be accurately located with a certain degree of confidence. When the similarity is predicted based on the duplication elimination model, only data groups sharing certain specified contents can be compared, and further, the number of the data groups to be compared can be greatly reduced, and the number of comparison to be performed is reduced.
Based on the above analysis, in some exemplary embodiments of the present application, to increase the comparison and processing speed in the training and prediction processes of the model, before the target data to be deduplicated is deduplicated based on the deduplication model, the target data to be deduplicated is divided into a plurality of blocks, each block has the same specific content, and based on the divided blocks, a plurality of sets of data pairs to be deduplicated are obtained from the target data to be deduplicated. An alternative embodiment of learning to partition the data into blocks based on the training data and making the data in each of the partitioned blocks share some specific content will be described in detail below.
Optionally, after the fitting of the deduplication model and the matching model is completed, a rule for partitioning the input data according to the predicate index may be further learned. The blocks obtained by division based on the block division rule can be called predicate blocks, each predicate block comprises a cluster of data records, and the features shared by the data records in each cluster are calculated based on a predicate function.
The following is an exemplary description in connection with the various data tables shown in fig. 2.
First, a certain amount of data records in the "job information table" can be read from the database as the target data to be deduplicated for the current batch. For example, the quanta of data records may be twenty thousand data records in the job information table.
Next, a predicate index is built according to the predicate function. The predicate function is used for extracting specified features from the data records according to specified calculation rules. There are many calculation rules for predicate functions, for example, in one calculation rule for predicate function, the first three characters of a specified field in a data record can be extracted. In some embodiments, a greedy algorithm may be employed to determine a computation rule for a predicate function, which may involve a combination of multiple fields. When the combination of fields found based on the greedy algorithm can cover each repeated data pair in the training data, the number of data pairs to be predicted by the deduplication model in the subsequent prediction process is favorably minimized.
When the calculation rule of the predicate function involves a combination of multiple fields, one or more predicate features can be extracted from each data record. And the predicate characteristic corresponding to each data record can be used as a predicate index corresponding to the data record.
In a subsequent block division process, data records having the same characteristics may be divided into one block. For example, taking advantage of the children, when the predicate function extracts the first three characters of a specified field from data records, the data records with the same first three characters of the specified field can be divided into a block when the block is divided. As will be described in detail below.
After one or more predicate indexes corresponding to each data record are obtained based on the foregoing steps, the predicate indexes and the data IDs (i.e., the data field primary key IDs) of the data records in the role information table may be stored in a table, which is referred to as a "block mapping table". In the block mapping table, the same predicate index may correspond to multiple different data IDs, and the same data ID may also correspond to multiple different predicate indexes. In this step, a joint index may be established for the predicate index and the data ID.
Next, the aggregation processing may be performed on the data to be predicted in the position information table based on the correspondence between the data ID and the predicate index described in the block mapping table. The aggregation processing is used for carrying out block division on the data to be predicted in the position information table. Specifically, data IDs having at least one identical predicate index may be divided into a group of data IDs, and values of the group of data IDs may be concatenated to obtain an ID block (block). The data record corresponding to each data ID in the ID block may be divided into one block.
Based on the above steps, the data IDs described in the block mapping table may be divided into a plurality of ID blocks, and the data IDs in each ID block share at least one identical predicate index. Next, for any ID block, the smallest predicate index may be determined from at least one identical predicate index shared by the ID block as a block key of the block corresponding to the ID block. When determining the block corresponding to each ID block, the addendum field may be set as the block ID of the block. In this step, the block key determined for each block and the block ID set for each block may be written into a data table, referred to as a "multi-key table". In the multi-built table, the block ID may be used as the primary key and an index may be built for the block key.
Based on the above steps, the block mapping table and the multi-key table may be inter-connected based on the correspondence between the block ID and the block key described in the multi-build table and the correspondence between the data ID and the predicate index described in the block mapping table. That is, for any block ID in the multi-key table, the data ID matching the block key is found from the block mapping table according to the block key corresponding to the block ID. The block key is obtained by the predicate index, so that the data ID corresponding to the predicate index identical to the block key can be found from the block mapping table, and the found data ID is used as the data ID corresponding to the block ID, so that the corresponding relationship between the block ID and the data ID can be obtained. In this step, the block ID and the data ID may be written into a data table to obtain a multi-block table including a correspondence relationship between the block ID and the data ID, and a unique index may be established for the block ID and the data ID in combination.
Based on the above steps, data records with the same chunk ID have at least one same predicate index, and therefore, data records with the same chunk ID can be considered to have higher similarity. In the subsequent process of predicting data repeatability, data IDs with the same block ID can be inquired based on the multi-block table, data records corresponding to the inquired data IDs are used as a group of data pairs to be deduplicated, and the data pairs to be deduplicated are input into a deduplication model. Based on the implementation mode, the data records with high repetition probability can be preliminarily screened out based on a data blocking mode, and the data records with high repetition probability are preferentially subjected to deduplication, so that the deduplication iteration times are reduced, and the deduplication efficiency is improved.
Further optionally, in some exemplary embodiments, data records with higher repetition probability may be further filtered out based on the regularity of the block IDs, as will be exemplarily described below.
Optionally, based on the multi-block table obtained in the foregoing embodiment, the multiple block IDs in the multi-block table may be sorted in an ascending order to obtain an ascending order result. The ascending sorting refers to sorting the block IDs in the multi-block table from small to large. Next, for any one of the plurality of block IDs, at least one block ID smaller than the block ID may be determined from the ascending order result as a small-valued block ID of the block ID. Optionally, in this embodiment, the ascending sort results may be concatenated, that is: and connecting the plurality of block IDs after the ascending sequencing by commas to form sequencing IDs. The sort ID may be written to a data table to obtain a block association table, and a unique index may be established for the sort ID.
In determining the small-valued block ID for each block ID, a small-valued index table may be established based on the multi-block table. Alternatively, the multi-block table and the block combination table may be internally connected by block IDs, and the sort ID may be truncated by each block ID to truncate the sort ID into two values. That is, for any block ID, after the sorting ID is truncated, the value before the block ID and the value after the block ID can be truncated. Next, the value before the block ID may be selected, and after removing the comma used to connect the block IDs, one or more block IDs may be obtained as block IDs smaller than the block ID, which is described as small-value block IDs (hereinafter abbreviated as block IDs) in this embodiment.
For example, the current block ID is 1234, the sorting ID is "1111, 1234, 2118, 7210", the sorting ID is intercepted by "1234", two values of "1111" and "2118, 7210" are generated, the comma is removed from the first value, and the "1111" is the block IDs. If the current block ID is 2118, then the truncated block IDs is 1111, 1234.
Based on the above steps, the block IDs of each block ID can be obtained, and each block ID, the data ID corresponding to each block ID, and the block IDs corresponding to each block ID can be written into a data table, which is referred to as a small-value index table.
In the present embodiment, data records with the same block IDs can be considered to have higher similarity. In the subsequent process of predicting data repeatability, the data ID with at least one same block IDs can be inquired based on the small-value index table, the data record corresponding to the inquired data ID is used as a group of data pairs to be deduplicated, and the data pairs to be deduplicated are input into the deduplication model. Based on the implementation mode, the data records with high repetition probability can be preliminarily screened out based on a data blocking mode, and the data records with high repetition probability are preferentially subjected to deduplication, so that the deduplication iteration times are reduced, and the deduplication efficiency is improved.
It should be noted that, for data without the block ID repetition feature or the block IDs partial repetition feature, it may be defined as non-repeated data, and the deduplication model is not input again, so as to reduce the comparison times of the deduplication model.
Based on the training and learning processes, the key field comparison rule, the duplication elimination model, the matching model and the predicate index blocking rule to be duplicated can be determined. The key field comparison rule to be deduplicated, the deduplication model, the matching model and the predicate index blocking rule can be stored as model files for subsequent deduplication operation.
It should be noted that the solutions provided in the above and following embodiments of the present application can be implemented based on multiple threads, so as to increase the computation speed.
Fig. 4 is a schematic flowchart of a data deduplication method according to an exemplary embodiment of the present application, and as shown in fig. 1, the data deduplication method includes:
step 401, target data to be deduplicated is obtained.
Step 402, acquiring a plurality of groups of data pairs to be deduplicated from the target data, wherein each group of data pairs to be deduplicated comprises a plurality of data records.
Step 403, respectively inputting the multiple groups of data pairs to be deduplicated into a pre-trained deduplication model, and obtaining respective deduplication results of the multiple groups of data pairs to be deduplicated output by the deduplication model; the de-duplication model is used for: comparing field values of key fields to be deduplicated in the input data pairs to be deduplicated according to preset comparison rules corresponding to different field types to determine the similarity of the data pairs to be deduplicated and obtain the deduplication result of the data pairs to be deduplicated.
Step 404, determining a first deduplication result of the target data according to the deduplication results of the multiple groups of data to be deduplicated.
In this embodiment, the target data to be deduplicated is composed of data records, and each of the multiple sets of data pairs to be deduplicated, which are obtained from the target data, may include multiple data records. For example, a set of data pairs may contain two, three, or more data records. In this embodiment, a group of data pairs can be used as a minimum alignment unit, and the similarity of multiple data records in the data pair can be compared through a deduplication model.
Wherein, the key field to be deduplicated refers to the field to be deduplicated and compared. The key field to be deduplicated can comprise a plurality of fields, and can be set by a user in a self-defining way according to a service scene.
The comparison rules corresponding to different field types comprise at least one of the following: comparing the numerical values corresponding to the fields of the data types; affine gap penalty comparison rules corresponding to fields of the common short text type; a completely consistent comparison rule corresponding to the field of the short text unique identification type; and (4) comparing cosine similarity corresponding to the field of the long text type. Reference may be made to the description of the foregoing embodiments, which are not repeated herein.
The deduplication model is obtained by pre-training, and the specific training method may refer to the description of the foregoing embodiment, which is not described herein again. The duplication elimination model is used for carrying out similarity calculation on each input group of data pairs and outputting duplication elimination results of the group of data pairs according to similarity calculation results. The deduplication logic of the deduplication model will be exemplified below by taking an arbitrary set of data pairs as an example.
Inputting the data pair to be deduplicated into a deduplication model aiming at any group of data pair to be deduplicated obtained from target data, and determining the field value of a keyword to be deduplicated in the data pair to be deduplicated according to a pre-learned deduplication logic in the deduplication model; then, comparing the field values of the key fields to be deduplicated in the data pair to be deduplicated according to preset comparison rules corresponding to different field types to obtain comparison results of the key fields to be deduplicated; then, the comparison results of the key fields to be deduplicated are weighted and calculated by combining the weight parameters of different field types learned in advance by the deduplication model, so as to obtain the similarity of the data pairs to be deduplicated.
For example, the data pair R to be deduplicated contains data record R1 and data record R2, and the key fields to be deduplicated are field L1, field L2, and field L3. The deduplication model may determine a field value LR11 of the field L1, a field value LR12 of the field L2, and a field value LR13 of the field L3 from the data record R1, and determine a field value LR21 of the field L1, a field value LR22 of the field L2, and a field value LR23 of the field L3 from the data record R2 when deduplication is performed on the data pair R.
Next, the deduplication model may calculate similarity between the LR11 field value and the LR21 field value based on the comparison rule of different types of fields, to obtain similarity S of the L1 field value (L1), similarity between the LR12 field value and the LR22 field value, similarity S of the L2 field value (L2), similarity between the LR13 field value and the LR23 field value, and similarity S of the L3 field value (L3). Next, the similarity of each field obtained by the above calculation may be weighted to obtain the similarity of the data record R1 and the data record R2, that is:
S(R1、R2)=w1* S(L1)+ w2* S(L2)+ w3* S(L3)
in the process, when the duplication elimination model compares the field values of the to-be-duplicated key fields according to the comparison rules corresponding to the different field types, if the to-be-duplicated key fields contain the fields of the data types, comparing the field values of the to-be-duplicated key fields according to the numerical value comparison rules; if the key field to be deduplicated contains a field of a common short text type, comparing the field values of the key field to be deduplicated according to an affine gap penalty comparison rule; if the key field to be deduplicated contains the field of the only identification type of the short text, comparing the field values of the key field to be deduplicated according to a completely consistent comparison rule; if the first field contains a field of a long text type, comparing the field values of the to-be-deduplicated key field according to a cosine similarity comparison rule. No further description is given.
In this embodiment, when the target data to be deduplicated is deduplicated, a data pair to be deduplicated is obtained from the target data, and the data pair is input into a previously trained deduplication model. In the deduplication model, field values of key fields to be deduplicated in the input data pairs to be deduplicated can be compared based on preset comparison rules of different field types, so that the similarity of the data pairs to be deduplicated is calculated. Based on the implementation mode, different types of fields can be effectively deduplicated, and a better data deduplication effect is realized.
When the deduplication method provided based on the above embodiment is used to deduplicate large-scale data, a large number of data pairs to be deduplicated need to be constructed, and a large number of calculations are performed by the deduplication model. Taking the example that the data pair to be deduplicated contains two data records, when performing deduplication prediction, the deduplication model needs to calculate the similarity between the key fields to be deduplicated according to the comparison rules corresponding to different field types and the field values of the key fields to be deduplicated of the two data records. After the similarity of the key fields to be deduplicated is obtained by calculation, the similarity of two data records in the data pair needs to be further calculated. And performing deduplication processing on the data according to batches, and calculating the distances between key fields to be deduplicated in the data pairs one by one from the target data to be deduplicated in the current batch in the deduplication processing process and predicting the similarity degree of the data pairs within a certain time. Therefore, in some alternative embodiments, the data to be deduplicated may be block-divided based on the block division rule described in the foregoing embodiments and by adding an index to the block, so as to improve the prediction speed of the deduplication model.
In the above embodiments, an embodiment in which a rule for partitioning data to be predicted is learned in advance is described. In the foregoing embodiment, the learned data blocking rule is stored in the model file, and before the target data is input into the deduplication model, the target data to be deduplicated may be block-divided based on the data blocking rule stored in the model file, so as to improve deduplication efficiency.
Based on the data blocking rule, in some exemplary embodiments, when acquiring multiple sets of data pairs to be deduplicated from target data, block division may be performed on the target data according to the data blocking rule to obtain multiple blocks, where data records in each block have the same specific characteristics; next, block IDs may be set for the blocks, and respective corresponding block IDs of data records included in the target data may be determined.
Optionally, in the foregoing process, a predicate function may be adopted to extract a respective predicate index of each data record included in the target data; then, the data records with at least one same predicate index are divided into the same block, and a plurality of blocks contained in the target data are obtained. Then, setting block IDs for the blocks, and establishing a corresponding relation between the data ID of each data record in the target data and the block ID according to the respective data ID of the data record contained in the target data and the block ID corresponding to the data record contained in the target data. The above process can be implemented by referring to the specific implementation manner of creating the block mapping table, the multiple building tables, and the multiple block tables according to the job information table described in the foregoing embodiment, which is not described in detail in this embodiment.
Based on the corresponding relationship between the data ID and the block ID, when the data pair to be deduplicated is selected, the data records with the same block ID can be selected from the target data as a group of data pairs to be deduplicated.
Further optionally, the multiple block IDs obtained by partitioning may be sorted in ascending order to obtain an ascending sorting result. For any one of the block IDs, at least one block ID smaller than the block ID may be determined from the ascending order result as a small-valued block ID of the block ID. The above process can be implemented by referring to the specific implementation manner of creating the block association table and the small-value index table described in the foregoing embodiment, and details are not repeated in this embodiment.
After a small-value index table is created based on target data, when a data pair to be deduplicated is selected, small-value block IDs of block IDs corresponding to a plurality of data records in the target data can be determined and used as the small-value block IDs corresponding to the plurality of data records; and determining the data records with the small-value block IDs partially overlapped from the target data as a group of data records to be deduplicated.
In the above embodiment, in the deduplication prediction, it is considered that the data with the same block ID has higher similarity, and any two pieces of data with the same block ID may be used as a pair of data to be deduplicated and input together into the deduplication model for deduplication. Meanwhile, it is considered that there are data with the same partial block IDs, and the data have certain similarity, and any two pieces of data with the partial block IDs overlapped can be used as a data pair to be deduplicated, and input into the deduplication model together for deduplication. Based on the method, blind iteration can be avoided, and the iteration times can be effectively reduced. Other data without the block ID repeat feature or IDs partial repeat feature may be defined as non-repeated data, and are not input into the deduplication model.
For the deduplication model, if the degree of repetition of a group of data pairs is predicted to reach the threshold of the finger, the data records included in the group of data pairs are considered to have the repeatability. The specified threshold may be set according to a requirement, for example, may be set to 0.9, 0.95, or 0.98, and the larger the specified threshold is, the stricter the verification of the duplicated data is, and details are not repeated.
The above embodiments describe an implementation of performing internal deduplication on target data to be deduplicated based on a deduplication model. In some cases, deduplication may be performed in batches when the amount of data to be deduplicated is large. For the first target data to be deduplicated, internal deduplication can be performed based on the deduplication model described in the foregoing embodiment. For non-first batch of target data to be deduplicated, internal deduplication can be performed firstly based on a deduplication model, and then based on a matching model, internal deduplication results are matched with data of the pre-ordered batches with deduplication completed, so that data deduplication between different batches is achieved. As will be exemplified below.
Fig. 5 is a schematic flowchart of a data deduplication method according to another exemplary embodiment of the present application, and as shown in fig. 5, the data deduplication method includes:
and step 501, acquiring target data to be deduplicated.
Step 502, acquiring a plurality of groups of data pairs to be deduplicated from the target data, wherein each group of data pairs to be deduplicated comprises a plurality of data records.
Step 503, respectively inputting the multiple groups of data pairs to be deduplicated into a pre-trained deduplication model, and obtaining respective deduplication results of the multiple groups of data pairs to be deduplicated output by the deduplication model; the de-duplication model is used for: comparing field values of key fields to be deduplicated in the input data pairs to be deduplicated according to preset comparison rules corresponding to different field types to determine the similarity of the data pairs to be deduplicated.
Step 504, determining a first duplicate removal result of the target data according to the respective duplicate removal results of the multiple groups of data to be deduplicated.
505, judging whether the target data is first batch of duplicate removal data, and if so, finishing the duplicate removal operation; if not, go to step 506.
Step 506, determining the first deduplication result as incremental deduplication data, and determining inventory deduplication data.
And 507, extracting data records from the first duplicate removal result and the stock duplicate removal data respectively to obtain a plurality of groups of data pairs to be matched.
Step 508, inputting the multiple groups of data pairs to be matched into pre-trained matching models respectively, and obtaining respective matching results of the multiple groups of data pairs to be matched output by the matching models; the matching model is used for: and extracting a set field value of a second key field from the input data pair to be matched, and comparing the extracted field value of the second key field according to preset comparison rules corresponding to different field types to determine the similarity of the data pair to be matched.
Step 509, determining a second duplicate removal result of the target data according to the matching result of the plurality of sets of data to be matched to each other.
For alternative implementation of steps 501 to 503, reference may be made to the description of the foregoing embodiments, which are not described herein again.
In step 504, in the first deduplication result of the target data, duplicate data records and non-duplicate data records (i.e., unique data records) included in the target data are marked. Optionally, to facilitate distinguishing duplicate data records from non-duplicate data records predicted by the deduplication model, the deduplication model may set a unique identification ID and a data status flag for each data record. Wherein the same unique identification ID can be set for a plurality of repeated data records. And setting unique identification ID for each non-repeated data record respectively. Meanwhile, a data status flag may be set for each nonrepeating data record. The data state mark is used for identifying whether the data record is the repeated data or not.
In step 505, it may be determined whether the target data is first batch deduplication data, and if so, the deduplication operation of the target data is completed, and the first deduplication result is no longer input into the matching model. If the target data is not the first batch of deduplication data, then the next step 506 may be performed to determine the target data pair as incremental deduplication data, and determine inventory deduplicated data. The stock duplication-removed data is data obtained after the pre-order batch duplication removal, and each data record in the stock duplication-removed data is also marked with a data state mark.
In step 507, optionally, the data pair to be matched includes data records from the first deduplication result and also includes data records from stock deduplication data, and the purpose of the matching is to determine, from the target data, data that is duplicated with the stock deduplication data to further optimize the deduplication result.
Optionally, in step 507, extracting data records from the first deduplication result and the stock deduplication data to obtain multiple sets of data pairs to be matched, which may be implemented by segmenting the stock deduplication data using multiple threads. As will be exemplified below.
Optionally, each thread may select a plurality of data blocks having the same data size as the first deduplication result from the inventory deduplication data; then, the plurality of data blocks are respectively subjected to block division to obtain a plurality of blocks contained in each of the plurality of data blocks. For an alternative implementation of performing block division on a data block to obtain a plurality of blocks included in the data block, reference may be made to the descriptions in the foregoing embodiments, which are not repeated herein. After each data block is subjected to block division, block IDs can be respectively set for the blocks obtained by division to obtain a plurality of block IDs, and further, the block ID corresponding to the data record in each data block can be determined.
For any data block in the multiple data blocks, selecting the ith data record from the first duplicate removal result, and selecting one data record with the same block ID as the ith data record from the data block to form an ith group of data pairs to be matched; where i =1,2,3 … n, where n represents the total number of data records contained by the first deduplication result. When the first deduplication result contains n data records, the data block partitioned by each thread also contains n data records. The above operation may be implemented based on a plurality of threads, each of which divides a data block having the same data size as the first deduplication result from the inventory deduplication data, i.e., each of which may divide a data block. In the subsequent matching process, each thread can select a data record from the first duplicate removal result, and select a data record from the data blocks divided from the stock duplicate removal data to form a data pair to be matched.
This will be further explained below with reference to a specific example.
Assuming that the target data to be deduplicated of the current batch is a, after deduplication by the deduplication model, the data records in a obtain respective data state markers. Assuming that the stock deduplication data is B, the data records in B also have respective data state flags. Next, data matching is performed for A, B using multithreading.
Specifically, the plurality of threads may select data blocks of the same size as the number of a from B in sequence. For example. Assuming that the data ID range of the data record in B is 1 to 20 ten thousand, when selecting a data block, each thread may select two or more thousand pieces of data each time according to the sequence of the data ID until 20 ten thousand pieces of data are selected. For convenience of description, the data block selected from B at the jth time will be described as B belowjJ is a positive integer, j =1,2, …, B/a.
Then, for A and BjFor the block division, the specific block division manner can refer to the description of the foregoing embodiments, and is not described herein again. To BjAfter block division, a block ID may be set for the divided block to determine BjEach data record in (1) records a corresponding block ID. Next, select the ith data note from ARecording from BjSelecting a data record with the same block ID as the ith data record, and recording the ith data record and BjSelecting one data record as the ith group of data pair to be matched
Figure DEST_PATH_IMAGE001
,
Figure 891914DEST_PATH_IMAGE002
Where i =1,2,3 … n, n representing the total number of data records that a contains. Then, can be
Figure 245535DEST_PATH_IMAGE001
,
Figure 309306DEST_PATH_IMAGE002
And inputting a pre-trained matching model and predicting the similarity of the matching model.
In some exemplary embodiments, when the second deduplication result of the target data is determined according to the matching result of each of the multiple sets of data pairs to be matched, the data records included in each of the multiple sets of data pairs to be matched may be divided into multiple data sets according to the matching result of each of the multiple sets of data pairs to be matched. The data records included in the data pairs to be matched, the matching results of which are not repeated, are respectively divided into different data groups, and the data records included in the data pairs to be matched, the matching results of which are repeated, are divided into the same data groups.
That is, after all the data pairs to be matched are predicted, each nonrepeating piece of data in each data pair can be grouped by itself, that is, a piece of unique data forms a group of data records and has a unique identification ID. And making the repeated data in the data pair into a group and sharing the unique identification ID. Further, a plurality of data records in a and a plurality of data records in B can be obtained.
Next, a repeatability determination can be made for the plurality of data sets to determine duplicate data sets and non-duplicate data sets from the plurality of data sets.
Alternatively, taking a first data group and a second data group of the plurality of data groups as an example, if there is a data record in the first data group that is duplicated with a data record in the second data group, the first data group and the second data group may be determined to be duplicated data groups. The first data group may be any one of the data groups. As will be exemplified below.
In connection with the above example, A, B contains data sets formed by multiple sets of duplicate data records and data sets formed by multiple sets of non-duplicate data records after matching model prediction.
Suppose that there is a certain data set a in A, a having x number of data records, where
Figure DEST_PATH_IMAGE003
N is the total number of entries of data records in the A data set; b a certain data group B having a number y of data records, wherein
Figure 39364DEST_PATH_IMAGE004
And m is the total number of data entries in the B data set. If a certain piece of data in b is duplicated with a certain piece of data in a, then b can be considered duplicated with all the data in the group a. The above-described judgment process can be broken down in detail into the following cases:
(1) if the data records in the group a are not repeated and the data records in the group b are not repeated, the number of the group a and the group b is only one piece of data. If the two data matches are repeated, then the a and b groups are considered repeated.
(2) If the data records in group a are not duplicated and group b contains multiple duplicated data records, then if a data record in group a is duplicated with a data record in group b, then all data records in group b can be considered duplicated with a data record in group a.
(3) If group a contains multiple duplicate data records and the data records in group b are not duplicate, then group b can be considered duplicate with the data records of group a if a data record in group a matches a duplicate with the data record in group b.
(4) If the group a comprises a plurality of repeated data records, the group b comprises a plurality of repeated data records, and if a certain data record in the group a and a certain data record in the group b match with the repetition, all data in the group a and the group b can be considered to be repeated.
Based on the above embodiments, a data set having a repetitive relationship can be found. Next, data sets having a repeating relationship may be merged. The description will be continued with reference to the first data group and the second data group.
For the first data group, if there is a second data group in the plurality of data groups that is duplicate to the first data group, the data records in the first data group may be allocated to the second data group to merge the first data group and the second data group.
If a plurality of second data groups exist in the plurality of data groups and the first data group is a repeated data group, the centroid distances between the data records in the first data group and the plurality of second data groups can be respectively calculated, and the data records in the first data group are distributed to the second data group with the smallest centroid distance.
That is, when a data record is in match and is repeated with data records in multiple data sets, then one data set having a centroid closest to the data record may be determined from the multiple data sets and assigned to the one data set, clustered according to the hierarchy of centroid links. That is, the data record is divided into a data group which is repeated and has the centroid closest to the data record. When grouping is completed, the unique identification ID of the data record can be modified to be the unique identification ID of the data record which is deduplicated in the same group, namely the unique identification ID shared by the data records in the data group where the data record is located, and a data state mark is set for the data record.
In the matching process, if a certain non-duplicated data record originally in B is matched with a, and no data record duplicated with the certain non-duplicated data record is still found after the certain non-duplicated data record is matched with a, the certain data record can be regarded as the data record of the non-duplicated data, and the data state flag of the certain data record is set as the flag of the unique data, which represents that the certain data belongs to the non-duplicated data and has the unique identification ID field.
In this embodiment, for large-scale data to be deduplicated, the data may be divided into multiple deduplication batches, each deduplication batch is internal deduplication based on a deduplication model, and subsequent data is incremental deduplication and matched with deduplicated data. The batch duplicate removal, the duplicate removal first and the matching mode can effectively complete the duplicate removal of large-scale data, the result of duplicate removal prediction is more accurate, and the data duplicate removal efficiency is greatly improved.
In some exemplary embodiments, to facilitate distinguishing between a first deduplication result output by the deduplication model and a second deduplication result output by the matching model, temporary data status flags may be added in the first deduplication result for predicted duplicate data records and non-duplicate data records. The temporary data state flag is used to identify whether the data record is duplicate data after being predicted by the deduplication model. Wherein the temporary data status flag of the repeated data record can be described as a first temporary flag, and the temporary data status flag of the non-repeated data record can be described as a second temporary flag.
For example, for a data record, if the deduplication result output by the deduplication model indicates that the data record is the duplicate data, a temporary duplicate data flag may be added to the data record; if the deduplication result output by the deduplication model indicates that the data record is non-duplicate data, a temporary non-duplicate flag may be added to the data.
And the first duplicate removal data is subjected to the matching operation of the matching model and then is subjected to duplicate removal again to obtain a second duplicate removal result. At this time, final data status flags may be added in the second deduplication result for the predicted duplicate data records and non-duplicate data records. The final data state flag is used to identify whether the data record is duplicate data after being re-predicted by the matching model. Wherein the final data state flag of the repeated data record can be described as a first flag, and the final data state flag of the non-repeated data record can be described as a second flag.
For example, in the first deduplication result, the data state flag of a certain data record is marked as a second temporary flag, and if it is determined that the data record is duplicated with a certain data record in the stock deduplication data after the matching prediction of the matching model, the data state flag of the data record may be modified to be the first flag.
After deduplication is completed, the unique identification ID of each data record and the final data status flag may be obtained. In some embodiments, a unique data state table may be created as shown in FIG. 2, in which the unique identification ID of each data record and the final data state flag are written. When the data record corresponding to the unique identification ID is unique data, setting the state field of the unique data to be 0; when the data record corresponding to the unique identification ID is a plurality of repeated data records, the status field of the unique identification ID may be set to 1, and the same field may be stored in an array formed by the data IDs of the repeated data records.
It should be noted that, in some alternative embodiments, because the model prediction result has a certain deviation, while deduplication is performed, manual intervention may be performed to label data, and the model is continuously iterated by using the manual labeling data.
Wherein a set of duplicate data may be labeled as: all the forms are the same, all the forms are different, parts of the forms are the same and the like, and the forms can be embodied through manually marked state fields. The status field has four possible values of 0, 1,2, and 3: 0 represents no duplicate data, with a unique record; 1 represents that repeated data exist and is not manually checked; 2, the repeated data exist, and the manual review fails; and 3 represents that repeated data exist and the artificial fusion is successful. The initial state field of the data is 0 or 1, and the initial state field can be obtained from the "unique data state table" described in the foregoing embodiment.
When manually labeled, the state field in the 'unique data state table' can be modified. For example, if all data of a group of duplicate data are considered to be completely duplicated after manual review, the state field of the group of manually reviewed data may be modified to 3, which represents that fusion is completed, an ID of one piece of data is selected as a "selected job ID" field, and the ID list of the group of duplicate data is stored in the same job ID field. If a group of data is checked manually, and part of the data is considered to be repeated, the status field of the group of data can be set to be 2, which represents that the fusion does not pass, the selected position ID field does not operate, and the identical position ID field stores the repeated part of the data ID list in the group.
Where the "selected job ID" is a field in the "unique data state table," it may be referred to as the "selected _ ID" in the database. And the unique data state table is used for facilitating the audit of the audit personnel on the repeated data judged by the model, and can be usually displayed in the background of the audit personnel. In the unique data state table, a selection button can be displayed in front of each data record, when a group of data judged to be repeated by the model is displayed in the background, if the auditor judges that the result of the deduplication result output by the model on the group of data is accurate, namely the group of data is all repeated, the auditor can click the selection button in front of one of the data, and submit the 'all-repeated' button to complete the selection operation aiming at the repeated all-data. Here, the ID of the clicked data is referred to as "job ID". Furthermore, audit judgment of auditors can be facilitated, and the marking speed is increased. The data after manual examination can be used as iteration data to iterate a training model.
Optionally, the unrepeated data in the state 0, the manually reviewed data sets in the states 2 and 3 may be further adopted to generate new training data to iterate the deduplication model and the matching model. Performing combined query through the 'unique data state table' and the 'position information table', determining the contents of a group of repeated data in the 'position information table' and the contents of data grouping conditions in the 'unique data state table' after manual marking are combined, and combining the repeated groups of data in pairs to form a matched data pair data set; combining the repeated data with the non-repeated data and combining the non-repeated data with the non-repeated data in pairs to form different data pair data sets, combining the data pairs and the data sets into a JSON format, expanding the data sets, training the model and finishing the iteration of the model.
The data deduplication method provided by the embodiment of the present application will be further illustrated in conjunction with fig. 6. As shown in fig. 6, the deduplication method may include the following steps:
s1, constructing a database table, determining key fields needing to be subjected to duplicate removal comparison as key fields to be subjected to duplicate removal comparison, and setting filed for the key fields to be subjected to duplicate removal comparison. Setting filed refers to specifying field names of the fields of the critical data to be deduplicated, defining comparison types of the fields, determining whether the fields can be deleted and the like.
And S2, setting comparison rules according to the service scenes. When the service scenes are different, different comparison rules can be set. For example, in a recruitment service scenario, when the job data is deduplicated, a string comparison rule and a numerical comparison rule may be set.
And S3, marking the data in the database table to divide the repeated data pairs and the repeated data pairs to form marked training data.
And S4, performing de-duplication model training and matching model training based on the labeled training data.
And S5, establishing a database duplicate removal intermediate table so as to facilitate processing and comparison in the duplicate removal process.
S6, reading target data from a database, if the target data are first batch of unremoved data, performing data deduplication prediction only based on a deduplication model, and when a prediction result is output, setting the same data identification mark for the duplicated data, and using the state of the marked data as duplication; setting different identification marks for non-repeated data; if the incremental data is the incremental data, self duplication elimination is carried out, then the matching model is used for comparing with the duplicated data, and the repeated identification and the data state are marked.
And S7, manually labeling the deduplicated data, processing to obtain repeated data pairs and non-repeated data pairs, and iterating the model.
And S8, if the data deduplication is not finished, repeating the steps S6 and S7.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of step 201 to step 204 may be device a; for another example, the execution subject of steps 201 and 202 may be device a, and the execution subject of step 203 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 201, 202, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
Fig. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application, and as shown in fig. 7, the electronic device includes: memory 701, processor 702, and communications component 703.
The memory 701 is used for storing a computer program and may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 701 may be implemented by any type or combination of volatile and non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 702, coupled to the memory 701, for executing the computer program in the memory 701 for: acquiring target data to be deduplicated; acquiring a plurality of groups of data pairs to be deduplicated from the target data, wherein each group of data pairs to be deduplicated comprises a plurality of data records; respectively inputting the multiple groups of data pairs to be deduplicated into a pre-trained deduplication model, and acquiring respective deduplication results of the multiple groups of data pairs to be deduplicated output by the deduplication model; determining a first duplicate removal result of the target data according to respective duplicate removal results of the multiple groups of data to be subjected to duplicate removal; wherein the de-duplication model is to: comparing field values of key fields to be deduplicated in the input data pairs to be deduplicated according to preset comparison rules corresponding to different field types to determine the similarity of the data pairs to be deduplicated and obtain the deduplication result of the data pairs to be deduplicated.
Further optionally, when the multiple sets of data pairs are respectively input into a pre-trained deduplication model and respective deduplication results of the multiple sets of data pairs output by the deduplication model are obtained, the processor 702 is specifically configured to: inputting the data pairs to be deduplicated into the deduplication model for any one of the multiple groups of data pairs to be deduplicated; extracting field values of the key fields to be deduplicated from the data pairs to be deduplicated in the deduplication model; comparing the field values of the key fields to be deduplicated according to the comparison rules corresponding to the different field types to obtain comparison results of the key fields to be deduplicated; and performing weighted calculation on the comparison result of the key fields to be deduplicated by using the weight parameters of different field types learned in advance by the deduplication model to obtain the similarity of the data pairs to be deduplicated.
Further optionally, when the processor 702 obtains multiple sets of data pairs to be deduplicated from the target data, and each set of data pairs to be deduplicated includes multiple data records, the processor is specifically configured to: dividing the target data into a plurality of blocks, wherein the data records in each block have the same specific characteristics; determining the corresponding relation between the data record contained in the target data and the plurality of blocks; and selecting the data records of which the corresponding blocks meet set conditions from the target data as a group of data pairs to be deduplicated according to the corresponding relation between the data records contained in the target data and the blocks.
Further optionally, when the processor 702 performs block division on the target data to obtain a plurality of blocks, and data records in each block have the same specific characteristics, the processor is specifically configured to: extracting respective predicate indexes of each data record contained in the target data by adopting a predicate function; and dividing the data records with at least one same predicate index into the same block to obtain the plurality of blocks contained in the target data.
Further optionally, when determining the corresponding relationship between the data record included in the target data and the plurality of blocks, the processor 702 is specifically configured to: determining a block key of each of the plurality of blocks according to at least one predicate index corresponding to each of the plurality of blocks; respectively setting block IDs for the blocks to obtain a plurality of block IDs; and determining the block ID corresponding to the data record contained in the target data according to the corresponding relation between the predicate index of the data record contained in the target data and the block keyword of each block, and establishing the corresponding relation between the data ID and the block ID of each data record in the target data.
Further optionally, when the processor 702 selects a data record of which the corresponding block meets the set condition from the target data, as a group of data to be deduplicated, it is specifically configured to: and determining the corresponding data records with the same block ID from the target data as a group of data pairs to be deduplicated.
Further optionally, the processor 702 is further configured to: sequencing the plurality of block IDs in an ascending order to obtain an ascending sequencing result; for any one of the plurality of block IDs, determining at least one block ID smaller than the block ID from the ascending sorting result as a small-valued block ID of the block ID.
Further optionally, when the processor 702 selects a data record of which the corresponding block satisfies the set condition from the target data, as a group of data pairs to be deduplicated, the processor is specifically configured to: determining small-value block IDs of block IDs corresponding to a plurality of data records in the target data, and using the small-value block IDs as the small-value block IDs corresponding to the plurality of data records; and determining the data records with partial overlapping small-value block IDs from the target data as a group of data records to be deduplicated.
Further optionally, the processor 702, after obtaining the respective deduplication results of the multiple data pairs output by the deduplication model, is further configured to: if the target data pair is incremental deduplication data, determining stock deduplication data; extracting data records from the first duplicate removal result and the stock duplicate removal data respectively to obtain a plurality of groups of data pairs to be matched; respectively inputting the multiple groups of data pairs to be matched into a pre-trained matching model, and acquiring the matching results of the multiple groups of data pairs to be matched output by the matching model; determining a second duplicate removal result of the target data according to the matching result of the plurality of groups of data to be matched to each other; wherein the matching model is to: and extracting a set field value of a second key field from the input data pair to be matched, and comparing the extracted field value of the second key field according to preset comparison rules corresponding to different field types to determine the similarity of the data pair to be matched.
Further optionally, when the processor 702 extracts data records from the first deduplication result and the stock deduplication data to obtain multiple sets of data pairs to be matched, the processor is specifically configured to: selecting a plurality of data blocks with the same data size as the first duplication removal result from the inventory duplication removal data; respectively carrying out block division on the plurality of data blocks to obtain a plurality of blocks contained in each of the plurality of data blocks so as to determine block IDs of data records contained in each of the plurality of data blocks; selecting the ith data record from the first duplicate removal result aiming at any data block in the data blocks, and selecting one data record with the same block ID as the ith data record from the data blocks to form an ith group of data pairs to be matched; wherein i =1,2,3 … n, where n represents the total number of data records contained by the first deduplication result.
Further optionally, when determining the second deduplication result of the target data according to the matching result of each of the multiple sets of data to be matched, the processor 702 is specifically configured to: dividing data records contained in the multiple groups of data pairs to be matched into multiple data groups according to the matching results of the multiple groups of data pairs to be matched; the data records contained in the data pairs to be matched, the matching results of which are not repeated, are respectively divided into different data groups, and the data records contained in the data pairs to be matched, the matching results of which are repeated, are divided into the same data groups; and carrying out repeatability judgment on the plurality of data sets so as to determine a repeated data set and a non-repeated data set from the plurality of data sets.
Further optionally, when performing a repeatability judgment on the plurality of data sets to determine a repeated data set and a non-repeated data set from the plurality of data sets, the processor 702 is specifically configured to: for a first data group and a second data group in the plurality of data groups, if one data record in the first data group is duplicated with one data record in the second data group, determining that the first data group and the second data group are duplicated.
Further optionally, the processor 702 is further configured to: if one second data group and the first data group exist in the plurality of data groups and are repeated data groups, distributing the data records in the first data group to the second data group; if a plurality of second data groups and the first data group exist in the plurality of data groups, the centroid distances between the data records in the first data group and the plurality of second data groups are respectively calculated, and the data records in the first data group are distributed to the second data group with the smallest centroid distance.
Further optionally, the alignment rule corresponding to the different field types includes at least one of the following: comparing the numerical values corresponding to the fields of the data types; affine gap penalty comparison rules corresponding to fields of the common short text type; a completely consistent comparison rule corresponding to the field of the short text unique identification type; and (4) comparing cosine similarity corresponding to the field of the long text type.
Further optionally, the processor 702 is further configured to: determining the key fields to be deduplicated and comparison rules corresponding to the different field types; acquiring a plurality of groups of data pairs as training data, wherein the training data comprises a plurality of groups of repeated data pairs and a plurality of groups of non-repeated data pairs; determining respective similarity calculation values of the multiple groups of data on the basis of the weight parameters of the algorithm model and the comparison rules corresponding to the different field types; and taking the respective similarity true values of the multiple groups of data pairs as supervision signals, calculating values according to the respective similarity of the multiple groups of data pairs, and optimizing the weight parameters of the algorithm model to obtain the de-weighting model.
Further optionally, the deduplication model is a logistic regression model.
Further, as shown in fig. 7, the electronic device further includes: power supply components 704, and the like. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.
The communication component 703 is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply module 704 provides power to various components of the device in which the power supply module is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
In the data deduplication method provided in the embodiment of the present application, in the embodiment, when deduplication is performed on target data to be deduplicated, a data pair to be deduplicated is obtained from the target data, and the data pair is input into a deduplication model trained in advance. In the deduplication model, field values of key fields to be deduplicated in the input data pairs to be deduplicated can be compared based on preset comparison rules of different field types, so that the similarity of the data pairs to be deduplicated is calculated. Based on the implementation mode, different types of fields can be effectively deduplicated, and a better data deduplication effect is realized.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program is capable of implementing the steps that can be executed by the electronic device in the foregoing method embodiments when executed.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (14)

1. A method for deduplication of textual data, comprising:
acquiring target data to be deduplicated;
acquiring a plurality of groups of data pairs to be deduplicated from the target data, wherein each group of data pairs to be deduplicated comprises a plurality of data records;
respectively inputting the multiple groups of data pairs to be deduplicated into a pre-trained deduplication model, and acquiring respective deduplication results of the multiple groups of data pairs to be deduplicated output by the deduplication model;
determining a first duplicate removal result of the target data according to respective duplicate removal results of the multiple groups of data to be subjected to duplicate removal;
wherein the de-duplication model is to: comparing field values of key fields to be deduplicated in the input data pairs to be deduplicated according to preset comparison rules corresponding to different field types to determine the similarity of the data pairs to be deduplicated and obtain deduplication results of the data pairs to be deduplicated;
if the target data pair is incremental deduplication data, determining stock deduplication data;
selecting a plurality of data blocks with the same data size as the first duplication removal result from the inventory duplication removal data;
respectively carrying out block division on the plurality of data blocks to obtain a plurality of blocks contained in each of the plurality of data blocks so as to determine block IDs of data records contained in each of the plurality of data blocks;
selecting the ith data record from the first duplicate removal result aiming at any data block in the data blocks, and selecting one data record with the same block ID as the ith data record from the data blocks to form an ith group of data pairs to be matched; wherein i =1,2,3 … n, where n represents the total number of data records contained by the first deduplication result;
respectively inputting multiple groups of data pairs to be matched corresponding to the multiple data blocks into a pre-trained matching model, and acquiring respective matching results of the multiple groups of data pairs to be matched output by the matching model;
and determining a second duplicate removal result of the target data according to the matching result of the plurality of groups of data to be matched to each other.
2. The method according to claim 1, wherein inputting the plurality of sets of data pairs into a pre-trained deduplication model respectively, and obtaining deduplication results of the plurality of sets of data pairs output by the deduplication model respectively comprises:
inputting the data pairs to be deduplicated into the deduplication model for any one of the multiple groups of data pairs to be deduplicated;
extracting field values of the key fields to be deduplicated from the data pairs to be deduplicated in the deduplication model;
comparing the field values of the key fields to be deduplicated according to the comparison rules corresponding to the different field types to obtain comparison results of the key fields to be deduplicated;
and performing weighted calculation on the comparison result of the key fields to be deduplicated by using the weight parameters of different field types learned in advance by the deduplication model to obtain the similarity of the data pairs to be deduplicated.
3. The method of claim 1, wherein obtaining a plurality of sets of data pairs to be deduplicated from the target data, each set of data pairs to be deduplicated comprising a plurality of data records, comprises:
dividing the target data into a plurality of blocks, wherein the data records in each block have the same specific characteristics;
determining the corresponding relation between the data record contained in the target data and the plurality of blocks;
and selecting the data records of which the corresponding blocks meet set conditions from the target data as a group of data pairs to be deduplicated according to the corresponding relation between the data records contained in the target data and the blocks.
4. The method of claim 3, wherein the block dividing the target data into a plurality of blocks, the data records in each block having the same specific characteristics comprises:
extracting respective predicate indexes of each data record contained in the target data by adopting a predicate function;
and dividing the data records with at least one same predicate index into the same block to obtain the plurality of blocks contained in the target data.
5. The method of claim 4, wherein determining the correspondence between the data records contained in the target data and the plurality of blocks comprises:
determining a block key of each of the plurality of blocks according to at least one predicate index corresponding to each of the plurality of blocks;
respectively setting block IDs for the blocks to obtain a plurality of block IDs;
and determining the block ID corresponding to the data record contained in the target data according to the corresponding relation between the predicate index of the data record contained in the target data and the block keyword of each block, and establishing the corresponding relation between the data ID and the block ID of each data record in the target data.
6. The method of claim 5, wherein selecting the data records with the corresponding blocks satisfying the set condition from the target data as a set of data pairs to be deduplicated comprises:
and determining the corresponding data records with the same block ID from the target data as a group of data pairs to be deduplicated.
7. The method of claim 5, further comprising:
sequencing the plurality of block IDs in an ascending order to obtain an ascending sequencing result;
for any one of the plurality of block IDs, determining at least one block ID smaller than the block ID from the ascending sorting result as a small-valued block ID of the block ID.
8. The method of claim 7, wherein selecting the data records with the corresponding blocks satisfying the set condition from the target data as a set of data pairs to be deduplicated comprises:
determining small-value block IDs of block IDs corresponding to a plurality of data records in the target data, and using the small-value block IDs as the small-value block IDs corresponding to the plurality of data records;
and determining the data records with partial overlapping small-value block IDs from the target data as a group of data records to be deduplicated.
9. The method of claim 1, wherein the matching model is used to: and extracting a set field value of a second key field from the input data pair to be matched, and comparing the extracted field value of the second key field according to preset comparison rules corresponding to different field types to determine the similarity of the data pair to be matched.
10. The method according to claim 9, wherein determining the second deduplication result of the target data according to the matching result of each of the plurality of sets of data to be matched includes:
dividing data records contained in the multiple groups of data pairs to be matched into multiple data groups according to the matching results of the multiple groups of data pairs to be matched; the data records contained in the data pairs to be matched, the matching results of which are not repeated, are respectively divided into different data groups, and the data records contained in the data pairs to be matched, the matching results of which are repeated, are divided into the same data groups;
and carrying out repeatability judgment on the plurality of data sets so as to determine a repeated data set and a non-repeated data set from the plurality of data sets.
11. The method of claim 10, wherein determining a repetitiveness decision for the plurality of data sets to determine a duplicate data set and a non-duplicate data set from the plurality of data sets comprises:
for a first data group and a second data group in the plurality of data groups, if one data record in the first data group is duplicated with one data record in the second data group, determining that the first data group and the second data group are duplicated.
12. The method of claim 11, further comprising:
if one second data group and the first data group exist in the plurality of data groups and are repeated data groups, distributing the data records in the first data group to the second data group;
if a plurality of second data groups and the first data group exist in the plurality of data groups, the centroid distances between the data records in the first data group and the plurality of second data groups are respectively calculated, and the data records in the first data group are distributed to the second data group with the smallest centroid distance.
13. An electronic device, comprising: a memory and a processor;
the memory is to store one or more computer instructions;
the processor is to execute the one or more computer instructions to: performing the steps of the method of any one of claims 1-12.
14. A computer-readable storage medium storing a computer program, wherein the computer program is capable of performing the steps of the method of any one of claims 1-12 when executed.
CN202011150210.0A 2020-10-23 2020-10-23 Text data duplication eliminating method, equipment and storage medium Active CN112463774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011150210.0A CN112463774B (en) 2020-10-23 2020-10-23 Text data duplication eliminating method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011150210.0A CN112463774B (en) 2020-10-23 2020-10-23 Text data duplication eliminating method, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112463774A CN112463774A (en) 2021-03-09
CN112463774B true CN112463774B (en) 2021-10-12

Family

ID=74835181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011150210.0A Active CN112463774B (en) 2020-10-23 2020-10-23 Text data duplication eliminating method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112463774B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792180B (en) * 2021-08-30 2024-02-23 北京百度网讯科技有限公司 Method and device for removing duplicate in recommended scene, electronic equipment and storage medium
CN113705184B (en) * 2021-09-01 2023-09-22 同盾科技有限公司 Custom report generation method and device, storage medium and electronic equipment
CN115408379A (en) * 2022-10-25 2022-11-29 广州市玄武无线科技股份有限公司 Terminal repeating data determination method, device, equipment and computer storage medium
CN115631866B (en) * 2022-12-19 2023-03-14 成都瑞华康源科技有限公司 Rapid and accurate de-duplication method for medical big data acquisition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463578A (en) * 2016-06-06 2017-12-12 工业和信息化部电信研究院 Using download statistics De-weight method, device and terminal device
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN111639487A (en) * 2020-04-30 2020-09-08 深圳壹账通智能科技有限公司 Classification model-based field extraction method and device, electronic equipment and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324552B (en) * 2013-06-06 2016-01-13 西安交通大学 Two benches list example duplicate removal data back up method
CN107229660A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 A kind of method and apparatus of data deduplication
CN107145537B (en) * 2017-04-21 2021-06-18 深圳市天天来玩科技有限公司 Table data importing method and system
JP6884128B2 (en) * 2018-09-20 2021-06-09 株式会社日立製作所 Data deduplication device, data deduplication method, and data deduplication program
CN110196848B (en) * 2019-04-09 2022-04-12 广联达科技股份有限公司 Cleaning and duplicate removal method and system for public resource transaction data
CN110457305B (en) * 2019-08-13 2021-11-26 腾讯科技(深圳)有限公司 Data deduplication method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463578A (en) * 2016-06-06 2017-12-12 工业和信息化部电信研究院 Using download statistics De-weight method, device and terminal device
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN111639487A (en) * 2020-04-30 2020-09-08 深圳壹账通智能科技有限公司 Classification model-based field extraction method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN112463774A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112463774B (en) Text data duplication eliminating method, equipment and storage medium
JP7169369B2 (en) Method, system for generating data for machine learning algorithms
US11361004B2 (en) Efficient data relationship mining using machine learning
US20160055205A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
CN110968695A (en) Intelligent labeling method, device and platform based on active learning of weak supervision technology
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
US20210334292A1 (en) System and method for reconciliation of data in multiple systems using permutation matching
CN113254507B (en) Intelligent construction and inventory method for data asset directory
US20220101057A1 (en) Systems and methods for tagging datasets using models arranged in a series of nodes
US11620453B2 (en) System and method for artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
US11360953B2 (en) Techniques for database entries de-duplication
US20230081737A1 (en) Determining data categorizations based on an ontology and a machine-learning model
US10467276B2 (en) Systems and methods for merging electronic data collections
US20220229854A1 (en) Constructing ground truth when classifying data
Bogatu et al. Towards automatic data format transformations: data wrangling at scale
US11556527B2 (en) System and method for value based region searching and associated search operators
US11604923B2 (en) High volume message classification and distribution
US11048730B2 (en) Data clustering apparatus and method based on range query using CF tree
CN117009518A (en) Similar event judging method integrating basic attribute and text content and application thereof
CN114443783B (en) Supply chain data analysis and enhancement processing method and device
US20230138491A1 (en) Continuous learning for document processing and analysis
US20230134218A1 (en) Continuous learning for document processing and analysis
CN113537349A (en) Method, device, equipment and storage medium for identifying hardware fault of large host
CN112182218A (en) Text data classification method and device
CN113901223B (en) Method, device, computer equipment and storage medium for generating enterprise classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant