CN113901264A

CN113901264A - Method and system for matching periodic entities among movie and television attribute data sources

Info

Publication number: CN113901264A
Application number: CN202111339282.4A
Authority: CN
Inventors: 赵春光; 李凯东; 林桢杰; 陈珊珊; 李孟禹; 赵亦喆
Original assignee: Central Video Financial Media Development Co ltd
Current assignee: Central Video Financial Media Development Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-01-07

Abstract

The invention discloses a method and a system for matching periodic entities among movie and television attribute data sources, wherein the method comprises the following steps: adding a plurality of first records to a first data source; acquiring a first index and a second index constructed for a second data source; searching the title and the alias of the first record in the index to obtain a plurality of candidate record pairs; adding to a second data source a plurality of second records involved in the pair of records; sequentially calculating the similarity of the first record and the second record in each candidate record pair in each dimension; inputting the similarity of each dimension into a similarity fusion model to obtain comprehensive similarity; and if the comprehensive similarity is greater than the threshold value, determining that the first record and the second record in the candidate record pair are successfully matched, updating the entity matching state dictionary of the first record to be matched, and storing the candidate record pair successfully matched. The invention can efficiently complete the entity matching task under the condition of limited training data resources, computing resources and storage resources, and supports the interpretability of the matching result.

Description

Method and system for matching periodic entities among movie and television attribute data sources

Technical Field

The invention relates to the technical field of entity matching, in particular to a method and a system for periodically matching entities among movie and television attribute data sources.

Background

Entity matching is an intersecting task of knowledge-graph and natural language processing. Entity matching is to solve the problem of knowledge fusion, i.e., the process of matching all record-pair mappings representing the same entity from among homogeneous or heterogeneous data sources (or among knowledge graphs), or identifying instance mappings for a given entity in the real world. For example, under the E-market scene, it is determined whether the commodities from the two platforms respectively correspond to the same commodity; under a movie video recommendation scene, judging whether two videos correspond to the same movie or not; in a knowledge graph fusion scenario, a mapping relationship matching all entity pairs from between two graphs is encountered. These services can be generalized or expressed as a need to merge external data sources to supplement and extend internal data sources. When dealing with such demands, the primary task to be solved is to pair the records between the data sources. The matching of this pair of records is achieved by calculating the similarity of attributes between the records. Therefore, the calculation model of the similarity determines the matching effect in the aspects of accuracy, recall ratio and the like to a great extent. This task is referred to as an entity matching task because it allows for a one-to-one mapping relationship of records to entities in the real world.

Entity matching specific to film and television attribute data can be illustrated by the following example: consider the movie-class attribute data sets a and B from two different sources. Any record a _ i in the set a can map a certain movie entity e _ k, and if the existing record B _ j points to the entity e _ k through the set B, the record pair (a _ i, B _ j) can be said to be successfully matched. The results of the entity-matched tasks can then be applied to a variety of downstream tasks, such as completing or extending the properties of a _ i using b _ j.

Entity matching is carried out on movie and television attribute data, and a common mode in the current solution is a deep learning-based mode. Although such methods based on deep learning can solve the problems in many general fields, deep learning models require a large amount of training data to train the models, and therefore the deep learning models are difficult to converge under the condition of a shortage of training data resources. Furthermore, the unexplainable property of the depth model is still subject to difficulties. In addition, as the size of the data set to be matched increases, how to reduce the calculation and storage resources is also an aspect which needs to be considered in the algorithm design. In addition, since the whole process often requires manual intervention and expert knowledge input, how to reduce the manual workload is also the optimization direction for improving the entity matching method. Based on this, there is a need in the art for a new method for entity matching for low training data resources, which efficiently completes the entity matching task under limited computation and storage resources and manual labeling amount, and supports the interpretability of the matching result.

Disclosure of Invention

The invention aims to provide a method and a system for periodically matching entities among movie and television attribute data sources, which can efficiently complete entity matching tasks under limited computing resources, storage resources and training data resources and support the interpretability of matching results.

In order to achieve the purpose, the invention provides the following scheme:

a method for matching periodic entities among movie and television attribute data sources, comprising the following steps:

acquiring a first data source;

adding a plurality of first records to the first data source, and initializing an entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;

acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;

sequentially taking one first record, searching the title of the first record in the first index, and searching the alias of the first record in the second index to obtain a search result; the search result comprises one or more identification codes of the second record;

obtaining one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;

acquiring a corresponding second record in the second data source according to the identification code of the second record in the candidate record pair to obtain a second record corresponding to the first record in the candidate record pair;

sequentially calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension to obtain the similarity of each dimension; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;

inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;

judging whether the comprehensive similarity is larger than a set threshold value or not;

and if the comprehensive similarity is larger than the set threshold, determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair successfully matched.

Optionally, the acquiring the first data source further includes:

constructing a first data source; the first data source includes a title attribute, an alias attribute, a show time attribute, a director attribute, a lead actor attribute, and a profile attribute of the movie.

Optionally, the sequentially calculating the similarity of each dimension of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:

sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair;

sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair;

sequentially calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair;

and sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair.

Optionally, the sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:

acquiring the year weight, month weight and day weight of the showing time; the sum of the annual weight, the monthly weight and the daily weight is 1;

comparing whether the year of the showing time in the first record is the same as the year of the showing time in a second record corresponding to the first record;

if the years are the same, determining that the year similarity is 1;

if the years are different, determining that the year similarity is 0;

comparing whether the month of the showing time in the first record is the same as the month of the showing time in the second record corresponding to the first record;

if the months are the same, determining that the similarity of the months is 1;

if the months are different, determining that the similarity of the months is 0;

comparing whether the date of the showing time in the first record is the same as the date of the showing time in the second record corresponding to the first record;

if the days are the same, determining that the day similarity is 1;

if the days are different, determining that the day similarity is 0;

and summing the product of the year similarity and the year weight, the product of the month similarity and the month weight and the product of the day similarity and the day weight to obtain the reflecting time similarity.

Optionally, the sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:

comparing whether the director in the first record is the same as the director in the second record corresponding to the first record;

if the directors are the same, determining that the director similarity is 1;

and if the directors are not the same, determining that the director similarity is 0.

expressing the feature of the lead actor in the first record by using a TF-IDF algorithm to obtain a first lead actor feature;

expressing the feature of the lead actor in the second record corresponding to the first record by using a TF-IDF algorithm to obtain a second lead actor feature;

comparing the first and second lead-actor characteristics by using a cosine similarity algorithm to obtain a lead-actor comparison result;

and carrying out normalization processing on the director comparison result to obtain a director similarity.

Optionally, the sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:

using LSI algorithm to express the characteristics of the brief introduction in the first record to obtain first brief introduction characteristics;

using LSI algorithm to express the characteristics of the brief introduction in the second record corresponding to the first record, and obtaining second brief introduction characteristics;

comparing the first brief introduction characteristic with the second brief introduction characteristic by using a cosine similarity algorithm to obtain a brief introduction comparison result;

and carrying out normalization processing on the comparison result of the brief introduction to obtain the similarity of the brief introduction.

Optionally, the inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the candidate record pair specifically includes:

splicing the mapping time similarity, the director similarity and the introduction similarity into a long vector;

inputting the long vector into a multilayer perceptron model, and performing dimensionality reduction and feature fusion on the long vector to obtain a low-dimensional vector;

and inputting the low-dimensional vector into a logistic regression model, and performing feature fusion on the low-dimensional vector to obtain comprehensive similarity.

Optionally, the determining whether the comprehensive similarity is greater than a set threshold further includes:

and if the comprehensive similarity is not greater than the set threshold, determining that the first record in the candidate record pair fails to be matched with a second record corresponding to the first record.

The invention also provides the following scheme:

a system for periodic entity matching between movie and television attribute data sources, the system comprising:

the first data source acquisition module is used for acquiring a first data source;

a first record adding module, configured to add a plurality of first records to the first data source, and initialize the entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;

the index acquisition module is used for acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;

a search result obtaining module, configured to sequentially take one of the first records, search for a title of the first record in the first index, and search for an alias of the first record in the second index, so as to obtain a search result; the search result comprises one or more identification codes of the second record;

a candidate record pair obtaining module, configured to obtain one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;

a second record obtaining module, configured to obtain a second record corresponding to the first record in the candidate record pair according to the identification code of the second record in the candidate record pair, so as to obtain a second record corresponding to the first record in the candidate record pair;

each dimension similarity calculation module is used for calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension in sequence to obtain each dimension similarity; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;

a comprehensive similarity obtaining module, configured to input the similarity of each dimension into a similarity fusion model, so as to obtain a comprehensive similarity between the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;

the judging module is used for judging whether the comprehensive similarity is greater than a set threshold value or not;

and the updating module is used for determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record when the comprehensive similarity is larger than the set threshold value according to the output result of the judging module, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair which is successfully matched.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a periodic entity matching method and a system among video attribute data sources, which initialize an entity matching state dictionary of a first record added into a first data source to be unmatched, update the entity matching state dictionary of the first record to be matched after the first record and a second record in a candidate record pair are successfully matched, periodically acquire a new record in the first data source, and can be used for screening and retaining the unmatched records according to the matching state before the state dictionary is memorized. The multi-layer perceptron model and the logistic regression model are adopted as a similarity fusion model, the similarity of multiple dimensions is calculated and then fused to obtain the comprehensive similarity, and the matching result has interpretability. In addition, the multilayer perceptron model and the logistic regression model are simple in structure and few in parameters, so that the requirements for computing resources, storage resources and training data resources are remarkably reduced, the entity matching task can be efficiently completed under the limited computing resources, storage resources and training data resources, and the interpretability of the matching result is supported.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of a method for matching periodic entities between movie and television attribute data sources according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a periodic entity matching process according to the present invention;

FIG. 3 is a schematic diagram of a film and television entity similarity calculation model according to the present invention;

FIG. 4 is a block diagram of an embodiment of a system for periodic entity matching between movie and television attribute data sources.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of an embodiment of a method for matching periodic entities between movie and television attribute data sources, and fig. 2 is a schematic diagram of a periodic entity matching process according to the present invention. Referring to fig. 1 and 2, the method for matching periodic entities between movie and television attribute data sources includes:

step 101: a first data source is obtained.

This step 101 is preceded by:

Step 102: adding a plurality of first records to a first data source, and initializing an entity matching state dictionary of each first record to be unmatched; each first record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.

The embodiment periodically acquires a new record (first record) from a first data source (a video data source a, namely the data source a in fig. 2); for the newly added record, the entity matching state dictionary of the record is initialized to be not matched. And traversing each record from the data source A, screening out records which are equal to the unmatched records in the state dictionary, and generating a target record list. And performing data cleaning on the records in the target record list, wherein the data cleaning comprises sub-links of unifying data types and structures, removing duplication and the like. Attributes that are purged may include title, alias, show time, director, lead actor, vignette, etc. of the movie.

Step 103: acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for the alias attribute of the second data source; the second data source comprises a plurality of second records; each second record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.

Step 104: sequentially taking a first record, searching the title of the first record in a first index, and searching the alias of the first record in a second index to obtain a search result; the search results include the identification codes of the one or more second records.

The target record list obtained in the previous step is one of the inputs of step 104, and each first record in the target record list is traversed, and is denoted as a _ i. Another input to this step 104 is an index built from the title attribute and alias attribute of the second data source (data source B in FIG. 2), which may also be replaced with the search system of data source B itself. This recall step entails searching the index of data source B for the title of a _ i, resulting in a plurality of candidate record pairs (a _ i, B _ j). Where b _ j represents the identification code (id) of the second record in the second data source. Finally, all the candidate record pairs (a _ i, b _ j) are formed into a list to be used as output.

Step 105: obtaining one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results.

Step 106: and acquiring a corresponding second record in the second data source according to the identification code of the second record in the candidate record pair to obtain a second record corresponding to the first record in the candidate record pair.

Step 107: sequentially calculating the similarity of the first record in each candidate record pair and the second record corresponding to the first record in each dimension to obtain the similarity of each dimension; the similarity of each dimension comprises the similarity of the showing time, the similarity of the director and the similarity of the brief introduction.

The step 107 specifically includes:

and sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair.

And sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair.

The method includes the following steps of sequentially calculating the mapping time similarity of a first record in each candidate record pair and a second record corresponding to the first record, and specifically includes:

acquiring the year weight, month weight and day weight of the showing time; the sum of the annual, monthly and daily weights is 1.

And comparing whether the year of the showing time in the first record is the same as the year of the showing time in the second record corresponding to the first record.

And if the years are the same, determining that the year similarity is 1.

And if the years are different, determining that the year similarity is 0.

And comparing whether the month of the showing time in the first record is the same as the month of the showing time in the second record corresponding to the first record.

If the months are the same, determining that the similarity of the months is 1.

If the months are different, the similarity of the months is determined to be 0.

And comparing whether the date of the showing time in the first record is the same as the date of the showing time in the second record corresponding to the first record.

And if the days are the same, determining that the day similarity is 1.

And if the days are different, determining that the day similarity is 0.

Sequentially calculating the director similarity of the first record in each candidate record pair and the second record corresponding to the first record, specifically comprising:

and comparing whether the director in the first record is the same as the director in the second record corresponding to the first record.

And if the directors are the same, determining that the director similarity is 1.

Calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair in turn, specifically comprising:

and expressing the feature of the lead actor in the first record by using a TF-IDF algorithm to obtain a first lead actor feature.

And expressing the feature of the lead actor in the second record corresponding to the first record by using a TF-IDF algorithm to obtain a second lead actor feature.

And comparing the first lead actor characteristic with the second lead actor characteristic by using a cosine similarity algorithm to obtain a lead actor comparison result.

And carrying out normalization processing on the director comparison result to obtain the director similarity.

Sequentially calculating the profile similarity of the first record in each candidate record pair and the second record corresponding to the first record, specifically comprising:

the profile in the first record is characterized using an LSI algorithm to obtain a first profile characteristic.

And representing the characteristics of the profile in the second record corresponding to the first record by using an LSI algorithm to obtain the characteristics of the second profile.

And comparing the first brief introduction characteristic with the second brief introduction characteristic by using a cosine similarity algorithm to obtain a brief introduction comparison result.

Step 108: inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of a first record in the candidate record pair and a second record corresponding to the first record; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model.

The step 108 specifically includes:

and splicing the mapping time similarity, the director similarity and the introduction similarity into a long vector.

And inputting the long vector into the multilayer perceptron model, and performing dimensionality reduction and feature fusion on the long vector to obtain a low-dimensional vector.

The step 107 and the step 108 traverse the whole candidate record pair list, and sequentially calculate the similarity of the second records corresponding to a _ i and a _ i in each candidate pair (a _ i, b _ j), including the similarity of dimensions such as showing time, director, introduction and the like. And taking the similarity of each dimension as a feature to input into a similarity fusion model. The model result is a numerical value in the range of [0,1] representing the composite similarity score of the second record corresponding to a _ i and a _ i in the record pair (a _ i, b _ j). Referring to fig. 3, the movie entity similarity calculation model includes the following modules:

calculating the similarity of the mapping time: and respectively matching three judgments of year, month and day, recording the equality as 1, and calculating the weighted sum of the three judgments if the equality is 0.

Calculating the director similarity: the Chinese or English name match is 1, otherwise it is 0.

Calculating the similarity of the actors: the cosine similarity between the above-mentioned record pairs with respect to the actor aggregate is calculated by TF-IDF (term frequency-inverse document frequency).

Brief introduction similarity calculation: the cosine similarity between the above-mentioned record pairs with respect to the Semantic representation of the text of the brief is calculated by lsi (content Semantic indexing).

And (3) similarity fusion: splicing the input similarity into a long vector, inputting the vector into an MLP (multi-layer perceptron) model, and finally inputting the vector into a logistic regression model (LR) to obtain the comprehensive similarity.

Step 109: and judging whether the comprehensive similarity is greater than a set threshold value.

If the integrated similarity is greater than the set threshold, execute step 110: and determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the successfully matched candidate record pair.

If the integrated similarity is not greater than (less than or equal to) the set threshold, execute step 111: and determining that the first record in the candidate record pair fails to match with the second record corresponding to the first record.

In steps 109 to 111, it is determined whether the second record corresponding to a _ i and a _ i in the (a _ i, b _ j) record pair with the highest composite score is successfully matched. When the composite score is larger than a certain configurable threshold (set threshold), the second record corresponding to a _ i and a _ i in (a _ i, b _ j) is considered as a valid matching record pair; otherwise, the second record corresponding to a _ i and a _ i in (a _ i, b _ j) is considered as invalid match. For the case of high quality requirements, optionally, the validity of the match is confirmed again manually. And accumulating and updating the (a _ i, b _ j) data successfully matched, marking the matching state dictionary of the a _ i as matched, and saving the matching state dictionary of the a _ i into a data table.

The invention provides a periodic entity matching method among movie and television attribute data sources and an entity alignment process shown in figure 2, relates to the cross field of knowledge graph and natural language processing, and belongs to the sub-field of entity matching. The invention discloses a periodic entity matching method among movie and television attribute data sources, which is a rule-based entity matching method and mainly aims at entity matching of movie and television attribute data.

Compared with the prior art, the invention has the following advantages:

1. the method realizes the periodic dynamic update of entity matching of multiple data sources, and periodically updates and stores each matching state dictionary.

2. For the movie and television entity matching, important attributes of the matching, namely title, alias, showing time, director and introduction, are designed.

3. A multi-dimensional feature fusion model based on machine learning is designed, and is used for automatically generating similarity of candidate record pairs and judging matching effectiveness.

4. The method realizes entity matching across data sources, and can serve downstream tasks such as data source fusion and the like. The fusion data can be further applied to content recommendation scenes and e-commerce scenes, and is not limited to specific service scenes.

5. Periodically, updated records from data source A are retrieved and the previous matching states are remembered by the state dictionary and used to screen and retain unmatched records. The incremental data update mode greatly reduces the calculation amount, and does not need to repeatedly calculate the previously calculated or matched record pair.

6. The characteristics of multiple dimensions are calculated and then are fused to obtain the comprehensive similarity, the model is good in interpretability and convenient to trace, the model parameters are few, and requirements on training data resources, calculation resources and storage resources are reduced.

7. Using the title and alias of the second data source as an index to search for the title of the first data source, so that recalled candidate record pairs still have a high recall rate under limited computing resources; and then, similarity judgment is carried out according to more attribute characteristics, so that the high accuracy of the final result is ensured.

FIG. 4 is a block diagram of an embodiment of a system for periodic entity matching between movie and television attribute data sources. Referring to fig. 4, the system for matching periodic entities between movie and television attribute data sources includes:

the first data source obtaining module 401 is configured to obtain a first data source.

A first record adding module 402, configured to add a plurality of first records to a first data source, and initialize an entity matching state dictionary of each first record to be unmatched; each first record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.

An index obtaining module 403, configured to obtain a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for the alias attribute of the second data source; the second data source comprises a plurality of second records; each second record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.

A search result obtaining module 404, configured to sequentially take a first record, search a title of the first record in the first index, and search an alias of the first record in the second index to obtain a search result; the search result comprises one or more identification codes of the second record;

a candidate record pair obtaining module 405, configured to obtain one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results.

A second record obtaining module 406, configured to obtain a second record corresponding to the first record in the candidate record pair according to the identification code of the second record in the candidate record pair.

Each dimension similarity calculation module 407 is configured to calculate similarity of each dimension between the first record in each candidate record pair and the second record corresponding to the first record in sequence, so as to obtain similarity of each dimension; the similarity of each dimension comprises the similarity of the showing time, the similarity of the director and the similarity of the brief introduction.

A comprehensive similarity obtaining module 408, configured to input the similarity of each dimension into the similarity fusion model to obtain a comprehensive similarity between a first record in the candidate record pair and a second record corresponding to the first record; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model.

And the judging module 409 is used for judging whether the comprehensive similarity is greater than a set threshold.

And an updating module 410, configured to determine that the first record and the second record in the candidate record pair are successfully matched when the comprehensive similarity is greater than the set threshold as an output result of the determining module, update the entity matching state dictionary of the first record in the candidate record pair to be matching, and store the candidate record pair successfully matched.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for matching periodic entities among movie and television attribute data sources is characterized by comprising the following steps:

acquiring a first data source;

2. The method of claim 1, wherein the obtaining the first data source further comprises:

3. The method according to claim 1, wherein said sequentially calculating the dimensional similarity of the first record and the second record corresponding to the first record in each candidate record pair comprises:

4. The method according to claim 3, wherein said sequentially calculating the mapping time similarity between the first record and the second record corresponding to the first record in each candidate record pair specifically comprises:

if the years are the same, determining that the year similarity is 1;

if the years are different, determining that the year similarity is 0;

if the months are the same, determining that the similarity of the months is 1;

if the days are the same, determining that the day similarity is 1;

if the days are different, determining that the day similarity is 0;

5. The method according to claim 3, wherein said sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair comprises:

if the directors are the same, determining that the director similarity is 1;

6. The method according to claim 3, wherein said sequentially calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically comprises:

7. The method according to claim 3, wherein said sequentially calculating profile similarity of said first record and said second record corresponding to said first record in each of said candidate record pairs comprises:

8. The method according to claim 1, wherein the inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the candidate record pairs comprises:

9. The method of claim 1, wherein said determining whether the integrated similarity is greater than a predetermined threshold further comprises:

10. A system for periodic entity matching between movie and television attribute data sources, the system comprising: