CN113901264A - Method and system for matching periodic entities among movie and television attribute data sources - Google Patents

Method and system for matching periodic entities among movie and television attribute data sources Download PDF

Info

Publication number
CN113901264A
CN113901264A CN202111339282.4A CN202111339282A CN113901264A CN 113901264 A CN113901264 A CN 113901264A CN 202111339282 A CN202111339282 A CN 202111339282A CN 113901264 A CN113901264 A CN 113901264A
Authority
CN
China
Prior art keywords
record
similarity
candidate
pair
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111339282.4A
Other languages
Chinese (zh)
Inventor
赵春光
李凯东
林桢杰
陈珊珊
李孟禹
赵亦喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Video Financial Media Development Co ltd
Original Assignee
Central Video Financial Media Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Video Financial Media Development Co ltd filed Critical Central Video Financial Media Development Co ltd
Priority to CN202111339282.4A priority Critical patent/CN113901264A/en
Publication of CN113901264A publication Critical patent/CN113901264A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for matching periodic entities among movie and television attribute data sources, wherein the method comprises the following steps: adding a plurality of first records to a first data source; acquiring a first index and a second index constructed for a second data source; searching the title and the alias of the first record in the index to obtain a plurality of candidate record pairs; adding to a second data source a plurality of second records involved in the pair of records; sequentially calculating the similarity of the first record and the second record in each candidate record pair in each dimension; inputting the similarity of each dimension into a similarity fusion model to obtain comprehensive similarity; and if the comprehensive similarity is greater than the threshold value, determining that the first record and the second record in the candidate record pair are successfully matched, updating the entity matching state dictionary of the first record to be matched, and storing the candidate record pair successfully matched. The invention can efficiently complete the entity matching task under the condition of limited training data resources, computing resources and storage resources, and supports the interpretability of the matching result.

Description

Method and system for matching periodic entities among movie and television attribute data sources
Technical Field
The invention relates to the technical field of entity matching, in particular to a method and a system for periodically matching entities among movie and television attribute data sources.
Background
Entity matching is an intersecting task of knowledge-graph and natural language processing. Entity matching is to solve the problem of knowledge fusion, i.e., the process of matching all record-pair mappings representing the same entity from among homogeneous or heterogeneous data sources (or among knowledge graphs), or identifying instance mappings for a given entity in the real world. For example, under the E-market scene, it is determined whether the commodities from the two platforms respectively correspond to the same commodity; under a movie video recommendation scene, judging whether two videos correspond to the same movie or not; in a knowledge graph fusion scenario, a mapping relationship matching all entity pairs from between two graphs is encountered. These services can be generalized or expressed as a need to merge external data sources to supplement and extend internal data sources. When dealing with such demands, the primary task to be solved is to pair the records between the data sources. The matching of this pair of records is achieved by calculating the similarity of attributes between the records. Therefore, the calculation model of the similarity determines the matching effect in the aspects of accuracy, recall ratio and the like to a great extent. This task is referred to as an entity matching task because it allows for a one-to-one mapping relationship of records to entities in the real world.
Entity matching specific to film and television attribute data can be illustrated by the following example: consider the movie-class attribute data sets a and B from two different sources. Any record a _ i in the set a can map a certain movie entity e _ k, and if the existing record B _ j points to the entity e _ k through the set B, the record pair (a _ i, B _ j) can be said to be successfully matched. The results of the entity-matched tasks can then be applied to a variety of downstream tasks, such as completing or extending the properties of a _ i using b _ j.
Entity matching is carried out on movie and television attribute data, and a common mode in the current solution is a deep learning-based mode. Although such methods based on deep learning can solve the problems in many general fields, deep learning models require a large amount of training data to train the models, and therefore the deep learning models are difficult to converge under the condition of a shortage of training data resources. Furthermore, the unexplainable property of the depth model is still subject to difficulties. In addition, as the size of the data set to be matched increases, how to reduce the calculation and storage resources is also an aspect which needs to be considered in the algorithm design. In addition, since the whole process often requires manual intervention and expert knowledge input, how to reduce the manual workload is also the optimization direction for improving the entity matching method. Based on this, there is a need in the art for a new method for entity matching for low training data resources, which efficiently completes the entity matching task under limited computation and storage resources and manual labeling amount, and supports the interpretability of the matching result.
Disclosure of Invention
The invention aims to provide a method and a system for periodically matching entities among movie and television attribute data sources, which can efficiently complete entity matching tasks under limited computing resources, storage resources and training data resources and support the interpretability of matching results.
In order to achieve the purpose, the invention provides the following scheme:
a method for matching periodic entities among movie and television attribute data sources, comprising the following steps:
acquiring a first data source;
adding a plurality of first records to the first data source, and initializing an entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
sequentially taking one first record, searching the title of the first record in the first index, and searching the alias of the first record in the second index to obtain a search result; the search result comprises one or more identification codes of the second record;
obtaining one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;
acquiring a corresponding second record in the second data source according to the identification code of the second record in the candidate record pair to obtain a second record corresponding to the first record in the candidate record pair;
sequentially calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension to obtain the similarity of each dimension; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;
inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;
judging whether the comprehensive similarity is larger than a set threshold value or not;
and if the comprehensive similarity is larger than the set threshold, determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair successfully matched.
Optionally, the acquiring the first data source further includes:
constructing a first data source; the first data source includes a title attribute, an alias attribute, a show time attribute, a director attribute, a lead actor attribute, and a profile attribute of the movie.
Optionally, the sequentially calculating the similarity of each dimension of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:
sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair;
sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair;
sequentially calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair;
and sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair.
Optionally, the sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:
acquiring the year weight, month weight and day weight of the showing time; the sum of the annual weight, the monthly weight and the daily weight is 1;
comparing whether the year of the showing time in the first record is the same as the year of the showing time in a second record corresponding to the first record;
if the years are the same, determining that the year similarity is 1;
if the years are different, determining that the year similarity is 0;
comparing whether the month of the showing time in the first record is the same as the month of the showing time in the second record corresponding to the first record;
if the months are the same, determining that the similarity of the months is 1;
if the months are different, determining that the similarity of the months is 0;
comparing whether the date of the showing time in the first record is the same as the date of the showing time in the second record corresponding to the first record;
if the days are the same, determining that the day similarity is 1;
if the days are different, determining that the day similarity is 0;
and summing the product of the year similarity and the year weight, the product of the month similarity and the month weight and the product of the day similarity and the day weight to obtain the reflecting time similarity.
Optionally, the sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:
comparing whether the director in the first record is the same as the director in the second record corresponding to the first record;
if the directors are the same, determining that the director similarity is 1;
and if the directors are not the same, determining that the director similarity is 0.
Optionally, the sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:
expressing the feature of the lead actor in the first record by using a TF-IDF algorithm to obtain a first lead actor feature;
expressing the feature of the lead actor in the second record corresponding to the first record by using a TF-IDF algorithm to obtain a second lead actor feature;
comparing the first and second lead-actor characteristics by using a cosine similarity algorithm to obtain a lead-actor comparison result;
and carrying out normalization processing on the director comparison result to obtain a director similarity.
Optionally, the sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically includes:
using LSI algorithm to express the characteristics of the brief introduction in the first record to obtain first brief introduction characteristics;
using LSI algorithm to express the characteristics of the brief introduction in the second record corresponding to the first record, and obtaining second brief introduction characteristics;
comparing the first brief introduction characteristic with the second brief introduction characteristic by using a cosine similarity algorithm to obtain a brief introduction comparison result;
and carrying out normalization processing on the comparison result of the brief introduction to obtain the similarity of the brief introduction.
Optionally, the inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the candidate record pair specifically includes:
splicing the mapping time similarity, the director similarity and the introduction similarity into a long vector;
inputting the long vector into a multilayer perceptron model, and performing dimensionality reduction and feature fusion on the long vector to obtain a low-dimensional vector;
and inputting the low-dimensional vector into a logistic regression model, and performing feature fusion on the low-dimensional vector to obtain comprehensive similarity.
Optionally, the determining whether the comprehensive similarity is greater than a set threshold further includes:
and if the comprehensive similarity is not greater than the set threshold, determining that the first record in the candidate record pair fails to be matched with a second record corresponding to the first record.
The invention also provides the following scheme:
a system for periodic entity matching between movie and television attribute data sources, the system comprising:
the first data source acquisition module is used for acquiring a first data source;
a first record adding module, configured to add a plurality of first records to the first data source, and initialize the entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
the index acquisition module is used for acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
a search result obtaining module, configured to sequentially take one of the first records, search for a title of the first record in the first index, and search for an alias of the first record in the second index, so as to obtain a search result; the search result comprises one or more identification codes of the second record;
a candidate record pair obtaining module, configured to obtain one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;
a second record obtaining module, configured to obtain a second record corresponding to the first record in the candidate record pair according to the identification code of the second record in the candidate record pair, so as to obtain a second record corresponding to the first record in the candidate record pair;
each dimension similarity calculation module is used for calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension in sequence to obtain each dimension similarity; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;
a comprehensive similarity obtaining module, configured to input the similarity of each dimension into a similarity fusion model, so as to obtain a comprehensive similarity between the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;
the judging module is used for judging whether the comprehensive similarity is greater than a set threshold value or not;
and the updating module is used for determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record when the comprehensive similarity is larger than the set threshold value according to the output result of the judging module, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair which is successfully matched.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a periodic entity matching method and a system among video attribute data sources, which initialize an entity matching state dictionary of a first record added into a first data source to be unmatched, update the entity matching state dictionary of the first record to be matched after the first record and a second record in a candidate record pair are successfully matched, periodically acquire a new record in the first data source, and can be used for screening and retaining the unmatched records according to the matching state before the state dictionary is memorized. The multi-layer perceptron model and the logistic regression model are adopted as a similarity fusion model, the similarity of multiple dimensions is calculated and then fused to obtain the comprehensive similarity, and the matching result has interpretability. In addition, the multilayer perceptron model and the logistic regression model are simple in structure and few in parameters, so that the requirements for computing resources, storage resources and training data resources are remarkably reduced, the entity matching task can be efficiently completed under the limited computing resources, storage resources and training data resources, and the interpretability of the matching result is supported.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart of a method for matching periodic entities between movie and television attribute data sources according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a periodic entity matching process according to the present invention;
FIG. 3 is a schematic diagram of a film and television entity similarity calculation model according to the present invention;
FIG. 4 is a block diagram of an embodiment of a system for periodic entity matching between movie and television attribute data sources.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for periodically matching entities among movie and television attribute data sources, which can efficiently complete entity matching tasks under limited computing resources, storage resources and training data resources and support the interpretability of matching results.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of an embodiment of a method for matching periodic entities between movie and television attribute data sources, and fig. 2 is a schematic diagram of a periodic entity matching process according to the present invention. Referring to fig. 1 and 2, the method for matching periodic entities between movie and television attribute data sources includes:
step 101: a first data source is obtained.
This step 101 is preceded by:
constructing a first data source; the first data source includes a title attribute, an alias attribute, a show time attribute, a director attribute, a lead actor attribute, and a profile attribute of the movie.
Step 102: adding a plurality of first records to a first data source, and initializing an entity matching state dictionary of each first record to be unmatched; each first record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.
The embodiment periodically acquires a new record (first record) from a first data source (a video data source a, namely the data source a in fig. 2); for the newly added record, the entity matching state dictionary of the record is initialized to be not matched. And traversing each record from the data source A, screening out records which are equal to the unmatched records in the state dictionary, and generating a target record list. And performing data cleaning on the records in the target record list, wherein the data cleaning comprises sub-links of unifying data types and structures, removing duplication and the like. Attributes that are purged may include title, alias, show time, director, lead actor, vignette, etc. of the movie.
Step 103: acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for the alias attribute of the second data source; the second data source comprises a plurality of second records; each second record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.
Step 104: sequentially taking a first record, searching the title of the first record in a first index, and searching the alias of the first record in a second index to obtain a search result; the search results include the identification codes of the one or more second records.
The target record list obtained in the previous step is one of the inputs of step 104, and each first record in the target record list is traversed, and is denoted as a _ i. Another input to this step 104 is an index built from the title attribute and alias attribute of the second data source (data source B in FIG. 2), which may also be replaced with the search system of data source B itself. This recall step entails searching the index of data source B for the title of a _ i, resulting in a plurality of candidate record pairs (a _ i, B _ j). Where b _ j represents the identification code (id) of the second record in the second data source. Finally, all the candidate record pairs (a _ i, b _ j) are formed into a list to be used as output.
Step 105: obtaining one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results.
Step 106: and acquiring a corresponding second record in the second data source according to the identification code of the second record in the candidate record pair to obtain a second record corresponding to the first record in the candidate record pair.
Step 107: sequentially calculating the similarity of the first record in each candidate record pair and the second record corresponding to the first record in each dimension to obtain the similarity of each dimension; the similarity of each dimension comprises the similarity of the showing time, the similarity of the director and the similarity of the brief introduction.
The step 107 specifically includes:
and sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair.
And sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair.
And sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair.
And sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair.
The method includes the following steps of sequentially calculating the mapping time similarity of a first record in each candidate record pair and a second record corresponding to the first record, and specifically includes:
acquiring the year weight, month weight and day weight of the showing time; the sum of the annual, monthly and daily weights is 1.
And comparing whether the year of the showing time in the first record is the same as the year of the showing time in the second record corresponding to the first record.
And if the years are the same, determining that the year similarity is 1.
And if the years are different, determining that the year similarity is 0.
And comparing whether the month of the showing time in the first record is the same as the month of the showing time in the second record corresponding to the first record.
If the months are the same, determining that the similarity of the months is 1.
If the months are different, the similarity of the months is determined to be 0.
And comparing whether the date of the showing time in the first record is the same as the date of the showing time in the second record corresponding to the first record.
And if the days are the same, determining that the day similarity is 1.
And if the days are different, determining that the day similarity is 0.
And summing the product of the year similarity and the year weight, the product of the month similarity and the month weight and the product of the day similarity and the day weight to obtain the reflecting time similarity.
Sequentially calculating the director similarity of the first record in each candidate record pair and the second record corresponding to the first record, specifically comprising:
and comparing whether the director in the first record is the same as the director in the second record corresponding to the first record.
And if the directors are the same, determining that the director similarity is 1.
And if the directors are not the same, determining that the director similarity is 0.
Calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair in turn, specifically comprising:
and expressing the feature of the lead actor in the first record by using a TF-IDF algorithm to obtain a first lead actor feature.
And expressing the feature of the lead actor in the second record corresponding to the first record by using a TF-IDF algorithm to obtain a second lead actor feature.
And comparing the first lead actor characteristic with the second lead actor characteristic by using a cosine similarity algorithm to obtain a lead actor comparison result.
And carrying out normalization processing on the director comparison result to obtain the director similarity.
Sequentially calculating the profile similarity of the first record in each candidate record pair and the second record corresponding to the first record, specifically comprising:
the profile in the first record is characterized using an LSI algorithm to obtain a first profile characteristic.
And representing the characteristics of the profile in the second record corresponding to the first record by using an LSI algorithm to obtain the characteristics of the second profile.
And comparing the first brief introduction characteristic with the second brief introduction characteristic by using a cosine similarity algorithm to obtain a brief introduction comparison result.
And carrying out normalization processing on the comparison result of the brief introduction to obtain the similarity of the brief introduction.
Step 108: inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of a first record in the candidate record pair and a second record corresponding to the first record; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model.
The step 108 specifically includes:
and splicing the mapping time similarity, the director similarity and the introduction similarity into a long vector.
And inputting the long vector into the multilayer perceptron model, and performing dimensionality reduction and feature fusion on the long vector to obtain a low-dimensional vector.
And inputting the low-dimensional vector into a logistic regression model, and performing feature fusion on the low-dimensional vector to obtain comprehensive similarity.
The step 107 and the step 108 traverse the whole candidate record pair list, and sequentially calculate the similarity of the second records corresponding to a _ i and a _ i in each candidate pair (a _ i, b _ j), including the similarity of dimensions such as showing time, director, introduction and the like. And taking the similarity of each dimension as a feature to input into a similarity fusion model. The model result is a numerical value in the range of [0,1] representing the composite similarity score of the second record corresponding to a _ i and a _ i in the record pair (a _ i, b _ j). Referring to fig. 3, the movie entity similarity calculation model includes the following modules:
calculating the similarity of the mapping time: and respectively matching three judgments of year, month and day, recording the equality as 1, and calculating the weighted sum of the three judgments if the equality is 0.
Calculating the director similarity: the Chinese or English name match is 1, otherwise it is 0.
Calculating the similarity of the actors: the cosine similarity between the above-mentioned record pairs with respect to the actor aggregate is calculated by TF-IDF (term frequency-inverse document frequency).
Brief introduction similarity calculation: the cosine similarity between the above-mentioned record pairs with respect to the Semantic representation of the text of the brief is calculated by lsi (content Semantic indexing).
And (3) similarity fusion: splicing the input similarity into a long vector, inputting the vector into an MLP (multi-layer perceptron) model, and finally inputting the vector into a logistic regression model (LR) to obtain the comprehensive similarity.
Step 109: and judging whether the comprehensive similarity is greater than a set threshold value.
If the integrated similarity is greater than the set threshold, execute step 110: and determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the successfully matched candidate record pair.
If the integrated similarity is not greater than (less than or equal to) the set threshold, execute step 111: and determining that the first record in the candidate record pair fails to match with the second record corresponding to the first record.
In steps 109 to 111, it is determined whether the second record corresponding to a _ i and a _ i in the (a _ i, b _ j) record pair with the highest composite score is successfully matched. When the composite score is larger than a certain configurable threshold (set threshold), the second record corresponding to a _ i and a _ i in (a _ i, b _ j) is considered as a valid matching record pair; otherwise, the second record corresponding to a _ i and a _ i in (a _ i, b _ j) is considered as invalid match. For the case of high quality requirements, optionally, the validity of the match is confirmed again manually. And accumulating and updating the (a _ i, b _ j) data successfully matched, marking the matching state dictionary of the a _ i as matched, and saving the matching state dictionary of the a _ i into a data table.
The invention provides a periodic entity matching method among movie and television attribute data sources and an entity alignment process shown in figure 2, relates to the cross field of knowledge graph and natural language processing, and belongs to the sub-field of entity matching. The invention discloses a periodic entity matching method among movie and television attribute data sources, which is a rule-based entity matching method and mainly aims at entity matching of movie and television attribute data.
Compared with the prior art, the invention has the following advantages:
1. the method realizes the periodic dynamic update of entity matching of multiple data sources, and periodically updates and stores each matching state dictionary.
2. For the movie and television entity matching, important attributes of the matching, namely title, alias, showing time, director and introduction, are designed.
3. A multi-dimensional feature fusion model based on machine learning is designed, and is used for automatically generating similarity of candidate record pairs and judging matching effectiveness.
4. The method realizes entity matching across data sources, and can serve downstream tasks such as data source fusion and the like. The fusion data can be further applied to content recommendation scenes and e-commerce scenes, and is not limited to specific service scenes.
5. Periodically, updated records from data source A are retrieved and the previous matching states are remembered by the state dictionary and used to screen and retain unmatched records. The incremental data update mode greatly reduces the calculation amount, and does not need to repeatedly calculate the previously calculated or matched record pair.
6. The characteristics of multiple dimensions are calculated and then are fused to obtain the comprehensive similarity, the model is good in interpretability and convenient to trace, the model parameters are few, and requirements on training data resources, calculation resources and storage resources are reduced.
7. Using the title and alias of the second data source as an index to search for the title of the first data source, so that recalled candidate record pairs still have a high recall rate under limited computing resources; and then, similarity judgment is carried out according to more attribute characteristics, so that the high accuracy of the final result is ensured.
FIG. 4 is a block diagram of an embodiment of a system for periodic entity matching between movie and television attribute data sources. Referring to fig. 4, the system for matching periodic entities between movie and television attribute data sources includes:
the first data source obtaining module 401 is configured to obtain a first data source.
A first record adding module 402, configured to add a plurality of first records to a first data source, and initialize an entity matching state dictionary of each first record to be unmatched; each first record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.
An index obtaining module 403, configured to obtain a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for the alias attribute of the second data source; the second data source comprises a plurality of second records; each second record includes a title, an alias, a show time, a director, a lead actor, and a brief description of the movie.
A search result obtaining module 404, configured to sequentially take a first record, search a title of the first record in the first index, and search an alias of the first record in the second index to obtain a search result; the search result comprises one or more identification codes of the second record;
a candidate record pair obtaining module 405, configured to obtain one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results.
A second record obtaining module 406, configured to obtain a second record corresponding to the first record in the candidate record pair according to the identification code of the second record in the candidate record pair.
Each dimension similarity calculation module 407 is configured to calculate similarity of each dimension between the first record in each candidate record pair and the second record corresponding to the first record in sequence, so as to obtain similarity of each dimension; the similarity of each dimension comprises the similarity of the showing time, the similarity of the director and the similarity of the brief introduction.
A comprehensive similarity obtaining module 408, configured to input the similarity of each dimension into the similarity fusion model to obtain a comprehensive similarity between a first record in the candidate record pair and a second record corresponding to the first record; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model.
And the judging module 409 is used for judging whether the comprehensive similarity is greater than a set threshold.
And an updating module 410, configured to determine that the first record and the second record in the candidate record pair are successfully matched when the comprehensive similarity is greater than the set threshold as an output result of the determining module, update the entity matching state dictionary of the first record in the candidate record pair to be matching, and store the candidate record pair successfully matched.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A method for matching periodic entities among movie and television attribute data sources is characterized by comprising the following steps:
acquiring a first data source;
adding a plurality of first records to the first data source, and initializing an entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
sequentially taking one first record, searching the title of the first record in the first index, and searching the alias of the first record in the second index to obtain a search result; the search result comprises one or more identification codes of the second record;
obtaining one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;
acquiring a corresponding second record in the second data source according to the identification code of the second record in the candidate record pair to obtain a second record corresponding to the first record in the candidate record pair;
sequentially calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension to obtain the similarity of each dimension; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;
inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;
judging whether the comprehensive similarity is larger than a set threshold value or not;
and if the comprehensive similarity is larger than the set threshold, determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair successfully matched.
2. The method of claim 1, wherein the obtaining the first data source further comprises:
constructing a first data source; the first data source includes a title attribute, an alias attribute, a show time attribute, a director attribute, a lead actor attribute, and a profile attribute of the movie.
3. The method according to claim 1, wherein said sequentially calculating the dimensional similarity of the first record and the second record corresponding to the first record in each candidate record pair comprises:
sequentially calculating the mapping time similarity of the first record and the second record corresponding to the first record in each candidate record pair;
sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair;
sequentially calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair;
and sequentially calculating the profile similarity of the first record and the second record corresponding to the first record in each candidate record pair.
4. The method according to claim 3, wherein said sequentially calculating the mapping time similarity between the first record and the second record corresponding to the first record in each candidate record pair specifically comprises:
acquiring the year weight, month weight and day weight of the showing time; the sum of the annual weight, the monthly weight and the daily weight is 1;
comparing whether the year of the showing time in the first record is the same as the year of the showing time in a second record corresponding to the first record;
if the years are the same, determining that the year similarity is 1;
if the years are different, determining that the year similarity is 0;
comparing whether the month of the showing time in the first record is the same as the month of the showing time in the second record corresponding to the first record;
if the months are the same, determining that the similarity of the months is 1;
if the months are different, determining that the similarity of the months is 0;
comparing whether the date of the showing time in the first record is the same as the date of the showing time in the second record corresponding to the first record;
if the days are the same, determining that the day similarity is 1;
if the days are different, determining that the day similarity is 0;
and summing the product of the year similarity and the year weight, the product of the month similarity and the month weight and the product of the day similarity and the day weight to obtain the reflecting time similarity.
5. The method according to claim 3, wherein said sequentially calculating the director similarity of the first record and the second record corresponding to the first record in each candidate record pair comprises:
comparing whether the director in the first record is the same as the director in the second record corresponding to the first record;
if the directors are the same, determining that the director similarity is 1;
and if the directors are not the same, determining that the director similarity is 0.
6. The method according to claim 3, wherein said sequentially calculating the lead actor similarity of the first record and the second record corresponding to the first record in each candidate record pair specifically comprises:
expressing the feature of the lead actor in the first record by using a TF-IDF algorithm to obtain a first lead actor feature;
expressing the feature of the lead actor in the second record corresponding to the first record by using a TF-IDF algorithm to obtain a second lead actor feature;
comparing the first and second lead-actor characteristics by using a cosine similarity algorithm to obtain a lead-actor comparison result;
and carrying out normalization processing on the director comparison result to obtain a director similarity.
7. The method according to claim 3, wherein said sequentially calculating profile similarity of said first record and said second record corresponding to said first record in each of said candidate record pairs comprises:
using LSI algorithm to express the characteristics of the brief introduction in the first record to obtain first brief introduction characteristics;
using LSI algorithm to express the characteristics of the brief introduction in the second record corresponding to the first record, and obtaining second brief introduction characteristics;
comparing the first brief introduction characteristic with the second brief introduction characteristic by using a cosine similarity algorithm to obtain a brief introduction comparison result;
and carrying out normalization processing on the comparison result of the brief introduction to obtain the similarity of the brief introduction.
8. The method according to claim 1, wherein the inputting the similarity of each dimension into a similarity fusion model to obtain the comprehensive similarity of the candidate record pairs comprises:
splicing the mapping time similarity, the director similarity and the introduction similarity into a long vector;
inputting the long vector into a multilayer perceptron model, and performing dimensionality reduction and feature fusion on the long vector to obtain a low-dimensional vector;
and inputting the low-dimensional vector into a logistic regression model, and performing feature fusion on the low-dimensional vector to obtain comprehensive similarity.
9. The method of claim 1, wherein said determining whether the integrated similarity is greater than a predetermined threshold further comprises:
and if the comprehensive similarity is not greater than the set threshold, determining that the first record in the candidate record pair fails to be matched with a second record corresponding to the first record.
10. A system for periodic entity matching between movie and television attribute data sources, the system comprising:
the first data source acquisition module is used for acquiring a first data source;
a first record adding module, configured to add a plurality of first records to the first data source, and initialize the entity matching state dictionary of each first record to be unmatched; each first record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
the index acquisition module is used for acquiring a first index and a second index of a second data source; the first index is an index constructed for the title attribute of the second data source; the second index is an index constructed for alias attributes of a second data source; the second data source comprises a plurality of second records; each second record comprises a title, an alias, a showing time, a director, a lead actor and a brief introduction of the movie;
a search result obtaining module, configured to sequentially take one of the first records, search for a title of the first record in the first index, and search for an alias of the first record in the second index, so as to obtain a search result; the search result comprises one or more identification codes of the second record;
a candidate record pair obtaining module, configured to obtain one or more candidate record pairs according to the search result; the candidate record pair includes a first record and an identification code of a second record in the search results;
a second record obtaining module, configured to obtain a second record corresponding to the first record in the candidate record pair according to the identification code of the second record in the candidate record pair, so as to obtain a second record corresponding to the first record in the candidate record pair;
each dimension similarity calculation module is used for calculating the similarity of the first record and the second record corresponding to the first record in each candidate record pair in each dimension in sequence to obtain each dimension similarity; the similarity of each dimension comprises the similarity of showing time, the similarity of director and the similarity of brief introduction;
a comprehensive similarity obtaining module, configured to input the similarity of each dimension into a similarity fusion model, so as to obtain a comprehensive similarity between the first record and a second record corresponding to the first record in the candidate record pair; the similarity fusion model comprises a multilayer perceptron model and a logistic regression model;
the judging module is used for judging whether the comprehensive similarity is greater than a set threshold value or not;
and the updating module is used for determining that the first record in the candidate record pair is successfully matched with the second record corresponding to the first record when the comprehensive similarity is larger than the set threshold value according to the output result of the judging module, updating the entity matching state dictionary of the first record in the candidate record pair to be matched, and storing the candidate record pair which is successfully matched.
CN202111339282.4A 2021-11-12 2021-11-12 Method and system for matching periodic entities among movie and television attribute data sources Pending CN113901264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111339282.4A CN113901264A (en) 2021-11-12 2021-11-12 Method and system for matching periodic entities among movie and television attribute data sources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111339282.4A CN113901264A (en) 2021-11-12 2021-11-12 Method and system for matching periodic entities among movie and television attribute data sources

Publications (1)

Publication Number Publication Date
CN113901264A true CN113901264A (en) 2022-01-07

Family

ID=79194129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111339282.4A Pending CN113901264A (en) 2021-11-12 2021-11-12 Method and system for matching periodic entities among movie and television attribute data sources

Country Status (1)

Country Link
CN (1) CN113901264A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809117A (en) * 2014-01-24 2015-07-29 深圳市云帆世纪科技有限公司 Video data aggregation processing method, aggregation system and video searching platform
CN107635012A (en) * 2017-10-18 2018-01-26 中汇信息技术(上海)有限公司 A kind of match messages method, server and computer-readable recording medium
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108012192A (en) * 2017-12-25 2018-05-08 北京奇艺世纪科技有限公司 A kind of method and system of identification and the polymerization of video resource
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN111651972A (en) * 2020-05-06 2020-09-11 腾讯科技(深圳)有限公司 Entity alignment method, device, computer readable medium and electronic equipment
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809117A (en) * 2014-01-24 2015-07-29 深圳市云帆世纪科技有限公司 Video data aggregation processing method, aggregation system and video searching platform
CN107635012A (en) * 2017-10-18 2018-01-26 中汇信息技术(上海)有限公司 A kind of match messages method, server and computer-readable recording medium
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108012192A (en) * 2017-12-25 2018-05-08 北京奇艺世纪科技有限公司 A kind of method and system of identification and the polymerization of video resource
CN109446399A (en) * 2018-10-16 2019-03-08 北京信息科技大学 A kind of video display entity search method
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN111651972A (en) * 2020-05-06 2020-09-11 腾讯科技(深圳)有限公司 Entity alignment method, device, computer readable medium and electronic equipment
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊晶: "《甲骨学知识图谱构建方法研究》", 31 January 2019 *

Similar Documents

Publication Publication Date Title
CN111382309B (en) Short video recommendation method based on graph model, intelligent terminal and storage medium
WO2021223567A1 (en) Content processing method and apparatus, computer device, and storage medium
CN110728541B (en) Information streaming media advertising creative recommendation method and device
CN110347719B (en) Enterprise foreign trade risk early warning method and system based on big data
CN107220365B (en) Accurate recommendation system and method based on collaborative filtering and association rule parallel processing
US11373257B1 (en) Artificial intelligence-based property data linking system
US20210097089A1 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
CN114996488B (en) Skynet big data decision-level fusion method
CN114048340B (en) Hierarchical fusion combined query image retrieval method
CN116049397B (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
CN111930915B (en) Session information processing method, device, computer readable storage medium and equipment
Li et al. Hybrid recommendation algorithm of cross-border e-commerce items based on artificial intelligence and multiview collaborative fusion
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN115438169A (en) Text and video mutual inspection method, device, equipment and storage medium
Zhuo et al. Research on personalized image retrieval technology of video stream big data management model
CN114329051A (en) Data information identification method, device, equipment, storage medium and program product
Peng et al. Swin transformer-based supervised hashing
US20240012809A1 (en) Artificial intelligence system for translation-less similarity analysis in multi-language contexts
CN117131873A (en) Double-encoder pre-training small sample relation extraction method based on contrast learning
CN116756281A (en) Knowledge question-answering method, device, equipment and medium
CN111026916A (en) Text description conversion method and device, electronic equipment and storage medium
WO2023045378A1 (en) Method and device for recommending item information to user, storage medium, and program product
CN113901264A (en) Method and system for matching periodic entities among movie and television attribute data sources
CN114897607A (en) Data processing method and device for product resources, electronic equipment and storage medium
Cao E-Commerce Big Data Mining and Analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220107

RJ01 Rejection of invention patent application after publication