CN113971403A

CN113971403A - Entity identification method and system considering text semantic information

Info

Publication number: CN113971403A
Application number: CN202111116386.9A
Authority: CN
Inventors: 宗威; 林松涛; 李兵
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-01-25

Abstract

The invention belongs to the technical field of data cleaning and data integration application, and discloses an entity identification method and system considering text semantic information, wherein for record sets A and B to be identified, the entity identification method comprises the following steps: data reading and preprocessing; creating an inverted index for the dataset; loading an SBERT model; calculating the IDF value of the words in the data set; generating a record pair to be matched; calculating record similarity; and processing and returning the identification result. According to the method, the record pairs to be matched are quickly generated through the inverted indexes and the IDF values of the words in the calculation data source based on the inverted indexes and the SBERT model, so that the recognition efficiency is improved; semantic information in the text records is fully extracted through an SBERT model, the similarity between the records is calculated by utilizing cosine similarity, and the identification accuracy is improved, so that the efficient and accurate entity identification effect is achieved; compared with the traditional entity identification method, the method improves the recall ratio of the entity identification result on the thesis data set by about 20 percent and improves the precision ratio by about 10 percent.

Description

Entity identification method and system considering text semantic information

Technical Field

The invention belongs to the technical field of data cleaning and data integration application, and particularly relates to an entity identification method and system considering text semantic information.

Background

At present, with the rapid development of information technology and the continuous acceleration of informatization construction, the data acquisition and storage capacity of enterprises and units is continuously improved. A large amount of data are stored in each enterprise and public institution information system, the data have great utilization value, and in order to acquire the value, massive disordered data need to be converted into high-quality data with consistency and accuracy by means of data cleaning.

Entity identification, also known as duplicate record identification, record linking, etc., is the process of identifying which records in a data set represent the same entity in the real world. The entity identification is firstly applied to the fields of medical treatment and health, census and the like, and with the arrival of a big data era, the entity identification becomes a key technology for improving data quality under a data integration scene, and can effectively solve the problems of identification of repeated records and correspondence to description contents of the same entity in data cleaning. The application scenarios of the entity identification technology can be mainly divided into two categories, namely repeated record detection under a single data source and entity record linkage under multiple data sources. Due to the fact that data redundancy exists in a single data source due to the problems of version replacement, incomplete information deletion and the like, an entity identification technology is applied when data content of an information system is cleaned and mined. Entity record linkage in multiple data sources is applied to a data integration scene.

At present, most entity identification methods adopt a strategy of 'blocking + comparison', and block data concentrated records based on a certain rule, so that records with similar contents under the rule appear in the same block. Then, the similarity between the records with high corresponding probability is calculated, and the weighted similarity calculation is usually performed by comprehensively considering the attributes, the structural features and the like of the records. The entity recognition algorithm based on multiple mapping and combination proposed by Andrews Torr (Andrea Thor) of Laybi tin university in Germany is a typical weighted entity recognition calculation method based on rules.

Compared with the method of calculating the similarity of every two records, the method of 'partitioning and comparing' improves the calculation efficiency to a certain extent but still has some problems. When the records are identified, the selection of the sorting rule has a great influence on the final identification effect, and the specific sorting rule needs to be determined by a professional with certain experience. In addition, when the sizes of the blocks are relatively fixed during comparison, the selected blocks are too large, which causes the content with extremely small correlation to be subjected to matching calculation, and unnecessary calculation amount is increased, and the selected blocks are too small, which causes similar content not to be completely contained in the window, thereby causing the omission of similar records. For the calculation of the similarity between two records, different weights need to be given to each attribute, and the determination of the weights also needs manual participation. More importantly, the existing method is not sufficient in utilization of semantic information in the text, and the problems bring great obstacles to practical application and development of entity recognition. With the development of natural language processing technology, methods for entity recognition by combining text semantic information are becoming new fields. In 2019, SBERT models are proposed by Nils Reimers and Iryna Gurevych, so that semantic information in the text can be well extracted, and a new idea is developed for designing an entity recognition method based on the semantic information.

Therefore, in view of the defects of poor recognition effect, low recognition efficiency, poor generality and insufficient semantic information utilization of the conventional method, a new entity recognition method is needed.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) in the existing entity identification method adopting the strategy of 'blocking and comparing', the selection of the sorting rule has a great influence on the final identification effect when identifying the record, and the specific sorting rule needs to be determined by a professional with certain experience.

(2) When the existing entity identification method is used for comparison, the block size is relatively fixed, the selected block is too large, which causes the content with extremely small correlation to be matched and calculated, and unnecessary calculation amount is increased, and the block is too small, which causes similar content not to be completely contained in a window, thereby causing the omission of similar records.

(3) The existing method needs to give different weights to each attribute for calculating the similarity between two records, and the determination of the weight needs manual participation; and the semantic information in the text is not fully utilized, the recognition effect is poor, the recognition efficiency is low, the universality is poor, and great obstacles are brought to the practical application and development of entity recognition.

With the advent of the big data era, a large amount of repeated data appears in daily life, data integration scenes appear more and more frequently, and an entity identification method is endowed with more accurate and efficient expectations. To realize these expectations, it is necessary to optimize the current "blocking + comparing" mechanism, or to explore an entity identification method that departs from the past framework and adopts a completely new scheme. Therefore, a method framework of 'reverse index + semantic comparison' is provided, the reverse index is used for replacing blocking operation, semantic information is added in the comparison step, and therefore entity identification efficiency and accuracy are improved. Efficient and accurate entity identification can effectively integrate repeated data, reduce enterprise storage cost, realize entity correspondence of different data sources and help data integration.

Disclosure of Invention

The invention provides an entity identification method and system considering text semantic information, and particularly relates to an entity identification method and system considering text semantic information based on an inverted index and a Sennce-BERT model, aiming at the problems of the existing entity identification method under the background of multi-source heterogeneous data integration.

The invention is realized in such a way that an entity recognition method considering text semantic information comprises the following steps for record sets A and B to be recognized:

step one, data reading and preprocessing: respectively reading the contents of the record set A, B, performing preprocessing operations such as word segmentation, spelling correction, part of speech reduction, word removal and the like on data contained in the records, and generating sets A and B of the records consisting of words, wherein the steps can extract words with actual semantic information, reduce the scale of the subsequent inverted index construction, and improve the accuracy of keyword extraction;

step two, creating an inverted index: the content of the words in A is subjected to de-duplication to generate a word dictionary, and the words in the word dictionary are used as index words to create an inverted index of the record set A;

step three, loading an SBERT model: loading the SBERT model trained in the network into a method for standby;

step four, calculating an IDF value: and calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs. The size of the IDF value can well reflect the importance degree of a word in a text set, and index content can be matched more conveniently by replacing the record with the keyword set;

step five, generating a record pair to be matched: and precisely matching the keyword set with all the index words, and sequentially combining the records represented by the keywords and all the records linked by the index words into two to-be-matched record pairs for matching results. The generated record pairs to be matched are all record contents with high possibility of pointing to the same entity, and the calculated amount can be effectively reduced only by carrying out similarity calculation on the contents;

step six, calculating record similarity: and (4) inputting the records in the pair of records to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity of the two vectors by using a cosine similarity method. Semantic information contained in the sentence can be fully extracted by applying the preloaded SBERT model;

step seven, processing corresponding records: judging the similar records exceeding the threshold value as records describing the same entity, making correspondence and links, and judging the record pairs not exceeding the threshold value as records describing different entities;

step eight, after the step seven is finished, detecting whether the records in the record set B are completely matched; and if not, skipping to the step four for the unidentified records until all the records in the record set B are identified, and realizing the entity identification process of the record sets A and B.

Further, in step two, the constructing of the inverted index includes:

(1) a keyword is acquired, and a word dictionary is generated.

Acquiring keywords from a record set to be recognized, forming the keywords into a word dictionary, wherein in a specific practical process, one record is a character string; firstly, finding out all words in a character string, namely performing word segmentation operation, wherein English records are separated by spaces, and Chinese records are subjected to special word segmentation processing by means of the existing word segmentation tool; and after word segmentation results are obtained, word stop operation is carried out, words without practical significance and various punctuations in the results are removed, the content of capital, small form, tense, morpheme and complex number in the words is subjected to standardized processing, and the words are uniformly converted into the forms of lowercase, common current time and singular number.

(2) And establishing an inverted index.

After obtaining the keywords, establishing an inverted index, and corresponding all the keywords in the record with the record number through a linked list; in the process of realizing the inverted index, the index words and the record set are respectively used as a dictionary file and a position file for storage; the dictionary file not only stores each word appearing in the record, but also keeps a pointer pointing to the position file, and the position information of the keyword can be found through the pointer.

Further, in step (1), the term "virtual words" having no practical meaning in the result include the, in, and at, etc.

Further, in step four, the IDF value for the vocabulary t is calculated as follows:

wherein | D | refers to the total number of records contained in the record set, { D | D ∈ D & & t ∈ D } refers to the number of records containing the target word t, and if the word is not in the record set, in order to avoid the situation that the denominator is 0, 1 is uniformly added at the denominator of the formula

Further, the main idea of the IDF is: if the number of records containing the word is less, namely the word appears less frequently, the IDF value is larger, and the entry has good category distinguishing capability.

Further, in the sixth step, the record similarity is calculated, the record pair to be matched is input into the SBERT model after the preloading is completed, and the SBERT introduces a twin neural network on the basis of BERT; the twin neural network maps input contents to a new space through two neural networks sharing weight values to obtain a sample pair, and similarity of the samples is measured by calculating cosine included angles of the sample pair, so that semantic information in the samples is fully considered, and recorded entity identification is realized more accurately.

Another object of the present invention is to provide an entity recognition system applying the entity recognition method considering text semantic information, the entity recognition system comprising:

the data reading and preprocessing module is used for respectively reading the contents of the record set A, B, performing word segmentation, spelling correction, part of speech restoration and word-off preprocessing on the data contained in the records, and generating a record set A and a record set B consisting of words;

the reverse index creating module is used for regenerating word dictionary from the word content in A, and creating the reverse index of the record set A by taking the words in the dictionary as index words;

the SBERT model loading module is used for loading the SBERT model trained on the network into the method for standby;

the IDF value calculating module is used for calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs;

the record pair generation module to be matched is used for accurately matching the keyword set with all the index words, and for matching matched results, the records represented by the keywords and all the records linked by the index words are sequentially combined into two record pairs to be matched;

the record similarity calculation module is used for inputting the records in the record pair to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity degree of the two vectors by utilizing a cosine similarity method;

the record processing module is used for processing the corresponding records, judging the similar records exceeding the threshold value as the records describing the same entity, making correspondence and linkage, and judging the records not exceeding the threshold value as the records describing different entities;

the record set detection module is used for detecting whether the records in the record set B are completely matched or not after the record processing module is finished; and if not, jumping the unrecognized records to an IDF value calculation module, and repeating the steps until all records in the record set B are recognized, so as to realize the entity recognition process of the record sets A and B.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

(1) data reading and preprocessing: respectively reading the contents of the record set A, B, and performing preprocessing operations of word segmentation, spelling correction, part of speech reduction and word stop removal on data contained in the records to generate a record set A and a record set B consisting of words;

(2) creating an inverted index: the content of the words in A is subjected to de-duplication to generate a word dictionary, and the words in the word dictionary are used as index words to create an inverted index of the record set A;

(3) loading an SBERT model: loading the SBERT model trained in the network into a method for standby;

(4) calculating the IDF value: calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs;

(5) generating a record pair to be matched: accurately matching the keyword set with all index words, and sequentially combining records represented by the keywords and all records linked by the index words into a group of two to-be-matched record pairs for matching results;

(6) record similarity is calculated: inputting the records in the pair of records to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity degree of the two vectors by using a cosine similarity method;

(7) and processing the corresponding records: judging the similar records exceeding the threshold value as records describing the same entity, making correspondence and links, and judging the record pairs not exceeding the threshold value as records describing different entities;

(8) after the step (7) is completed, detecting whether the records in the record set B are completely matched; and if not, skipping to the step (4) for the unidentified records, and repeating the steps until all records in the record set B are identified, so as to realize the entity identification process of the record sets A and B.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide an information data processing terminal for implementing the entity identification system.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention provides an entity identification method considering text semantic information, and relates to an entity identification method for identifying and combining contents with similar data sets based on an inverted index and a sequence-BERT (SBERT for short) model. The method comprises the following implementation steps: (1) data reading and preprocessing; (2) creating an inverted index for the dataset; (3) loading an SBERT model; (4) calculating the IDF value of the words in the data set; (5) generating a record pair to be matched; (6) calculating record similarity; (7) and processing and returning an identification result, and realizing entity identification based on semantic information through the seven steps. Experimental data analysis results show that the method has high accuracy and recall rate, realizes efficient and accurate linking and cleaning of repeated records, reduces manual participation, and has high automation degree, good recognition effect and practical application significance.

Aiming at the problem of repeated description of the same entity in the data integration process, the invention realizes the entity identification process by establishing the inverted index, calculating the IDF value of the key words and calculating the similarity between the records by applying an SBERT model. Based on the inverted index and the SBERT model, firstly, the to-be-matched record pair is quickly generated through the inverted index and the IDF value of the word in the calculation data source, the recognition efficiency is improved, then, the semantic information in the text record is fully extracted through the SBERT model, the similarity between the records is calculated by utilizing cosine similarity, and the recognition accuracy is improved, so that the high-efficiency and accurate entity recognition effect is achieved.

In the process of carrying out entity identification on two record sets A and B to be identified, the invention is mainly characterized by the following points:

(1) and (3) rearranging the contents in the set A in an index word-record mode by means of a linked list by applying an inverted index technology, and increasing the access speed by adopting a hash value method when accessing the recorded contents.

(2) By adopting a keyword method, a set of three words with the highest IDF value in one record is used for representing the record, so that the dimensionality reduction effect of extracting the main content of the record is achieved.

(3) And processing the input records by using an SBERT model when calculating the similarity of the two records, generating a vector with semantic information, and calculating the similarity of the two vectors by using cosine similarity.

The invention has the following advantages:

(1) good entity identification effect

Compared with the conventional solid recognition algorithm Sorted neighbor + SVM and Standard Blocking + SVM, the method improves the recall ratio of the solid recognition result on a paper data set by about 20 percent and improves the precision ratio by about 10 percent.

(2) High entity identification efficiency and short response time

The method does not need to execute multiple sequencing and Blocking operations, the time complexity of calculation is greatly reduced, the real-time performance is better, in the multiple tests of the data set, the average execution time of the algorithm is 331 seconds, the data set for verification is 66889 pieces of data, and the average time of each piece of data is 202 ms.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of an entity identification method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an entity identification method according to an embodiment of the present invention.

FIG. 3 is a block diagram of an entity recognition system according to an embodiment of the present invention;

in the figure: 1. a data reading and preprocessing module; 2. an inverted index creation module; 3. an SBERT model loading module; 4. an IDF value calculation module; 5. a record pair generation module to be matched; 6. a record similarity calculation module; 7. a record processing module; 8. and a record set detection module.

Fig. 4 is a diagram illustrating results of a test experiment performed on identified response times according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides an entity identification method and system considering text semantic information, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the entity identification method provided in the embodiment of the present invention includes the following steps:

s101, data reading and preprocessing: respectively reading the contents of the record set A, B, and performing preprocessing operations of word segmentation, spelling correction, part of speech reduction and word stop removal on data contained in the records to generate a record set A and a record set B consisting of words;

s102, creating an inverted index: the content of the words in A is subjected to de-duplication to generate a word dictionary, and the words in the word dictionary are used as index words to create an inverted index of the record set A;

s103, loading an SBERT model: loading the SBERT model trained in the network into a method for standby;

s104, calculating an IDF value: calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs;

s105, generating a record pair to be matched: accurately matching the keyword set with all index words, and sequentially combining records represented by the keywords and all records linked by the index words into a group of two to-be-matched record pairs for matching results;

s106, calculating record similarity: inputting the records in the pair of records to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity degree of the two vectors by using a cosine similarity method;

s107, processing corresponding records: judging the similar records exceeding the threshold value as records describing the same entity, making correspondence and links, and judging the record pairs not exceeding the threshold value as records describing different entities;

s108, after S107 is finished, detecting whether the records in the record set B are completely matched; and if not, jumping to S104 for the unidentified records, and repeating the steps until all the records in the record set B are identified, so as to realize the entity identification process of the record sets A and B.

A schematic diagram of an entity identification method provided by the embodiment of the present invention is shown in fig. 2.

As shown in fig. 3, an entity identification system provided in an embodiment of the present invention includes:

the data reading and preprocessing module 1 is used for respectively reading the contents of the record set A, B, and performing preprocessing operations of word segmentation, spelling correction, part of speech restoration and word deactivation on the data contained in the records to generate record sets A and B consisting of words;

the reverse index creating module 2 is used for regenerating word dictionary from the word content in A, and creating the reverse index of the record set A by taking the words in the dictionary as index words;

the SBERT model loading module 3 is used for loading the SBERT model trained by the network into the method for standby;

an IDF value calculating module 4, configured to calculate an IDF value of each word in each record in the record set B in the record set a, and select the first three words with the highest IDF score to form a keyword set representing the record to which the word belongs;

the record pair generation module to be matched 5 is used for accurately matching the keyword set with all the index words, and for matching results, sequentially combining the records represented by the keywords and all the records linked by the index words into two record pairs to be matched;

the record similarity calculation module 6 is used for inputting the records in the pair of records to be matched into an SBERT model to generate sentence vectors containing semantic information, and calculating the similarity degree of the two vectors by utilizing a cosine similarity method;

the record processing module 7 is used for processing the corresponding records, judging the similar records exceeding the threshold value as the records describing the same entity, making correspondence and linkage, and judging the record pairs not exceeding the threshold value as the records describing different entities;

a record set detection module 8, configured to detect whether records in the record set B are completely matched after the record processing module is completed; and if not, jumping the unrecognized records to an IDF value calculation module, and repeating the steps until all records in the record set B are recognized, so as to realize the entity recognition process of the record sets A and B.

The technical solution of the present invention is further described below with reference to specific examples.

Example 1

The embodiment of the invention provides an entity identification method based on an inverted index and a sequence-BERT (SBERT for short) model, which comprises the following steps:

for record sets A and B to be identified

(1) Data reading and preprocessing:

respectively reading the contents of the record sets A, B, and performing preprocessing operations such as word segmentation, spelling correction, part of speech restoration, word stop removal and the like on the data contained in the records to generate record sets A and B consisting of words;

(2) creating an inverted index:

the content of the words in A is subjected to de-duplication to generate a word dictionary, and the words in the word dictionary are used as index words to create an inverted index of the record set A;

(3) loading the SBERT model.

Loading the SBERT model trained on the network into the method for standby;

(4) calculating the IDF value:

calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs;

(5) generating a record pair to be matched:

accurately matching the keyword set with all index words, and sequentially combining records represented by the keywords and all records linked by the index words into a group of two to-be-matched record pairs for matching results;

(6) record similarity is calculated:

inputting the records in the pair of records to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity degree of the two vectors by utilizing a cosine similarity method

(7) And processing the corresponding records:

judging the similar records exceeding the threshold value as records describing the same entity, making correspondence and links, and judging the record pairs not exceeding the threshold value as records describing different entities;

and (4) detecting whether the records in the record set B are completely matched after the 7 th step is finished, and if not, skipping to the step (4) for the unidentified records to repeat until all the records in the record set B are identified, so that the entity identification process of the record sets A and B is realized.

The method for constructing the inverted index in the step (2) provided by the embodiment of the invention comprises the following specific contents:

firstly, obtaining keywords and generating a word dictionary. In a specific practical process, keywords are obtained from a record set to be recognized, the keywords form a word dictionary, one record is a character string, all words in the character string are found out firstly, word segmentation operation is carried out, English records are separated by spaces, and Chinese records are subjected to special word segmentation processing by means of an existing word segmentation tool. And (3) after the word segmentation result is obtained, word stopping operation is carried out, words (such as the, in, at and the like) without practical significance and various punctuations in the result are eliminated, the contents of capital, small form, tense, morpheme, complex number and the like in the words are subjected to standardized processing, and the words are uniformly converted into the forms of small form, common current time and singular number.

And secondly, establishing an inverted index. After the keywords are obtained, an inverted index can be established, and all the keywords in the record correspond to the record number through a linked list. In the process of realizing the inverted index, the index words and the record set are respectively used as a dictionary file and a position file for storage. The dictionary file not only stores each word appearing in the record, but also keeps a pointer pointing to the position file, and the position information of the keyword can be found through the pointer.

The specific calculation method for the IDF value of a specific vocabulary t in step (4) provided by the embodiment of the present invention is shown as follows:

where | D | refers to the total number of records contained in the record set, and { D | D ∈ D & & t ∈ D } refers to the number of records containing the target word t, and if the word is not in the record set, to avoid the occurrence of a case where the denominator is 0, 1 is added after the number of records. The main idea of IDF is: if the number of records containing the word is less, namely the word appears less frequently, the IDF value is larger, and the entry has good category distinguishing capability.

The step (6) provided by the embodiment of the invention calculates the record similarity, and inputs the record pair to be matched into the SBERT model after the preloading is completed, wherein the SBERT introduces a twin neural network on the basis of BERT so as to further improve the execution efficiency. A twin Neural Network (Simense Neural Network) maps input contents to a new space through two Neural networks sharing weight values to obtain a sample pair, and then the similarity of the samples is measured by calculating the cosine included angle of the sample pair, so that semantic information in the samples is fully considered, and recorded entity identification is realized more accurately.

Example 2

The efficient entity recognition method fully considering text semantic information provided by the invention is based on the inverted index and the SBERT model, firstly, the to-be-matched record pair is quickly generated through the inverted index and the IDF value of the word in the calculation data source, the recognition efficiency is improved, then, the semantic information in the text record is fully extracted through the SBERT model, the similarity between the records is calculated by utilizing the cosine similarity, the recognition accuracy is improved, and therefore, the efficient and accurate entity recognition effect is achieved.

The entity identification method based on the inverted index and the SBERT model provided by the embodiment of the invention takes two record sets A and B to be identified as an example, and comprises the following steps:

1. and (4) reading and preprocessing data. Reading the record set into the model, and combining the read contents into a short text. And preprocessing the merged data, including uniformly converting words into lowercase words, correcting error words, removing stop words, removing punctuation marks and the like. Sets a and B of records consisting of individual words are generated.

2. An inverted index is created. And (4) taking all words contained in A as index words, and creating an inverted index of the record set A. Firstly, index words are acquired, a word dictionary is generated, contents contained in A are subjected to duplication elimination, and a single word forming word dictionary is generated. An inverted index is then created, the words in the dictionary being unique, and for each word in the dictionary, the record containing the word is stored in a linked relationship with the word generation. And in the process of generating the index, the hash table is applied to realize the quick access of the data content, and the linked list is applied to organize the relation between the index words and the corresponding records.

3. Loading the SBERT model. And loading the trained SBERT model.

4. The IDF value is calculated. And calculating the IDF score of the word in the record set A for each word in the record set B, extracting the words which are used as the first three of the recorded IDF values in the record set B as keywords, and using the keyword set as the representation of the record.

5. And generating a pair of records to be matched. And accurately matching the keyword set extracted from B with the index words in the record set A, and respectively forming sets of record contents represented by the keywords and all record contents linked with the index words for the matched keywords and index words to generate record pairs to be matched.

6. Record similarity is calculated. And inputting the record pairs to be matched into an SBERT model, generating a sentence vector containing text semantic information, and calculating the sentence vector based on cosine similarity.

7. The corresponding record is processed. And judging the content exceeding the similarity threshold as the description about the same entity, associating and linking the information such as the content and the position corresponding to the two records, and judging the record pair not exceeding the threshold as the description content of the non-same entity.

And 7, judging whether the record set B has unidentified records after the step 7 is finished, if so, turning to the step 4 for identification until all the contents in the record set B are compared, thereby finishing the entity identification task of the record sets A and B.

Example 3

The invention divides the whole flow of the whole entity recognition algorithm into three main stages, namely a preparation stage, a processing stage and a verification stage, and the detailed processing steps of each stage are as follows.

(1) A preparation stage:

the preparation phase mainly comprises preprocessing the data, establishing related indexes and the like. Firstly, judging whether a cache file exists, if so, loading the cache file, then reading a data original file, carrying out field merging and misspelling correction on information needing to be processed, preloading an SBERT model, creating an inverted index comprising a dictionary file, a position file and the like, and finally writing a processing result and generated contents into the cache file for storage. The fields in the file set are combined mainly by considering that firstly, the algorithm can be flexibly applied to all data records mainly based on short texts, secondly, the generated sentence vectors can cover the semantics of all the fields, so that the sentence vectors obtained by similar repeated records are closer, finally, the unified combination is carried out by considering the conditions of different data sets, possible loss of all the fields of the data table, different field types and the like, and the code and processing logic process can be simplified.

(2) And (3) a treatment stage:

this stage comprises the following steps.

Reading a data set B to be integrated, carrying out preprocessing such as word recovery and word stop removal on the data set B, and taking out a record in the file set B.

Calculating the IDF value of each word processed in the record in the data set A, reserving the key words meeting the IDF screening conditions, and generating a key word set.

And thirdly, matching the keyword set with the index words in the data set A one by one based on an accurate matching rule, and extracting all records corresponding to the index words matched and hit in the data set A to form a record set.

And fourthly, sequentially carrying out cosine similarity calculation on the records in the record set generated in the data set A and the record in the data set B by applying an SBERT model, fully considering semantic information in the records, adding the record pairs meeting the requirements into the set, judging that the records describe the content of the same entity, judging that the records are not similar if the records do not meet the requirements, and repeating the steps until all the record contents in the data set B are matched.

(3) Checking phase

In the verification stage, the algorithm needs to be verified according to the similar entities given in the data set, and the performance of the algorithm is verified. When the method is applied to data integration, records judged to be dissimilar are integrated into a new file, and the index is dynamically updated.

The technical solution of the present invention is further described below with reference to simulation experiments.

And (3) experimental verification: the experimental verification of the invention adopts two groups of paper collective data of DBLP-Scholar and DBLP-ACM. The DBLP-Scholar data set contains 66889 records in total, with a total of 5347 pairs of duplicate records, and the DBLP-ACM data set contains 4908 records in total, with a total of 2224 pairs of duplicate records. Is pretreatedThen, the records in the three data sets are each composed of the following five fields: id. title, authors, venue, year, the representation form of each field is mainly short text. In order to more fully utilize the data content in each record and also consider the filling or empty judgment processing aiming at the missing of each field, the algorithm of the invention generates the final record by adopting a field merging mode, namely abandoning an id field and merging four contents of title, author, vector and year into one field. For the evaluation of the method, Recall (Recall), Precision (Precision) and F were used₁Value (F)₁-measure) three evaluation indexes. The calculation formula is as follows:

Recall＝TP/(TP+F_N)×100％ (1)

Precison＝TP/(TP+FP)×100％ (2)

wherein TP represents the number of duplicate records correctly identified by the algorithm, (TP + FP) represents the total number of duplicate records identified by the algorithm (TP + FN) represents the total number of duplicate records in the database, recall ratio represents the ratio of the number of records predicted to be correct in the data of the real corresponding records, accuracy ratio represents the ratio of the number of records predicted to be correct in the data of the corresponding records, and F1 measure is the harmonic mean of recall ratio and accuracy ratio. The recognition effects are shown in tables 1 and 2.

TABLE 1 comparison of recall and precision under DBLP-Scholar dataset

TABLE 2 DBLP-ACM data set recall, precision comparison

In addition to testing the recognition effect, the response time of recognition is also tested in the experiment, and the specific result is shown in fig. 4.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. An entity recognition method considering text semantic information, characterized in that, for record sets A and B to be recognized, the entity recognition method considering text semantic information comprises the following steps:

step one, data reading and preprocessing: respectively reading the contents of the record set A, B, and performing word segmentation, spelling correction, part of speech reduction and word-off preprocessing on the data contained in the records to generate a record set A and a record set B consisting of words;

step four, calculating an IDF value: calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the record to which the word belongs;

step five, generating a record pair to be matched: accurately matching the keyword set with all index words, and sequentially combining records represented by the keywords and all records linked by the index words into a group of two to-be-matched record pairs for matching results;

step six, calculating record similarity: inputting the records in the pair of records to be matched into an SBERT model to generate a sentence vector containing semantic information, and calculating the similarity degree of the two vectors by using a cosine similarity method;

2. The entity recognition method considering text semantic information according to claim 1, wherein in the second step, the reverse index construction comprises:

(1) obtaining keywords and generating a word dictionary;

acquiring keywords from a record set to be recognized, and forming a word dictionary by the keywords in a specific practical process, wherein one record is a character string; firstly, finding out all words in a character string, and performing word segmentation operation, wherein English records are separated by spaces, and Chinese records are subjected to special word segmentation processing by means of the existing word segmentation tool; the word segmentation result is obtained, then word stopping and word processing are carried out, words without practical meaning in the result and various punctuations are removed, the content of capital, small form, tense, morpheme and complex number in the words is subjected to standardized processing, and the words are uniformly converted into the forms of lowercase, common current time and singular number;

(2) establishing an inverted index;

establishing an inverted index after obtaining the keywords, and linking all the keywords in the record with the record number through a linked list; in the implementation process of the inverted index, the index words and the record set are respectively used as a dictionary file and a position file for storage; the dictionary file not only stores each word appearing in the record, but also keeps a pointer pointing to the position file, and the position information of the corresponding record of the keyword is found through the pointer.

3. The method for entity recognition considering text semantic information according to claim 2, wherein in the step (1), the particle having no actual meaning in the result comprises the, in and at.

4. The method for entity recognition based on semantic information of text according to claim 1, wherein in step four, the IDF value for the vocabulary t is calculated as follows:

where | D | refers to the total number of records contained in the record set, and { D | D ∈ D & & t ∈ D } refers to the number of records containing the target word t, and if the word is not in the record set, to avoid the case where the denominator is 0, 1 is uniformly added at the formula denominator.

5. The entity recognition method considering text semantic information according to claim 4, wherein the main idea of the IDF is to: if the number of records containing the word is less, namely the word appears less frequently, the IDF value is larger, and the entry has good category distinguishing capability.

6. The entity recognition method according to claim 1, wherein in step six, the record similarity is calculated, and the record pair to be matched is input into a preloaded SBERT model, and SBERT introduces a twin neural network on the basis of BERT; the twin neural network maps input content to a new space through two neural networks sharing weight values to obtain a sample pair, and similarity of the samples is measured by calculating cosine included angles of the sample pair.

7. An entity recognition system applying the entity recognition method considering text semantic information according to any one of claims 1 to 6, the entity recognition system comprising:

the IDF value calculating module is used for calculating the IDF value of each word in each record in the record set B in the record set A, and selecting the first three words with the highest IDF score to form a keyword set to represent the records to which the words belong;

the record pair generation module to be matched is used for accurately matching the keyword set with all the index words, and for matching matched results, the records represented by the keyword set and all the records linked by the index words are sequentially combined into two record pairs to be matched;

8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

10. An information data processing terminal characterized by being configured to implement the entity identification system according to claim 7.