CN111414759A

CN111414759A - Method and system for entity disambiguation

Info

Publication number: CN111414759A
Application number: CN202010169248.6A
Authority: CN
Inventors: 齐云飞; 付骁弈; 张�杰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-07-14

Abstract

The embodiment of the invention discloses a method and a device for entity disambiguation, which are applied to a distributed platform, wherein the method comprises the following steps: dividing word vector data of an entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain class representation vectors of each class in each part; clustering all class expression vectors of all parts together again to obtain a final clustering result; and clustering different parts of word vector data by different nodes in the distributed platform. Therefore, entity disambiguation is completed through secondary clustering by using the distributed platform, and the requirement of performing entity disambiguation on a large number of entity words can be met.

Description

Method and system for entity disambiguation

Technical Field

The present invention relates to the field of natural language processing, and more particularly, to a method and apparatus for entity disambiguation.

Background

Information extraction is an important task in natural language processing, and is particularly important in the background of information explosion at present. Among the many entities extracted, merging words of similar meaning is an important issue, which is called entity disambiguation.

Entities in the field of natural language processing are simply understood to be nouns such as names of people, organizations, places, and all other entities identified by names, and more generally, numbers, dates, currencies, addresses, and the like. An entity may have multiple meanings, for example, the meaning of the same entity may be different in different contexts. For humans, the specific meaning of each entity can be determined visually, but for machines, natural language processing techniques are required to identify the specific meaning of each entity and to distinguish the different entities, i.e., entity disambiguation techniques.

At present, machine learning algorithms which can be applied to a distributed environment are few, only a few simple machine learning algorithms can be applied to a distributed computing platform, and a great number of other algorithms are still computed in a single machine environment, but the problems brought by running in the single machine environment are that the computing capability is limited, the computing speed is slow, and the requirement of entity disambiguation on a great number of entity words cannot be met. It can be said that no solution for entity disambiguation of a large number of entity words exists in the prior art.

Disclosure of Invention

In view of this, an embodiment of the present invention provides an entity disambiguation method applied to a distributed platform, including:

dividing word vector data of an entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain class representation vectors of each class in each part;

clustering all class expression vectors of all parts together again to obtain a final clustering result;

and clustering different parts of word vector data by different nodes in the distributed platform.

The embodiment of the invention also provides a device for entity disambiguation, which is applied to a distributed platform and comprises the following steps:

the first clustering unit is used for dividing the word vector data of the entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain a class representation vector of each class in each part;

the second clustering unit is used for clustering all the class expression vectors of all the parts together again to obtain a final clustering result;

An embodiment of the present invention further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of entity disambiguation described above.

An embodiment of the present invention further provides a computer-readable storage medium, where an information processing program is stored on the computer-readable storage medium, and when the information processing program is executed by a processor, the information processing program implements the steps of the entity disambiguation method.

According to the technical scheme provided by the embodiment of the invention, the entity disambiguation is completed by utilizing the distributed platform through secondary clustering, and the requirement of carrying out entity disambiguation on a large number of entity words can be met.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

FIG. 1 is a flow chart illustrating a method for entity disambiguation according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for entity disambiguation according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of data flow according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for entity disambiguation according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for entity disambiguation according to an embodiment of the present invention.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

Fig. 1 is a schematic flowchart of a method for entity disambiguation according to an embodiment of the present invention, where the method is applied to a distributed platform, and as shown in fig. 1, the method includes:

step 101, dividing word vector data of an entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain class representation vectors of each class in each part;

102, clustering all class expression vectors of all parts together again to obtain a final clustering result;

Optionally, the clustering the word vector data of each part to obtain a class representation vector of each class in each part includes:

the following operations are respectively carried out on the word vector data of each part:

calculating the similarity between every two entities in the part by adopting a similarity algorithm;

clustering according to the similarity between every two entities in the part by using a clustering algorithm;

and respectively adding all the word vector data in each class obtained by clustering in the part and then averaging to obtain the class representation vector of each class in the part.

Optionally, the clustering all the class expression vectors of all the parts together again to obtain a final clustering result includes:

calculating the similarity between each two classes by adopting a similarity algorithm;

clustering again according to the similarity between every two classes by using a clustering algorithm;

and adding all the word vector data in each class obtained by clustering again, and averaging to obtain the class representation vector of each class after clustering again.

Optionally, before dividing the word vector data of the entity to be disambiguated, the method further comprises:

screening original entity word vector data according to a preset screening rule to obtain word vector data of the entity to be disambiguated;

optionally, before the original entity word vector data is filtered according to the preset filtering rule, the method further includes:

and identifying original entity word vector data from the original data by using an entity identification NER model.

Optionally, the screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated includes:

inputting the original entity word vector data into a Kafka queue of the distributed platform.

And reading the original entity word vector data from the Kafka queue by using a Flink calculation engine of the distributed platform, screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated, and storing the word vector data in a distributed file system (HDFS) of the distributed platform.

Optionally, the clustering algorithm is a DBSCAN algorithm, and the similarity algorithm is a cosine similarity algorithm.

Fig. 2 is a flowchart illustrating a method for entity disambiguation according to another embodiment of the present invention, the method being applied to a distributed platform.

As shown in fig. 2, the method includes:

step 201, recognizing original entity word vector data from original data by using an entity recognition NER model;

named Entity Recognition (NER) refers to Recognition of entities with specific meaning in text, including names of people, places, organizations, proper nouns, etc. Named entity identification is an important basic tool in application fields such as information extraction, question and answer systems, syntactic analysis, machine translation, Semantic Web-oriented metadata labeling and the like, and plays an important role in the process of bringing natural language processing technology into practical use. Generally, the task of named entity recognition refers to recognizing three major categories (entity category, time category and number category), seven minor categories (person name, organization name, place name, time, date, currency and percentage) named entities in the text to be processed.

The embodiment of the invention can combine entity disambiguation and a neural network, train an entity recognition NER model by using the neural network, and greatly improve the semantic description capability of the entity context.

At present, most of the mainstream entity disambiguation algorithm bottom models are based on bag-of-words models, and the bag-of-words models have inherent limitations, so that the algorithms cannot fully utilize semantic information of contexts, and the entity disambiguation effect has a great improvement space. Word embedding is a hot spot in machine learning in recent years, and the idea of word embedding is to construct a distributed representation for each word, so that gaps between words are avoided. Neural Networks (NN) are complex network systems formed by a large number of simple processing units widely connected to each other, and are highly complex nonlinear dynamical learning systems. The neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is particularly suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously. For example, a convolutional neural network is a branch of a neural network model, and can effectively capture local features and then perform global modeling. If the convolutional neural network can be used for modeling word embedding, semantic features more effective than the bag-of-words model can be obtained. And based on the thought of local perception and weight sharing, the parameters in the convolutional neural network model are greatly reduced, and the training speed is high.

Alternatively, the entity identification NER model may be any NER model in the prior art, for example, the entity identification model of the embodiment of the present invention may adopt a BERT (Bidirectional Encoder retrieval from transforms) model.

The entity word data identified by the NER model can be subjected to subsequent screening and disambiguation only by adopting word vector representation. Therefore, the NER model identification process needs to obtain a word vector for each entity word in addition to identifying the entity word. For each entity, firstly, the self-vector of each word in the entity is respectively obtained, then the average vector of the word vectors of all the words in the entity is calculated, and the average vector is used as the word vector of the entity. For example, the text to be processed is arbor (full name: Steve arbor) who has gone to apple Inc. holding an apple that has eaten half of it. When the number is "one", the entities extracted are "arbor", "Steve arbor", "apple" and "apple". The entity "arbor" includes three words, and the word vectors of the three words are (a1, a2, a3), (bl, b2, b3) (c1, c2, c3), so the word vector of this entity "arbor" is ((a1+ bl + cl)/3, (a2+ b2+ c2)/3, (a3+ b3+ c 3)/3).

Step 202, screening original entity word vector data according to a preset screening rule to obtain word vector data of the entity to be disambiguated;

the preset screening rule refers to a screening rule set according to actual conditions or a black and white list and the like. Because some service requirements may exist in the actual task, and in order to reduce the number of the entity words and increase the clustering speed, some screening rules or black and white lists need to be used for screening, filtering or adding the entity words. For example, when the entity word conforms to the filtering rule or appears in the black list, the entity word is directly discarded; for another example, when the entity word in the white list does not appear in the identified entity word data set, the entity word in the white list is directly added to the entity data.

In this embodiment, in order to improve the overall efficiency of the System as much as possible, a large data platform may be used for entity data acquisition, as shown in fig. 3, data listing, that is, the identified entity word vector result is input into a Kafka queue, a Flink Filter reads real-time entity content from the Kafka queue, and performs screening using a Black list Black L ist, and writes the screened result into an HDFS (Distributed File System, Hadoop Distributed File System).

Step 203, dividing the word vector data of the entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain a class representation vector of each class in each part;

the division is performed to reduce the data volume of single clustering of the distributed platform. The word vector data of the entity to be disambiguated may be divided equally. For example, the word vector data of the entity to be disambiguated may be divided into 3 on average. The data may be divided into categories, for example, large categories such as comment data and consumption data, or sub-categories under the large categories, and each sub-category data may be used as a separate part.

And clustering different parts of word vector data by different nodes in the distributed platform. The distributed platform usually comprises a plurality of nodes, one node can be responsible for clustering one part of data, and a plurality of nodes can be responsible for clustering one part of data, so that one node only needs to be responsible for clustering tasks of one part of data, and the single-node clustering efficiency is improved.

The similarity calculation method may be any similarity calculation method in the prior art.

In this embodiment, a cosine similarity algorithm is used to calculate the similarity between each two entity words. For example, the word vector for entity A is expressed as [1,2,3,4,1], and the word vector for entity B is expressed as [1,2,3,4,3 ]. The cosine similarity S is calculated according to the formula that S is M/N. Where M is the dot product of the word vector a and the word vector B, and N is the vector product of the word vector a and the word vector B. M, N, M is 1+ 2+ 3+ 4+ 1+ 3-33

Finally, cosine similarity S of 33/34.77 of 0.949 is obtained.

The clustering algorithm may be any algorithm that implements clustering in the prior art.

In this embodiment, a DBSCAN algorithm is used to implement clustering. DBSCAN (sensitivity-Based spatial clustering of Applications with Noise) is a relatively representative Density-Based clustering algorithm. When the DBSCAN is used, the distance calculation formula needs to be modified, and the distance calculation formula in this embodiment adopts the above cosine similarity calculation method. The detailed clustering process is not described in detail herein for the prior art. Then, the clustered class expression vectors are calculated, the calculation mode is to add all the entity word vectors in the class and then average the added entity word vectors to obtain class average vectors, and the class average vectors are used as the class expression vectors of the class.

For example, as shown in fig. 3, entity data is read from the HDFS, divided into partitions 1,2, and 3 …, and then clustered using DBSCAN1, 2, and 3 …, respectively. After the algorithm is completed, the clustering result can be output, for example, the form of the result clustering the Partition1 is as follows:

class 1: entity word 1, entity word 2, … …

Class 2: entity word 3, entity word 4, … …

Class 3: entity word 5, entity word 6, … …

Then adding all the entity word vectors in the class 1 and averaging to obtain the class representation vector of the class 1, adding all the entity word vectors in the class 2 and averaging to obtain the class representation vector of the class 2, adding all the entity word vectors in the class 3 and averaging to obtain the class representation vector of the class 3, and so on. As shown in fig. 3, the clustered data may be saved to the HDFS again.

And step 204, clustering all the class expression vectors of all the parts together again to obtain a final clustering result.

For example, all class data may be put on the same node, or clustered again by the same node or nodes together through DBSCAN.

For example, suppose that the original entity is divided into 3 parts, and each part is clustered once to obtain 3 classes, so that 9 classes are obtained; and then, respectively calculating the similarity between each class by adopting the cosine similarity, clustering by using DBSCAN again, and obtaining a new clustering result by adopting the cosine similarity as a distance calculation formula of the clustering, wherein the clustering is assumed to be 5 classes again. As shown in fig. 3, the result of the last clustering is read from the HDFS, and the DBSCAN is used again for clustering again to obtain a final clustering result FileResult.

Wherein, the above mentioned re-clustering may be one-time, two-time or multiple clustering until the service requirement is satisfied. Further, the clustering process can also be used for correcting the clustering result by continuously modifying the clustering parameters.

According to the technical scheme provided by the embodiment of the invention, the task that a large number of data sets cannot be clustered in a single machine is realized by using the distributed platform through twice clustering, and the clustering speed is improved. Further, the dbs can algorithm can be used to complete the new word discovery function, since the dbs can detect outliers, the outliers are considered as new words in this embodiment; for example, classes that are not highly related to other classes may be used as new words. In this way, disambiguation of entities of large data volumes is achieved through distributed clustering.

Fig. 4 is a flowchart illustrating a method for entity disambiguation according to another embodiment of the present invention, the method being applied to a distributed platform.

As shown in fig. 4, the method includes:

step 401, recognizing original entity word vector data from original data by using an entity recognition NER model obtained through neural network training;

the embodiment of the invention combines entity disambiguation and a neural network, trains an entity recognition NER model by using the neural network, and can greatly improve the semantic description capability of the entity context.

The entity word data identified by the NER model can be subjected to subsequent screening and disambiguation only by adopting word vector representation. Therefore, the NER model identification process needs to obtain a word vector for each entity word in addition to identifying the entity word. For each entity, firstly, the self-vector of each word in the entity is respectively obtained, then the average vector of the word vectors of all the words in the entity is calculated, and the average vector is used as the word vector of the entity. For example, the text to be processed is arbor (full name: Steve arbor) who has gone to apple Inc. holding an apple that has eaten half of it. When the number is "one", the entities extracted are "arbor", "Steve arbor", "apple" and "apple". Wherein, entity "arbor" includes three words, the word vector distribution of three words is: the word vectors of the three words are (a1, a2, a3), (bl, b2, b3) (c1, c2, c3), respectively, and then the word vector of this entity "arbor" is ((a1+ bl + cl)/3, (a2+ b2+ c2)/3, (a3+ b3+ c 3)/3).

Step 402, inputting the original entity word vector data into a Kafka queue of a distributed platform;

kafka is a distributed message publish-subscribe system designed to be fast, extensible, and persistent. Kafka, like other message publish-subscribe systems, maintains information for messages within topics. The producer writes data to the subject and the consumer reads data from the subject. Since Kafka is characterized by the support of, and is based on, distribution, the subject matter can also be partitioned and overlaid on multiple nodes.

In this embodiment, the entity recognition model may directly input the recognized entity result into the Kafka queue, and provide real-time stream data for the Flink computation engine.

Step 403, reading the original entity word vector data from the Kafka queue by using a Flink calculation engine of the distributed platform, screening the original entity word vector data according to a preset screening rule to obtain word vector data of the entity to be disambiguated, and storing the word vector data in a distributed file system HDFS of the distributed platform;

The Flink is a distributed big data processing engine and can perform stateful computation on a finite data stream and an infinite data stream. The method can be deployed in various cluster environments and can be used for quickly calculating the data sizes of various sizes.

In order to improve the overall efficiency of the System as much as possible, in this embodiment, a big data platform may be used for entity data acquisition, as shown in fig. 3, a data list, i.e., an identified entity result, is first input into a Kafka queue, a fin Filter reads real-time entity content from the Kafka queue, and the screening is performed by using a Black list Black L ist, and the screened result is written into an HDFS (Distributed File System, Hadoop Distributed File System).

Step 404, dividing the word vector data of the entity to be disambiguated into a plurality of parts, and clustering the word vector data of each part to obtain a class representation vector of each class in each part;

Finally, cosine similarity S of 33/34.77 of 0.949 is obtained.

For example, as shown in fig. 3, entity data is read from the HDFS, divided into partitions 1,2,3 …, and then clustered using DBSCAN1, 2,3 …, respectively. After the algorithm is completed, the clustering result can be output, for example, the form of the result clustering the Partition1 is as follows:

class 1: entity word 1, entity word 2, … …

Class 2: entity word 3, entity word 4, … …

Class 3: entity word 5, entity word 6, … …

And 405, clustering all the class expression vectors of all the parts together again to obtain a final clustering result.

For example, all classes may be put to the same node, or clustered again by the same node or nodes together through DBSCAN.

For example, suppose that the original entity is divided into 3 partitions, and each partition is clustered once to obtain 3 classes, so that 9 classes are obtained; and then, respectively calculating the similarity between each class by adopting the cosine similarity, clustering by using DBSCAN again, and obtaining a new clustering result by adopting the cosine similarity as a distance calculation formula of the clustering, wherein the 5 classes are assumed to be clustered again to obtain respective class expression vectors of the 5 classes. As shown in fig. 3, the Result of the last clustering is read from the HDFS, and clustering is performed again using DBSCAN to obtain a final clustering Result File Result.

Fig. 5 is a schematic structural diagram of an apparatus for entity disambiguation according to an embodiment of the present invention, the apparatus being applied to a distributed platform.

As shown in fig. 5, the apparatus includes:

Optionally, the apparatus further comprises:

and the screening unit is used for screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated.

Optionally, the apparatus further comprises:

and the identification unit is used for identifying original entity word vector data from the original data by using the entity identification NER model.

Optionally, the identifying unit is further configured to input the original entity word vector data into a Kafka queue of a distributed platform;

and the screening unit is used for reading the original entity word vector data from the Kafka queue by using a Flink calculation engine of the distributed platform, screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated and storing the word vector data in a distributed file system HDFS of the distributed platform.

An embodiment of the present invention further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the method of disambiguating an entity of any of the above.

An embodiment of the present invention further provides a computer-readable storage medium, where an information processing program is stored on the computer-readable storage medium, and when the information processing program is executed by a processor, the information processing program implements the steps of the entity disambiguation method described in any of the above.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method for entity disambiguation, applied to a distributed platform, comprising:

2. The method of claim 1, wherein clustering the word vector data for each portion to obtain a class representation vector for each class in each portion comprises:

3. The method of claim 1, wherein clustering all class representation vectors of all parts together again to obtain a final clustering result comprises:

4. The method of claim 1, wherein prior to partitioning the word vector data for the entity to be disambiguated, the method further comprises:

and screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated.

5. The method of claim 4, wherein before the filtering the original entity word vector data according to the preset filtering rule, the method further comprises:

6. The method according to claim 5, wherein the screening the original entity word vector data according to a preset screening rule to obtain the word vector data of the entity to be disambiguated comprises:

inputting the original entity word vector data into a Kafka queue of a distributed platform;

7. The method according to any one of claims 2 to 3,

the clustering algorithm is a DBSCAN algorithm, and the similarity algorithm is a cosine similarity algorithm.

8. An apparatus for entity disambiguation, applied to a distributed platform, comprising:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements a method of disambiguating an entity according to any of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon an information processing program which, when executed by a processor, performs the steps of a method of entity disambiguation as claimed in any of claims 1 to 7.