CN112395407A

CN112395407A - Method and device for extracting enterprise entity relationship and storage medium

Info

Publication number: CN112395407A
Application number: CN202011211617.XA
Authority: CN
Inventors: 陈家银; 陈曦; 麻志毅
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-23
Anticipated expiration: 2040-11-03
Also published as: CN112395407B

Abstract

The invention discloses a method, a device and a storage medium for extracting an enterprise entity relationship, wherein the method comprises the following steps: acquiring text data to be extracted; inputting the text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector; inputting the word vector into a first entity recognition layer of the entity recognition model to obtain a main entity containing an entity relationship; and inputting the word vector and the main entity into a second entity recognition layer of the entity recognition model to obtain a guest entity having a corresponding relation with the main entity. According to the extraction method of the enterprise entity relationship disclosed by the invention, unrelated entities are not identified, and related entities are directly identified, so that the noise influence caused by negative samples is greatly reduced, and the training efficiency and the identification effect of the model are improved.

Description

Method and device for extracting enterprise entity relationship and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting an enterprise entity relationship, and a storage medium.

Background

The information extraction refers to a process of extracting entity, event, relationship and other types of information from a section of text to form structured data to be stored in a database for a user to inquire and use. Relationship extraction is the key content of information extraction and aims to find semantic relationships among real-world entities. In recent years, the technology is widely applied to a plurality of machine learning and natural language processing tasks, for example, the relation between enterprises is mined from news texts by means of an information extraction technology, and the technology has important application value in constructing an upstream and downstream relation knowledge base between enterprises.

In the prior art, entity recognition is carried out by a supervised learning method, all entities are recognized firstly, then a classification model is trained to label every two entities, the relationship extraction task is divided into two steps which are not influenced by each other by the classification model, the error transmission of the two models causes overlarge final result error, and in the actual enterprise relationship extraction task, a large number of enterprise entities exist in a news text, the number of unassociated negative samples is large, serious noise problems are caused, and the recognition effect of the entity relationship is influenced; secondly, the relation of all possible enterprise entities is judged in the scene, the complexity of the model increases exponentially along with the increase of the number of the entities in the text, and the training efficiency and the recognition effect of the model are greatly reduced.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for extracting an enterprise entity relationship and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present disclosure provides a method for extracting an enterprise entity relationship, including:

acquiring text data to be extracted;

inputting text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector;

inputting the word vector into a first entity recognition layer of the entity recognition model to obtain a main entity containing an entity relationship;

and inputting the word vector and the main entity into a second entity recognition layer of the entity recognition model to obtain a guest entity having a corresponding relation with the main entity.

In one embodiment, before inputting the text data into the coding layer of the pre-trained entity recognition model, the method further includes:

and marking the text data according to an IOBES marking criterion to obtain marked text data.

In one embodiment, the coding layer is constructed from a BERT network model.

In one embodiment, the first entity identification layer is comprised of a first BILSTM network model and a first CRF network model.

In one embodiment, inputting the word vector into a first entity recognition layer of the entity recognition model to obtain a main entity containing entity relationships, includes:

inputting the word vector into a first BILSTM network model to obtain a forward hidden layer sequence and a backward hidden layer sequence;

combining the forward hidden layer sequence and the backward hidden layer sequence to obtain a word vector sequence;

and inputting the word vector sequence into the first CRF network model to obtain a position vector of the main entity containing the entity relationship.

In one embodiment, the second entity identification layer is comprised of a second BILSTM network model and a second CRF network model.

In one embodiment, inputting the word vector and the host entity into a second entity recognition layer of the entity recognition model to obtain a guest entity having a corresponding relationship with the host entity, includes:

and inputting the word vector sequence and the position vector of the main entity containing the entity relationship into a second CRF network model to obtain a guest entity having a corresponding relationship with the main entity.

obtaining a loss function of the entity recognition model according to the probability map model;

and training the entity recognition model according to the loss function.

In a second aspect, an embodiment of the present disclosure provides an apparatus for extracting a business entity relationship, including:

the acquisition module is used for acquiring text data to be extracted;

the input module is used for inputting text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector;

the first extraction module is used for inputting the word vectors into a first entity recognition layer of the entity recognition model to obtain a main entity containing entity relations;

and the second extraction module is used for inputting the word vectors and the main entity into a second entity recognition layer of the entity recognition model to obtain the guest entity having a corresponding relation with the main entity.

In a third aspect, the disclosed embodiment provides a computer-readable medium, on which computer-readable instructions are stored, where the computer-readable instructions are executable by a processor to implement the method for extracting a business entity relationship provided in the foregoing embodiment.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the method for extracting the entity relationship of the enterprise provided by the embodiment of the disclosure, the entity identification model comprises a first entity identification layer and a second entity identification layer, the first entity identification layer directly identifies a main entity with the relationship, and the second entity identification layer identifies a guest entity with the corresponding relationship with the main entity. The entity relation extraction method provided by the embodiment of the disclosure does not identify unrelated entities, and directly identifies related entities, thereby greatly reducing noise influence caused by negative samples and improving the training efficiency and identification effect of the model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flowchart illustrating a method for extraction of Business entity relationships, according to an example embodiment;

FIG. 2 is an exemplary diagram illustrating entity relationships in news text in accordance with one illustrative embodiment;

FIG. 3 is a flowchart illustrating a method for extraction of Business entity relationships, in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating an entity recognition model in accordance with an exemplary embodiment.

FIG. 5 is a block diagram illustrating an extraction mechanism for Business entity relationships, according to an example embodiment;

FIG. 6 is a block diagram illustrating an abstraction device for Business entity relationships in accordance with an exemplary embodiment;

FIG. 7 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.

Detailed Description

So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.

In the entity relation extraction method in the prior art, all entities are identified firstly, and in an actual enterprise relation extraction task, a large number of enterprise entities exist in a news text, and the number of irrelevant negative samples is large, so that the serious noise problem is caused. FIG. 2 is an exemplary diagram illustrating entity relationships in news text in accordance with one illustrative embodiment. As shown in fig. 2, 6 business entities appear in the news text, wherein, "illite", "mon cow", "wa haha", and "Qingdao beer" have no relationship, and "smith field" and "wangtui international" have a relationship of "company", but in the existing entity relationship extraction, the relationship of two entities is determined one by one in a manner of entity pair, which causes repeated determination of many negative examples, affects the recognition effect, and brings unnecessary calculation.

The business entity relationship extraction task can be described as the extraction of (s, r, o) triples, where s and o represent two business entity subjects (main entities) and objects (guest entities) containing relationships, respectively, and r represents the relationship between s and o. (s, r, o) is understood that r of s is o.

The entity relation extraction method provided by the embodiment of the disclosure identifies the s and r with the relation by using the input text, and then identifies the o, so that entities without the relation can be filtered out in the step of enterprise entity identification, the influence caused by negative samples is reduced, and the o is identified by using the s and r with the determined relation, so that the identification efficiency and the accuracy of the model can be greatly improved.

The following describes in detail an extraction method of business entity relationships provided in an embodiment of the present application with reference to fig. 1 to 4.

Referring to fig. 1, the method specifically includes the following steps.

S101, text data to be extracted are obtained.

The text data may be news text data, and the news text data often includes text information with high value content, for example, client information, peer information, investment information, and the like may be mined from the news text data. The text data to be extracted can be obtained from each of the web sites of the large and new web.

Further, after the text data to be extracted is obtained, the text data is labeled, and in a possible implementation manner, the text data is labeled according to an IOBES labeling criterion to obtain the labeled text data.

S102, inputting the text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector.

For a given text, it needs to be mapped into a vector space of numerical tokens. The word vectors obtained by the traditional One-hot vectorization method have the defects of large dimensionality and sparseness, and the word vectors obtained by the word bag model-based coding methods such as word2vec and fasttext cannot well capture global semantic information. In order to more accurately mine complex relationships between entities in the following process, a BERT (bidirectional encoder representation from converters) network model is adopted in a coding layer in the embodiment of the disclosure, wherein BERT is a deep bidirectional coding language model based on big corpus pre-training, can well learn potential semantic information and context information in a text, codes an input text more accurately, and has good performance in application of many downstream tasks.

In one possible implementation, the labeled text data is entered into the BERT network model, where for a given text w ═ w₁,w₂,w₃,…,w_n]Where n is the length of the text, resulting in an encoded word vector e ═ e₁,e₂,e₃,…,e_n]。

Because a document contains a large number of sentences, the document to be extracted is segmented, the segmentation criterion is mainly punctuation, the text length max _ seq _ length is set to be 100, and the sentence length does not exceed 100 characters. Other processes include combining multiple blank characters and multiple continuous punctuation marks, etc. Sentences which are more than a specified length and caused by nonstandard punctuations and missing punctuations commonly existing in news texts can be directly truncated. According to this step, an encoded word vector e is obtained, namely:

e＝BERT(w)

s103, the word vector is input into a first entity recognition layer of the entity recognition model, and a main entity containing the entity relationship is obtained.

The entity recognition model in the embodiment of the disclosure comprises two parts, namely a first entity recognition layer and a second entity recognition layer, wherein the task of the first entity recognition layer is to recognize a main entity containing a specific relationship for a given text, and the task of the second entity recognition layer is to recognize a guest entity having a corresponding relationship with the main entity.

Specifically, the first entity identification layer is composed of a first BILSTM network model and a first CRF network model. In one possible implementation, the word vector e is input into a first BILSTM network model, and the text is further characterized and learned using the BILSTM network, mainly to deepen learning context information, with hidden dimension settings 128. Obtaining a forward hidden layer sequence

And backward hidden layer sequence

Combining the forward hidden layer sequence and the backward hidden layer sequence to obtain a word vector sequence

Namely:

H＝BILSTM(e)

vector sequence of words

And inputting a first CRF network model, and identifying the entity s and the relation R, specifically, performing traversal circulation on the CRF network according to the type number (R) of the relation R to identify s, so as to obtain a main entity s containing the entity relation. Namely:

and coding the position of the main entity s to obtain a main entity position vector P, and according to the step, directly identifying the main entity with relationship, filtering out the entity without relationship, and greatly reducing the influence caused by negative samples.

S104, the word vectors and the main entity are input into a second entity recognition layer of the entity recognition model, and a guest entity having a corresponding relation with the main entity is obtained.

Specifically, the second entity identification layer is composed of a second BILSTM network model and a second CRF network model. In order to reduce error transmission of the two entity identification layers and enable better information interaction between the entity identification layers corresponding to different layers and different relationships, parameters of the second BILSTM network model and the first BILSTM network model in the embodiment of the disclosure are shared, that is, the second BILSTM network model directly uses the word vector sequence obtained by the first BILSTM network model

And then splicing the main entity position vector P and H to form a new vector X ═ H, P]New vector X ═ H, P]And inputting a second CRF network model for identification to obtain a guest entity O having a corresponding relation with the host entity. Namely:

according to this step, a guest entity having a correspondence with the host entity can be identified. Thus yielding the (s, r, o) triplet.

In order to facilitate understanding of the method for extracting the business entity relationship provided in the embodiment of the present application, the following description is made with reference to fig. 3 and 4. As shown in fig. 3, the method mainly includes the following steps.

First, news text data is collected to generate a corpus. In a possible implementation manner, news data may be obtained from each news website, and then the collected text data may be labeled, and the labeled text data may be obtained by using an IOBES labeling criterion. Optionally, the labeled text data may be used to train a BERT pre-training model of a lower coding layer to better fit the data scene of the news text.

Secondly, training an entity recognition model according to the labeled news text data, wherein the structure of the entity recognition model is shown in fig. 4 and comprises a coding layer, a first entity recognition layer (a front NER layer) and a second entity recognition layer (a rear NER layer) which are sequentially connected.

Inputting the marked text into a coding layer, wherein the coding layer is composed of a BERT network model, and for a given text w ═ w₁,w₂,w₃,…,w_n]Where n is the length of the text, resulting in an encoded word vector e ═ e₁,e₂,e₃,…,e_n]。

And inputting the coded word vector e into a first entity recognition layer, wherein the first entity recognition layer is composed of a first BILSTM network model and a first CRF network model. Inputting the word vector e into a first BILSTM network model to obtain a forward hidden layer sequence

And backward hidden layer sequence

Vector sequence of words

And inputting a first CRF network model, and identifying the entity s and the relation R, specifically, performing traversal circulation on the CRF network according to the type number (R) of the relation R to identify s, so as to obtain a main entity s containing the entity relation. And coding the position of the main entity s to obtain a main entity position vector P. As shown in fig. 4, the first entity recognition layer recognizes a main entity "arbiba" having a subsidiary relationship and a main entity "ant golden clothes" having a cooperative relationship. The loss function of the first entity identification layer is:

wherein the content of the first and second substances,

representing a document w_iThe probability that the jth character predicts the correct label in the host entity s with the relation r, H_iRepresenting a sequence of word vectors, R representing the number of categories of the relation R.

The second entity identification layer is composed of a second BILSTM network model and a second CRF network model. And splicing the position vector P of the main entity with the position vector H to form a new vector X ═ H, P, and inputting the new vector X ═ H, P into a second CRF network model for identification to obtain a guest entity O corresponding to the main entity. As shown in fig. 4, the second entity recognition layer recognizes that the guest entity 'ant gold clothes' having a relationship with the host entity 'alisbaba' recognizes 'vivo' having a corresponding relationship with the 'ant gold clothes'. The loss function of the second entity identification layer is:

wherein the content of the first and second substances,

representing a document w_iThe probability that the jth character predicts the correct label in the guest entity o with relation r, H_iRepresenting a sequence of word vectors, R representing the number of categories of the relation R, p_kRepresenting the location vector of the master entity.

In the embodiment of the disclosure, a loss function of the entity recognition model is obtained according to the probability map model, and the entity recognition model is trained according to the loss function. To extract all possible (s, r, o) triplets from the text and design an objective function at the level of a triplet, the chain rule likelihood function according to probability can be written as:

wherein D is a training set, w_iFor training text in sets, T_iRepresenting a dataset document w_iAll possible (s, r, o) triplets in (s, r, o), s ∈ T_iRepresented in a relational triple T_iWhere is shown, o is e.T_iTriple T representing o in a relationship_iHas been shown in (a). That is, the probability of all triples appearing in the corpus is maximized, and the negative logarithm of the above formula is taken as the joint loss function of the entity recognition model, so that:

and finally, extracting the entity relationship of the enterprise according to the trained entity identification model, inputting the news text data to be extracted into the entity identification model, and identifying the entity relationship of the enterprise to obtain the extracted entity relationship triple.

According to the extraction method of the enterprise entity relationship provided by the embodiment of the disclosure, unrelated entities are not identified, and related entities are directly identified, so that the noise influence caused by negative samples is greatly reduced, and the training efficiency and the identification effect of the model are improved.

An embodiment of the present disclosure further provides an apparatus for extracting an enterprise entity relationship, where the apparatus is configured to execute the method for extracting an enterprise entity relationship in the foregoing embodiment, and as shown in fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain text data to be extracted;

an input module 502, configured to input text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector;

a first extraction module 503, configured to input the word vector into a first entity identification layer of the entity identification model, so as to obtain a main entity containing an entity relationship;

the second extraction module 504 is configured to input the word vector and the host entity into a second entity identification layer of the entity identification model, so as to obtain a guest entity having a corresponding relationship with the host entity.

It should be noted that, when the extraction apparatus for business entity relationship provided in the foregoing embodiment executes the extraction method for business entity relationship, the division of each functional module is merely used as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules, so as to complete all or part of the functions described above. In addition, the embodiment of the extraction apparatus for enterprise entity relationships and the embodiment of the extraction method for enterprise entity relationships provided in the foregoing embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The embodiment of the present disclosure further provides an electronic device corresponding to the method for extracting an enterprise entity relationship provided in the foregoing embodiment, so as to execute the method for extracting an enterprise entity relationship.

Please refer to fig. 6, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 6, the electronic apparatus includes: the processor 600, the memory 601, the bus 602 and the communication interface 603, wherein the processor 600, the communication interface 603 and the memory 601 are connected through the bus 602; the memory 601 stores a computer program that can be executed on the processor 600, and the processor 600 executes the method for extracting business entity relationships according to any of the embodiments of the present application when executing the computer program.

The Memory 601 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 603 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 602 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 601 is used for storing a program, and the processor 600 executes the program after receiving an execution instruction, and the method for extracting an enterprise entity relationship disclosed in any of the foregoing embodiments of the present application may be applied to the processor 600, or implemented by the processor 600.

Processor 600 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 600. The Processor 600 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 601, and the processor 600 reads the information in the memory 601 and performs the steps of the above method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the method for extracting the entity relationship of the enterprise provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 7, the computer readable storage medium is an optical disc 700, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program executes the method for extracting an enterprise entity relationship provided in any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above embodiment of the present application and the method for extracting an enterprise entity relationship provided by the embodiment of the present application have the same beneficial effects as the method adopted, run, or implemented by the application program stored in the computer-readable storage medium.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for extracting a business entity relationship is characterized by comprising the following steps:

acquiring text data to be extracted;

inputting the text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector;

2. The method of claim 1, wherein prior to entering the text data into a coding layer of a pre-trained entity recognition model, further comprising:

and labeling the text data through an IOBES labeling criterion to obtain the labeled text data.

3. The method of claim 1, wherein the coding layer is constructed from a BERT network model.

4. The method of claim 1, wherein the first entity identification layer is comprised of a first BILSTM network model and a first CRF network model.

5. The method of claim 4, wherein inputting the word vector into a first entity recognition layer of the entity recognition model, resulting in a primary entity containing entity relationships, comprises:

inputting the word vector into the first BILSTM network model to obtain a forward hidden layer sequence and a backward hidden layer sequence;

merging the forward hidden layer sequence and the backward hidden layer sequence to obtain a word vector sequence;

and inputting the word vector sequence into the first CRF network model to obtain a position vector of a main entity containing an entity relationship.

6. The method of claim 5 wherein the second entity identification layer is comprised of a second BILSTM network model and a second CRF network model.

7. The method of claim 6, wherein inputting the word vector and the host entity into a second entity recognition layer of the entity recognition model to obtain a guest entity having a corresponding relationship with the host entity comprises:

and inputting the word vector sequence and the position vector of the main entity containing the entity relationship into the second CRF network model to obtain a guest entity having a corresponding relationship with the main entity.

8. The method of claim 1, wherein prior to entering the text data into a coding layer of a pre-trained entity recognition model, further comprising:

obtaining a loss function of the entity recognition model according to a probability map model;

and training the entity recognition model according to the loss function.

9. An apparatus for extracting business entity relationships, comprising:

the acquisition module is used for acquiring text data to be extracted;

the input module is used for inputting the text data into a coding layer of a pre-trained entity recognition model to obtain a coded word vector;

the first extraction module is used for inputting the word vector into a first entity identification layer of the entity identification model to obtain a main entity containing an entity relationship;

and the second extraction module is used for inputting the word vectors and the main entity into a second entity identification layer of the entity identification model to obtain a guest entity having a corresponding relation with the main entity.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement a method of extracting business entity relationships according to any one of claims 1 to 8.