CN114238597A

CN114238597A - Information extraction method, device, equipment and storage medium

Info

Publication number: CN114238597A
Application number: CN202111479541.3A
Authority: CN
Inventors: 闫润强; 段素霞
Original assignee: Henan Xunfei Artificial Intelligence Technology Co ltd
Current assignee: Henan Xunfei Artificial Intelligence Technology Co ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-25

Abstract

The application provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, wherein the method comprises the following steps: selecting entities similar to the text to be extracted from a preset knowledge base as candidate entities; determining the fusion characteristics of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity; the text segment in the text to be extracted consists of a single character or more than two continuous characters in the text to be extracted, and the fusion feature comprises a text segment feature and a candidate entity feature; and determining each entity in the text to be extracted and the relation between the entities according to the fusion characteristics of each text segment in the text to be extracted. By adopting the method, the entity can be synchronously extracted from the text and the entity relationship can be determined, and the information extraction accuracy is higher.

Description

Information extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to an information extraction method, apparatus, device, and storage medium.

Background

Information extraction is a main means for extracting useful information from natural language texts, wherein entity and entity relationship extraction is the most concerned business field in information extraction.

Conventional information extraction methods generally extract entities from text, and then analyze relationships between the entities to determine entity relationships. The process is complicated and requires two steps to determine the entities and the relationships between the entities. Moreover, the conventional information extraction method completely depends on the content of the text to be extracted to perform entity identification and entity relationship determination, and the accuracy of identifying the entity and the entity relationship is not high.

Disclosure of Invention

Based on the above technical current situation, the embodiment of the present application provides an information extraction method, which can extract entities from a text to be extracted at one time and determine entity relationships, and has higher information extraction accuracy.

In order to achieve the above purpose, the present application specifically proposes the following technical solutions:

an information extraction method, comprising:

selecting entities similar to the text to be extracted from a preset knowledge base as candidate entities;

determining the fusion characteristics of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity; the text segment in the text to be extracted consists of a single character or more than two continuous characters in the text to be extracted, and the fusion feature comprises a text segment feature and a candidate entity feature;

and determining each entity in the text to be extracted and the relation between the entities according to the fusion characteristics of each text segment in the text to be extracted.

Optionally, the selecting an entity similar to the text to be extracted from the preset knowledge base as a candidate entity includes:

matching the text to be extracted with a preset knowledge base, and selecting a knowledge triple similar to the text to be extracted from the preset knowledge base;

and determining entities similar to the text to be extracted from the selected knowledge triple as candidate entities.

Optionally, the method further includes:

and performing information expansion on the text to be extracted by utilizing the information related to the text to be extracted.

Optionally, determining the fusion features of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity, including:

respectively determining the vector code of each text segment in the text to be extracted and the vector code of each candidate entity;

determining the similarity of each candidate entity and each text segment according to the vector code of each text segment and the vector code of each candidate entity;

and determining the fusion characteristics of each text segment according to the vector codes of each text segment, the vector codes of each candidate entity and the similarity of each candidate entity and each text segment.

Optionally, after determining the vector code of each text segment in the text to be extracted, the method further includes:

and filtering non-entity text segments from each text segment according to the vector codes of each text segment in the text to be extracted.

Optionally, determining the similarity between each candidate entity and each text segment according to the vector coding of each text segment and the vector coding of each candidate entity, includes:

for each text segment, respectively determining the similarity between the text segment and each candidate entity by using the vector code of the text segment and the vector code of each candidate entity;

and normalizing the similarity of the text segment and each candidate entity.

Optionally, determining each entity in the text to be extracted and the relationship between the entities according to the fusion feature of each text segment in the text to be extracted includes:

and determining entity text segments from the text segments according to the fusion characteristics of the text segments in the text to be extracted, and determining the entity types of the entity text segments and the relationship between the entity text segments.

Optionally, determining an entity text segment from each text segment according to the fusion feature of each text segment in the text to be extracted, and determining the entity type of each entity text segment and the relationship between each entity text segment, including:

classifying each text segment according to the fusion characteristics of each text segment in the text to be extracted, determining an entity text segment from each text segment and determining the entity type of each entity text segment;

and determining the relation between the entity text segments according to the determined fusion characteristics of the entity text segments.

Optionally, determining fusion features of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity; and determining each entity in the text to be extracted and the relationship among the entities according to the fusion characteristics of each text segment in the text to be extracted, wherein the determining comprises the following steps:

respectively inputting the text to be extracted and each candidate entity into a pre-trained information extraction model, enabling the information extraction model to divide text segments of the text to be extracted, and determining the fusion characteristics of each text segment in the text to be extracted according to each text segment and each candidate entity; and determining each entity in the text to be extracted and the relationship between the entities according to the fusion characteristics of each text segment in the text to be extracted.

An information extraction apparatus comprising:

the candidate entity screening unit is used for selecting entities similar to the text to be extracted from a preset knowledge base as candidate entities;

the feature extraction unit is used for determining the fusion features of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity; the text segment in the text to be extracted consists of a single character or more than two continuous characters in the text to be extracted, and the fusion feature comprises a text segment feature and a candidate entity feature;

and the information extraction unit is used for determining each entity in the text to be extracted and the relationship among the entities according to the fusion characteristics of each text segment in the text to be extracted.

An information extraction device comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is used for realizing the information extraction method by running the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the information extraction method described above.

According to the information extraction method, the text to be extracted is processed by means of the preset knowledge base, the entities in the text to be extracted can be determined at one time, and the relation between the entities can be determined. In addition, in the information extraction process, the information of the entity similar to the text to be extracted, which is extracted from the preset knowledge base, is referred to, namely, the external knowledge is referred to, and the addition of the external knowledge enriches the reference information for identifying the entity and the entity relationship, so that the information extraction accuracy is higher.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of an information extraction method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another information extraction method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an information extraction device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for extracting the entities and the application scenes of the entity relations from the natural language texts, and by adopting the technical scheme of the embodiment of the application, the entities can be more accurately and efficiently extracted from the texts to be extracted, and meanwhile, the relations among the entities are determined.

The text to be extracted may be any natural language text, for example, it may be various texts such as user speech, news manuscripts, product introductions, article writings, patient cases, and the like, wherein the patient cases may be electronic cases generated when a patient or a pet goes to a hospital. Therefore, the information extraction method provided by the embodiment of the application is applicable to natural language texts in any format and in any type, and theoretically, as long as the text types can be processed by processing equipment, the extraction of entities and entity relationships can be realized by executing the technical scheme of the embodiment of the application.

In the conventional natural language processing scheme, there are also corresponding entity and entity relationship extraction schemes. However, the existing entity and entity relationship extraction schemes are generally performed in two steps, i.e., extracting entities from the text and then analyzing to determine the relationships between the entities. The above two steps of information extraction process are inefficient.

In addition, the conventional information extraction schemes perform entity recognition and determine entity relationships according to the text to be extracted. However, due to the diversity and unpredictability of the text to be extracted, the information extraction model performs entity and entity relationship prediction completely based on the text to be extracted, and often causes an error in entity and entity relationship recognition due to unfamiliarity with the text to be extracted, and the overall information extraction accuracy is not high.

In view of the above technical problems, an embodiment of the present application provides an information extraction scheme, which can simultaneously determine an entity and an entity relationship in a text to be extracted, and can predict the entity and the entity relationship in combination with knowledge base information.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

An embodiment of the present application provides an information extraction method, which is shown in fig. 1 and includes:

s101, selecting entities similar to the text to be extracted from a preset knowledge base as candidate entities.

Specifically, the preset knowledge base refers to a database containing texts in the fields to which the texts to be recognized belong, or a database composed of only text information in the fields to which the texts to be recognized belong, and may be, for example, a knowledge graph containing knowledge information in the fields to which the texts to be recognized belong. Illustratively, in the repository, information is stored in triples to facilitate information retrieval and querying.

The text to be extracted is the text to be processed from which the entity needs to be extracted and the entity relationship needs to be determined, and may specifically be any natural language text.

As an exemplary implementation manner, an entity is identified from a text to be extracted, and then an entity similar to the identified entity is retrieved from a preset knowledge base by taking the identified entity as a retrieval condition, so as to serve as a candidate entity.

As another optional implementation manner, the text to be extracted is matched with a preset knowledge base, and an entity matched with the text to be extracted is selected from the preset knowledge base, so that the entity can be used as a candidate entity.

Specifically, firstly, matching the text to be extracted with a preset knowledge base, and selecting a knowledge triple similar to the text to be extracted from the preset knowledge base.

Because the information in the preset knowledge base is stored in the form of the triples, the text to be extracted is matched with the preset knowledge base, and the triples matched with the text to be extracted can be selected from the triples.

For example, the preset knowledge base stores not only the knowledge triples, but also the entities in the knowledge triples and the feature vectors of the entity relationships. On this basis, the feature vector of the text to be extracted can be extracted, and then the feature vector of the text to be extracted is matched with the feature vector of each knowledge triple in the knowledge base, and the knowledge triple matched with the text to be extracted is selected from the feature vector.

According to the method and the device, a pre-trained BERT model is utilized to recognize and train a large amount of corpora in the field to which the text to be extracted belongs, so that vectors of characters or words in the field can be accurately extracted. And then, performing feature extraction on the text to be extracted by using the trained BERT model to obtain a feature vector of the text to be extracted and also obtain a feature vector of each character or word in the text to be extracted. The structure of the BERT model can be referred to the introduction of the BERT model in the conventional technical solution, and is not described in detail here.

And then, determining entities similar to the texts to be extracted from the selected knowledge triples as candidate entities.

That is, the entities in each knowledge triple are selected by the matching process described above and directly used as candidate entities.

It will be appreciated that the candidate entities determined by the above process are entities determined from the knowledge base that are similar to the text to be extracted. Since the domain to which the text in the knowledge base belongs is the same as or includes the domain to which the text to be extracted belongs. Therefore, the candidate entities selected by the above processing can reflect that some contents are similar to the candidate entities in the text to be extracted, that is, some contents can be regarded as entities in the text to be extracted. Therefore, the selection of the candidate entities provides reference for determining the entities in the text to be extracted.

S102, determining fusion characteristics of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity.

Specifically, in general, an entity in the text to be extracted is a single character in the text to be extracted, or a phrase consisting of two or more continuous characters, and the number of characters specifically included in each entity is not fixed.

When the entity extraction is performed on the text to be extracted, when the strategy adopted by the embodiment of the application is adopted, all text segments with different lengths contained in the text to be extracted are listed first, and then whether each text segment is an entity is verified, so that the purpose of entity extraction is achieved.

Based on the above thought, in the embodiment of the present application, firstly, a text segment of a text to be extracted is divided, and a specific division strategy is to take each character of the text to be extracted as a starting character, and sequentially intercept, from the text to be extracted, text segments containing 1 character, 2 consecutive characters, and … … n consecutive characters in a sequence from front to back, where n is less than or equal to the total number of characters from the character serving as the starting character to the last character of the text to be extracted. Thus, for a text containing W characters, N ═ W (W +1)/2 text segments can be divided therefrom. Wherein, the ith text segment is defined as all characters contained from start (i) to end (i), start (i) is the position of the first character of the ith text segment in the text to be extracted, and end (i) is the position of the last character of the ith text segment in the text to be extracted.

For example, assuming that there is a text sentence "there is a high-density shadow in rectum visible by repeating X-ray film in the morning today" in a patient case, the text has 19 characters, and 19 × 19+1)/2 190 text segments can be divided from the text sentence according to the above description, as shown in table 1:

TABLE 1

Numbering	Text segment	start(i)	End(i)
				1	Jinjin tea	1	1
2	Today's day	1	2
				3	Today's morning goods	1	3
4	Day(s)	2	2
				5	Morning and morning care	2	3
…	…	…	…

It can be understood that each text segment extracted from the text to be extracted through the above processing covers all possible words or phrases in the text to be extracted, and each extracted text segment is likely to be a real word.

After each text segment is extracted from the text to be extracted, the embodiment of the application further obtains the feature vector of each text segment, thereby facilitating subsequent entity identification and entity relationship determination.

Different from the conventional method for directly determining the feature vector of the text segment through the model, in the embodiment of the application, when the feature vector of each extracted text segment is determined, the feature vector of each text segment is determined by combining the information of each candidate entity extracted from the knowledge base and the information of each candidate entity extracted from the knowledge base on the basis of the information of the text segment.

As can be seen from the above description, the candidate entities are entities with higher similarity to some contents in the text to be extracted, and these candidate entities may be used to determine whether the contents in the text to be extracted are entities. For example, if the feature vector of a certain text segment in the text to be extracted is highly similar to or completely identical to the feature vector of a certain candidate entity, the text segment may be determined to be identical or similar to the candidate entity, and thus the text segment may be determined to be an entity.

However, in practical situations, a text segment may be similar to a plurality of candidate entities, and in this case, the relation between the text segment and other text segments cannot be directly determined.

In order to determine an entity from a text to be extracted and determine an entity relationship at the same time, in the embodiment of the present application, a feature of each text segment in the text to be extracted is determined according to the text segment in the text to be extracted and each candidate entity, and feature information of the text segment itself and feature information of the candidate entity are fused in the feature, so that the feature vector is called as a fusion feature.

For example, for each text segment extracted from the text to be extracted, the feature of the text segment is fused with the feature of each candidate entity, specifically, for example, the feature of the text segment is fused with the feature of the candidate entity with higher similarity, and the feature is taken as the fusion feature of the text segment.

For a specific process of extracting the fusion features of each text passage, reference may be made to the detailed description of the embodiments below.

It can be understood that the fusion feature of the text segment of the text to be extracted, which is determined in the above manner, includes not only the information of the text segment itself, but also external knowledge information, specifically information of entity words similar to the text to be extracted. In contrast, the information contained in the fusion features of each text segment is richer, and the method is more beneficial to identifying whether the text segments are entities and determining the relationship among the text segments.

S103, determining each entity in the text to be extracted and the relation between the entities according to the fusion characteristics of each text segment in the text to be extracted.

Specifically, each text segment in the text to be extracted is classified according to the fusion feature of each text segment in the text to be extracted, so that whether each text segment in the text to be extracted is an entity can be determined.

For example, it is assumed that the fusion feature of the text segment a includes features of 3 similar candidate entities in addition to the feature of the text segment a itself; and the fused features of the text segment B include features of 8 candidate entities similar to the text segment B, in addition to the features of the text segment B itself. By contrast, the fused features of text passage B contain more entity feature components, and thus text passage B is more likely to be an entity than text passage A. It is also understood in conventional wisdom that if a word is similar to a plurality of entity words, the probability that the word itself is an entity word is high, and the natural law is applied to entity recognition through technical means by means of the characteristics of texts in the embodiment of the application.

Meanwhile, for each text segment classified as an entity, the relationship between the text segments is further analyzed according to the fusion characteristics of the text segments, and the entity relationship can be determined.

Therefore, the entity relation determining method and the entity relation determining device can determine the entity in the text to be extracted at one time and determine the entity relation.

As can be seen from the above description, the information extraction method provided in the embodiment of the present application processes the text to be extracted by using the preset knowledge base, and can determine the entities in the text to be extracted and determine the relationships between the entities at one time. In addition, in the information extraction process, the information of the entity similar to the text to be extracted, which is extracted from the preset knowledge base, is referred to, namely, the external knowledge is referred to, and the addition of the external knowledge enriches the reference information for identifying the entity and the entity relationship, so that the information extraction accuracy is higher.

As a preferred implementation manner, in the embodiment of the present application, before information extraction is performed on a text to be extracted, information expansion is performed on the text to be extracted, and then information extraction is performed on the text to be extracted after the information expansion.

Specifically, the information expansion is carried out on the text to be extracted by utilizing the information related to the text to be extracted.

The information related to the text to be extracted refers to information similar to or related to the content of the text to be extracted.

Assuming that the text to be extracted is an electronic patient case, part of the information is useful for the patient's disease diagnosis and treatment and completion of medical record information by analyzing and mining the previous diagnosis and treatment data of the patient, such as the previous purchased health care product, the purchased food, the vaccination condition (mainly the latest vaccination condition on the supplementary and calibrated medical record if the hospital has a record), the diagnosis and treatment record (mainly the previous supplementary and verified medical record), and the like.

The information expansion is carried out on the text to be extracted by utilizing the information related to the text to be extracted, so that the information of the text to be extracted is richer, the candidate entity really similar to the text to be extracted is favorably determined, the characteristics of the text segment contained in the text to be extracted are favorably determined, and the entity and entity relation in the text to be extracted is favorably and accurately identified.

It is understood that, relatively speaking, information related to the text to be extracted is also external information other than the text to be extracted. Therefore, the information expansion is carried out on the text to be extracted by utilizing the information related to the text to be extracted, and the external information is actually used for carrying out the information extraction on the text to be extracted, so that the accuracy of the information extraction is favorably improved.

As an alternative implementation, referring to fig. 2, after step S201 is executed, by executing steps S202 to S204, the information extraction method provided in the embodiment of the present application determines a fusion feature of each text segment in the text to be extracted according to each text segment and each candidate entity in the text to be extracted.

The specific contents of steps S202 to S204 include:

s202, respectively determining the vector code of each text segment in the text to be extracted and the vector code of each candidate entity.

For example, for each text segment in the text to be extracted, the feature vector of each character in the text to be extracted can be obtained by the trained BERT model described above.

Then, for each text segment, according to the feature vector of each character included in the text segment, the feature vector of the text segment can be obtained, that is, the vector code of the text segment is obtained, for example, the feature vector of each character included in the text segment is spliced, and the obtained vector code can be used as the vector code of the text segment.

In addition to the knowledge triples, the knowledge base also stores the entities in each triplet and the vectors of the entity relationships. Therefore, the feature vector of each candidate entity can be read directly from the knowledge base, and the read feature vector is used as the vector code of the candidate entity.

As another alternative, the embodiment of the present application determines the vector encoding of each text segment by means of a feedforward neural network.

E.g. the vector encoding s of the ith text segment_iThis can be obtained by the following equation:

wherein FFNN represents a feed-forward neural network operation, x_start(i)Feature vector, x, representing the starting character of the ith text segment_end(i)A feature vector, β, representing the ending character of the ith text segment_i,tRepresenting the contribution rate of the feature vector of the t-th character in the ith text segment to the feature vector of the text segment,

the contribution of the feature vector representing each character in the ith text segment to the feature vector of that text segment,

indicates the length of the ith text segment]Representing vector stitching.

For each candidate entity, the feature vector of each candidate entity can be read from the knowledge base, and then the feature vector of each candidate entity is input into the feedforward neural network, so that the vector code of each candidate entity can be obtained.

See the following equations in detail:

n_j＝FFNN(k_j),k_j∈K

where K represents the set of feature vectors for each candidate entity, K_jFeature vector, n, representing the jth candidate entity read from the set K_jA vector code representing the jth candidate entity.

S203, determining the similarity weight of each candidate entity and each text segment according to the vector code of each text segment and the vector code of each candidate entity.

Specifically, as described above, each candidate entity is an entity similar to the text to be extracted, which is selected from the knowledge base, but it is not clear which candidate entity is similar to which text segment in the text to be extracted, nor is the similarity between each candidate entity and each text segment.

In practice, however, a candidate entity may only resemble certain text passages, and thus may only play a substantial or major role in determining whether certain text passages are entities. That is, different candidate entities may have different roles in determining whether different text segments are entities. Theoretically, if a candidate entity is more similar to a text passage, then the more likely it is that the text passage is an entity. That is, the similarity between the candidate entity and the text segment is high or low, and the possibility that the text segment is the entity can be directly reflected. Meanwhile, the higher the similarity of the candidate entity to the text passage, the higher the probability that the text passage is the entity of the same type as the candidate entity. Therefore, the similarity between the text segment and each candidate entity has an auxiliary effect on accurately identifying the entities from each text segment of the text to be extracted and further determining the relationship between the entities when the text segment is determined as the entity.

Therefore, the similarity between each candidate entity and each text segment is measured in the embodiment of the application.

In specific implementation, for each text segment, firstly, by means of a feedforward neural network, the similarity between each candidate entity and the text segment is determined by using the vector code of the text segment and the vector code of each candidate entity, and the similarity between the text segment and each candidate entity is obtained.

The specific formula is as follows:

α_ij＝FFNN([s_i,n_j])

wherein alpha is_ijRepresenting the similarity of the ith text segment and the jth candidate entity, [ s ]_i,n_j]And representing the splicing result of the vector code of the ith text segment and the vector code of the jth candidate entity.

Then, the similarity of the text segment and each candidate entity is normalized.

The specific formula is as follows:

wherein, beta_ijRepresenting the normalized similarity of the ith text segment to the jth candidate entity,

representing the sum of the similarity of the ith text segment and each candidate entity.

S204, determining the fusion characteristics of each text segment according to the vector codes of each text segment, the vector codes of each candidate entity and the similarity weight of each candidate entity and each text segment.

Specifically, for each text segment, the vector code of the text segment itself and the contribution of the vector code of each candidate entity to the fusion feature of the text segment are fused, that is, the fusion feature of the text segment is obtained. The contribution amount of the vector code of the candidate entity to the fusion feature of the text segment is determined by multiplying the vector code of the candidate entity by the similarity of the candidate entity and the text segment.

Illustratively, the fusion feature f of the ith text segment is determined by operation according to the following formula_i：

Wherein s is_iVector code representing the ith text segment, n_jVector code, beta, representing the jth candidate entity_ijRepresenting the similarity of the ith text segment to the jth candidate entity,

namely the contribution of each candidate entity to the fusion feature of the ith text segment.

As can be understood from the introduction of the above processing, the fusion features of each text segment of the text to be extracted not only include the information of the text segment itself, but also include the information of each candidate entity similar to the text segment, and the proportions of the information of each candidate entity included in the fusion features of the text segment are different according to the difference in the similarity between each candidate entity and the text segment. Because the fusion characteristic of the text segment contains external information, the method is more beneficial to identifying whether the text segment is an entity and determining the entity type when the text segment is the entity.

Further, the embodiment of the present application further provides that, after the step S202 is executed, the vector codes of the text segments in the text to be extracted are respectively determined, and then according to the vector codes of the text segments in the text to be extracted, non-entity text segments are filtered out from the text segments, that is, text segments that are obviously not possible to be entities are filtered out.

Illustratively, according to the vector coding of each text segment, entity recognition is carried out on each text segment by means of a feedforward neural network, and for the text segments with lower recognition probability, namely non-entity text segments, the text segments are filtered, so that the workload of entity extraction and entity relation extraction in the later period can be reduced.

For example, the entity recognition rate e of each text segment is calculated by performing the entity recognition on each text segment according to the following formula_i：

e_i＝softmax(FFNN(s_i))

Wherein, the value range of i is all extracted from the text to be extractedNumber range of text segments, s_iA vector code representing the ith text segment.

Assume e of the ith text passage_iText segments smaller than 0.4 can be determined to be non-entity text segments and can be filtered out.

In addition, steps S201 and S205 in the embodiment shown in fig. 2 correspond to steps S101 and S103 in the method embodiment shown in fig. 1, respectively, and specific contents thereof can refer to corresponding contents in the embodiment shown in fig. 1, and are not repeated here.

As an optional implementation manner, the determining, according to the fusion feature of each text segment in the text to be extracted, each entity in the text to be extracted and the relationship between the entities is specifically performed by determining, according to the fusion feature of each text segment in the text to be extracted, an entity text segment from each text segment, and determining the entity type of each entity text segment and the relationship between each entity text segment.

Specifically, the entity text segment is a text segment as an entity.

According to the embodiment of the application, the text segment which can be used as the entity is selected from all the text segments of the text to be extracted, and the purpose of extracting the entity from the text to be extracted is achieved. Further, the relationship between the extracted entity text segments is analyzed, and the entity relationship of each entity extracted from the text to be extracted can be determined.

For example, each text segment is subjected to entity classification according to the fusion characteristics of each text segment in the text to be extracted, and each text segment is classified into an entity text segment or a non-entity text segment, so that the entity text segment can be determined from each text segment, and meanwhile, the entity type of each entity text segment can be analyzed and determined according to the fusion characteristics of each entity text segment. For example, assuming that the text to be extracted is an electronic patient case, the entity type of the extracted entity text segment may be an entity of the type of a name, a disease, etc.

After the entity text segments are identified from the text segments, the embodiment of the application further identifies and classifies the relationship between the entity text segments according to the fusion characteristics of the entity text segments, so as to determine the relationship between the entity text segments, namely determine the relationship between the entities extracted from the text to be extracted.

It should be noted that, the information extraction method proposed in the embodiment of the present application seeks to extract entities from the text to be extracted at one time and determine the entity relationships, and therefore, although the processing procedure is to identify the entity text segments from the text segments of the text to be extracted in steps and then determine the relationships between the entity text segments, in practical applications, the identification results of the entity text segments and the identification results of the relationships between the entity text segments may be output at the same time.

As an exemplary implementation, the embodiment of the present application trains an entity extraction model in advance, and the entity extraction model may be obtained based on the feed-forward neural network FFNN training. The entity extraction model uses the fusion characteristics f of each text segment in the text to be extracted_iFor input, the entity classification result of each text segment can be output simultaneously

And the relation classification result r of each entity text segment_ij。

The operation formula of the entity extraction model is as follows:

r_ij＝softmax(FFNN[f_i,f_j,f_iof_j])

wherein f is_iA fusion feature representing the ith text segment,

probability of representing the ith text segment as a solid text segment, f_iof_jFeature elements representing corresponding positions of the fusion feature of the ith entity text segment and the fusion feature of the jth entity text segmentMultiplication of r_ijAnd representing the relation classification result of the ith entity text segment and the jth entity text segment.

The entity extraction model can be obtained by cross entropy loss training as follows:

wherein the content of the first and second substances,

the label of the entity label is represented,

and representing entity relation label labels, S represents all text segment sets of the text to be extracted, and S' represents an entity text segment set in the text to be extracted.

As a more preferable implementation, the embodiment of the present application trains the information extraction model in advance, and the model can be obtained by training through a feedforward neural network exemplarily.

The information extraction model takes a text to be extracted and candidate entities similar to the text to be extracted as input, can divide text segments of the text to be extracted, and determines the fusion characteristics of the text segments in the text to be extracted according to the text segments and the candidate entities.

Based on the information extraction model, when the information extraction method provided by the embodiment of the application is implemented, only entities similar to the text to be extracted need to be selected from a preset knowledge base to serve as candidate entities, and then the text to be extracted and each candidate entity are input into the information extraction model obtained through training, wherein the model can output each entity extracted from the text to be extracted and output the relationship among the extracted entities.

The information extraction model can be divided into a text processing module, a feature extraction module and an entity extraction module, wherein the text processing module is used for dividing an input text to be extracted to obtain each text segment; the feature extraction module is used for determining the fusion features of all text segments in the text to be extracted according to all the text segments divided from the text to be extracted and all the input candidate entities; and the entity extraction module is used for determining each entity in the text to be extracted and the relationship between the entities according to the fusion characteristics of each text segment in the text to be extracted.

The specific working contents of the text processing module, the feature extraction module and the entity extraction module can be referred to the corresponding processing contents in the above method embodiment. The entity extraction module may adopt the entity extraction model described in the above embodiments.

The training process of the information extraction model may refer to the training process of the entity extraction model, for example, the information extraction model may be trained by using cross entropy in training the entity extraction model. The specific training process is not described in detail.

Furthermore, the information extraction model may further include a candidate entity screening module, where the model is based on a preset knowledge base, and is used to select an entity similar to the input text to be extracted from the preset knowledge base as a candidate entity. Meanwhile, the information extraction model comprising the candidate entity screening module, the text processing module, the feature extraction module and the entity extraction module becomes an end-to-end entity and entity relationship extraction model. Only the text to be extracted is input into the information extraction model, and the entity and entity relation extraction result output by the model can be obtained, so that the information extraction efficiency is further improved.

The specific working contents of the candidate entity screening module, the text processing module, the feature extraction module and the entity extraction module can be referred to the corresponding processing contents in the above method embodiments.

In correspondence with the above-mentioned information extraction method, an embodiment of the present application further provides an information extraction apparatus, as shown in fig. 3, the apparatus including:

a candidate entity screening unit 100, configured to select an entity similar to the text to be extracted from a preset knowledge base, as a candidate entity;

a feature extraction unit 110, configured to determine, according to each text segment in the text to be extracted and each candidate entity, a fusion feature of each text segment in the text to be extracted; the text segment in the text to be extracted consists of a single character or more than two continuous characters in the text to be extracted, and the fusion feature comprises a text segment feature and a candidate entity feature;

and the information extraction unit 120 is configured to determine each entity in the text to be extracted and a relationship between the entities according to the fusion feature of each text segment in the text to be extracted.

As an optional implementation manner, the selecting, from a preset knowledge base, an entity similar to the text to be extracted as a candidate entity includes:

As an optional implementation, the apparatus further comprises:

and the information expansion unit is used for performing information expansion on the text to be extracted by utilizing the information related to the text to be extracted.

As an optional implementation manner, determining a fusion feature of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity includes:

As an optional implementation, the apparatus further comprises: and the text segment screening unit is used for filtering non-entity text segments from the text segments according to the vector codes of the text segments in the text to be extracted after the vector codes of the text segments in the text to be extracted are respectively determined.

As an alternative implementation, determining the similarity between each candidate entity and each text segment according to the vector coding of each text segment and the vector coding of each candidate entity includes:

and normalizing the similarity of the text segment and each candidate entity.

As an optional implementation manner, determining each entity in the text to be extracted and the relationship between the entities according to the fusion feature of each text segment in the text to be extracted includes:

As an optional implementation manner, determining an entity text segment from each text segment according to the fusion feature of each text segment in the text to be extracted, and determining the entity type of each entity text segment and the relationship between each entity text segment includes:

As an optional implementation manner, determining a fusion feature of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity; and determining each entity in the text to be extracted and the relationship among the entities according to the fusion characteristics of each text segment in the text to be extracted, wherein the determining comprises the following steps:

Specifically, please refer to the specific content of the corresponding processing steps in the information extraction method for the specific work content of each unit of the information extraction apparatus.

Another embodiment of the present application further provides an information extraction device, as shown in fig. 4, the information extraction device includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the information extraction method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the information extraction device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the information extraction method provided by the above-mentioned embodiment of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the information extraction method provided in the foregoing embodiment of the present application.

Specifically, the specific working contents of each part of the information extraction device and the specific processing contents of the computer program on the storage medium when being executed by the processor can refer to the contents of each embodiment of the information extraction method, and are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An information extraction method, comprising:

2. The method according to claim 1, wherein the selecting of entities similar to the text to be extracted from the predetermined knowledge base as candidate entities comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. The method according to claim 1, wherein determining the fusion feature of each text segment in the text to be extracted according to each text segment in the text to be extracted and each candidate entity comprises:

5. The method according to claim 4, wherein after determining the vector encoding of each text segment in the text to be extracted, the method further comprises:

6. The method of claim 4, wherein determining the similarity of each candidate entity to each text segment based on the vector coding of each text segment and the vector coding of each candidate entity comprises:

and normalizing the similarity of the text segment and each candidate entity.

7. The method according to claim 1, wherein determining each entity in the text to be extracted and the relationship between entities according to the fusion feature of each text segment in the text to be extracted comprises:

8. The method according to claim 7, wherein determining an entity text segment from each text segment according to the fusion feature of each text segment in the text to be extracted, and determining the entity type of each entity text segment and the relationship between each entity text segment comprises:

9. The method according to claim 1, wherein fusion features of each text segment in the text to be extracted are determined according to each text segment in the text to be extracted and each candidate entity; and determining each entity in the text to be extracted and the relationship among the entities according to the fusion characteristics of each text segment in the text to be extracted, wherein the determining comprises the following steps:

10. An information extraction apparatus, characterized by comprising:

11. An information extraction device characterized by comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is configured to implement the information extraction method according to any one of claims 1 to 9 by executing a program in the memory.

12. A storage medium having stored thereon a computer program which, when executed by a processor, implements the information extraction method according to any one of claims 1 to 9.