CN114692644A

CN114692644A - Text entity labeling method, device, equipment and storage medium

Info

Publication number: CN114692644A
Application number: CN202210242288.8A
Authority: CN
Inventors: 谢育涛; 俞声; 夏俊; 袁正
Original assignee: Tsinghua University; International Digital Economy Academy IDEA
Current assignee: Tsinghua University; International Digital Economy Academy IDEA
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-07-01
Anticipated expiration: 2042-03-11
Also published as: CN114692644B

Abstract

The invention relates to the technical field of text data processing, in particular to a text entity labeling method, a text entity labeling device, text entity labeling equipment and a storage medium. The method comprises the steps of marking a target entity in a text to be marked, marking a semantic type on the marked target entity through a semantic type marker, and outputting the target entity containing the marked semantic type through the semantic type marker. On one hand, the semantic type annotator is adopted to label the text to be labeled instead of manual labeling, so that the labeling accuracy is improved. On the other hand, the target entity is marked, so that the target entity can be accurately found by the semantic type marker according to the mark to label the target entity when the semantic type marker labels, the semantic type marker is prevented from labeling non-target entities, the labeling speed of the semantic type marker on a text to be labeled is improved, and the labeling accuracy is further improved.

Description

Text entity labeling method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of text data processing, in particular to a text entity labeling method, a text entity labeling device, text entity labeling equipment and a storage medium.

Background

The text contains entities such as English words and Chinese words, punctuation marks and the like, when the text is subjected to entity labeling, useful entities (target entities) in the text need to be found out firstly, and then semantic types are labeled on the target entities, wherein the semantic types are used for distinguishing the types of the target entities.

For example, in the field of biomedical information, the NER (named entity recognition) system can be applied to discover biomedical entities in medical texts. The NER system is constructed based on a Deep Learning (Deep Learning) method, a Deep Learning model requires a large amount of labeled entity text data to train the model when the NER system is constructed, and the large amount of labeled entity text data is manually labeled on biomedical entities in medical texts, so that the labeling accuracy is low.

In summary, the existing text entity labeling method reduces the accuracy of labeling.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a text entity labeling method, a text entity labeling device, text entity labeling equipment and a storage medium, and solves the problem that the labeling accuracy is reduced by the existing text entity labeling method.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for labeling a text entity, including:

acquiring a text to be marked;

marking a target entity contained in the text to be marked, wherein the target entity corresponds to the characteristics of the text to be marked;

inputting the text to be labeled marked out of the target entity into a trained semantic type labeling device, and performing semantic type labeling on the marked out target entity through the trained semantic type labeling device to obtain a labeled text, wherein the semantic type corresponds to the category of the target entity.

In one implementation manner, the marking out the target entity included in the text to be labeled includes:

acquiring the original entity library;

obtaining the information field of the original entity library according to the original entity library;

according to the information field, a white list character domain matched with the information field is constructed;

cleaning the entities in the original entity library according to the white list character domain to obtain the cleaned original entity library;

performing word segmentation on the text to be labeled;

marking the text to be marked after the word is cut through the cleaned original entity library so as to mark a target entity in the text to be marked.

In one implementation, the training method of the trained semantic type annotator comprises the following steps:

acquiring an original entity library and a sample text;

marking out sample entities contained in the sample text through the original entity library;

labeling semantic sample types for the sample texts marked with the sample entities to obtain labeled sample texts;

and training the semantic type annotator through the marked sample text to obtain the trained semantic type annotator.

In one implementation, the marking out the sample entities contained in the sample text by the original entity library includes:

marking out sample entities contained in the sample text according to the cleaned original entity library.

obtaining brackets contained in the original entity library according to the original entity library:

according to the information corresponding to the brackets, cleaning the entities containing the brackets to obtain the cleaned original entity library;

obtaining a nonsense entity and/or an entity containing abnormal head and tail characters contained in the original entity library according to the original entity library, wherein the nonsense entity is an entity with actual meaning, and the entity with the abnormal head and tail characters is an entity with head and tail characters which are not matched with the language of the entity;

washing the nonsense entities and/or entities containing abnormal head and tail characters from the original entity library to obtain the washed original entity library;

In one implementation, the marking out the sample entities contained in the sample text according to the original entity library after the cleaning includes:

carrying out independent word segmentation processing on the sample text to obtain the sample text after the independent word segmentation processing;

marking sample entities contained in the sample text after the independent word segmentation processing according to the cleaned original entity library.

In one implementation, the labeling the sample text marked with the sample entity with a semantic sample type for the sample entity to obtain a labeled sample text includes:

obtaining an ambiguous entity and a non-ambiguous entity in the sample entity according to the sample entity, wherein the number of semantic types corresponding to the ambiguous entity is more than one, and the non-ambiguous entity is an entity with a unique semantic type;

obtaining context information of the ambiguous entity in the sample text according to the ambiguous entity;

according to the context information, labeling semantic sample types aiming at the ambiguous entities to obtain labeled first sample texts in the labeled sample texts;

and according to the non-ambiguous entity, labeling a semantic sample type aiming at the non-ambiguous entity to obtain a labeled second sample text in the labeled sample text.

In one implementation, the training a semantic type annotator by the labeled sample text to obtain the trained semantic type annotator includes:

training a semantic type marker through the labeled first sample text and the labeled second sample text in the labeled sample text to obtain the trained semantic type marker.

In one implementation, the method further comprises:

acquiring an original entity library containing a seed entity, wherein the original entity library is used for marking a target entity contained in the text to be marked;

extracting noun phrases containing the seed entities from the text to be labeled;

correcting the noun phrase to obtain the corrected noun phrase, wherein the corrected noun phrase is matched with the structure of an entity;

and adding the modified noun phrases into the original entity library to obtain the modified original entity library.

In one implementation, the inputting the text to be labeled marked out of the target entity into a trained semantic type annotator, performing semantic type labeling on the marked out target entity through the trained semantic type annotator to obtain a labeled text, where the semantic type corresponds to the category to which the target entity belongs, and then further comprising:

counting the proportion of all the target entities in the text to be labeled;

counting the probability of each target entity subjected to semantic type labeling in the text to be labeled;

and according to the proportion and the probability, performing semantic type labeling on the target entity and the non-target entity in the labeled text again to obtain an optimized labeled text, wherein the non-target entity is an entity except the target entity in the labeled text.

In one implementation, the text to be labeled is biomedical text.

In a second aspect, an embodiment of the present invention further provides a text entity tagging device, where the device includes the following components:

the text acquisition module is used for acquiring a text to be labeled;

the marking module is used for marking out a target entity contained in the text to be marked;

and the marking module is used for inputting the text to be marked of the target entity into a trained semantic type marker, and performing semantic type marking on the marked target entity through the trained semantic type marker to obtain a marked text.

In a third aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a text entity tagging program that is stored in the memory and is executable on the processor, and when the processor executes the text entity tagging program, the steps of the text entity tagging method are implemented.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a text entity labeling program is stored on the computer-readable storage medium, and when the text entity labeling program is executed by a processor, the steps of the text entity labeling method are implemented.

Has the advantages that: the method comprises the steps of marking a target entity in a text to be marked, marking a semantic type on the marked target entity through a semantic type marker, and outputting the target entity containing the marked semantic type through the semantic type marker. On one hand, the semantic type annotator is adopted to label the text to be labeled instead of manual labeling, so that the labeling accuracy is improved. On the other hand, the target entity is marked, so that the target entity can be accurately found by the semantic type marker according to the mark to label the target entity when the semantic type marker labels, the semantic type marker is prevented from labeling non-target entities, the labeling speed of the semantic type marker on a text to be labeled is improved, and the labeling accuracy is further improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of semantic type tagger training of the present invention;

FIG. 3 is a labeling flowchart in the embodiment;

fig. 4 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is clearly and completely described below by combining the embodiment and the attached drawings of the specification. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Research shows that the text contains entities such as English words and Chinese words, punctuation marks and the like, and when the text is subjected to entity labeling, useful entities (target entities) in the text need to be found out firstly, and then semantic types are labeled on the target entities, wherein the semantic types are used for distinguishing the types of the target entities. For example, in the field of biomedical information, the NER (named entity recognition) system can be applied to discover biomedical entities in medical texts. The NER system is constructed based on a deep learning (deep learning) method, a large amount of labeled entity text data are needed for model training when the NER system is constructed by the deep learning model, and the large amount of labeled entity text data are manually labeled on biomedical entities in medical texts, so that the labeling accuracy is low.

In order to solve the technical problems, the invention provides a text entity labeling method, a text entity labeling device, text entity labeling equipment and a storage medium, and solves the problem that the labeling accuracy is reduced by the existing text entity labeling method. In specific implementation, firstly, a text to be labeled is obtained, then, a target entity contained in the text to be labeled is marked, and finally, the text to be labeled with the marked target entity is input to a semantic type labeler to obtain a labeled text. The marking method of the embodiment can improve the marking speed and the marking accuracy.

For example, the text to be marked is the following text: the wangzhi suffers from depression, frequent insomnia, food intake reduction and low mood due to work fatigue. As can be seen from the above words, this is a medical diagnostic book, and the target entities (i.e. the lemmas matched with the characteristics of medicine) in the above text to be marked are marked first, and the target entities in the above words have depression, insomnia, decreased appetite and depressed mood. The target entities of depression, insomnia, food consumption reduction and emotional depression are marked (the marking is carried out to ensure that a semantic type marker accurately finds the positions of the target entities), then the target entities of depression, insomnia, food consumption reduction and emotional depression are marked by the semantic type marker, diseases are marked on the depression, and symptoms are marked on the three target entities of insomnia, food consumption reduction and emotional depression. Finally, the labeled text output by the semantic type labeling device is as follows: wangzheng's worker fatigue suffered from depression (target entity: depression, start position of marker: 11, end position of marker: 13, semantic type: disease), regular insomnia (target entity: regular insomnia, start position of marker: 15, end position of marker: 18, semantic type: symptom), decreased appetite (target entity: decreased appetite, start position of marker: 20, end position of marker: 23, semantic type: symptom), depressed mood (target entity: depressed mood, start position of marker: 25, end position of marker: 28, semantic type: symptom). Where the start and end positions are the subscripts of the characters in the sentence.

Exemplary method

The text entity labeling method of the embodiment can be applied to terminal equipment, and the terminal equipment can be a terminal product with a text data processing function, such as a computer. In this embodiment, as shown in fig. 1, the text entity labeling method specifically includes the following steps:

and S100, acquiring a text to be annotated.

The text to be labeled in the embodiment is a biomedical text (i.e., a medical knowledge database), and the text to be labeled may also be a product specification. When the text to be labeled is a product specification, how the product specification is labeled is illustrated by the following example:

for example, the text to be labeled: a vase is shown in a certain market, the vase body is cylindrical, and the whole vase is blue.

S200, marking out the target entity contained in the text to be marked.

The marking of the target entity in this embodiment is to mark a start position and an end position of the target entity in the sentence where the target entity is located. The position of the target entity needs to be marked, so that a subsequent semantic type marker can find the target entity in the text to be marked so as to mark the semantic type of the target entity.

In the embodiment, the target entity is an entity capable of reflecting the characteristics of the text to be annotated, and the step S200 includes the following steps S201, S202, S203, S204, S205, and S206:

s201, obtaining the original entity library.

The original entity library in this embodiment is derived from an existing medical knowledge base.

S202, obtaining the information field of the original entity library according to the original entity library.

S203, according to the information field, a white list character domain matched with the information field is constructed.

S204, cleaning the entity in the original entity library according to the white list character domain to obtain the cleaned original entity library.

S205, carrying out segmentation processing on the text to be annotated.

S206, marking the text to be marked after the word is cut through the cleaned original entity library so as to mark a target entity in the text to be marked.

In the embodiment, each target entity in the text to be labeled is marked by each seed entity in the original entity library, and each seed entity in the original entity library and the text to be labeled need to be preprocessed before marking so as to remove useless data, thereby improving the marking accuracy and the marking speed.

The original entity library is a medical entity library, the original entity library contains seed entities such as headache, rhinorrhea, chronic pharyngitis and circumference ratio, and the circumference ratio is obviously not matched with the medicine, so the circumference ratio is deleted from the original entity library to obtain the preprocessed original entity library.

For example, the step of marking the target entities contained in the text to be annotated is as follows: a certain shop shows a vase (target entity: vase, starting position of mark: 8, end position of mark: 9), a vase body is a cylinder (target entity: cylinder, starting position: 14, end position: 16), and the whole vase is blue (target entity: blue, starting position: 21, end position: 22). Where the start and end positions are subscripts of the characters in the sentence.

S300, inputting the text to be labeled marked of the target entity into a trained semantic type labeling device, and performing semantic type labeling on the marked target entity through the trained semantic type labeling device to obtain a labeled text.

Marking out the marked text corresponding to the target entity contained in the text to be marked in the step S200: a flower vase (semantic type: product name) is displayed in a certain market, the vase body is cylindrical (semantic type: product shape), and the color is blue (semantic type: product color).

Step S100 to step S300 adopt a segmentation manner to mark a target entity to obtain a labeled text, and the following example illustrates the advantage of segmentation:

for example, a text to be labeled is: headache and rhinorrhea appear, which are common symptoms of cold. Preprocessing the text to be labeled, namely dividing the word elements in the text, wherein the divided text to be labeled comprises the following steps: the "appearance", "headache", "nasal discharge", "these conditions", "belonging to" cold "," of "and" common symptoms ". The original entity library is preprocessed, invalid entities are removed, and when the original entity library is used for marking texts to be marked, the marking speed can be improved. The text to be labeled is divided according to the lemmas, so that the headache can be prevented from being divided into a whole, and if the headache is divided into a whole, the headache in the text to be labeled can not be marked through the headache in the original entity library.

The semantic type in this embodiment is used to reflect the category to which the target entity belongs, and this embodiment includes two parts: training a semantic type annotator (first part), and annotating the text to be annotated by using the trained semantic type annotator (second part).

Training the semantic type annotator comprises the following steps S301, S302, S303, S304, S305, S306, S307, S308, S309, S3010, S3011, S3012, S3013, S3014 and S3015:

s301, acquiring an original entity library and a sample text.

The original entity library is a database containing a plurality of entities. In this embodiment, when the target entity is marked on the text to be marked, the original entity library is used as a reference to mark the target entity, and the original entity library is still needed in the training of the semantic type marker in this embodiment. The original entity library in this embodiment is derived from a biomedical knowledge base, however, at present, some published mainstream biomedical knowledge bases, such as the largest-scale biomedical knowledge map umls (unified medical language knowledge system), still have a large number of entities with poor quality, such as non-biomedical entity words like "between", "68", etc., and directly labeling texts with these words will cause serious labeling noise problems (labeling noise is to label non-medical entity words in texts as medical entity words). The training of the semantic type marker can be influenced by the existence of redundant punctuation marks on the directly acquired sample text. Therefore, before training the semantic type annotator, the original entity library and the sample text need to be cleaned to remove invalid data, so as to prevent the invalid data from interfering with the training of the semantic type annotator, thereby improving the accuracy of the trained semantic type annotator for labeling the text to be labeled. In this embodiment, the original entity library is cleaned through the following steps S302 to S308:

s302, obtaining the information field of the original entity library according to the original entity library.

When the original entity library relates to a medical record, the original entity library belongs to the field of medical information. When the original entity library relates to a sports fitness method, the original entity library belongs to the field of kinematic information.

S303, according to the information field, a white list character domain matched with the information field is constructed.

S304, cleaning the entities in the original entity library according to the white list character domain.

In this embodiment, taking the field of medical information as an example, the white list character domain is a character related to medicine, for example, rabiesvaccine (rabies vaccine) is a white list character in the field of medicine, and "territorial area" is not a white list character in the field of medicine, when an entity in the original entity library is not matched with the white list character domain, the unmatched entity is washed from the original entity library, and the original entity library after being washed is obtained.

When the original entity pool contains a "xxx", since "xxx" is not a medically relevant character, the entity of "xxx" in which "xxx" is located is removed from the original entity pool to complete the cleaning of the original entity pool.

S305, obtaining brackets contained in the original entity library according to the original entity library.

And S306, cleaning the entity containing the brackets according to the information corresponding to the brackets.

In this embodiment, the information corresponding to the parentheses includes whether the parentheses are single parentheses or double parentheses, and the types of the left parentheses and the right parentheses are not the same (i.e., the left parentheses are in english, the right parentheses are also in english, and the types of the parentheses can be the same). When the types of the single brackets and the left brackets and the right brackets are different, the entities containing the brackets are removed from the original entity library, and the cleaning of the original entity library is completed.

For example, the left bracket of the "(rabies vaccine)" is in English, the right bracket is in Chinese, and the "(rabies vaccine)" is removed from the original entity library because the types of the left bracket and the right bracket are different. "sars-cov-1 (2003" has only one bracket, and also needs to be removed from the original entity library.

The reason why the original entity library contains the brackets with inconsistent types (non-standard brackets) is to remove the brackets with inconsistent types is that when the entity labeling is performed on the sample text, if the entity in the sample adopts the bracket with the standard type, the original entity library containing the non-standard type brackets can not label the entity containing the non-standard type brackets in the sample text, thereby influencing the training effect on the semantic type marker.

S307, obtaining a nonsense entity and/or an entity containing abnormal head and tail characters contained in the original entity library according to the original entity library, wherein the nonsense entity is an entity with actual meaning, and the entity with the abnormal head and tail characters is a language type mismatching the head and tail characters and the entity.

S308, cleaning the meaningless entities and/or entities containing abnormal head and tail characters from the original entity library.

The meaningless entities in this embodiment are entities that contain only numbers and special symbols without unit information, and these entities that only numbers and special symbols do not deliver any valuable information, and if they are retained in the original entity library, the matching search amount is increased when the original entity library is used to mark the sample text, and thus these meaningless entities need to be removed from the original entity library.

For example, the appearance of "2.5", "3-", "&" in the original entity library involved in medicine does not show any valuable information from these numbers, and the sample text to be marked and the text to be marked do not contain such worthless entities, so that there is no need to keep these worthless entities in the original entity library.

The entity of the abnormal beginning and end character is that the beginning and end of the English word are non-English characters, such as 'dermo-'.

The embodiment may remove characters not in the white list character domain from the original entity library, remove entities where the parentheses type is inconsistent, and remove meaningless entities and/or entities containing abnormal beginning and end characters. This allows the elimination of characters that are not in the white list character domain as well as non-uniform bracket types.

For example, the original entity library contains a t-tag, and a left and a right brackets are respectively in Chinese and English, if an entity with inconsistent bracket types is removed first, only a t-tag is cleaned as a result of the last cleaning, and the t-tag is still in the original entity library. The desired result is that both t.t. teriflunamides are purged from the original entity pool. In the embodiment, the characters which are not in the white list character domain are removed from the original entity library, and then the entities with inconsistent bracket types are removed, so that the above problem can be avoided, and the whole of the "(xxx) teriflunomide" is cleaned from the original entity library.

The embodiment may also remove the meaningless entity and/or the entity containing the abnormal beginning and ending characters, then remove the characters which are not in the white list character domain from the original entity library, and finally remove the entity where the bracket type is inconsistent.

S309, performing independent word segmentation processing on the sample text to obtain the sample text after independent word segmentation processing.

When the original entity library is used to mark out the target entities contained in the sample text, the sample text needs to be segmented, and the segmentation is to separate each lemma (a lemma is a word or a Chinese character). When the sample text is English, because punctuation marks are often followed by English words, if the punctuation marks are not separated from the English words, the original entity library is adopted to mark the sample text, and the target entities contained in the sample text cannot be marked.

For example, there is a sentence "cyclic Pa interaction is associated with reduced function", followed by "function", if the sentence is segmented by a space in the prior art, the segmentation is followed by "cyclic", "Pa", "interaction", "is", "associated", "with", "reduced", "pointing", "function", and "in this sentence" the "pointing function" is a sample entity that needs to be marked out, but since the "function" is followed by "function" in the original entity library, the "pointing function" in the sample text is not marked out. The following segmentation method of this embodiment can be used to mark the "lung function" in the sample text:

“chronic”，“Pa”，“infection”，“is”，“associated”，“with”，“reduced”，“lung”，“function”，“,”。

the embodiment separates the lemma "function" and the non-lemma "during the segmentation, and the" lung function "in the sample text can be marked when the" lung function "in the original entity library is used.

S3010, marking out sample entities contained in the sample texts after the independent word segmentation processing according to the cleaned original entity library.

In this embodiment, the training of the semantic type labeler is completed by obtaining the cleaned original entity library and the segmented sample text through steps S3011 to S3015:

s3011, obtaining an ambiguous entity and a non-ambiguous entity in the sample entity according to the sample entity, wherein the number of semantic types corresponding to the ambiguous entity is greater than one, and the non-ambiguous entity is an entity with a unique semantic type.

S3012, obtaining context information of the ambiguous entity in the sample text according to the ambiguous entity.

S3013, according to the context information, labeling semantic sample types for the ambiguous entities to obtain labeled first sample texts in the labeled sample texts.

For example, the rabies vaccine is an ambiguous entity, and if the rabies vaccine is considered from the five words of rabies vaccine alone, the corresponding semantic type can be a pharmacological substance or an immune factor. When the context of "rabies vaccine" is contacted, it is known that "someone has injected rabies vaccine, thus reducing the probability of getting rabies" in this sample text, the semantic type corresponding to "rabies vaccine" is "immune factor".

S3014, according to the non-ambiguous entity, labeling a semantic sample type for the non-ambiguous entity to obtain a labeled second sample text in the labeled sample texts.

In this embodiment, when the sample entity in the sample text is retrieved as the non-ambiguous entity, the semantic sample type is directly labeled to the sample entity in the sample text, so as to obtain the labeled second sample text. When a sample entity in the retrieved sample text is an ambiguous entity, the sample entity is labeled with a semantic sample type in relation to the context in which the sample entity is located. After the corresponding semantic sample types are added to the ambiguous entities and the non-ambiguous entities in the sample text, the labeling of all the sample entities in the sample text is completed, and the labeled text containing the labeled first sample text and the labeled second sample text is obtained.

S3015, training a semantic type marker through the labeled first sample text and the labeled second sample text in the labeled sample text to obtain the trained semantic type marker.

The embodiment is to input the labeled sample text into a semantic type labeling device, compare the semantic type output by the semantic type labeling device and aiming at the sample entity with the semantic type in the labeled sample text, if the semantic type output by the semantic type labeling device and aiming at the sample entity are different from each other, adjust the semantic type labeling device until the semantic type output by the semantic type labeling device and the semantic type output by the semantic type labeling device are the same, and finish training the semantic type labeling device.

By way of example, the sample text has been labeled: the role of [ topic association ] (Therapeutic or productive process) in The management of [ actual knowledge ] (Disease or Syndrome), wherein [ ] is The labeled sample entity, () is The semantic type labeled to The sample entity. Inputting the marked sample text into a semantic type marker, wherein the semantic type marker outputs a new semantic type aiming at the sample entity, compares the new semantic type with the semantic type in the (), and if the new semantic type is not consistent with the semantic type in the (), the parameters of the semantic type marker are adjusted until the new semantic type is the same as the semantic type in the () so as to finish the training of the semantic type marker.

The process of training the semantic type annotator in the embodiment is shown in fig. 2, where the original entity library (entities) in fig. 2 contains terms (seed entities) and their corresponding semantic types, and some of the terms are ambiguous terms and they correspond to multiple semantic types. A dictionary tree is constructed through an original entity library, then sample entities (labeled data) in sample texts (namely text corpora) are matched by the dictionary tree, and in the process, ambiguous terms appearing in sentences and semantic types of the ambiguous terms are labeled by longest prefix matching. Meanwhile, the terms in the text corpus are classified into semantic type classification models according to the context of the terms, the context is enhanced through a part of MASK terms, and finally, the labeled data, the ambiguous terms and the semantic types corresponding to the ambiguous terms are input into a semantic type auxiliary labeling device (semantic type labeling device), so that the training of the semantic type auxiliary labeling device is completed.

The training of the semantic type annotator is completed through the steps S301-S3015, and then the trained semantic type annotator is used for annotating the text to be annotated.

The embodiment also extracts the noun phrase of the seed entity while labeling the text to be labeled, judges whether the noun phrase is an entity, and adds the noun phrase to the original entity library of the seed entity if the noun phrase is the entity so as to enrich the original entity library, thereby realizing accurate labeling of the target entity in the next text to be labeled. In this embodiment, the specific process of adding the seed entity to the original entity library includes: acquiring an original entity library containing a seed entity, wherein the original entity library is used for marking a target entity contained in the text to be marked; extracting noun phrases containing the seed entities from the text to be labeled; correcting the noun phrase to obtain the corrected noun phrase, wherein the corrected noun phrase is matched with the structure of an entity; and adding the modified noun phrases into the original entity library to obtain the modified original entity library.

For example, if the entity "respiratory systemic choringcholengitis" appears in the text to be labeled, and the original entity library only contains the seed entity "respiratory choringcholengitis", the FMM can only mark the "respiratory choringcholengitis" but can not mark the "respiratory systemic choringcholengitis", thereby causing inaccurate labeling of the nested entity. To solve this problem, the present embodiment proposes a technique of noun phrase extraction to dynamically correct incomplete terms in a sentence, reduce annotation noise, and improve automatic annotation accuracy.

The specific implementation mode is as follows:

1) and identifying long-name phrase containing seed entities in the sentence by using a noun phrase extraction method, and taking the long-name phrase as a candidate entity. The method can make the same original entity library dynamically corrected in different sentences (namely, the original entity library can be corrected once by marking each pair of texts to be labeled, thereby forming dynamic correction). Noun phrase extraction methods are known in the art, such as spaCy and the like.

2) The recognized long term phrases are corrected by some rule methods, such as a) the part of speech of the first token (lemma) and the last token (lemma) of the long term phrases are checked, and unreasonable parts of speech such as tokens, pronouns and the like are deleted; b) checking the validity of the brackets; c) whether it belongs to a white list character domain, etc.

The training of the semantic type annotator is completed through the steps S301-S3015, and then the trained semantic type annotator is used for annotating the text to be annotated to obtain the annotated text. However, the labeled text obtained in this way may have an incorrect label, so that the labeled text needs to be optimized to correct the error, so as to obtain the optimized labeled text, including the following steps S401, S402, and S403:

s401, counting the proportion of all the target entities in the text to be labeled.

Target entities are entities needing to be labeled with semantic types, and the occupation ratios of the entities in all the entities in the text to be labeled are calculated.

S402, counting the probability of each target entity subjected to semantic type labeling in the text to be labeled.

If there are 10 entities, 2 target entities a and 1 target entity b in the text to be labeled, the probability of occurrence of the target entity a is 20%, and the probability of occurrence of the target entity b is 10%.

And S403, performing semantic type labeling on the target entity and the non-target entity in the labeled text again according to the proportion and the probability to obtain an optimized labeled text, wherein the non-target entity is an entity except the target entity in the labeled text.

And (4) comparing the 20% of the occurrence probability of the target entity a and the 10% of the occurrence probability of the target entity b with the occupation ratio in the step S401 respectively, and modifying the semantic type of which target entity if the probability corresponding to which target entity does not match with the occupation ratio in the step S401.

For example, the text to be labeled in this embodiment is a biomedical document, and DF (document frequency) is a document proportion including a specific term in a corpus. When the proportion calculated by S401 corresponding to the target entity is small, in general, it can be considered that the term with high DF may be a common vocabulary (non-biomedical entity). Therefore, in the present embodiment, the probability of 1-DF (the proportion corresponding to the target entity calculated in S401) is used to label the entity with which the FMM matches with a semantic type, and the probability of DF does not label the entity. The method can solve the problem that the common high-frequency vocabularies are excessively marked as biomedical entities, so that the samples are unbalanced. For example, if the DF value of the entity word "as" in a medical knowledge base is 0.958, we label the entity word with a probability of 1-0.958 ═ 0.042, and label it with a probability of 0.958 as O (letter O), thereby avoiding the common word "as" being learned too much by the model.

The whole process of labeling the text to be labeled through steps S100-S300 in this embodiment is shown in fig. 3.

In summary, the present invention firstly marks the target entity in the text to be marked, then marks the semantic type of the marked target entity by the semantic type marker, and finally outputs the target entity containing the marked semantic type by the semantic type marker. On one hand, the semantic type annotator is adopted to label the text to be labeled instead of manual labeling, so that the labeling accuracy is improved. On the other hand, the target entity is marked, so that the target entity can be accurately found by the semantic type marker according to the mark to label the target entity when the semantic type marker labels, the semantic type marker is prevented from labeling non-target entities, the labeling speed of the semantic type marker on a text to be labeled is improved, and the labeling accuracy is further improved.

In addition, the invention realizes the high-quality automatic marking of the biomedical entity in the text, and avoids the high cost of manual marking. Meanwhile, the invention provides a DF value-based method for probability labeling aiming at high-frequency words, thereby effectively improving the labeling quality. The invention reduces the labeling noise by a dynamic correction method, and the maximum noun phrase correction method of the invention dynamically corrects the entity boundary in the sentence, thereby effectively solving the problem of nested entities. The invention combines the entity and the context information in the sentence to more accurately mark the semantic type of the ambiguous entity.

Exemplary devices

The embodiment also provides a text entity labeling device, which comprises the following components:

the text acquisition module is used for acquiring a text to be labeled;

and the marking module is used for inputting the text to be marked of the target entity into a semantic type marker, and performing semantic type marking on the marked target entity through the semantic type marker to obtain a marked text.

Based on the above embodiments, the present invention further provides a terminal device, and a schematic block diagram thereof may be as shown in fig. 4. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is configured to provide computing and control capabilities. The memory of the terminal equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a text entity tagging method. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is arranged in the terminal equipment in advance and used for detecting the operating temperature of the internal equipment.

It will be understood by those skilled in the art that the block diagram of fig. 4 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the terminal device to which the solution of the present invention is applied, and a specific terminal device may include more or less components than those shown in the figure, or may combine some components, or have different arrangements of components.

In one embodiment, a terminal device is provided, where the terminal device includes a memory, a processor, and a text entity annotation program stored in the memory and executable on the processor, and when the processor executes the text entity annotation program, the following operation instructions are implemented:

acquiring a text to be marked;

marking out a target entity contained in the text to be marked;

and inputting the text to be labeled marked of the target entity into a semantic type labeling device, and performing semantic type labeling on the marked target entity through the semantic type labeling device to obtain a labeled text.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In summary, the present invention discloses a method, an apparatus, a device and a storage medium for text entity annotation, wherein the method comprises: acquiring a text to be marked; marking a target entity contained in the text to be marked, wherein the target entity corresponds to the characteristics of the text to be marked; inputting the text to be labeled marked of the target entity into a semantic type labeling device, and performing semantic type labeling on the marked target entity through the semantic type labeling device to obtain a labeled text, wherein the semantic type corresponds to the category of the target entity. On one hand, the semantic type annotator is adopted to label the text to be labeled instead of manual labeling, so that the labeling accuracy is improved. On the other hand, the target entity is marked, so that the target entity can be accurately found by the semantic type marker according to the mark to label the target entity when the semantic type marker labels, the semantic type marker is prevented from labeling non-target entities, the labeling speed of the semantic type marker on a text to be labeled is improved, and the labeling accuracy is further improved.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text entity labeling method is characterized by comprising the following steps:

acquiring a text to be marked;

marking out a target entity contained in the text to be marked;

and inputting the text to be labeled marked out of the target entity into a trained semantic type labeling device, and performing semantic type labeling on the marked out target entity through the trained semantic type labeling device to obtain a labeled text.

2. The method for labeling text entities according to claim 1, wherein the step of labeling the target entities contained in the text to be labeled comprises:

acquiring the original entity library;

performing word segmentation on the text to be labeled;

3. The method of textual entity annotation of claim 1, wherein the manner of training the trained semantic type annotator comprises:

acquiring an original entity library and a sample text;

labeling the sample entity in the sample text with a semantic sample type to obtain a labeled sample text;

4. The method for labeling text entities according to claim 3, wherein the step of marking out the sample entities contained in the sample text by the original entity library comprises:

5. The method for labeling text entities according to claim 3, wherein the step of marking out the sample entities contained in the sample text by the original entity library comprises:

obtaining brackets contained in the original entity library according to the original entity library;

cleaning the entity containing the bracket according to the information corresponding to the bracket to obtain the cleaned original entity library;

6. The method for labeling text entities according to claim 3, wherein the step of marking out the sample entities contained in the sample text by the original entity library comprises:

obtaining a nonsense entity and/or an entity containing abnormal head and tail characters contained in the original entity library according to the original entity library, wherein the nonsense entity is an entity without actual meaning, and the entity with the abnormal head and tail characters is an entity with head and tail characters not matched with the language of the entity;

7. The method for labeling text entities according to any of claims 4, 5 or 6, wherein the step of labeling sample entities contained in the sample text according to the original entity library after washing comprises:

performing independent word segmentation processing on the sample text to obtain the sample text after the independent word segmentation processing;

marking out sample entities contained in the sample texts after the independent word segmentation processing according to the cleaned original entity library.

8. The method of claim 3, wherein the labeling the sample text labeled with the sample entity with a semantic sample type for the sample entity to obtain a labeled sample text comprises:

obtaining an ambiguous entity and a non-ambiguous entity in the sample entity according to the sample entity, wherein the ambiguous entity is an entity with semantic types more than one, and the non-ambiguous entity is an entity with a unique semantic type;

9. The method of text entity tagging of claim 8, wherein said training a semantic type annotator with said tagged sample text resulting in said trained semantic type annotator comprises:

10. The method for labeling text entities according to claim 1, further comprising:

11. The method for labeling text entities according to any of claims 1-6 or 8-10, wherein the text to be labeled marked out of the target entity is input into a trained semantic type annotator, and the labeled text is obtained by performing semantic type labeling on the marked out target entity through the trained semantic type annotator, and then further comprising:

counting the proportion of all the target entities in the text to be labeled;

12. The text entity annotation method of any one of claims 1-6 or 8-10, wherein the text to be annotated is biomedical text.

13. A text entity tagging apparatus, the apparatus comprising:

the text acquisition module is used for acquiring a text to be labeled;

and the marking module is used for inputting the text to be marked of the target entity into a trained semantic type marker, and performing semantic type marking on the marked target entity through the trained semantic type marker to obtain a marked text, wherein the semantic type corresponds to the category of the target entity.

14. A terminal device, characterized in that the terminal device comprises a memory, a processor and a text entity annotation program stored in the memory and operable on the processor, and the processor implements the steps of the text entity annotation method according to any one of claims 1 to 12 when executing the text entity annotation program.

15. A computer-readable storage medium, wherein a text entity tagging program is stored on the computer-readable storage medium, and when executed by a processor, the text entity tagging program implements the steps of the text entity tagging method according to any one of claims 1 to 12.