CN114692644B

CN114692644B - Text entity labeling method, device, equipment and storage medium

Info

Publication number: CN114692644B
Application number: CN202210242288.8A
Authority: CN
Inventors: 谢育涛; 俞声; 夏俊; 袁正
Original assignee: Tsinghua University; International Digital Economy Academy IDEA
Current assignee: Tsinghua University; International Digital Economy Academy IDEA
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2024-06-11
Anticipated expiration: 2042-03-11
Also published as: CN114692644A

Abstract

The present invention relates to the field of text data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for labeling text entities. The method comprises the steps of firstly marking target entities in texts to be marked, then marking semantic types on the marked target entities through a semantic type marking device, and finally outputting target entities with marked semantic types through the semantic type marking device. On one hand, the method adopts the semantic type annotator to annotate the text to be annotated instead of manual annotation, thereby improving the annotation accuracy. On the other hand, the method marks the target entity, so that the semantic type marker can accurately find the target entity to mark the target entity only according to the mark when marking, thereby preventing the semantic type marker from marking non-target entities, improving the marking speed of the semantic type marker on texts to be marked and further improving the marking accuracy.

Description

Text entity labeling method, device, equipment and storage medium

Technical Field

The present invention relates to the field of text data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for labeling text entities.

Background

The text contains entities such as English words and Chinese words, punctuation marks and the like, and when the text is marked with the entities, the useful entities (target entities) in the text need to be found out first, and then the target entities are marked with semantic types which are used for distinguishing the types of the target entities.

For example, in the field of biomedical information, the NER (named entity recognition) system may be applied to discover biomedical entities in medical text. The NER system is built based on a deep learning (DEEP LEARNING) method, a large amount of labeling entity text data is needed for model training when the NER system is built by the deep learning model, and the large amount of labeling entity text data is derived from manual labeling of biomedical entities in medical texts, so that the labeling accuracy is low.

In summary, the existing text entity labeling method reduces labeling accuracy.

Accordingly, there is a need for improvement and advancement in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a text entity labeling method, a device, equipment and a storage medium, which solve the problem that the labeling accuracy is reduced by the existing text entity labeling method.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the present invention provides a text entity labeling method, including:

acquiring a text to be marked;

Marking out target entities contained in the text to be marked, wherein the target entities correspond to the characteristics of the text to be marked;

inputting the text to be marked of the target entity into a trained semantic type marker, and marking the semantic type of the marked target entity through the trained semantic type marker to obtain a marked text, wherein the semantic type corresponds to the category to which the target entity belongs.

In one implementation manner, the marking the target entity contained in the text to be marked includes:

Acquiring the original entity library;

Obtaining the information field of the original entity library according to the original entity library;

according to the information field, a white list character field matched with the information field is constructed;

according to the white list character domain, cleaning the entities in the original entity library to obtain the cleaned original entity library;

word segmentation processing is carried out on the text to be marked;

marking the text to be marked after the segmentation by the cleaned original entity library so as to mark a target entity in the text to be marked.

In one implementation, the training method of the trained semantic type annotators includes:

Acquiring an original entity library and a sample text;

marking out sample entities contained in the sample text through the original entity library;

Labeling semantic sample types for the sample entities to the sample texts marked with the sample entities to obtain labeled sample texts;

Training the semantic type annotators through the annotated sample text to obtain the trained semantic type annotators.

In one implementation, the marking, by the original entity library, the sample entity included in the sample text includes:

and marking out the sample entities contained in the sample text according to the cleaned original entity library.

obtaining brackets contained in the original entity library according to the original entity library:

According to the information corresponding to the brackets, cleaning the entity containing the brackets to obtain the original entity library after cleaning;

according to the original entity library, nonsensical entities and/or entities containing abnormal head and tail characters contained in the original entity library are obtained, wherein the nonsensical entities are entities with actual meanings, and the entities with the abnormal head and tail characters are entities with head and tail characters not matched with languages to which the entities belong;

washing the nonsensical entities and/or the entities containing the abnormal head and tail characters from the original entity library to obtain the washed original entity library;

In one implementation, the marking the sample entity included in the sample text according to the original entity library after cleaning includes:

performing word independent segmentation processing on the sample text to obtain the sample text after the word independent segmentation processing;

and marking sample entities contained in the sample text after the word independent segmentation processing according to the cleaned original entity library.

In one implementation, the labeling the sample text marked out of the sample entity with a semantic sample type for the sample entity to obtain a labeled sample text includes:

according to the sample entity, an ambiguous entity and a non-ambiguous entity in the sample entity are obtained, the number of semantic types corresponding to the ambiguous entity is greater than one, and the non-ambiguous entity is the entity with the unique corresponding semantic type;

Obtaining context information of the ambiguous entity in the sample text according to the ambiguous entity;

Labeling semantic sample types for the ambiguous entities according to the context information to obtain labeled first sample texts in the labeled sample texts;

And labeling semantic sample types for the non-ambiguous entity according to the non-ambiguous entity to obtain a labeled second sample text in the labeled sample text.

In one implementation manner, the training the semantic type annotator through the annotated sample text to obtain the trained semantic type annotator includes:

Training the semantic type annotators through the noted first sample text and the noted second sample text in the noted sample text to obtain the trained semantic type annotators.

In one implementation, the method further comprises:

acquiring an original entity library containing seed entities, wherein the original entity library is used for marking target entities contained in the text to be marked;

Extracting noun phrases containing the seed entities from the text to be marked;

correcting the noun phrase to obtain the corrected noun phrase, wherein the corrected noun phrase is matched with the structure of the entity;

and adding the noun phrases after modification into the original entity library to obtain the original entity library after modification.

In one implementation manner, the text to be marked out of the target entity is input to a trained semantic type marker, the semantic type marker marks the marked target entity to obtain a marked text, and the semantic type corresponds to the category to which the target entity belongs, and then the method further includes:

counting the duty ratio of all the target entities in the text to be marked;

counting the occurrence probability of each target entity subjected to semantic type labeling in the text to be labeled;

And according to the duty ratio and the probability, carrying out semantic type labeling on the target entity and the non-target entity in the labeled text again to obtain the optimized labeled text, wherein the non-target entity is an entity except the target entity in the labeled text.

In one implementation, the text to be annotated is biomedical text.

In a second aspect, an embodiment of the present invention further provides a text entity labeling device, where the device includes the following components:

The text acquisition module is used for acquiring a text to be marked;

The marking module is used for marking out target entities contained in the text to be marked;

The labeling module is used for inputting the text to be labeled of the target entity into a trained semantic type labeling device, and labeling the semantic type of the labeled target entity through the trained semantic type labeling device to obtain a labeled text.

In a third aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a text entity labeling program stored in the memory and capable of running on the processor, and when the processor executes the text entity labeling program, the steps of the text entity labeling method described above are implemented.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a text entity labeling program is stored in the computer readable storage medium, where the text entity labeling program is executed by a processor to implement the steps of the text entity labeling method described above.

The beneficial effects are that: the method comprises the steps of firstly marking target entities in texts to be marked, then marking semantic types on the marked target entities through a semantic type marking device, and finally outputting target entities with marked semantic types through the semantic type marking device. On one hand, the method adopts the semantic type annotator to annotate the text to be annotated instead of manual annotation, thereby improving the annotation accuracy. On the other hand, the method marks the target entity, so that the semantic type marker can accurately find the target entity to mark the target entity only according to the mark when marking, thereby preventing the semantic type marker from marking non-target entities, improving the marking speed of the semantic type marker on texts to be marked and further improving the marking accuracy.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a semantic type annotator training flow diagram of the present invention;

FIG. 3 is a labeling flow chart in an embodiment;

Fig. 4 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is clearly and completely described below with reference to the examples and the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The research shows that the text contains entities such as English words and Chinese words, punctuation marks and the like, when the text is marked with the entities, the useful entities (target entities) in the text need to be found out first, and then the target entities are marked with semantic types which are used for distinguishing the types of the target entities. For example, in the field of biomedical information, the NER (named entity recognition) system may be applied to discover biomedical entities in medical text. The NER system is built based on a deep learning (DEEPLEARNING) method, a large amount of labeling entity text data is needed for model training when the NER system is built by the deep learning model, and the large amount of labeling entity text data is derived from manual labeling of biomedical entities in medical texts, so that the labeling accuracy is low.

In order to solve the technical problems, the invention provides a text entity labeling method, a device, equipment and a storage medium, which solve the problem that the labeling accuracy is reduced by the existing text entity labeling method. In the specific implementation, firstly, a text to be marked is obtained, then a target entity contained in the text to be marked is marked, and finally, the text to be marked with the target entity is input into a semantic type marker to obtain the marked text. The labeling method of the embodiment can improve the labeling speed and the labeling accuracy.

For example, the text to be marked is the following text: wang Mou suffers from depression due to work fatigue, frequent insomnia, reduced appetite and low mood. It can be seen from the above text that the text is a medical diagnosis book, and the target entity (i.e. the word element matching the feature of the medical science) in the text to be marked is marked first, and the target entity in the text has depression, insomnia, reduced appetite and low emotion. Marking the target entities of depression, insomnia, reduced appetite and low emotion (the purpose of marking is to enable a semantic type marker to accurately find the positions of the target entities), marking the target entities of depression, insomnia, reduced appetite and low emotion by the semantic type marker, marking diseases on depression, and marking symptoms on the three target entities of insomnia, reduced appetite and low emotion by the semantic type marker. Finally, the marked text output by the semantic type marker is: wang Mou suffers from depression (target entity: depression, marked start position: 11, marked end position: 13, semantic type: illness), frequent insomnia (target entity: frequent insomnia, marked start position: 15, marked end position: 18, semantic type: symptoms), reduced appetite (target entity: reduced appetite, marked start position: 20, marked end position: 23, semantic type: symptoms), depressed mood (target entity: depressed mood, marked start position: 25, marked end position: 28, semantic type: symptoms). Where the start and end positions are the subscripts of the characters in the sentence.

Exemplary method

The text entity labeling method of the embodiment can be applied to terminal equipment, and the terminal equipment can be a terminal product with a text data processing function, such as a computer and the like. In this embodiment, as shown in fig. 1, the text entity labeling method specifically includes the following steps:

s100, obtaining a text to be annotated.

The text to be annotated in the embodiment is biomedical text (i.e. medical knowledge database), and the text to be annotated can also be a product specification. When the text to be labeled is a product specification, the following examples are used to illustrate how the product specification is labeled:

for example, text to be annotated: a vase is developed in a market, the vase body is cylindrical, and the whole vase is blue.

And S200, marking out the target entity contained in the text to be marked.

The marking target entity in this embodiment is marking the starting position and ending position of the target entity in the sentence. The position of the target entity is marked so as to facilitate the subsequent semantic type annotators to find the target entity in the text to be annotated so as to annotate the semantic type to the target entity.

The target entity in the embodiment is an entity capable of reflecting the text feature to be annotated, and step S200 includes the following steps S201, S202, S203, S204, S205, S206:

S201, acquiring the original entity library.

The original entity library in this embodiment is derived from an existing medical knowledge base.

S202, obtaining the information field of the original entity library according to the original entity library.

S203, constructing a white list character domain matched with the information domain according to the information domain.

S204, according to the white list character domain, cleaning the entities in the original entity library to obtain the cleaned original entity library.

S205, segmentation processing is carried out on the text to be annotated.

S206, marking the text to be marked after the segmentation through the cleaned original entity library so as to mark out the target entity in the text to be marked.

According to the method, all the seed entities in the original entity library and all the target entities in the text to be marked are marked through all the seed entities in the original entity library, and all the seed entities and the text to be marked in the original entity library are required to be preprocessed before marking, so that useless data are removed, and therefore marking accuracy and marking speed are improved.

The original entity library is a medical entity library, and the original entity library contains seed entities such as headache, nasal discharge, chronic pharyngitis and circumference rate, so that the circumference rate is obviously unmatched with the medical science, and the circumference rate is deleted from the original entity library, so that the original entity library after pretreatment is obtained.

For example, the target entities included in the text to be marked are marked as follows: a flower vase (target entity: flower vase, marked starting position: 8, marked ending position: 9) is displayed in a certain market, the bottle body is a cylinder (target entity: cylinder, starting position: 14, ending position: 16), and the whole flower vase is blue (target entity: blue, starting position: 21, ending position: 22). Where the start and end positions are the subscripts of the characters in the sentence.

S300, inputting the text to be marked of the target entity into a trained semantic type marker, and marking the semantic type of the marked target entity through the trained semantic type marker to obtain a marked text.

Marking marked text corresponding to the target entity contained in the text to be marked in step S200: a market shows a vase (semantic type: product name), the bottle body is cylindrical (semantic type: product shape), and the color is blue (semantic type: product color).

The steps S100 to S300 employ a segmentation method to mark the target entity and obtain the marked text, and the following examples illustrate the advantages of segmentation:

For example, a text to be annotated is: headache and nasal discharge appear, which are common symptoms of cold. Preprocessing the text to be annotated, namely dividing the word elements in the text, and dividing the text to be annotated: "appearance", "headache", "nasal discharge", "conditions", "belongs to" cold "," common symptoms ". The original entity library is preprocessed, invalid entities are removed, and when the text to be marked is marked through the original entity library, the marking speed can be improved. The text to be marked is divided according to the word elements, so that headache can be prevented from being divided into a whole, and if the headache is divided into a whole, the headache in the text to be marked cannot be marked through the headache in the original entity library.

The semantic type in this embodiment is used to reflect the category to which the target entity belongs, and includes two parts in this embodiment: training the semantic type annotator (first part), and annotating the text to be annotated (second part) using the semantic type annotator after training.

Training the semantic type annotators includes the following steps S301, S302, S303, S304, S305, S306, S307, S308, S309, S3010, S3011, S3012, S3013, S3014, S3015:

S301, acquiring an original entity library and a sample text.

The original entity library comprises a database of good multi-entities. In this embodiment, when labeling a target entity in a text to be labeled, an original entity library is required as a reference to label the target entity, and in this embodiment, the original entity library is still required when training a semantic type labeler. The original entity library in this embodiment is derived from a biomedical knowledge base, however, at present, some of the disclosed mainstream biomedical knowledge bases, such as biomedical knowledge maps UMLS (UnifiedMedicalLanguageSystem) with maximum rule, still have a large number of entities with poor quality, such as non-biomedical entity words like "betwen", "68", and the like, and directly labeling the text with these words will lead to serious labeling noise problem (labeling noise is that the non-medical entity words in the text are labeled as medical entity words). The training of semantic type annotators is also affected by the presence of redundant punctuation for directly acquired sample text. Therefore, before training the semantic type annotator, the original entity library and the sample text need to be cleaned to remove invalid data, so that the invalid data is prevented from interfering with the training of the semantic type annotator, and the annotating accuracy of the semantic type annotator to the text to be annotated after the training is improved. In this embodiment, the cleaning of the original entity library is completed through the following steps S302 to S308:

S302, obtaining the information field of the original entity library according to the original entity library.

When the original entity library relates to medical records, the original entity library belongs to the field of medical information. When the original entity library relates to a sports fitness method, the original entity library belongs to the field of kinematic information.

S303, constructing a white list character domain matched with the information domain according to the information domain.

S304, according to the white list character domain, cleaning the entity in the original entity library.

In this embodiment, taking the field of medical information as an example, the whitelist character field is a character related to medicine, for example rabiesvaccine (rabies vaccine) is a whitelist character in the medical field, and the "homeland area" is not a whitelist character in the medical field, when the entity in the original entity library is not matched with the whitelist character field, the unmatched entity is cleaned from the original entity library, and the cleaned original entity library is obtained.

When the original entity library contains "t x teriflunomide", because "t x" is not a medically relevant character, the entity "t x teriflunomide" where "t x" is located is removed from the original entity library to complete the cleaning of the original entity library.

S305, obtaining brackets contained in the original entity library according to the original entity library.

And S306, cleaning the entity containing the brackets according to the information corresponding to the brackets.

In this embodiment, the information corresponding to the brackets includes whether the brackets are single brackets or double brackets, and the types of the left brackets and the right brackets are the same (i.e., the left brackets are english, the right brackets are english, and the types of the brackets are the same). And when the types of the single brackets and the left brackets and the right brackets are different, removing the entity containing the brackets from the original entity library, and finishing the cleaning of the original entity library.

For example, the left bracket of "(rabies vaccine)" is english, the right bracket is chinese, and "(rabies vaccine)" is removed from the original entity library due to the different types of the left and right brackets. "sars-cov-1 (2003" has only one bracket and is also required to be removed from the original entity library).

The fact that the brackets (non-standard brackets) with inconsistent types are removed from the original entity library is because when the entity labeling is performed on the sample text, if the brackets with standard types are adopted for the entities in the sample, the original entity library with the non-standard types of brackets can not label the entities with the non-standard types of brackets in the sample text, so that the training effect of the semantic type labeling device is affected.

S307, according to the original entity library, obtaining nonsensical entities and/or entities containing abnormal head and tail characters contained in the original entity library, wherein the nonsensical entities are entities with actual meanings, and the entities with abnormal head and tail characters are mismatched with languages to which the entities belong.

S308, cleaning the nonsensical entities and/or the entities containing abnormal head and tail characters from the original entity library.

The nonsensical entities in this embodiment are entities containing only numbers and special symbols and no unit information, and these entities with only numbers and special symbols do not deliver any valuable information, and if they are kept in the original entity library, the matching search amount is increased when the original entity library is used to mark the sample text, so that these nonsensical entities need to be removed from the original entity library.

For example, the presence of "2.5", "3-," "x &" in the original library of entities involved in medicine only does not see any valuable information from these numbers, and the sample text to be marked and the text to be marked do not contain these valuable entities, so it is not necessary to keep these valuable entities in the original library of entities.

The entity of the abnormal head and tail characters is that the head and tail of the English word are non-English characters, such as 'dermo-'.

The embodiment can remove the characters which are not in the white list character domain from the original entity library, then remove the entity where the bracket types are inconsistent, and finally remove the nonsensical entity and/or the entity containing the abnormal head and tail characters. This allows for the elimination of both bracket type inconsistencies and characters that are not in the whitelist character field.

For example, the original entity library contains "(" Tx ") teriflunomide", the left and right brackets are respectively Chinese and English, if the entity where the bracket types are inconsistent is firstly removed, the final cleaning result only cleans "(" Tx "), and" teriflunomide "is also in the original entity library. While the desired result is "(" t teriflunomide ") that were all purged from the original entity library. In this embodiment, the characters that are not in the whitelist character domain are removed from the original entity library, and then the entity where the bracket types are inconsistent is removed, so that the above problem can be avoided, and the whole "(" tx ") teriflunomide" is cleaned from the original entity library.

In this embodiment, meaningless entities and/or entities containing abnormal head and tail characters may be removed first, then characters not in the whitelist character domain are removed from the original entity library, and finally entities where bracket types are inconsistent are removed.

S309, performing word independent segmentation processing on the sample text to obtain the sample text after the word independent segmentation processing.

When the original entity library is used to mark the target entity contained in the sample text, the sample text needs to be segmented, namely, each word element (the word element is a word or a Chinese character) is separated. When the sample text is english, since the english word is often followed by a punctuation mark, if the punctuation mark is not separated from the english word, the original entity library is used to label the sample text, so that the target entity contained in the sample text is not labeled.

For example, there is a term "chronic Pa infection is associated with reduced lung function" in the sample text, there is a term "function" after "function", if a space in the prior art is used to split the sentence, "stereo", "Pa", "input", "is", "associated", "with", "reduced", "hook", "function", and "hook function" in the sentence is the sample entity that needs to be marked, but since "function" after the split is followed by "hook", the "hook function" in the original entity library is the "hook function" in the sample text that cannot be marked. The following segmentation method of this embodiment can be used to mark "lung function" in the sample text:

“chronic”，“Pa”，“infection”，“is”，“associated”，“with”，“reduced”，“lung”，“function”，“,”。

in this embodiment, the lemma "function" and the non-lemma "are separated during segmentation, and" lung function "in the sample text can be marked when" lungfunction "in the original entity library is used.

S3010, marking sample entities contained in the sample text after word independent segmentation processing according to the cleaned original entity library.

In the embodiment, training of the semantic type labeler is completed through steps S3011-S3015 after the cleaned original entity library and the sample text after segmentation are obtained:

S3011, according to the sample entity, obtaining an ambiguous entity and a non-ambiguous entity in the sample entity, wherein the number of semantic types corresponding to the ambiguous entity is greater than one, and the non-ambiguous entity is the entity with the unique corresponding semantic type.

S3012, obtaining context information of the ambiguous entity in the sample text according to the ambiguous entity.

S3013, labeling semantic sample types for the ambiguous entities according to the context information, and obtaining labeled first sample text in the labeled sample text.

For example, "a person has injected a rabies vaccine, thus reducing the probability of suffering from rabies", wherein the entity "rabies vaccine" is an ambiguous entity, and if considered from the five words "rabies vaccine" alone, the corresponding semantic type may be "pharmacologic substance" or "immune factor". When the context of "rabies vaccine" is linked, it is known that "a person has been injected with rabies vaccine, thus reducing the probability of rabies" in this sample text, the semantic type corresponding to "rabies vaccine" is "immune factor".

S3014, labeling semantic sample types for the non-ambiguous entity according to the non-ambiguous entity, and obtaining a labeled second sample text in the labeled sample text.

In this embodiment, when a sample entity in the sample text is retrieved as a non-ambiguous entity, the sample entity is directly labeled with the semantic sample type in the sample text, so that a labeled second sample text is obtained. When the sample entity in the retrieved sample text is an ambiguous entity, the sample entity is labeled with a semantic sample type in connection with the context in which the sample entity is located. After the corresponding semantic sample types are added to the ambiguous entities and the non-ambiguous entities in the sample text, labeling of all the sample entities in the sample text is completed, and labeled text comprising labeled first sample text and labeled second sample text is obtained.

S3015, training the semantic type annotators through the noted first sample text and the noted second sample text in the noted sample text to obtain the trained semantic type annotators.

The method comprises the steps of inputting a marked sample text into a semantic type marker, comparing the semantic type of a sample entity output by the semantic type marker with the semantic type in the marked sample text, and adjusting the semantic type marker if the semantic type of the sample entity is different from the semantic type of the sample entity, until the semantic type marker is the same with the semantic type marker, and completing training of the semantic type marker.

For example, labeled sample text ：The role of[catheter ablation](Therapeutic or Preventive Procedure)in the management of[atrial fibrillation](Disease or Syndrome),, in which [ ] is the labeled sample entity, and (") is the semantic type of label for the sample entity. Inputting the marked sample text into a semantic type marker, outputting a new semantic type aiming at the sample entity by the semantic type marker, comparing the new semantic type with the semantic type in (), and adjusting parameters of the semantic type marker if the new semantic type is inconsistent with the semantic type in (), until the two types are identical to each other, so as to complete training of the semantic type marker.

The process of training the semantic type annotator in this embodiment is shown in fig. 2, where the original entity library (entities) in fig. 2 contains terms (seed entities) and their corresponding semantic types, and some terms are ambiguous terms, which correspond to multiple semantic types. And constructing a dictionary tree through an original entity library, and matching sample entities (labeling data) in sample texts (namely text corpus) by using the dictionary tree, wherein in the process, the longest prefix matching is used for labeling ambiguous terms and semantic types of the ambiguous terms in sentences. Meanwhile, the terms in the text corpus are deduced to a semantic type classification model of the terms according to the context, the effect of the context is enhanced through a part of the MASK terms, and finally, the annotation data, the ambiguous terms and the corresponding semantic types are input into a semantic type auxiliary annotator (semantic type annotator), so that training of the semantic type auxiliary annotator is completed.

Training of the semantic type annotators is completed through the steps S301-S3015, and the text to be annotated is annotated by using the semantic type annotators after training.

According to the method and the device for marking the text to be marked, noun phrases of the seed entity are extracted while the text to be marked is marked, whether the noun phrases are entities is judged, and if yes, the noun phrases are added into an original entity library of the seed entity to enrich the original entity library, so that accurate marking of target entities in the next text to be marked is achieved. In this embodiment, the specific process of adding the seed entity to the original entity library includes: acquiring an original entity library containing seed entities, wherein the original entity library is used for marking target entities contained in the text to be marked; extracting noun phrases containing the seed entities from the text to be marked; correcting the noun phrase to obtain the corrected noun phrase, wherein the corrected noun phrase is matched with the structure of the entity; and adding the noun phrases after modification into the original entity library to obtain the original entity library after modification.

For example, if the entity "chronic sclerosing cholangitis" appears in the text to be annotated, and the original entity library only contains the seed entity "sclerosing cholangitis", the FMM can only be used for marking "sclerosing cholangitis" and cannot be used for marking "chronic sclerosing cholangitis", so that the nested entity annotation is inaccurate. For this problem, the present embodiment proposes a technique of using noun phrase extraction to dynamically correct incomplete terms in sentences, reduce labeling noise, and improve automatic labeling accuracy.

The specific implementation mode is as follows:

1) And identifying long noun phrases containing seed entities in sentences by using a noun phrase extraction method, and taking the long noun phrases as candidate entities. The method can dynamically correct the same original entity library in different sentences (namely, the original entity library can be corrected once when each text to be marked is marked, so that dynamic correction is formed). Noun phrase extraction methods are known in the art, such as spaCy.

2) Correcting the recognized long noun phrase by a plurality of rule methods, such as a) checking the parts of speech of the head and tail token of the long noun phrase, deleting the tokens such as connective words, human-called pronouns and the like with unreasonable parts of speech; b) Checking the effectiveness of brackets; c) Whether belonging to a whitelist character field, etc.

Training of the semantic type annotators is completed through the steps S301-S3015, and the text to be annotated is annotated by using the trained semantic type annotators, so that the annotated text is obtained. However, there may be erroneous labeling of the labeled text, and therefore, the labeled text needs to be optimized to correct the errors, so as to obtain the labeled text after optimization, which includes the following steps S401, S402, and S403:

S401, counting the duty ratio of all the target entities in the text to be marked.

The target entity is the entity needing to annotate the semantic type, and the duty ratio of the entity in all the entities in the text to be annotated is calculated.

S402, counting the occurrence probability of each target entity subjected to semantic type labeling in the text to be labeled.

If there are 10 entities in the text to be annotated, there are 2 target entities a,1 target entity b, then the probability of occurrence of target entity a is 20% and the probability of occurrence of target entity b is 10%.

S403, according to the duty ratio and the probability, carrying out semantic type labeling on the target entity and the non-target entity in the labeled text again to obtain the labeled text after optimization, wherein the non-target entity is an entity except the target entity in the labeled text.

The probability of 20% of the occurrence of the target entity a and the probability of 10% of the occurrence of the target entity b are compared with the duty ratio in the step S401, respectively, and if the probability corresponding to which target entity does not match the duty ratio in the step S401, the semantic type of which target entity is modified.

For example, the text to be annotated in this embodiment is a biomedical document, and DF (document frequency) is the document proportion containing specific terms in the corpus. When the target entity calculated in S401 corresponds to a relatively small occupation area, in general, it can be considered that the term with high DF may be a common vocabulary (non-biomedical entity). Therefore, in this embodiment, the semantic type is labeled for the entity matched with the FMM by using the probability of 1-DF (the duty ratio corresponding to the target entity calculated in S401), and the probability of DF is not labeled for the entity. The method can solve the problem that the common high-frequency vocabulary is excessively marked as a biomedical entity, so that the sample is unbalanced. For example, the DF value of an entity word "as" in a certain medical knowledge base is 0.958, and we will label the entity word with a probability of 1-0.958=0.042, and label it with a probability of 0.958 as O (letter O), so as to avoid the model from learning too many common words "as".

The whole labeling flow of the text to be labeled in this embodiment through steps S100-S300 is shown in fig. 3.

In summary, the method and the device for labeling the target entity in the text to be labeled first label the target entity in the text to be labeled, then label the labeled target entity with the semantic type through the semantic type labeler, and finally output the target entity with the labeled semantic type through the semantic type labeler. On one hand, the method adopts the semantic type annotator to annotate the text to be annotated instead of manual annotation, thereby improving the annotation accuracy. On the other hand, the method marks the target entity, so that the semantic type marker can accurately find the target entity to mark the target entity only according to the mark when marking, thereby preventing the semantic type marker from marking non-target entities, improving the marking speed of the semantic type marker on texts to be marked and further improving the marking accuracy.

In addition, the invention realizes high-quality automatic labeling of biomedical entities in texts, and avoids high cost of manual labeling. Meanwhile, the invention provides a DF value-based method for probability labeling aiming at high-frequency words, and the labeling quality is effectively improved. The method reduces the marking noise by a dynamic correction method, dynamically corrects the entity boundary in the sentence by the maximum noun phrase correction method, and effectively solves the problem of nested entities. The invention combines the entity and the context information of the entity in the sentence to more accurately label the semantic type of the ambiguous entity.

Exemplary apparatus

The embodiment also provides a text entity labeling device, which comprises the following components:

The text acquisition module is used for acquiring a text to be marked;

the labeling module is used for inputting the text to be labeled of the target entity to a semantic type labeling device, and labeling the semantic type of the labeled target entity through the semantic type labeling device to obtain a labeled text.

Based on the above embodiment, the present invention also provides a terminal device, and a functional block diagram thereof may be shown in fig. 4. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of labeling text entities. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is preset in the terminal equipment and is used for detecting the running temperature of the internal equipment.

It will be appreciated by persons skilled in the art that the functional block diagram shown in fig. 4 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or fewer components than shown, or may combine some of the components, or may have a different arrangement of components.

In one embodiment, a terminal device is provided, where the terminal device includes a memory, a processor, and a text entity labeling program stored in the memory and capable of running on the processor, and when the processor executes the text entity labeling program, the processor implements the following operation instructions:

acquiring a text to be marked;

marking out target entities contained in the text to be marked;

Inputting the text to be marked out of the target entity into a semantic type marker, and marking the semantic type of the marked target entity through the semantic type marker to obtain a marked text.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

In summary, the invention discloses a text entity labeling method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be marked; marking out target entities contained in the text to be marked, wherein the target entities correspond to the characteristics of the text to be marked; inputting the text to be marked out of the target entity into a semantic type marker, marking the semantic type of the marked target entity through the semantic type marker, and obtaining marked text, wherein the semantic type corresponds to the category to which the target entity belongs. On one hand, the method adopts the semantic type annotator to annotate the text to be annotated instead of manual annotation, thereby improving the annotation accuracy. On the other hand, the method marks the target entity, so that the semantic type marker can accurately find the target entity to mark the target entity only according to the mark when marking, thereby preventing the semantic type marker from marking non-target entities, improving the marking speed of the semantic type marker on texts to be marked and further improving the marking accuracy.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for labeling text entities, comprising:

acquiring a text to be marked;

marking out target entities contained in the text to be marked;

inputting the text to be marked out of the target entity into a trained semantic type marker, and marking the semantic type of the marked target entity through the trained semantic type marker to obtain a marked text;

the marking the target entity contained in the text to be marked comprises the following steps:

Acquiring an original entity library;

According to the information field, a white list character field matched with the information field is constructed, wherein the white list character field is a character related to the information field;

word segmentation processing is carried out on the text to be marked;

Marking the text to be marked after the segmentation by the cleaned original entity library so as to mark a target entity in the text to be marked;

Inputting the text to be marked out of the target entity to a trained semantic type marker, marking the semantic type of the marked target entity by the trained semantic type marker to obtain a marked text, and then further comprising:

counting the duty ratio of all the target entities in the text to be marked;

2. The method for labeling text entities according to claim 1, wherein the training mode of the trained semantic type labeler comprises:

Acquiring an original entity library and a sample text;

Labeling semantic sample types for the sample entities in the sample text to obtain labeled sample text;

3. The method for labeling text entities according to claim 2, wherein said marking out, by said original entity library, the sample entities contained in said sample text comprises:

4. The method for labeling text entities according to claim 2, wherein said marking out, by said original entity library, the sample entities contained in said sample text comprises:

obtaining brackets contained in the original entity library according to the original entity library;

5. The method for labeling text entities according to claim 2, wherein said marking out, by said original entity library, the sample entities contained in said sample text comprises:

According to the original entity library, nonsensical entities and/or entities containing abnormal head and tail characters contained in the original entity library are obtained, wherein the nonsensical entities have no practical meaning, and the entities of the abnormal head and tail characters are entities of which the head and tail characters are not matched with languages to which the entities belong;

6. The method for labeling text entities according to any one of claims 3,4 or 5, wherein labeling the sample entities contained in the sample text according to the cleaned original entity library comprises:

7. The method for labeling text entities according to claim 2, wherein labeling semantic sample types for the sample entities on the sample text labeled with the sample entities to obtain labeled sample text comprises:

According to the sample entity, an ambiguous entity and a non-ambiguous entity in the sample entity are obtained, wherein the ambiguous entity is an entity with the number of semantic types being more than one, and the non-ambiguous entity is an entity with the unique semantic type;

8. The method for labeling text entities according to claim 7, wherein training a semantic type labeler by using said labeled sample text to obtain a trained semantic type labeler comprises:

9. The method for labeling a text entity as recited in claim 1, further comprising:

Correcting the noun phrase to obtain the corrected noun phrase;

10. A method of labeling a text entity as in any of claims 1-5 or 7-9 wherein the text to be labeled is biomedical text.

11. A text entity labeling device, characterized in that the device comprises the following components:

The text acquisition module is used for acquiring a text to be marked;

the labeling module is used for inputting the text to be labeled of the target entity into a trained semantic type labeling device, and labeling the semantic type of the labeled target entity through the trained semantic type labeling device to obtain a labeled text;

Acquiring an original entity library;

word segmentation processing is carried out on the text to be marked;

counting the duty ratio of all the target entities in the text to be marked;

12. A terminal device, characterized in that the terminal device comprises a memory, a processor and a text entity marking program stored in the memory and executable on the processor, the processor implementing the steps of the text entity marking method according to any one of claims 1-10 when executing the text entity marking program.

13. A computer readable storage medium, wherein a text entity labeling program is stored on the computer readable storage medium, and when the text entity labeling program is executed by a processor, the steps of the text entity labeling method according to any of claims 1-10 are implemented.