CN110008473B

CN110008473B - Medical text named entity identification and labeling method based on iteration method

Info

Publication number: CN110008473B
Application number: CN201910257482.1A
Authority: CN
Inventors: 陈储培
Original assignee: Unisound Shanghai Intelligent Technology Co Ltd
Current assignee: Unisound Shanghai Intelligent Technology Co Ltd
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2022-11-25
Anticipated expiration: 2039-04-01
Also published as: CN110008473A

Abstract

The embodiment of the invention provides a medical text named entity identification and labeling method based on an iteration method, and relates to the technical field of medical information. For a large-scale medical corpus labeling tool, a traditional labeling tool is used, and a large amount of manpower and material resources are consumed. The method combines a model and an automatic tool, is suitable for large-scale medical text labeling, and reduces the labeling period, thereby being beneficial to improving the product research and development efficiency.

Description

Medical text named entity identification and labeling method based on iteration method

Technical Field

The invention relates to the technical field of medical information, in particular to a medical text named entity identification and labeling method based on an iteration method.

Background

The medical field is different from the general field and has a certain specialty. The research in the medical field cannot leave the support of medical corpus, and in the medical research field, sequence annotation is a basic and very important work. However, the named entity identification and labeling needs a large amount of manpower and material resources, and the current mainstream sequence labeling is performed by means of an open source labeling tool, so that the labeling period is long, and the medical field also relates to the knowledge with strong specialty, so that the medical sequence labeling task is difficult. In order to improve the labeling efficiency, an iteration-based automatic labeling method for medical named entities is provided.

With the development of the internet, the mobile internet and big data technology, the scale of various text data resources is showing explosive growth, mainly including unstructured data on social media (e.g. microblog number, public number, facebook, twitter, etc.) and news media (e.g. people's daily news, phoenix news, fox search news, etc.) websites, and semi-structured data on encyclopedia websites, such as encyclopedia and wiki, natural Language Processing (NLP) plays a very important role in the text information extraction process. In the text mining process, how to extract useful information from massive text data is valuable to enterprises or users. Sequence labeling is one of the most basic and most commonly used NLP methods. How to quickly and effectively predict the corresponding labels (such as nouns, names of people, names of places, time and the like) of each word in the Chinese sequence plays an important role in important artificial intelligence tasks such as relationship mining, knowledge graph spectrums and the like.

In the prior art, the medical labeling corpus is less, which brings difficulty to the basic research work of medical texts; meanwhile, the medical text labeling depends on a labeling tool, the labeling period is long, and a large amount of manpower and material resources are consumed.

Disclosure of Invention

The invention aims to provide a medical text named entity identification and marking method based on an iteration method, which has the advantages of high marking efficiency, accurate marking and simple method.

In order to achieve the above object, the embodiments of the present invention adopt the following technical solutions:

a medical text named entity identification labeling method based on an iteration method comprises the following steps:

step 1: preparing an initialized seed word according to the category of the named entity, wherein the seed word is used as the basis of subsequent iteration;

step 2: based on the existing medical free text, marking a seed word label on the text, wherein in the named entity recognition task, the beginning and the end of the seed word are respectively B and E, the middle character is I, and the rest words are O;

and step 3: performing model training on the first round labeled corpus, completing prediction on the medical text corpus according to the generated model, and extracting predicted entity words;

and 4, step 4: performing webpage analysis on the generated new round of entity words by using a search engine tool, filtering according to the principle of whether encyclopedia entries exist or not, simultaneously further supplementing entity word resources according to related terms of network resources, and supplementing the processed entity words to a dictionary base;

and 5: repeating the step 2, the step 3 and the step 4 to complete multiple rounds of iteration, and stopping iteration when the set iteration times are reached or the number of newly added entries is not increased; extracting entity words with inconsistent boundaries and inconsistent categories by using an automatic tool according to the entity words marked by the dictionary and the entity words predicted by the model; and further correcting the extracted inconsistent entity word material through rules, and finally completing the labeling of the medical named entity.

Further, the method for labeling the text with the seed words based on the existing medical free text comprises the following steps:

step S1: acquiring different keywords, generating keyword lists corresponding to different medical texts, and storing the keyword lists in a database;

step S2: reading a keyword list from a database, generating unique identifiers corresponding to different medical texts according to different medical texts and keywords of the medical texts, and constructing a unique dictionary tree according to the unique identifiers, wherein all the unique dictionary trees form a basic dictionary tree object pool for word segmentation service;

and step S3: receiving data to be processed, and performing word segmentation on the data to be processed according to a dictionary tree in a basic dictionary tree object pool corresponding to medical text to be processed corresponding to the data to be processed; and filtering the keywords according to the word segmentation result.

Further, different keywords of different medical texts in the step S1 are maintained to the database by the user.

Further, the step S2 further includes:

the keyword list is used for constructing different word banks according to the mode that one medical text corresponds to one word bank; the format of the lexicon is X.dic, wherein X is the name of the lexicon.

Further, the step S3 includes the following sub-steps:

s31: receiving data to be processed, judging a medical text corresponding to the data to be processed, and jumping to the step S32;

s32: retrieving a dictionary tree corresponding to the medical text from a basic dictionary tree object pool according to the medical text corresponding to the data to be processed; if yes, jumping to step S33; otherwise, jumping to step S34;

s33: performing word segmentation on the data to be processed through the dictionary tree, filtering the keywords according to word segmentation results, and ending;

s34: judging whether a word stock corresponding to the medical text corresponding to the data to be processed exists or not, if so, skipping to the step S35, otherwise, skipping to the step S36;

s35: dynamically constructing a dictionary tree according to a word bank corresponding to the medical text corresponding to the data to be processed, segmenting words according to the constructed dictionary tree, realizing keyword filtering according to word segmentation results, and ending;

s36: and calling a preset general word bank, constructing a general dictionary tree according to the general word bank, segmenting words of the data to be processed according to the constructed general dictionary tree, filtering the keywords according to a word segmentation result, and ending.

Further, the method for performing model training on the corpus after the first round of labeling, completing prediction on the medical text corpus according to the generated model, and extracting the predicted entity words executes the following steps:

step A1: preprocessing the acquired corpus;

step A2: inputting the linguistic data preprocessed in the step A1 into a preset learning model, adjusting parameters of the learning model and storing the parameters;

step A3: respectively adding corresponding prediction labels to the obtained corpora according to a sequence classification result output by the learning model, performing minimum optimization on a loss function of the learning model by using the artificial labels to fit the matching of the prediction labels and the artificial labels, performing word segmentation on unknown corpora by using a word segmentation algorithm, and performing primary labeling on the unknown corpora subjected to word segmentation by using the adjusted learning model;

step A4: and D, adjusting the unknown corpus which is labeled for the first time in the step A3, and finally labeling the adjusted corpus.

Further, the preprocessing in the step A1 includes merging large-granularity participles and unifying formats.

The medical text named entity identification and labeling method based on the iteration method has the following beneficial effects that: for a large-scale medical corpus labeling tool, a traditional labeling tool is used, and a large amount of manpower and material resources are consumed. The method combines a model and an automatic tool, is suitable for large-scale medical text labeling, and reduces the labeling period, thereby being beneficial to improving the product research and development efficiency.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 shows a method flow diagram of a medical text named entity identification and tagging method based on an iterative method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Example 1:

as shown in fig. 1, a medical text named entity identification and labeling method based on an iterative method performs the following steps:

step 1: preparing an initialization seed word according to the category of the named entity, wherein the seed word is used as the basis of subsequent iteration;

The technical principle of the technical scheme is as follows: the functions are realized by extracting the keywords and then matching the extracted keywords.

The technical effect of the technical scheme is as follows: thereby contributing to the improvement of the research and development efficiency of the product.

Example 2:

on the basis of the previous embodiment, the method for labeling the text with the seed words based on the existing medical free text performs the following steps:

step S1: acquiring different keywords, generating a keyword list corresponding to different medical texts, and storing the keyword list in a database;

step S2: reading the keyword list from the database, generating unique identifiers corresponding to different medical texts according to different medical texts and keywords of the medical texts, constructing a unique dictionary tree according to the unique identifiers, and forming a basic dictionary tree object pool for word segmentation service by all the unique dictionary trees;

The technical principle of the technical scheme is as follows: receiving data to be processed, and performing word segmentation on the data to be processed according to a dictionary tree in a basic dictionary tree object pool corresponding to medical text to be processed corresponding to the data to be processed; keyword filtering according to word segmentation result

The technical effect of the technical scheme is as follows: the accuracy of the method can be improved.

Example 3:

on the basis of the above embodiment, different keywords of different medical texts in the step S1 are maintained to the database by the user.

The technical principle of the technical scheme is as follows: the keywords are filled into the database, so that the keywords can be ensured to be effective for a long time.

The technical effect of the technical scheme is as follows: the reliability of the method is ensured.

Example 4:

on the basis of the above embodiment, the step S2 further includes:

The technical principle of the technical scheme is as follows: constructing different word banks according to the mode that one medical text corresponds to one word bank; the format of the lexicon is X.dic, wherein X is the name of the lexicon.

The technical effect of the technical scheme is as follows: the efficiency of the method is improved.

Example 5:

on the basis of the above embodiment, the step S3 includes the following sub-steps:

s35: dynamically constructing a dictionary tree according to a word bank corresponding to the medical text corresponding to the data to be processed, segmenting words according to the constructed dictionary tree, filtering keywords according to a word segmentation result, and ending;

The technical principle of the technical scheme is as follows: and calling a preset general word bank, constructing a general dictionary tree according to the general word bank, segmenting words of the data to be processed according to the constructed general dictionary tree, and filtering the keywords according to a word segmentation result.

The technical effect of the technical scheme is as follows: the accuracy of the method is improved.

Example 6:

on the basis of the previous embodiment, model training is performed on the corpus after the first round of labeling, prediction is completed on the corpus of the medical text according to the generated model, and the method for extracting the predicted entity words executes the following steps:

step A1: preprocessing the acquired corpus;

step AA2: inputting the linguistic data preprocessed in the step A1 into a preset learning model, adjusting parameters of the learning model and storing the parameters;

The technical principle of the technical scheme is as follows: respectively adding corresponding prediction labels to the obtained corpora according to sequence classification results output by the learning model, performing minimum optimization on a loss function of the learning model by using the artificial labels to fit the matching of the prediction labels and the artificial labels, performing word segmentation on unknown corpora by using a word segmentation algorithm, and performing primary labeling on the unknown corpora subjected to word segmentation by using the adjusted learning model.

The technical effect of the technical scheme is as follows: the method has learning and growth promoting effects.

Example 7

On the basis of the above embodiment, the preprocessing in step A1 includes merging large-granularity participles and uniform formats.

The technical principle of the technical scheme is as follows: and the unified format is used for merging, so that the result is more accurate.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional unit in the embodiments of the present invention may be integrated together to form an independent part, or each unit may exist separately, or two or more units may be integrated to form an independent part.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-only memory (ROM, read-on 8 memory 8), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims

1. A medical text named entity identification and labeling method based on an iteration method is characterized by comprising the following steps:

and 4, step 4: performing webpage analysis on the generated new round of entity words by using a search engine tool, filtering according to the principle of whether encyclopedic entries exist or not, further supplementing entity word resources according to related terms of network resources, and supplementing the processed entity words to a dictionary base;

2. The medical text named entity recognition tagging method based on an iterative approach as recited in claim 1, wherein said method for tagging a text with seed words based on existing medical free text comprises the following steps:

3. The medical text named entity recognition tagging method based on iterative approach as claimed in claim 2, wherein different keywords of different medical texts in step S1 are maintained by the user to the database.

4. The method for medical text named entity recognition tagging based on iterative approach as recited in claim 3, wherein said step S2 further comprises:

5. The iterative method-based medical text named entity recognition tagging method of claim 4, wherein said step S3 comprises the sub-steps of:

6. The iterative process-based medical text named entity recognition tagging method of claim 5, wherein the method for model training the first labeled corpus, performing prediction on the medical text corpus according to the generated model, and extracting the predicted entity words performs the following steps:

step A1: preprocessing the acquired corpus;

step A2: inputting the corpus preprocessed in the step A1 into a preset learning model, adjusting parameters of the learning model and storing the parameters;

step A4: and B, tuning the unknown corpus primarily labeled in the step A3, and finally labeling the tuned corpus.

7. The iterative method-based medical text named entity recognition tagging method of claim 6, wherein said preprocessing in step A1 comprises merging large-grained participles and uniform formatting.