CN114139610B

CN114139610B - Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Info

Publication number: CN114139610B
Application number: CN202111349067.2A
Authority: CN
Inventors: 雷蕾; 李海燕; 杨乐; 刘华云; 李小阳; 王晰
Original assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Current assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2024-04-26
Anticipated expiration: 2041-11-15
Also published as: CN114139610A

Abstract

The invention discloses a data structuring method and device for traditional Chinese medicine clinical documents based on deep learning, and relates to the technical field of data processing. Comprising the following steps: acquiring a document to be processed; inputting a document to be processed into a pre-constructed document data structuring model; and obtaining the structured text based on the document to be processed and the document data structuring model. The invention can solve the problems that the extraction result is inaccurate, the correction workload is large, the upgrading process is complex because the extraction rule is artificially and actively preset, the self-learning can not be carried out by utilizing the corrected content, and the purpose of more accurate use can not be achieved.

Description

Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for structuring data of clinical documents of traditional Chinese medicine based on deep learning.

Background

The clinical literature of traditional Chinese medicine contains abundant text and digital information, a great deal of effective clinical practice experience is to be mined, and the individualized diagnosis and treatment experience of the old traditional Chinese medicine is extremely needed to be inherited and summarized. How is the combination with direct evidence obtained from strict clinical randomized controlled trials at the present of the rise of informative tides of traditional Chinese medicine? How are the "soft index" of symptoms, signs, etc. of traditional Chinese medicine combined with the "hard index" obtained from the physical and chemical examination of modern medicine? How to obtain the best evidence needed by evidence-based medicine from the clinical study data of a large number of traditional Chinese medicines? Therefore, the method is convenient for the archiving of the clinical literature of the traditional Chinese medicine, the construction of the knowledge base, the analysis of diagnosis and treatment experience, the promotion of the research and development of new medicines, the cultivation of talents for the research and construction of the data of the traditional Chinese medicine by an information methodology, and the structuring of the data of the clinical literature of the traditional Chinese medicine. However, the prior art has certain defects and shortcomings because the combination of the prior natural language processing and the traditional Chinese medicine is not tight. Firstly, although some traditional Chinese medicine clinical literature data are simply structured by means of manual extraction or regular extraction and manual correction, the traditional Chinese medicine clinical literature data are faced with factors such as mass traditional Chinese medicine clinical literature data, different content constitution, writing method, dependency syntax, forward and foreign names and the like, even under the condition of consuming a large amount of labor cost, extraction and judgment still cannot be accurately and efficiently performed, and further development of research is not facilitated under the background of a large data age. Secondly, the prior technology for carrying out natural language processing and deep learning on clinical documents of traditional Chinese medicines is less, and convenience cannot be provided for research on the relationship between the incidence rules and factors such as medicines and dosages in the field of traditional Chinese medicine by researchers.

The traditional Chinese medicine literature data structuring processing system mainly comprises three parts, namely Chinese medicine literature word extraction, PDF analysis and identification and client identity verification, user-defined word list and knowledge graph construction. On the one hand, the method is to extract words by means of the Chinese medicine word list, so that only the words which appear in the word list can be identified, the unknown words cannot be identified, and if the extraction accuracy is to be improved, new word supplement is needed to be carried out on the word list, and a great amount of time is consumed in the process; on the other hand, the method needs to manually formulate the extraction rule, and the process of adding the new rule is complex.

Disclosure of Invention

The invention aims at the problems that the extraction result is inaccurate, the workload is checked, the upgrading process is complex because the extraction rule is artificially preset, the self-learning can not be performed by using checked content, and the purpose of more accurate and more accurate can not be achieved in the prior art.

In order to solve the technical problems, the invention provides the following technical scheme:

In one aspect, the invention provides a data structuring method of a traditional Chinese medicine clinical document based on deep learning, which is realized by electronic equipment and comprises the following steps:

S1, acquiring a document to be processed.

S2, inputting the document to be processed into a pre-constructed document data structured model.

And S3, obtaining a structured text based on the document to be processed and the document data structured model.

Optionally, the construction process of the document data structured model in S2 includes:

s21, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.

S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.

S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.

S24, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.

S25, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.

Optionally, preprocessing the sample data set in S21 includes:

splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.

Optionally, the labeling the data of the preprocessed sample data set in S22 includes:

And setting labels and sequencing according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured.

Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.

Optionally, obtaining the regular pool according to the obtained labeling data in S22 includes:

extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.

Optionally, obtaining the labeling set according to the obtained labeling data in S22 includes:

and (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.

On the other hand, the invention provides a traditional Chinese medicine clinical document data structuring device based on deep learning, which is applied to realizing a traditional Chinese medicine clinical document data structuring method based on deep learning, and comprises the following steps:

and the acquisition module is used for acquiring the document to be processed.

And the input module is used for inputting the document to be processed into the pre-constructed document data structure model.

And the output module is used for obtaining the structured text based on the document to be processed and the document data structured model.

Optionally, the input module is further configured to:

In one aspect, an electronic device is provided, the electronic device comprising a processor and a memory, the memory storing at least one instruction, the at least one instruction loaded and executed by the processor to implement the above-described deep learning-based traditional Chinese medicine clinical literature data structuring method.

In one aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the above-described deep learning-based traditional Chinese medicine clinical literature data structuring method is provided.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

In the scheme, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.

On the other hand, when the data structure is carried out, firstly, the sentence where the target entity is located is positioned, and after the positioning, the target entity in the sentence is extracted by utilizing the content in the regular pool, so that the text to be extracted can be effectively extracted, the entities at other positions in the literature are prevented from being identified, and the identification accuracy is improved.

The invention solves the problems of the prior art that the natural language processing and the traditional Chinese medicine are not tightly combined, and the defects and the shortcomings exist in the traditional Chinese medicine literature data structuring processing system. In the method, a data labeling person performs data labeling on clinical documents of traditional Chinese medicine by using a BIO labeling method, and a neural network is constructed by a labeling result through a deep learning mode to train an NLP (Natural Language Processing ) data model. According to the NLP model, dynamic regular matching is carried out on training results according to the ontology-related attribute concepts in the traditional Chinese medicine literature field, target content coordinate positioning is completed, extraction of related literature data is achieved, and manual correction is carried out on extraction results. The secondary organization of the correction results is carried out to form a standard knowledge base, the standard sample knowledge base is utilized to perfect training source data, multiple rounds of training are carried out, and the intellectualization and the precision of the data mining results are promoted.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data structuring method of a traditional Chinese medicine clinical document based on deep learning;

FIG. 2 is a schematic flow chart of a method for constructing a data structuring model of the document of the present invention;

FIG. 3 is a schematic diagram of a sample of the clinical literature of the traditional Chinese medicine;

FIG. 4 is a schematic diagram of the tag contents of the present invention;

FIG. 5 is a regularized sentence pattern extraction schematic of the present invention;

FIG. 6 is a schematic diagram of the data marking results of the present invention;

FIG. 7 is a schematic structural diagram of a data structuring method of a traditional Chinese medicine clinical document based on deep learning;

FIG. 8 is a block diagram of a data structuring device of a traditional Chinese medicine clinical document based on deep learning;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the embodiment of the invention provides a data structuring method of a traditional Chinese medicine clinical document based on deep learning, which is implemented by electronic equipment, and the processing flow of the method can comprise the following steps:

S11, acquiring a document to be processed.

S12, inputting the document to be processed into a pre-constructed document data structuring model.

And S13, obtaining a structured text based on the document to be processed and the document data structured model.

Optionally, the construction process of the document data structured model in S12 includes:

s121, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.

S122, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.

S123, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.

S124, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.

S125, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S121; and if the manual checking results are consistent, outputting the document data structured model.

Optionally, preprocessing the sample data set in S121 includes:

Optionally, the labeling the data of the preprocessed sample data set in S122 includes:

Optionally, obtaining the regular pool according to the obtained labeling data in S122 includes:

Optionally, obtaining the labeling set according to the obtained labeling data in S122 includes:

In the embodiment of the invention, a machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.

As shown in fig. 2, an embodiment of the present invention provides a method for constructing a document data structured model, where the method is applied to an electronic device, and the method includes:

In one possible embodiment, as shown in fig. 3, each of the clinical document samples of the traditional Chinese medicine can be split into three parts of contents of "abstract", "data and method", "result". In the three parts, the content to be structured needs to filter information such as keywords, date, head-up, serial numbers and the like in the splitting process, so that the data of the clinical literature of the traditional Chinese medicine forms Map < abstract, content >, map < data and method, content >, map < result, content > in the format.

S22, carrying out data annotation on the preprocessed sample data set.

In a possible implementation, tag definition and ordering are performed first, and tag content is a description of content to be structured, as shown in fig. 4. And respectively marking the content to be structured according to the three parts of content of the preprocessed sample data set, and associating the marked content with the corresponding label.

S23, obtaining a regular pool according to the obtained labeling data.

In a possible embodiment, the labeling result is extracted in a regularized sentence form, as shown in fig. 5, where "the clinical main symptoms are dysphagia and dysarthria" and where dysphagia and dysarthria are target contents, the regularized sentence form may be extracted as the clinical main symptoms.

S24, obtaining a labeling set according to the obtained labeling data.

And (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set. The labeling set is divided into a training set, a verification set and a test set.

In a possible embodiment, as shown in fig. 6, in the preprocessed clinical document sample data of the traditional Chinese medicine, a sequence represents a split sentence, and a structured entity represents an element in the sentence.

The BIO labeling includes: labeling each element as "B-X", "I-X", or "O"; wherein "B-X" means that the fragment in which the element is located is of the X type and that the element is at the beginning of the fragment; "I-X" means that the fragment in which the element is located is of the X type and that the element is in the middle of the fragment; "O" means not of any type.

The structured entity is extracted and put into a word list, document data is converted into one line, the contents in the word list are searched and replaced in the form of B-label, I-label and O, and then the structured entity is scattered according to characters to construct a labeling set.

When NER (NAMED ENTITY Recognition, named entity Recognition neural network model) training is performed, training sets, verification sets and test sets are required to be divided for BIO annotation sets, and the training sets, verification sets and test sets can be divided according to 7:2:1, in order to obtain parameters such as accuracy, precision, recall rate, F1 and the like, and evaluate the quality of the model.

S25, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.

In a possible implementation, a Transformer-based NLP-NER is built, and NLP is an important direction in the fields of computer science and artificial intelligence. It researches various theories and methods for realizing effective communication between human and computer by natural language. It mainly includes two parts, NLU (Natural Language Understanding ) and NLG (Natural LanguageGeneration, natural language generation).

The transducer employs Encoder-Decoder architecture, encoder-Decoder: the method is a model framework, is a generic algorithm, is not particularly limited to a specific algorithm, and firstly, an encoder converts an input sequence into a dense vector with a fixed dimension, and a decoding (encoding) stage generates a target translation from the activation state. There are great advantages in terms of parallelism and long-range dependence, but it is found that it has disadvantages in terms of directionality, relative position, sparsity by analysis of the transducer attention mechanism. Based on the structural data characteristics of the clinical literature of the traditional Chinese medicine, the performance of the transducer structure on the NER task of the clinical literature is greatly improved through simple improvement of the attention scoring function. The attention scoring function is the prior art, the invention is not repeated here, only the improvement part is described, after calculation softmax (Q dot K), each point is weighted once, and the part Pytorch codes are as follows:

self.time_weighting＝nn.Parameter(torch.ones(self.n_head,

config.window_len,config.

...

att=f.softmax (att, dim= -1) # this is the original code

Att=att×self. Time_ weightingt [; t, T# only needs to be increased by this sentence

Att=self attn_drop (att) # this is the original code

The advantages of the above improvement are two, one is: the contributions to the position of the token at different distances should be different. Secondly, it is: for a token near the beginning of training, the overall weight of self-attention should be reduced due to the smaller observation window and relatively low information content. Aiming at the defects of unique directionality, relative position, sparsity and the like in the traditional Chinese medicine literature data, each point is weighted once after softmax (Q dot K) is calculated, so that a prediction result is more accurate.

The training times are that the training is terminated by taking the accuracy of the (n+1) rounds and the F1 value of the training is less than or equal to the n rounds, so that a model is obtained. The specific training parameters are as follows:

S26, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.

In a feasible implementation manner, the sample data of the test set is initialized, short sentences and length verification firstly, the test set is predicted by using a trained model, the content obtained by prediction is subjected to coordinate positioning in the test set, and the test set is judged to be which sentence in the test set, and the result is obtained by expanding the coordinates into one sentence or several sentences in which the test set is positioned. And (5) invoking the content in the regular pool to extract the obtained result, wherein the extracted result is the content needing to be structured.

S27, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.

In a possible implementation, as shown in fig. 7, the structured content obtained by using the method of S21-S26 is subjected to manual secondary calibration, and S21 is repeated after the calibration is completed until the obtained content is consistent with the calibration result, and a new model is updated at the same time, so that the model is continuously and accurately learned by the self-learning method.

As shown in fig. 8, an embodiment of the present invention provides a device 800 for structuring data of clinical documents of traditional Chinese medicine based on deep learning, where the device 800 is applied to implement a method for structuring data of clinical documents of traditional Chinese medicine based on deep learning, and the device 800 includes:

an obtaining module 810, configured to obtain a document to be processed.

And the input module 820 is used for inputting the document to be processed into the pre-constructed document data structure model.

And an output module 830, configured to obtain a structured text based on the document to be processed and the document data structuring model.

Optionally, the input module 820 is further configured to:

Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present invention, where the electronic device 900 may have relatively large differences due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 901 and one or more memories 902, where at least one instruction is stored in the memories 902, and the at least one instruction is loaded and executed by the processors 901 to implement the following steps of a deep learning-based data structuring method for traditional Chinese medicine clinical literature:

S1, acquiring a document to be processed.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described deep learning based traditional Chinese medicine clinical literature data structuring method is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for structuring data of clinical literature of traditional Chinese medicine based on deep learning, which is characterized by comprising the following steps:

S1, acquiring a document to be processed;

s2, inputting the document to be processed into a pre-constructed document data structuring model;

s3, obtaining a structured text based on the document to be processed and the document data structured model;

the construction process of the document data structured model in S2 includes:

S21, acquiring a sample data set of a traditional Chinese medicine clinical document, and preprocessing the sample data set, wherein the format of the preprocessed sample data set is Map < abstract, content >, map < data and method, content >, map < result, content > format;

s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;

S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a literature data structured model; wherein, the attention scoring function of the document data structured model weights each point once after calculating softmax;

S24, inputting the test set into the document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structuring text;

s25, manually checking the predicted structured text, and if the manual checking results are inconsistent, turning to execute S21; outputting the document data structured model if the manual checking results are consistent;

the step S22 of obtaining the regular pool according to the obtained labeling data comprises the following steps:

extracting sentences in which the annotation data are located, removing the annotation data from the sentences, dynamically generating regular extraction sentence patterns, and storing the regular extraction sentence patterns into a regular pool.

2. The method of claim 1, wherein the preprocessing of the sample dataset in S21 comprises:

Splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.

3. The method according to claim 1, wherein the data labeling of the preprocessed sample dataset in S22 comprises:

Setting labels and ordering according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured;

4. The method according to claim 1, wherein obtaining the annotation set from the obtained annotation data in S22 comprises:

And carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.

5. A device for structuring data of clinical literature of traditional Chinese medicine based on deep learning, characterized in that the device comprises:

The acquisition module is used for acquiring the document to be processed;

the input module is used for inputting the document to be processed into a pre-constructed document data structuring model;

The output module is used for obtaining a structured text based on the document to be processed and the document data structured model;

the construction process of the document data structured model comprises the following steps:

6. The apparatus of claim 5, wherein the preprocessing of the sample dataset in S21 comprises:

7. The apparatus of claim 5, wherein data labeling the preprocessed sample dataset comprises:

Setting labels and ordering according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured; labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.