CN114139610B - Deep learning-based traditional Chinese medicine clinical literature data structuring method and device - Google Patents

Deep learning-based traditional Chinese medicine clinical literature data structuring method and device Download PDF

Info

Publication number
CN114139610B
CN114139610B CN202111349067.2A CN202111349067A CN114139610B CN 114139610 B CN114139610 B CN 114139610B CN 202111349067 A CN202111349067 A CN 202111349067A CN 114139610 B CN114139610 B CN 114139610B
Authority
CN
China
Prior art keywords
data
document
content
labeling
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111349067.2A
Other languages
Chinese (zh)
Other versions
CN114139610A (en
Inventor
雷蕾
李海燕
杨乐
刘华云
李小阳
王晰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Information On Traditional Chinese Medicine Cacms
Original Assignee
Institute Of Information On Traditional Chinese Medicine Cacms
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Information On Traditional Chinese Medicine Cacms filed Critical Institute Of Information On Traditional Chinese Medicine Cacms
Priority to CN202111349067.2A priority Critical patent/CN114139610B/en
Publication of CN114139610A publication Critical patent/CN114139610A/en
Application granted granted Critical
Publication of CN114139610B publication Critical patent/CN114139610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a data structuring method and device for traditional Chinese medicine clinical documents based on deep learning, and relates to the technical field of data processing. Comprising the following steps: acquiring a document to be processed; inputting a document to be processed into a pre-constructed document data structuring model; and obtaining the structured text based on the document to be processed and the document data structuring model. The invention can solve the problems that the extraction result is inaccurate, the correction workload is large, the upgrading process is complex because the extraction rule is artificially and actively preset, the self-learning can not be carried out by utilizing the corrected content, and the purpose of more accurate use can not be achieved.

Description

Deep learning-based traditional Chinese medicine clinical literature data structuring method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for structuring data of clinical documents of traditional Chinese medicine based on deep learning.
Background
The clinical literature of traditional Chinese medicine contains abundant text and digital information, a great deal of effective clinical practice experience is to be mined, and the individualized diagnosis and treatment experience of the old traditional Chinese medicine is extremely needed to be inherited and summarized. How is the combination with direct evidence obtained from strict clinical randomized controlled trials at the present of the rise of informative tides of traditional Chinese medicine? How are the "soft index" of symptoms, signs, etc. of traditional Chinese medicine combined with the "hard index" obtained from the physical and chemical examination of modern medicine? How to obtain the best evidence needed by evidence-based medicine from the clinical study data of a large number of traditional Chinese medicines? Therefore, the method is convenient for the archiving of the clinical literature of the traditional Chinese medicine, the construction of the knowledge base, the analysis of diagnosis and treatment experience, the promotion of the research and development of new medicines, the cultivation of talents for the research and construction of the data of the traditional Chinese medicine by an information methodology, and the structuring of the data of the clinical literature of the traditional Chinese medicine. However, the prior art has certain defects and shortcomings because the combination of the prior natural language processing and the traditional Chinese medicine is not tight. Firstly, although some traditional Chinese medicine clinical literature data are simply structured by means of manual extraction or regular extraction and manual correction, the traditional Chinese medicine clinical literature data are faced with factors such as mass traditional Chinese medicine clinical literature data, different content constitution, writing method, dependency syntax, forward and foreign names and the like, even under the condition of consuming a large amount of labor cost, extraction and judgment still cannot be accurately and efficiently performed, and further development of research is not facilitated under the background of a large data age. Secondly, the prior technology for carrying out natural language processing and deep learning on clinical documents of traditional Chinese medicines is less, and convenience cannot be provided for research on the relationship between the incidence rules and factors such as medicines and dosages in the field of traditional Chinese medicine by researchers.
The traditional Chinese medicine literature data structuring processing system mainly comprises three parts, namely Chinese medicine literature word extraction, PDF analysis and identification and client identity verification, user-defined word list and knowledge graph construction. On the one hand, the method is to extract words by means of the Chinese medicine word list, so that only the words which appear in the word list can be identified, the unknown words cannot be identified, and if the extraction accuracy is to be improved, new word supplement is needed to be carried out on the word list, and a great amount of time is consumed in the process; on the other hand, the method needs to manually formulate the extraction rule, and the process of adding the new rule is complex.
Disclosure of Invention
The invention aims at the problems that the extraction result is inaccurate, the workload is checked, the upgrading process is complex because the extraction rule is artificially preset, the self-learning can not be performed by using checked content, and the purpose of more accurate and more accurate can not be achieved in the prior art.
In order to solve the technical problems, the invention provides the following technical scheme:
In one aspect, the invention provides a data structuring method of a traditional Chinese medicine clinical document based on deep learning, which is realized by electronic equipment and comprises the following steps:
S1, acquiring a document to be processed.
S2, inputting the document to be processed into a pre-constructed document data structured model.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the construction process of the document data structured model in S2 includes:
s21, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.
S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.
S24, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.
S25, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.
Optionally, preprocessing the sample data set in S21 includes:
splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.
Optionally, the labeling the data of the preprocessed sample data set in S22 includes:
And setting labels and sequencing according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured.
Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
Optionally, obtaining the regular pool according to the obtained labeling data in S22 includes:
extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, obtaining the labeling set according to the obtained labeling data in S22 includes:
and (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.
On the other hand, the invention provides a traditional Chinese medicine clinical document data structuring device based on deep learning, which is applied to realizing a traditional Chinese medicine clinical document data structuring method based on deep learning, and comprises the following steps:
and the acquisition module is used for acquiring the document to be processed.
And the input module is used for inputting the document to be processed into the pre-constructed document data structure model.
And the output module is used for obtaining the structured text based on the document to be processed and the document data structured model.
Optionally, the input module is further configured to:
s21, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.
S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.
S24, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.
S25, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.
Optionally, the input module is further configured to:
splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.
Optionally, the input module is further configured to:
And setting labels and sequencing according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured.
Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
Optionally, the input module is further configured to:
extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module is further configured to:
and (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.
In one aspect, an electronic device is provided, the electronic device comprising a processor and a memory, the memory storing at least one instruction, the at least one instruction loaded and executed by the processor to implement the above-described deep learning-based traditional Chinese medicine clinical literature data structuring method.
In one aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the above-described deep learning-based traditional Chinese medicine clinical literature data structuring method is provided.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
In the scheme, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.
On the other hand, when the data structure is carried out, firstly, the sentence where the target entity is located is positioned, and after the positioning, the target entity in the sentence is extracted by utilizing the content in the regular pool, so that the text to be extracted can be effectively extracted, the entities at other positions in the literature are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the natural language processing and the traditional Chinese medicine are not tightly combined, and the defects and the shortcomings exist in the traditional Chinese medicine literature data structuring processing system. In the method, a data labeling person performs data labeling on clinical documents of traditional Chinese medicine by using a BIO labeling method, and a neural network is constructed by a labeling result through a deep learning mode to train an NLP (Natural Language Processing ) data model. According to the NLP model, dynamic regular matching is carried out on training results according to the ontology-related attribute concepts in the traditional Chinese medicine literature field, target content coordinate positioning is completed, extraction of related literature data is achieved, and manual correction is carried out on extraction results. The secondary organization of the correction results is carried out to form a standard knowledge base, the standard sample knowledge base is utilized to perfect training source data, multiple rounds of training are carried out, and the intellectualization and the precision of the data mining results are promoted.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data structuring method of a traditional Chinese medicine clinical document based on deep learning;
FIG. 2 is a schematic flow chart of a method for constructing a data structuring model of the document of the present invention;
FIG. 3 is a schematic diagram of a sample of the clinical literature of the traditional Chinese medicine;
FIG. 4 is a schematic diagram of the tag contents of the present invention;
FIG. 5 is a regularized sentence pattern extraction schematic of the present invention;
FIG. 6 is a schematic diagram of the data marking results of the present invention;
FIG. 7 is a schematic structural diagram of a data structuring method of a traditional Chinese medicine clinical document based on deep learning;
FIG. 8 is a block diagram of a data structuring device of a traditional Chinese medicine clinical document based on deep learning;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the embodiment of the invention provides a data structuring method of a traditional Chinese medicine clinical document based on deep learning, which is implemented by electronic equipment, and the processing flow of the method can comprise the following steps:
S11, acquiring a document to be processed.
S12, inputting the document to be processed into a pre-constructed document data structuring model.
And S13, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the construction process of the document data structured model in S12 includes:
s121, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.
S122, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
S123, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.
S124, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.
S125, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S121; and if the manual checking results are consistent, outputting the document data structured model.
Optionally, preprocessing the sample data set in S121 includes:
splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.
Optionally, the labeling the data of the preprocessed sample data set in S122 includes:
And setting labels and sequencing according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured.
Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
Optionally, obtaining the regular pool according to the obtained labeling data in S122 includes:
extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, obtaining the labeling set according to the obtained labeling data in S122 includes:
and (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.
In the embodiment of the invention, a machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.
On the other hand, when the data structure is carried out, firstly, the sentence where the target entity is located is positioned, and after the positioning, the target entity in the sentence is extracted by utilizing the content in the regular pool, so that the text to be extracted can be effectively extracted, the entities at other positions in the literature are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the natural language processing and the traditional Chinese medicine are not tightly combined, and the defects and the shortcomings exist in the traditional Chinese medicine literature data structuring processing system. In the method, a data labeling person performs data labeling on clinical documents of traditional Chinese medicine by using a BIO labeling method, and a neural network is constructed by a labeling result through a deep learning mode to train an NLP (Natural Language Processing ) data model. According to the NLP model, dynamic regular matching is carried out on training results according to the ontology-related attribute concepts in the traditional Chinese medicine literature field, target content coordinate positioning is completed, extraction of related literature data is achieved, and manual correction is carried out on extraction results. The secondary organization of the correction results is carried out to form a standard knowledge base, the standard sample knowledge base is utilized to perfect training source data, multiple rounds of training are carried out, and the intellectualization and the precision of the data mining results are promoted.
As shown in fig. 2, an embodiment of the present invention provides a method for constructing a document data structured model, where the method is applied to an electronic device, and the method includes:
s21, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.
Splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.
In one possible embodiment, as shown in fig. 3, each of the clinical document samples of the traditional Chinese medicine can be split into three parts of contents of "abstract", "data and method", "result". In the three parts, the content to be structured needs to filter information such as keywords, date, head-up, serial numbers and the like in the splitting process, so that the data of the clinical literature of the traditional Chinese medicine forms Map < abstract, content >, map < data and method, content >, map < result, content > in the format.
S22, carrying out data annotation on the preprocessed sample data set.
In a possible implementation, tag definition and ordering are performed first, and tag content is a description of content to be structured, as shown in fig. 4. And respectively marking the content to be structured according to the three parts of content of the preprocessed sample data set, and associating the marked content with the corresponding label.
S23, obtaining a regular pool according to the obtained labeling data.
Extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
In a possible embodiment, the labeling result is extracted in a regularized sentence form, as shown in fig. 5, where "the clinical main symptoms are dysphagia and dysarthria" and where dysphagia and dysarthria are target contents, the regularized sentence form may be extracted as the clinical main symptoms.
S24, obtaining a labeling set according to the obtained labeling data.
And (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set. The labeling set is divided into a training set, a verification set and a test set.
In a possible embodiment, as shown in fig. 6, in the preprocessed clinical document sample data of the traditional Chinese medicine, a sequence represents a split sentence, and a structured entity represents an element in the sentence.
The BIO labeling includes: labeling each element as "B-X", "I-X", or "O"; wherein "B-X" means that the fragment in which the element is located is of the X type and that the element is at the beginning of the fragment; "I-X" means that the fragment in which the element is located is of the X type and that the element is in the middle of the fragment; "O" means not of any type.
The structured entity is extracted and put into a word list, document data is converted into one line, the contents in the word list are searched and replaced in the form of B-label, I-label and O, and then the structured entity is scattered according to characters to construct a labeling set.
When NER (NAMED ENTITY Recognition, named entity Recognition neural network model) training is performed, training sets, verification sets and test sets are required to be divided for BIO annotation sets, and the training sets, verification sets and test sets can be divided according to 7:2:1, in order to obtain parameters such as accuracy, precision, recall rate, F1 and the like, and evaluate the quality of the model.
S25, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.
In a possible implementation, a Transformer-based NLP-NER is built, and NLP is an important direction in the fields of computer science and artificial intelligence. It researches various theories and methods for realizing effective communication between human and computer by natural language. It mainly includes two parts, NLU (Natural Language Understanding ) and NLG (Natural LanguageGeneration, natural language generation).
The transducer employs Encoder-Decoder architecture, encoder-Decoder: the method is a model framework, is a generic algorithm, is not particularly limited to a specific algorithm, and firstly, an encoder converts an input sequence into a dense vector with a fixed dimension, and a decoding (encoding) stage generates a target translation from the activation state. There are great advantages in terms of parallelism and long-range dependence, but it is found that it has disadvantages in terms of directionality, relative position, sparsity by analysis of the transducer attention mechanism. Based on the structural data characteristics of the clinical literature of the traditional Chinese medicine, the performance of the transducer structure on the NER task of the clinical literature is greatly improved through simple improvement of the attention scoring function. The attention scoring function is the prior art, the invention is not repeated here, only the improvement part is described, after calculation softmax (Q dot K), each point is weighted once, and the part Pytorch codes are as follows:
self.time_weighting=nn.Parameter(torch.ones(self.n_head,
config.window_len,config.
...
att=f.softmax (att, dim= -1) # this is the original code
Att=att×self. Time_ weightingt [; t, T# only needs to be increased by this sentence
Att=self attn_drop (att) # this is the original code
The advantages of the above improvement are two, one is: the contributions to the position of the token at different distances should be different. Secondly, it is: for a token near the beginning of training, the overall weight of self-attention should be reduced due to the smaller observation window and relatively low information content. Aiming at the defects of unique directionality, relative position, sparsity and the like in the traditional Chinese medicine literature data, each point is weighted once after softmax (Q dot K) is calculated, so that a prediction result is more accurate.
The training times are that the training is terminated by taking the accuracy of the (n+1) rounds and the F1 value of the training is less than or equal to the n rounds, so that a model is obtained. The specific training parameters are as follows:
S26, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.
In a feasible implementation manner, the sample data of the test set is initialized, short sentences and length verification firstly, the test set is predicted by using a trained model, the content obtained by prediction is subjected to coordinate positioning in the test set, and the test set is judged to be which sentence in the test set, and the result is obtained by expanding the coordinates into one sentence or several sentences in which the test set is positioned. And (5) invoking the content in the regular pool to extract the obtained result, wherein the extracted result is the content needing to be structured.
S27, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.
In a possible implementation, as shown in fig. 7, the structured content obtained by using the method of S21-S26 is subjected to manual secondary calibration, and S21 is repeated after the calibration is completed until the obtained content is consistent with the calibration result, and a new model is updated at the same time, so that the model is continuously and accurately learned by the self-learning method.
In the embodiment of the invention, a machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.
On the other hand, when the data structure is carried out, firstly, the sentence where the target entity is located is positioned, and after the positioning, the target entity in the sentence is extracted by utilizing the content in the regular pool, so that the text to be extracted can be effectively extracted, the entities at other positions in the literature are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the natural language processing and the traditional Chinese medicine are not tightly combined, and the defects and the shortcomings exist in the traditional Chinese medicine literature data structuring processing system. In the method, a data labeling person performs data labeling on clinical documents of traditional Chinese medicine by using a BIO labeling method, and a neural network is constructed by a labeling result through a deep learning mode to train an NLP (Natural Language Processing ) data model. According to the NLP model, dynamic regular matching is carried out on training results according to the ontology-related attribute concepts in the traditional Chinese medicine literature field, target content coordinate positioning is completed, extraction of related literature data is achieved, and manual correction is carried out on extraction results. The secondary organization of the correction results is carried out to form a standard knowledge base, the standard sample knowledge base is utilized to perfect training source data, multiple rounds of training are carried out, and the intellectualization and the precision of the data mining results are promoted.
As shown in fig. 8, an embodiment of the present invention provides a device 800 for structuring data of clinical documents of traditional Chinese medicine based on deep learning, where the device 800 is applied to implement a method for structuring data of clinical documents of traditional Chinese medicine based on deep learning, and the device 800 includes:
an obtaining module 810, configured to obtain a document to be processed.
And the input module 820 is used for inputting the document to be processed into the pre-constructed document data structure model.
And an output module 830, configured to obtain a structured text based on the document to be processed and the document data structuring model.
Optionally, the input module 820 is further configured to:
s21, acquiring a traditional Chinese medicine clinical document sample data set, and preprocessing the sample data set.
S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to a training set and a verification set to obtain a literature data structured model.
S24, inputting the test set into a document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structured text.
S25, manually checking the predicted structured texts, and if the manual checking results are inconsistent, turning to execute S21; and if the manual checking results are consistent, outputting the document data structured model.
Optionally, the input module 820 is further configured to:
splitting sample data in the sample data set, and deleting key words, date, head up and number information in the split sample data.
Optionally, the input module 820 is further configured to:
And setting labels and sequencing according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured.
Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
Optionally, the input module 820 is further configured to:
extracting sentences in which the annotation data are, removing the annotation data from the sentences, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module 820 is further configured to:
and (3) carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.
In the embodiment of the invention, a machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the output result can be manually checked, so that the system can complete multiple rounds of automatic learning, the time is saved, and the purpose of optimizing the system can be achieved. And utilizing a dynamic regular pool concept to secondarily utilize the data labeling result, extracting the data labeling result to form an expression library, forming an expression set (which can assist in dependency syntactic analysis), and determining the weight of the expression in the expression set according to the occurrence times of the expression. In the document data structuring process, model prediction plays two roles, 1, a result 2 is directly obtained, a sentence where the result is located, which aspect is executed is judged according to result scoring, and if the 2 nd aspect is executed, a dynamic regular pool is called to obtain accurate structured data.
On the other hand, when the data structure is carried out, firstly, the sentence where the target entity is located is positioned, and after the positioning, the target entity in the sentence is extracted by utilizing the content in the regular pool, so that the text to be extracted can be effectively extracted, the entities at other positions in the literature are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the natural language processing and the traditional Chinese medicine are not tightly combined, and the defects and the shortcomings exist in the traditional Chinese medicine literature data structuring processing system. In the method, a data labeling person performs data labeling on clinical documents of traditional Chinese medicine by using a BIO labeling method, and a neural network is constructed by a labeling result through a deep learning mode to train an NLP (Natural Language Processing ) data model. According to the NLP model, dynamic regular matching is carried out on training results according to the ontology-related attribute concepts in the traditional Chinese medicine literature field, target content coordinate positioning is completed, extraction of related literature data is achieved, and manual correction is carried out on extraction results. The secondary organization of the correction results is carried out to form a standard knowledge base, the standard sample knowledge base is utilized to perfect training source data, multiple rounds of training are carried out, and the intellectualization and the precision of the data mining results are promoted.
Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present invention, where the electronic device 900 may have relatively large differences due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 901 and one or more memories 902, where at least one instruction is stored in the memories 902, and the at least one instruction is loaded and executed by the processors 901 to implement the following steps of a deep learning-based data structuring method for traditional Chinese medicine clinical literature:
S1, acquiring a document to be processed.
S2, inputting the document to be processed into a pre-constructed document data structured model.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described deep learning based traditional Chinese medicine clinical literature data structuring method is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (7)

1. A method for structuring data of clinical literature of traditional Chinese medicine based on deep learning, which is characterized by comprising the following steps:
S1, acquiring a document to be processed;
s2, inputting the document to be processed into a pre-constructed document data structuring model;
s3, obtaining a structured text based on the document to be processed and the document data structured model;
the construction process of the document data structured model in S2 includes:
S21, acquiring a sample data set of a traditional Chinese medicine clinical document, and preprocessing the sample data set, wherein the format of the preprocessed sample data set is Map < abstract, content >, map < data and method, content >, map < result, content > format;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a literature data structured model; wherein, the attention scoring function of the document data structured model weights each point once after calculating softmax;
S24, inputting the test set into the document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structuring text;
s25, manually checking the predicted structured text, and if the manual checking results are inconsistent, turning to execute S21; outputting the document data structured model if the manual checking results are consistent;
the step S22 of obtaining the regular pool according to the obtained labeling data comprises the following steps:
extracting sentences in which the annotation data are located, removing the annotation data from the sentences, dynamically generating regular extraction sentence patterns, and storing the regular extraction sentence patterns into a regular pool.
2. The method of claim 1, wherein the preprocessing of the sample dataset in S21 comprises:
Splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
3. The method according to claim 1, wherein the data labeling of the preprocessed sample dataset in S22 comprises:
Setting labels and ordering according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured;
Labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
4. The method according to claim 1, wherein obtaining the annotation set from the obtained annotation data in S22 comprises:
And carrying out sequence labeling on the labeling data by adopting a BIO labeling method to obtain a labeling set.
5. A device for structuring data of clinical literature of traditional Chinese medicine based on deep learning, characterized in that the device comprises:
The acquisition module is used for acquiring the document to be processed;
the input module is used for inputting the document to be processed into a pre-constructed document data structuring model;
The output module is used for obtaining a structured text based on the document to be processed and the document data structured model;
the construction process of the document data structured model comprises the following steps:
S21, acquiring a sample data set of a traditional Chinese medicine clinical document, and preprocessing the sample data set, wherein the format of the preprocessed sample data set is Map < abstract, content >, map < data and method, content >, map < result, content > format;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
S23, constructing a neural network model based on a self-attention mechanism transducer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a literature data structured model; wherein, the attention scoring function of the document data structured model weights each point once after calculating softmax;
S24, inputting the test set into the document data structuring model to obtain a predicted target point, and extracting one or more sentences of the predicted target point according to the regular pool to obtain a predicted structuring text;
s25, manually checking the predicted structured text, and if the manual checking results are inconsistent, turning to execute S21; outputting the document data structured model if the manual checking results are consistent;
the step S22 of obtaining the regular pool according to the obtained labeling data comprises the following steps:
extracting sentences in which the annotation data are located, removing the annotation data from the sentences, dynamically generating regular extraction sentence patterns, and storing the regular extraction sentence patterns into a regular pool.
6. The apparatus of claim 5, wherein the preprocessing of the sample dataset in S21 comprises:
Splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
7. The apparatus of claim 5, wherein data labeling the preprocessed sample dataset comprises:
Setting labels and ordering according to the content of the preprocessed sample data set, wherein the content of the labels is the description of the content to be structured; labeling the content of the preprocessed sample data set according to the label, and associating the labeled content with the corresponding label.
CN202111349067.2A 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device Active CN114139610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111349067.2A CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111349067.2A CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Publications (2)

Publication Number Publication Date
CN114139610A CN114139610A (en) 2022-03-04
CN114139610B true CN114139610B (en) 2024-04-26

Family

ID=80394333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111349067.2A Active CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Country Status (1)

Country Link
CN (1) CN114139610B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644719A (en) * 2023-05-29 2023-08-25 南通大学 Element coding method for clinical research literature and application of element coding method in diabetic retinopathy

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153664A (en) * 2016-03-04 2017-09-12 同方知网(北京)技术有限公司 A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110866113A (en) * 2019-09-30 2020-03-06 浙江大学 Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111834012A (en) * 2020-07-14 2020-10-27 中国中医科学院中医药信息研究所 Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism
CN112487134A (en) * 2020-12-08 2021-03-12 武汉大学 Scientific and technological text problem extraction method based on extremely simple abstract strategy
CN112685513A (en) * 2021-01-07 2021-04-20 昆明理工大学 Al-Si alloy material entity relation extraction method based on text mining
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113420126A (en) * 2021-06-30 2021-09-21 北京法意科技有限公司 Legal rule map construction method and system based on legal text
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10409911B2 (en) * 2016-04-29 2019-09-10 Cavium, Llc Systems and methods for text analytics processor

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153664A (en) * 2016-03-04 2017-09-12 同方知网(北京)技术有限公司 A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110866113A (en) * 2019-09-30 2020-03-06 浙江大学 Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111834012A (en) * 2020-07-14 2020-10-27 中国中医科学院中医药信息研究所 Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism
CN112487134A (en) * 2020-12-08 2021-03-12 武汉大学 Scientific and technological text problem extraction method based on extremely simple abstract strategy
CN112685513A (en) * 2021-01-07 2021-04-20 昆明理工大学 Al-Si alloy material entity relation extraction method based on text mining
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113420126A (en) * 2021-06-30 2021-09-21 北京法意科技有限公司 Legal rule map construction method and system based on legal text
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于 BiLSTM-CRF 的中医文本命名实体识别;肖瑞 等;《世界科学技术-中医药现代化》;第22卷(第7期);第2504-2510页 *
基于正则抽取的竹种数据结构化方法研究;李欣 等;《计算机技术与发展》;20180208;第28卷(第06期);第147-150+155页 *
针刺临床基础研究文献数据库人机协同构建方法研究;刘华云;《中国优秀硕士学位论文全文数据库医药卫生科技辑》;20230228(第02期);第E056-1083页 *

Also Published As

Publication number Publication date
CN114139610A (en) 2022-03-04

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110532554B (en) Chinese abstract generation method, system and storage medium
JP7259650B2 (en) Translation device, translation method and program
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
US11327971B2 (en) Assertion-based question answering
CN112131351B (en) Segment information extraction model training method based on multi-answer loss function
Xu et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding
CN114580424A (en) Labeling method and device for named entity identification of legal document
CN114139610B (en) Deep learning-based traditional Chinese medicine clinical literature data structuring method and device
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN112597299A (en) Text entity classification method and device, terminal equipment and storage medium
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
Acharjee et al. Sequence-to-sequence learning-based conversion of pseudo-code to source code using neural translation approach
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN115392255A (en) Few-sample machine reading understanding method for bridge detection text
CN112257447B (en) Named entity recognition system and recognition method based on depth network AS-LSTM
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN113221573A (en) Entity classification method and device, computing equipment and storage medium
CN110633471A (en) English word segmentation processing system and method based on PubMed database
US20240104355A1 (en) Generating neural network outputs by enriching latent embeddings using self-attention and cross-attention operations
Ma et al. An enhanced method for dialect transcription via error‐correcting thesaurus
Jiang Chinese named entity recognition method based on multiscale feature fusion
Wu et al. Unify the Usage of Lexicon in Chinese Named Entity Recognition
CN117972626A (en) Key attribute image-text matching method and device based on multi-mode feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant