CN114139610A - Traditional Chinese medicine clinical literature data structuring method and device based on deep learning - Google Patents

Traditional Chinese medicine clinical literature data structuring method and device based on deep learning Download PDF

Info

Publication number
CN114139610A
CN114139610A CN202111349067.2A CN202111349067A CN114139610A CN 114139610 A CN114139610 A CN 114139610A CN 202111349067 A CN202111349067 A CN 202111349067A CN 114139610 A CN114139610 A CN 114139610A
Authority
CN
China
Prior art keywords
data
document
structured
sample data
annotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111349067.2A
Other languages
Chinese (zh)
Other versions
CN114139610B (en
Inventor
雷蕾
李海燕
杨乐
刘华云
李小阳
王晰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Information On Traditional Chinese Medicine Cacms
Original Assignee
Institute Of Information On Traditional Chinese Medicine Cacms
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Information On Traditional Chinese Medicine Cacms filed Critical Institute Of Information On Traditional Chinese Medicine Cacms
Priority to CN202111349067.2A priority Critical patent/CN114139610B/en
Publication of CN114139610A publication Critical patent/CN114139610A/en
Application granted granted Critical
Publication of CN114139610B publication Critical patent/CN114139610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a traditional Chinese medicine clinical literature data structuring method and device based on deep learning, and relates to the technical field of data processing. The method comprises the following steps: acquiring a document to be processed; inputting a document to be processed into a document data structured model which is constructed in advance; and obtaining a structured text based on the document to be processed and the document data structured model. The invention can solve the problems of inaccurate extraction result, large correction workload, complicated upgrading process, incapability of utilizing corrected contents to carry out self-learning and incapability of achieving the purpose of more accurate use due to the fact that the extraction rule is artificially and actively preset in the prior art.

Description

Traditional Chinese medicine clinical literature data structuring method and device based on deep learning
Technical Field
The invention relates to the technical field of data processing, in particular to a traditional Chinese medicine clinical literature data structuring method and device based on deep learning.
Background
The clinical literature of traditional Chinese medicine contains abundant text and digital information, wherein a great deal of effective clinical practice experience needs to be mined, and personalized diagnosis and treatment experience of famous old traditional Chinese medicine needs to be inherited and summarized. How to organically combine with direct evidence obtained from strict clinical randomized contrast trials at the present time when the Chinese medicine informatization wave is rising? How to combine the "soft indexes" of the symptoms and signs of traditional Chinese medicine with the "hard indexes" obtained by the physicochemical examination of modern medicine? How to obtain the best evidence for evidence-based medicine from the clinical research data of a large amount of traditional Chinese medicines? Therefore, the structured traditional Chinese medicine clinical literature data brings great convenience in the aspects of filing of traditional Chinese medicine clinical literature, knowledge base construction work, diagnosis and treatment experience analysis, new medicine research and development promotion, information methodology research and construction of talent team of traditional Chinese medicine data. However, the prior art has certain defects and shortcomings due to the fact that the combination of natural language processing and traditional Chinese medicine is not tight at present. Firstly, although some traditional Chinese medicine clinical literature data are simply structured by manual extraction or rule extraction plus manual proofreading, even in the case of large amount of traditional Chinese medicine clinical literature data and different content composition, writing law, dependency syntax, different names and other factors, even if a large amount of labor cost is consumed, accurate and efficient extraction and judgment still cannot be performed, and the method is not beneficial to further development of research in the background of a big data era. Secondly, the technology of natural language processing and deep learning of clinical documents of traditional Chinese medicine is less at present, and convenience cannot be provided for research of relationship between disease incidence rules and factors such as medicines and dosage by research personnel in the field of traditional Chinese medicine.
The traditional Chinese medicine document data structured processing system mainly comprises three parts, namely Chinese medicine document word extraction, PDF analysis and identification, client identity verification, user-defined word list and knowledge map construction. On one hand, the method extracts words by means of the traditional Chinese medicine word list, so that only the words appearing in the word list can be recognized, and the unknown words cannot be recognized, if the extraction accuracy is improved, new words are required to be supplemented to the word list, and a large amount of time is consumed in the process; on the other hand, the method needs to manually make an extraction rule, and the process of adding a new rule is complex.
Disclosure of Invention
The invention provides the method for extracting the content of the data, aiming at the problems that the extraction result is inaccurate, the correction workload is high, the updating process is complex due to the fact that the extraction rule is artificially and actively preset, the corrected content cannot be used for self-learning, and the purpose of increasing the use accuracy cannot be achieved in the prior art.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the present invention provides a deep learning-based method for structuring data of clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and the method includes:
and S1, acquiring the document to be processed.
And S2, inputting the document to be processed into a document data structured model which is constructed in advance.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the building process of the document data structured model in S2 includes:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the preprocessing the sample data set in S21 includes:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the data tagging of the preprocessed sample data set in S22 includes:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the obtaining the regular pool according to the obtained labeling data in S22 includes:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the deriving an annotation set according to the obtained annotation data in S22 includes:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In another aspect, the present invention provides a deep learning-based clinical literature data structuring apparatus for chinese medicine, which is applied to implement a deep learning-based clinical literature data structuring method for chinese medicine, the apparatus comprising:
and the acquisition module is used for acquiring the document to be processed.
And the input module is used for inputting the document to be processed into the document data structured model which is constructed in advance.
And the output module is used for obtaining the structured text based on the document to be processed and the document data structured model.
Optionally, the input module is further configured to:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the input module is further configured to:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the input module is further configured to:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the input module is further configured to:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module is further configured to:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In one aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above deep learning-based method for structuring data of clinical literature in traditional Chinese medicine.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above deep learning-based data structuring method for clinical literature of traditional Chinese medicine.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
in the scheme, the machine learning method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, results can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning according to the present invention;
FIG. 2 is a flow chart of a method for constructing a document data structured model according to the present invention;
FIG. 3 is a sample schematic of the clinical literature of medicine in the present invention;
FIG. 4 is a schematic view of the tag contents of the present invention;
FIG. 5 is a schematic diagram of a regularized sentence extraction of the present invention;
FIG. 6 is a schematic representation of the data tagging results of the present invention;
FIG. 7 is a schematic structural diagram of a data structuring method of a clinical literature of traditional Chinese medicine based on deep learning according to the present invention;
FIG. 8 is a block diagram of a data structuring device for clinical documents of traditional Chinese medicine based on deep learning according to the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, an embodiment of the present invention provides a deep learning-based data structuring method for clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and a processing flow of the method may include the following steps: ,
and S11, acquiring the document to be processed.
And S12, inputting the document to be processed into a document data structured model which is constructed in advance.
And S13, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the building process of the document data structured model in S12 includes:
s121, acquiring a sample data set of clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S122, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S123, constructing a neural network model based on a self-attention mechanism Transformer, and performing named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S124, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S125, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S121; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the preprocessing the sample data set in S121 includes:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the data tagging performed on the preprocessed sample data set in S122 includes:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the obtaining the regular pool according to the obtained labeling data in S122 includes:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the obtaining of the annotation set according to the obtained annotation data in S122 includes:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
As shown in fig. 2, an embodiment of the present invention provides a method for constructing a document data structured model, where the method is applied to an electronic device, and the method includes:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
In a possible implementation manner, as shown in fig. 3, each sample of clinical literature of traditional Chinese medicine can be divided into three contents, namely, abstract, data and method, and result. In the three parts, the structured content needs to filter the information such as keywords, dates, head-up, numbers and the like in the splitting process, so that the clinical document data of the traditional Chinese medicine forms the formats such as Map < abstract, content >, Map < data and method, content >, Map < result and content >.
And S22, carrying out data annotation on the preprocessed sample data set.
In one possible implementation, the tag definition and ordering is first performed, and the tag content is a description of the content to be structured, as shown in fig. 4. And selecting and marking the contents to be structured respectively according to the three parts of contents of the preprocessed sample data set, and associating the marked contents with corresponding labels.
And S23, obtaining a regular pool according to the obtained labeling data.
And extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
In one possible embodiment, the labeled results are extracted as a regular sentence pattern, as shown in fig. 5, wherein the "clinical chief complaint is dysphagia and dysvocalization" with dysphagia and dysvocalization as the target content, the regular sentence pattern is extracted as the clinical chief complaint.
And S24, obtaining an annotation set according to the obtained annotation data.
And carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set. And dividing the label set into a training set, a verification set and a test set.
In a possible embodiment, as shown in fig. 6, in the preprocessed sample data of clinical literature of traditional chinese medicine, one sequence represents one sentence after splitting, and the structured entity represents an element in one sentence.
The BIO labeling includes: labeling each element as "B-X", "I-X", or "O"; wherein "B-X" indicates that the fragment in which the element is located belongs to the X type and that the element is at the beginning of the fragment; "I-X" indicates that the fragment in which this element is located belongs to the X type and that this element is in the middle position of this fragment; "O" means not of any type.
The structured entities are required to be extracted and put into a word list, document data is converted into a line, the contents in the word list are searched and replaced in the forms of B-tags, I-tags and O, and then the words are scattered to construct a label set.
When performing NER (Named Entity Recognition, Named Entity Recognition neural network model) training, a training set, a verification set, and a test set need to be divided for a BIO label set, which may be according to 7: 2: the proportion of 1 is carried out, so as to obtain parameters such as accuracy, precision, recall rate, F1 and the like and evaluate the quality of the model.
And S25, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
In a feasible implementation mode, a Transformer-based NLP-NER is constructed, and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between people and computers using natural language. It mainly includes two parts, NLU (Natural Language Understanding) and NLG (Natural Language generation).
The Transformer adopts an Encoder-Decoder architecture, Encoder-Decoder: the method is a model framework, is a general term of algorithms, is not particularly specific to a specific algorithm, firstly, an encoder converts an input sequence into a dense vector with fixed dimension in an encoding (encode), and a decoding (decode) stage generates a target translation from an activation state. The method has great advantages in parallelism and long-range dependence, but through analysis of a Transformer attention mechanism, the method has disadvantages in directivity, relative position and sparsity. Based on the characteristics of structured data of clinical literature of traditional Chinese medicine, the performance of the Transformer structure on the NER task of the clinical literature is greatly improved by simply improving the attention scoring function of the traditional Chinese medicine. The attention scoring function is the prior art, the description of the invention is omitted here, only the improvement part is described, after softmax (q dot k) is calculated, each point is weighted once, and part of the Pytorch codes are as follows:
self.time_weighting=nn.Parameter(torch.ones(self.n_head,
config.window_len,config.
...
softmax (att, dim-1) # is the original code
Time _ weight [: att. T, T # only needs to be increased
attn _ drop (att) # is the original code
The advantages of the above improvement are two, one is: tokens at different distances should have different contributions to the location. Secondly, the following steps: for tokens near the beginning of training, the overall weight of self-attention should be reduced because the observation window is small and the amount of information is relatively low. Aiming at the defects of directionality, relative position, sparsity and the like in traditional Chinese medicine literature data, after softmax (Q dot K) is calculated, each point is weighted once, so that the prediction result is more accurate.
And (5) obtaining the model by terminating the training with the accuracy of the (n +1) round and the F1 value less than or equal to n rounds. The specific training parameters are as follows:
Figure BDA0003355058440000091
Figure BDA0003355058440000101
and S26, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
In a feasible implementation mode, the test set sample data is firstly initialized, short sentences and length verification are carried out, a trained model is used for predicting the test set, coordinate positioning is carried out on the predicted content in the test set, which sentence in the test set is judged, and the coordinate is expanded into the sentence or sentences in which the sentence is located, so that the result is obtained. And calling the content in the regular pool to extract the obtained result, wherein the extracted result is the content needing structuring.
S27, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
In a possible implementation, as shown in fig. 7, the structured content obtained by the methods S21-S26 is subjected to manual second proofreading, and after the proofreading is completed, S21 is repeated until the obtained content is consistent with the proofreading result, and a new model is updated, so that the model is continuously and accurately learned by the self-learning method.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
As shown in fig. 8, an embodiment of the present invention provides an apparatus 800 for structuring data of clinical literature of traditional Chinese medicine based on deep learning, where the apparatus 800 is applied to implement a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning, and the apparatus 800 includes:
an obtaining module 810, configured to obtain a document to be processed.
And the input module 820 is used for inputting the document to be processed into the document data structured model which is constructed in advance.
And the output module 830 is configured to obtain a structured text based on the document to be processed and the document data structured model.
Optionally, the input module 820 is further configured to:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the input module 820 is further configured to:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the input module 820 is further configured to:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the input module 820 is further configured to:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module 820 is further configured to:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present invention, where the electronic device 900 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the following steps of the method for structuring data of clinical literature of traditional Chinese medicine based on deep learning:
and S1, acquiring the document to be processed.
And S2, inputting the document to be processed into a document data structured model which is constructed in advance.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, comprising instructions executable by a processor in a terminal to perform the above deep learning based method of structuring data of clinical literature of traditional Chinese medicine. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A traditional Chinese medicine clinical literature data structuring method based on deep learning is characterized by comprising the following steps:
s1, acquiring a document to be processed;
s2, inputting the document to be processed into a document data structured model which is constructed in advance;
and S3, obtaining a structured text based on the document to be processed and the document data structured model.
2. The method according to claim 1, wherein the building process of the literature data structured model in S2 comprises:
s21, acquiring a sample data set of clinical literature of traditional Chinese medicine, and preprocessing the sample data set;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
s23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model;
s24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more words where the predicted target point is located according to the regular pool to obtain a predicted structured text;
s25, manually correcting the predicted structured text, and if the results of manual correction are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
3. The method according to claim 2, wherein the preprocessing of the sample data set in S21 comprises:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
4. The method according to claim 2, wherein the data tagging of the preprocessed sample data set in S22 includes:
setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content;
and marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
5. The method according to claim 2, wherein the deriving the regular pool according to the derived labeling data in S22 comprises:
and extracting the sentence where the label data is, removing the label data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
6. The method of claim 2, wherein the deriving an annotation set according to the derived annotation data in S22 comprises:
and carrying out sequence annotation on the annotation data by adopting a BIO annotation method to obtain an annotation set.
7. A traditional Chinese medicine clinical literature data structuring device based on deep learning is characterized in that the device comprises:
the acquisition module is used for acquiring documents to be processed;
the input module is used for inputting the document to be processed into a document data structured model which is constructed in advance;
and the output module is used for obtaining a structured text based on the document to be processed and the document data structured model.
8. The apparatus of claim 7, wherein the building process of the document data structured model comprises:
s21, acquiring a sample data set of clinical literature of traditional Chinese medicine, and preprocessing the sample data set;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
s23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model;
s24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more words where the predicted target point is located according to the regular pool to obtain a predicted structured text;
s25, manually correcting the predicted structured text, and if the results of manual correction are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
9. The apparatus according to claim 7, wherein the preprocessing of the sample data set in the S21 comprises:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
10. The apparatus of claim 7, wherein data tagging the pre-processed sample data set comprises:
setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content; and marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
CN202111349067.2A 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device Active CN114139610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111349067.2A CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111349067.2A CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Publications (2)

Publication Number Publication Date
CN114139610A true CN114139610A (en) 2022-03-04
CN114139610B CN114139610B (en) 2024-04-26

Family

ID=80394333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111349067.2A Active CN114139610B (en) 2021-11-15 2021-11-15 Deep learning-based traditional Chinese medicine clinical literature data structuring method and device

Country Status (1)

Country Link
CN (1) CN114139610B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153664A (en) * 2016-03-04 2017-09-12 同方知网(北京)技术有限公司 A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
US20170315984A1 (en) * 2016-04-29 2017-11-02 Cavium, Inc. Systems and methods for text analytics processor
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110866113A (en) * 2019-09-30 2020-03-06 浙江大学 Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111834012A (en) * 2020-07-14 2020-10-27 中国中医科学院中医药信息研究所 Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism
CN112487134A (en) * 2020-12-08 2021-03-12 武汉大学 Scientific and technological text problem extraction method based on extremely simple abstract strategy
CN112685513A (en) * 2021-01-07 2021-04-20 昆明理工大学 Al-Si alloy material entity relation extraction method based on text mining
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113420126A (en) * 2021-06-30 2021-09-21 北京法意科技有限公司 Legal rule map construction method and system based on legal text
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153664A (en) * 2016-03-04 2017-09-12 同方知网(北京)技术有限公司 A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted
US20170315984A1 (en) * 2016-04-29 2017-11-02 Cavium, Inc. Systems and methods for text analytics processor
CN107193798A (en) * 2017-05-17 2017-09-22 南京大学 A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN110866113A (en) * 2019-09-30 2020-03-06 浙江大学 Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111834012A (en) * 2020-07-14 2020-10-27 中国中医科学院中医药信息研究所 Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism
CN112487134A (en) * 2020-12-08 2021-03-12 武汉大学 Scientific and technological text problem extraction method based on extremely simple abstract strategy
CN112685513A (en) * 2021-01-07 2021-04-20 昆明理工大学 Al-Si alloy material entity relation extraction method based on text mining
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113420126A (en) * 2021-06-30 2021-09-21 北京法意科技有限公司 Legal rule map construction method and system based on legal text
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘华云: "针刺临床基础研究文献数据库人机协同构建方法研究", 《中国优秀硕士学位论文全文数据库医药卫生科技辑》, no. 02, 28 February 2023 (2023-02-28), pages 056 - 1083 *
李欣 等: "基于正则抽取的竹种数据结构化方法研究", 《计算机技术与发展》, vol. 28, no. 06, 8 February 2018 (2018-02-08), pages 147 - 150 *
肖瑞 等: "基于 BiLSTM-CRF 的中医文本命名实体识别", 《世界科学技术-中医药现代化》, vol. 22, no. 7, pages 2504 - 2510 *

Also Published As

Publication number Publication date
CN114139610B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110532554B (en) Chinese abstract generation method, system and storage medium
Qiu et al. DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain
CN112528034B (en) Knowledge distillation-based entity relationship extraction method
JP7259650B2 (en) Translation device, translation method and program
US11327971B2 (en) Assertion-based question answering
CN110597997A (en) Military scenario text event extraction corpus iterative construction method and device
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
US10963647B2 (en) Predicting probability of occurrence of a string using sequence of vectors
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN113841168A (en) Hierarchical machine learning architecture including a primary engine supported by distributed lightweight real-time edge engines
Xu et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
CN113297852B (en) Medical entity word recognition method and device
CN112036186A (en) Corpus labeling method and device, computer storage medium and electronic equipment
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
CN114139610B (en) Deep learning-based traditional Chinese medicine clinical literature data structuring method and device
CN115392255A (en) Few-sample machine reading understanding method for bridge detection text
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN112257447A (en) Named entity recognition system and recognition method based on deep network AS-LSTM
Dong et al. Argumentprompt: activating multi-category of information for event argument extraction with automatically generated prompts
Ma et al. An enhanced method for dialect transcription via error‐correcting thesaurus
CN115048924B (en) Negative sentence identification method based on negative prefix and suffix information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant