CN115374787A - Model training method and device for continuous learning based on medical named entity recognition - Google Patents

Model training method and device for continuous learning based on medical named entity recognition Download PDF

Info

Publication number
CN115374787A
CN115374787A CN202211294936.0A CN202211294936A CN115374787A CN 115374787 A CN115374787 A CN 115374787A CN 202211294936 A CN202211294936 A CN 202211294936A CN 115374787 A CN115374787 A CN 115374787A
Authority
CN
China
Prior art keywords
data
sentences
medical
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211294936.0A
Other languages
Chinese (zh)
Other versions
CN115374787B (en
Inventor
宋佳祥
杨雅婷
白焜太
刘硕
许娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Health China Technologies Co Ltd
Original Assignee
Digital Health China Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Health China Technologies Co Ltd filed Critical Digital Health China Technologies Co Ltd
Priority to CN202211294936.0A priority Critical patent/CN115374787B/en
Publication of CN115374787A publication Critical patent/CN115374787A/en
Application granted granted Critical
Publication of CN115374787B publication Critical patent/CN115374787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Animal Behavior & Ethology (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a model training method and a device for continuous learning based on medical named entity recognition, wherein seed data is reserved in the model training process for continuous learning, model training is carried out together with new data when the new data is trained by using a model, after the new model obtained by training has old knowledge, the new model can simultaneously have the capacity of the new knowledge and the old knowledge, the bert layers and parameter information of the 0 th, 4 th and 8 th layers are frozen, parameter updating is not carried out, the previously learned information is reserved, the forgetting performance of the old knowledge is reduced, the forgetting rate of the obtained training result is lowest and the accuracy is highest, in the medical field, the model of an original hospital can be trained to the new hospital without needing full data, and the knowledge learned in the original hospital can not be forgotten, so that the model of the original hospital can be suitable for training, the new hospital for carrying out mass text labeling is avoided, the training time is saved, the training efficiency and the accuracy of the training result are improved, and the medical named entity recognition is more accurate.

Description

Model training method and device for continuous learning based on medical named entity recognition
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a model training method and device for continuous learning based on medical named entity recognition.
Background
The application of artificial intelligence pathological diagnosis makes the analysis and diagnosis of diseases more scientific and efficient. The medical named entity recognition is only one component in artificial intelligent pathological diagnosis, and important information such as disease names, clinical manifestations, disease duration and the like are extracted from a section of diagnosis.
However, different hospitals express different for the same disease, for example: gastric cancer, also known as gastric malignancy. Or due to regional or other reasons, the disease of hospital B does not appear in hospital a, and therefore the entity recognition model trained in hospital a is not necessarily suitable for hospital B. If a model is trained for the hospital B again, not only is time wasted, but also a large amount of label texts of the hospital B are needed, and time and labor are wasted. Under the condition of not considering time cost and personnel cost, the data of the A hospital and the B hospital are combined to train a model, so that the method is suitable for the A hospital and the B hospital, and obviously, the method is not seen. Then if the model for hospital a is subsequently trained using only the data for hospital B, such model will eventually fit only to hospital B, and the knowledge learned by hospital a is forgotten, which is a problem that the model will be forgotten catastrophically during learning of new knowledge.
Disclosure of Invention
In view of the above deficiencies of the prior art, the present application provides a model training method and apparatus for continuous learning based on medical named entity recognition.
In a first aspect, the present application provides a model training method for continuous learning based on medical named entity recognition, comprising the following steps:
acquiring medical text corpora from a plurality of data sources;
processing the medical text corpus by adopting a binary language statistical model to construct a medical knowledge map;
extracting a sentence to be trained from the medical knowledge graph;
inputting the sentence to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data and new data;
and freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.
In some optional implementations of some embodiments, the plurality of data sources includes at least: a target hospital data source, a diagnosis and treatment data source and a medical professional book data source.
In some optional implementations of some embodiments, the constructing a medical knowledge graph after processing the medical text corpus using a binary language statistical model includes:
performing word segmentation processing on the medical text corpus by using the binary language statistical model to obtain collocation information between adjacent words;
constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information;
and graphically reconstructing the dictionary to obtain a medical knowledge map corresponding to the binary language statistical model.
In some optional implementations of some embodiments, the constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information includes:
traversing the medical text corpus according to the collocation information, and calculating the word frequency of the collocation information;
and establishing a corresponding relation between the collocation information and the word frequency, and storing the corresponding relation to form the medical dictionary.
In some optional implementations of some embodiments, the graphically reconstructing the dictionary to obtain the knowledge graph of the binary language statistical model includes:
and taking adjacent words contained in the collocation information in the medical dictionary as two adjacent nodes, connecting the two adjacent nodes according to the collocation relationship of the adjacent words to form edges, and marking the edges by the word frequency of the collocation information to construct and obtain the medical knowledge map.
In some optional implementations of some embodiments, the extracting the sentence to be trained from the medical knowledge-graph includes:
calculating joint probability of natural sentences in the neural network based on the binary language statistical model;
extracting and adjusting the natural sentences according to the joint probability to obtain reasonable sentences of which the joint probability is not zero;
and performing path search on the reasonable sentences according to the medical knowledge graph, and mapping according to a search result to obtain the sentences to be trained.
In some optional implementations of some embodiments, the inputting the sentence to be trained into the bert language model for continuous learning training, and the retaining seed data in the training process includes:
extracting any two sentences to be trained from the sentences to be trained as sentences to be judged;
calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result;
screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold;
calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, and if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same numerical value as the reserved quantity threshold value are randomly selected and stored in the json file as the seed data.
In some optional implementation manners of some embodiments, the calculating similarity between the sentences to be determined through cosine similarity to obtain a similarity calculation result includes:
the sentences to be judged comprise a first sentence to be judged and a second sentence to be judged;
using a language processing tool to perform text splitting on the first sentence to be judged and the second sentence to be judged to obtain a first word segmentation result and a second word segmentation result;
merging the first word segmentation result and the second word segmentation result to obtain a word segmentation list;
converting the first sentence to be judged and the second sentence to be judged into digital vectors by using one-hot coding, and performing duplication degree comparison by combining the first sentence to be judged, the second sentence to be judged and the word segmentation list to obtain a first sentence vector representation and a second sentence vector representation;
and substituting the first sentence vector representation and the second sentence vector representation into a cosine similarity formula to obtain a similarity calculation result.
In some optional implementations of some embodiments, the fusing the reserved seed data and the new data includes:
acquiring new data generated in the continuous training process;
acquiring reserved seed data by loading a json file;
and merging the new data and the seed data to obtain fused data, wherein the fused data has the characteristics of both the new data and the seed data.
In some optional implementation manners of some embodiments, the freezing a preset number of layers and parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result includes:
traversing layers 1-11 of an encoder of the bert language model in the continuous training process, setting gradient updating of the layers 0,4 and 8 as stop updating when traversing to the layers 0,4 and 8, and completing freezing of the layers 0,4 and 8 and parameter information;
and inputting the fusion data into the frozen model for training to obtain a final training result.
In a second aspect of the disclosed embodiments, a model training apparatus for continuous learning based on medical named entity recognition is provided, including:
the data acquisition module is used for acquiring medical text corpora from a plurality of data sources;
the medical knowledge map building module is used for building a medical knowledge map after processing the medical text corpus by adopting a binary language statistical model;
the sentence extracting module is used for extracting and extracting the sentences to be trained from the medical knowledge graph;
the data processing module is used for inputting the sentence to be trained into the bert language model for continuous learning training, reserving seed data in the training process and fusing the reserved seed data with new data;
and the model processing module is used for freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.
In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps of the above-mentioned method.
The invention has the beneficial effects that:
the method is characterized in that seed data is reserved in the model training process of continuous learning, when new data is trained by using a model, model training is carried out together with the new data, the forgetting degree of the model to old knowledge is reduced, after the new model obtained by training has the old knowledge, the new model can simultaneously have the capacity of the new knowledge and the old knowledge, the bert layers and parameter information of the 0 th layer, the 4 th layer and the 8 th layer are frozen, the parameter updating is not carried out, the information learned before is reserved, the forgetting performance of the old knowledge is reduced, the forgetting rate of the obtained training result is lowest and the accuracy is highest, in the medical field, the model of the original hospital can be trained without needing the whole amount of data, and the knowledge learned in the original hospital can not be forgotten, so that the model of the original hospital can be suitable for the new hospital to be trained, a large amount of text labeling is avoided for the new hospital, the model training time is saved, the training efficiency and the accuracy of the training result are improved, and the medical naming entity recognition is more accurate.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Fig. 2 is a block diagram of the system of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device implementing some embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In a first aspect, the present application provides a model training method for continuous learning based on medical named entity recognition, as shown in fig. 1, including steps S100-S500:
s100: acquiring medical text corpora from a plurality of data sources;
in some optional implementations of some embodiments, the plurality of data sources includes at least: a target hospital data source, a diagnosis and treatment data source and a medical professional book data source.
The following describes the process of acquiring the medical text corpus for different data sources respectively:
(1) Target hospital data source
The data is provided by the cooperative academies and stored in an SQL database, and the specific acquisition mode is that the data enters the SQL database of the target hospital and is acquired by using a corresponding command.
(2) Diagnosis and treatment data source
As an example, medical record data can be derived from an existing electronic medical record system database, analyzed, and a result of the analysis is converted into a text format to obtain a medical text corpus.
(3) Medical professional book data source
As an example, for an electronic format, for example, a medical professional book in a text format, the medical text corpus may be directly obtained without processing, and for a non-electronic format, for example, a paper medical professional book, the medical professional book may be converted into a text format, so as to obtain the medical text corpus.
S200: processing the medical text corpus by adopting a binary language statistical model and then constructing a medical knowledge map;
in some optional implementations of some embodiments, the constructing a medical knowledge graph after processing the medical text corpus by using a binary language statistical model includes:
performing word segmentation processing on the medical text corpus by using the binary language statistical model to obtain collocation information between adjacent words;
the formula of the binary language statistical model calculation statement is as follows:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 841117DEST_PATH_IMAGE002
representing the probability of n words occurring at the same time,
Figure 211050DEST_PATH_IMAGE003
indicates the probability of the 1 st word occurring,
Figure 25553DEST_PATH_IMAGE004
representing the probability of the 2 nd word appearing simultaneously with the 1 st word, and so on.
Figure 834240DEST_PATH_IMAGE005
The equiprobability may be further obtained by counting the number of times the words occur simultaneously in the collected text corpus.
It should be clear that the collocation information between adjacent words refers to adjacent words in the sentence and the collocation relationship between the adjacent words, and the collocation relationship between the adjacent words refers to the reasonable collocation of the two words in the front-back order, for example, the "lung cancer" and the "patient" are adjacent words, and the reasonable collocation of the two words is the "lung cancer patient", and the collocation relationship between the two words is that the "lung cancer" is before the "patient". Therefore, the collocation information between adjacent words can reflect the collocation relationship between the adjacent words, that is, the collocation information between the adjacent words can know two adjacent words contained therein, so that the two adjacent words are reasonably collocated according to the front and back sequence.
In this embodiment, the word segmentation processing on the text corpus is implemented by a binary language statistical model. Specifically, the probability of simultaneous occurrence of adjacent words in the sentence subjected to word segmentation is calculated through a binary language statistical model, and the most appropriate collocation information between adjacent words can be obtained according to the maximum probability obtained through calculation.
For example, "lung cancer" and "patient" are adjacent words, if they occur simultaneously according to the matching sequence of "lung cancer patient", the probability calculated by the binary language statistical model for this is larger, and if they occur simultaneously according to the matching sequence of "lung cancer patient", the probability calculated by the binary language statistical model for this is zero. Therefore, according to the principle of high probability, the matching information between two adjacent words of the lung cancer and the patient is obtained as the lung cancer patient, namely the adjacent words of the lung cancer and the patient are reasonably matched according to the sequence of the lung cancer before the patient.
Constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information;
in some optional implementations of some embodiments, the constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information includes:
traversing the medical text corpus according to the collocation information, and calculating the word frequency of the collocation information;
the collocation information between adjacent words can reflect the collocation relationship between the adjacent words, that is, two adjacent words contained therein can be known through the collocation information between the adjacent words, and the two adjacent words are reasonably collocated according to which sequence.
Therefore, the word frequency represents the number of times that adjacent words in the collocation information appear simultaneously according to the collocation relationship, for this reason, the text corpus is traversed according to the collocation information, that is, the adjacent words in all sentences of the text corpus are traversed according to the collocation relationship between the adjacent words and the adjacent words in the collocation information, and the number of times that the adjacent words in the collocation information appear simultaneously in the text corpus according to the collocation relationship is counted, so that the word frequency of the collocation information can be calculated.
And establishing a corresponding relation between the collocation information and the word frequency, and storing the corresponding relation to form the medical dictionary.
After the word frequency of the collocation information is obtained, the corresponding relationship between the two can be established and stored, and a medical dictionary such as that shown in table 1 below is formed.
TABLE 1 medical dictionary corresponding to binary language statistical model
Collocation information Word frequency
Lung cancer&Patient's health 138
Patient(s) is/are&Symptoms and signs 113
Symptoms and signs&Included 98
And graphically reconstructing the dictionary to obtain a medical knowledge map corresponding to the binary language statistical model.
In some optional implementations of some embodiments, the graphically reconstructing the dictionary to obtain the knowledge graph of the binary language statistical model includes:
and taking adjacent words contained in the collocation information in the medical dictionary as two adjacent nodes, connecting the two adjacent nodes according to the collocation relationship of the adjacent words to form edges, and marking the edges by the word frequency of the collocation information to construct and obtain the medical knowledge map.
Since the knowledge graph of the binary language statistical model is in a graph form, after the corresponding medical dictionary is constructed, the medical dictionary needs to be reconstructed graphically.
Furthermore, the collocation information between adjacent words contained in the medical dictionary is reconstructed graphically.
The adjacent words contained in the collocation information are used as nodes, edges connecting the nodes represent collocation relationships between the adjacent words, and the graphical reconstruction can utilize the probability or frequency of simultaneous occurrence of the adjacent words contained in the medical dictionary according to the collocation relationships. For example, the edges are identified according to the probability or frequency of the simultaneous occurrence of the adjacent words according to the collocation relationship, for example, the nodes include "lung cancer" and "patient", the edges formed by the two nodes as the adjacent words represent the collocation relationship between the two, which represents the collocation relationship of "lung cancer" in front of "patient" in the back of "patient". And representing the frequency of the two adjacent words appearing in the text corpus simultaneously according to the collocation relationship by numbers.
As described above, each node of the medical knowledge graph represents each word in the sentence, and the edge represents the collocation relationship between the words, for this reason, in this embodiment, based on the binary language statistical model, two adjacent nodes are used to represent adjacent words in the collocation information, and the edge formed by connecting the two adjacent nodes is used to represent the collocation relationship between the adjacent words, that is, the medical knowledge graph is constructed by identifying the edge according to the probability that the adjacent words in the collocation information occur at the same time, wherein, because the edge is formed by connecting the two adjacent nodes according to the collocation relationship between the adjacent words, the edge has directionality, and the directionality is closely related to the collocation relationship between the adjacent words. For example, the matching relationship between the adjacent words "lung cancer" and "patient" is "lung cancer patient", so the direction of the corresponding edge of the adjacent word is from the node "lung cancer" to the node "patient", the probability of the adjacent word appearing at the same time is positive, and the word frequency for representing the number of times of the adjacent word appearing at the same time is greater than zero, therefore, on the basis of the construction of the medical knowledge map, the edge is identified by replacing the probability of the adjacent word appearing at the same time with the word frequency of the matching information, which is beneficial to improving the generation efficiency of natural sentences.
S300: extracting a sentence to be trained from the medical knowledge graph;
in some optional implementations of some embodiments, the extracting the sentence to be trained from the medical knowledge-graph includes:
calculating joint probability of natural sentences in the neural network based on the binary language statistical model;
the neural network trains the collected text corpora to enable the machine to learn various characteristics of the language, and further enable the machine to generate natural sentences on the premise of no manual intervention.
Extracting and adjusting the natural sentences according to the joint probability to obtain reasonable sentences of which the joint probability is not zero;
the reasonable sentence refers to a natural sentence with smooth or reasonable grammar, for example, "lung cancer" and "patient" in "symptoms possessed by lung cancer patient" belong to reasonable collocation. In other words, two words in the form of "patient lung cancer" cannot exist in the originally collected text corpus, i.e., the two words in the form of "patient lung cancer" are counted to have zero number of simultaneous occurrences in the originally collected text corpus.
Based on the method, after the joint probability of the natural sentences is obtained through calculation, reasonable sentences can be screened out from the generated natural sentences according to the principle that the joint probability is not zero.
And performing path search on the reasonable sentences according to the medical knowledge graph, and mapping according to a search result to obtain the sentences to be trained.
In the medical knowledge graph, a reasonable sentence can be mapped through a path formed by nodes and edges connecting the nodes, for example, the 'symptom of a lung cancer patient' in the reasonable sentence is mapped by a path formed by four nodes and corresponding edges of the 'lung cancer', 'patient', 'symptom' and 'symptom'.
Based on the method, the reasonable sentences are subjected to path search through the medical knowledge graph, the search results corresponding to the paths are obtained, after the search results are obtained, the corresponding paths in the search results are mapped to obtain the reasonable sentences according to the mapping relation between the paths in the medical knowledge graph and the reasonable sentences, and then the obtained reasonable sentences are used as the sentences to be trained to be stored, so that model training for continuous learning is carried out by using the sentences to be trained subsequently.
S400: inputting the sentence to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data and new data;
in some optional implementations of some embodiments, the inputting the sentence to be trained into the bert language model for continuous learning training, and the retaining seed data in the training process includes:
extracting any two sentences to be trained from the sentences to be trained as sentences to be judged;
the two extracted sentences to be trained may be, for example:
sensor 1= "symptoms possessed by lung cancer patient"
sensor 2= "symptoms of lung cancer patient include"
Calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result;
in some optional implementation manners of some embodiments, the calculating similarity between the sentences to be determined by cosine similarity to obtain a similarity calculation result includes:
the sentences to be judged comprise a first sentence to be judged and a second sentence to be judged;
in this embodiment, the content 1 is used as a first sentence to be judged, and the content 2 is used as a second sentence to be judged;
using a language processing tool to perform text splitting on the first sentence to be judged and the second sentence to be judged to obtain a first word segmentation result and a second word segmentation result;
in this embodiment, a jieba word segmentation tool is used to perform text splitting on a first sentence to be determined and a second sentence to be determined, and the first word segmentation result is: sensor 1 gets the following: [ "lung cancer", "patient", "have", "is", "symptoms" ], second word segmentation result: the sensor 2 gets: [ "lung cancer", "patient", "symptoms", "included" ]
Merging the first word segmentation result and the second word segmentation result to obtain a word segmentation list;
merging the first segmentation result and the second segmentation result to obtain a segmentation list word _ list = [ 'lung cancer', 'patient', 'having', 'symptom', 'including' ]
Converting the first sentence to be judged and the second sentence to be judged into digital vectors by using one-hot coding, and performing duplication degree comparison by combining the first sentence to be judged, the second sentence to be judged and the word segmentation list to obtain a first sentence vector representation and a second sentence vector representation;
in this embodiment, if words in the first sentence to be determined and the second sentence to be determined appear in the word _ list, the word _ list is represented by 1, otherwise, the word _ list is 0, and the result is:
word_vec_1 = [1,1,1,1,1,0],word_vec_2 = [1,1,0,1,1,1];
and substituting the first sentence vector representation and the second sentence vector representation into a cosine similarity formula to obtain a similarity calculation result.
According to the cosine similarity formula:
Figure 847327DEST_PATH_IMAGE006
wherein A and B are vector representations of statements to be judged,
Figure 247215DEST_PATH_IMAGE007
and
Figure 345753DEST_PATH_IMAGE008
and n is the subset of the vector of the statement to be judged.
The calculation method of the cosine similarity formula comprises the following steps: and calculating the dot product of the two vectors, calculating the modulus of each vector, dividing the dot product by the product of the two vector moduli to obtain the cosine included angle, and judging the similarity of the two sentences to be judged according to the size of the included angle.
Screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold;
calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, and if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same value as the reserved quantity threshold value are randomly selected and stored in the json file as the seed data.
In this embodiment, the similarity of the two sentences to be trained is calculated by a cosine similarity algorithm, and if the two sentences to be trained are more similar, the result is closer to 1, otherwise, the result is closer to-1. The number of the training texts (namely the total number of the stored sentences to be trained) is z, so that z x (z-1)/2 times of comparison is needed, each text is compared with other texts, the texts with similarity lower than 0 (threshold value self-defined) are reserved, and finally, the texts which are dissimilar to each other are obtained by duplication elimination.
And setting the number of the seed data as a threshold value t, if the number of the finally reserved texts which are dissimilar to each other is less than t, reserving all the texts, and if the number of the finally reserved texts is greater than t, randomly selecting t texts to be stored in the json file.
As can be seen from table 2 below, after the seed data is retained and model training for continuous learning is performed, the knowledge accuracy rate is significantly higher than that of the seed data that is not retained.
TABLE 2 seed data accuracy comparison table
Whether to join seed data Accuracy of old knowledge Rate of accuracy of new knowledge
Whether or not 88.95% 95.95%
Is that 95.42% 95.44%
In some optional implementations of some embodiments, the fusing the reserved seed data and the new data includes:
acquiring new data generated in the continuous training process;
acquiring reserved seed data by loading a json file;
and merging the new data and the seed data to obtain fused data, wherein the fused data has the characteristics of both the new data and the seed data.
For example: in the continuous training stage, a batch of new data sets with x data are added, the x data and the z data have no size relation, t text data representing the z data are obtained by reserving seed data, the t seed data are obtained by loading a json file, the x data and the t seed data are combined to obtain s representative data, and model training is performed by using the s fused data, so that the model training method has characteristics in the z data and the x data, and a new model obtained by training has old knowledge to a certain extent.
S500: and freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.
In some optional implementation manners of some embodiments, the freezing a preset number of layers and parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result includes:
traversing the 1 st layer to the 11 th layer of the encoder of the bert language model in the continuous training process, and setting the gradient updating of the 0 th layer, the 4 th layer and the 8 th layer as the updating stop when traversing to the 0 th layer, the 4 th layer and the 8 th layer, and completing the freezing of the 0 th layer, the 4 th layer and the 8 th layer and the corresponding parameter information;
and inputting the fusion data into the frozen model for training to obtain a final training result.
In order to better enable the new model to have the ability of new knowledge and old knowledge at the same time, the number of layers of the part of the bert language model is frozen and the embeddings parameters are frozen without updating the parameters.
Each layer in the bert language model retains the learned parameter information, and if the parameters of partial layers of the bert language model are retained and not subjected to gradient updating in the continuous training stage, the previously learned information can be retained, and the forgetting performance of old knowledge can be reduced.
The specific implementation steps are as follows:
when traversing the 0,4,8 th layer of the encoder (encoder) of the bert, traversing the parameters (parameters) of the above layers, and setting the gradient update as the stop update (False), namely, the parameters of the 0,4,8 th layer are not subjected to the gradient update in the training process of the model, and the original parameters are reserved.
And inputting the fusion data into the frozen model for training to obtain a final training result.
The comparative training results are shown in table 3 below:
TABLE 3 comparative training results
Number of frozen layers Rate of accuracy of old knowledge New knowledge accuracy rate
Odd number of layers 95.60% 95.00%
Even number of layers 95.60% 95.38%
The first 6 layers 95.43% 94.50%
Rear 6 layers 95.50% 95.62%
The first 10 layers 95.28% 92.08%
0. 3, 6 and 9 layers 95.40% 95.81%
1. 4, 7 and 10 layers 95.53% 95.31%
0. 4,8 layers 95.72% 95.62%
As can be seen from table 3 above, in the model after freezing layers 0,4 and 8, the knowledge accuracy of the training sample obtained after training the fusion data is the highest, and the knowledge forgetting rate is the lowest, and the error rate of the medical named entity recognition using the training sample is the lowest, so that the efficiency and accuracy of the medical named entity recognition can be effectively improved.
In a second aspect of the embodiments of the present disclosure, there is provided a model training apparatus for continuous learning based on medical named entity recognition, as shown in fig. 2, including:
the data acquisition module is used for acquiring medical text corpora from a plurality of data sources;
the medical knowledge map building module is used for building a medical knowledge map after processing the medical text corpus by adopting a binary language statistical model;
the sentence extracting module is used for extracting and extracting the sentences to be trained from the medical knowledge graph;
the data processing module is used for inputting the sentence to be trained into the bert language model for continuous learning training, reserving seed data in the training process and fusing the reserved seed data with new data;
and the model processing module is used for freezing the preset layer number and the parameter information in the bert language model, inputting the fused data into the processed bert language model and obtaining a final training result.
In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps of the above-mentioned method.
Fig. 3 is a schematic diagram of a computer device 3 provided by the embodiment of the present disclosure. As shown in fig. 3, the computer device 3 of this embodiment includes: a processor 601, a memory 602, and a computer program 603 stored in the memory 602 and operable on the processor 601. The steps in the various method embodiments described above are implemented when the processor 601 executes the computer program 603. Alternatively, the processor 601 realizes the functions of each module/unit in the above-described apparatus embodiments when executing the computer program 603.
Illustratively, the computer program 603 may be partitioned into one or more modules/units, which are stored in the memory 602 and executed by the processor 601 to accomplish the present disclosure. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 603 in the computer device 3.
The computer device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computer devices. The computer device 3 may include, but is not limited to, a processor 601 and a memory 602. Those skilled in the art will appreciate that fig. 3 is merely an example of a computer device 3 and is not intended to limit the computer device 3 and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the computer device may also include input output devices, network access devices, buses, etc.
The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 602 may be an internal storage unit of the computer device 3, for example, a hard disk or a memory of the computer device 3. The memory 602 may also be an external storage device of the computer device 3, such as a plug-in hard disk provided on the computer device 3, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 602 may also include both an internal storage unit of the computer device 3 and an external storage device. The memory 602 is used for storing computer programs and other programs and data required by the computer device. The memory 602 may also be used to temporarily store data that has been output or is to be output.
It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the device is divided into different functional units or modules, so as to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a division of modules or units, a division of logical functions only, an additional division may be made in actual implementation, multiple units or components may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.
The above examples are only intended to illustrate the technical solution of the present disclosure, not to limit it; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims (13)

1. The model training method of continuous learning based on medical named entity recognition is characterized by comprising the following steps: the method comprises the following steps:
acquiring medical text corpora from a plurality of data sources;
processing the medical text corpus by adopting a binary language statistical model to construct a medical knowledge map;
extracting sentences to be trained from the medical knowledge graph;
inputting the sentence to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data and new data;
and freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.
2. The method of claim 1, wherein: the plurality of data sources includes at least: a target hospital data source, a diagnosis and treatment data source and a medical professional book data source.
3. The method of claim 2, wherein: the construction of the medical knowledge graph after the medical text corpus is processed by adopting a binary language statistical model comprises the following steps:
performing word segmentation processing on the medical text corpus by using the binary language statistical model to obtain collocation information between adjacent words;
constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information;
and graphically reconstructing the dictionary to obtain the medical knowledge map corresponding to the binary language statistical model.
4. The method of claim 3, wherein: the constructing of the medical dictionary corresponding to the binary language statistical model according to the collocation information includes:
traversing the medical text corpus according to the collocation information, and calculating the word frequency of the collocation information;
and establishing a corresponding relation between the collocation information and the word frequency, and storing the corresponding relation to form the medical dictionary.
5. The method of claim 4, wherein: the graphically reconstructing the dictionary to obtain the knowledge graph of the binary language statistical model comprises the following steps:
and taking adjacent words contained in the collocation information in the medical dictionary as two adjacent nodes, connecting the two adjacent nodes according to the collocation relationship of the adjacent words to form an edge, and marking the edge by the word frequency of the collocation information to construct and obtain the medical knowledge map.
6. The method of claim 5, wherein: the extracting and adjusting the sentence to be trained from the medical knowledge graph comprises the following steps:
calculating joint probability of natural sentences in the neural network based on the binary language statistical model;
extracting and adjusting the natural sentences according to the joint probability to obtain reasonable sentences of which the joint probability is not zero;
and performing path search on the reasonable sentences according to the medical knowledge graph, and mapping according to a search result to obtain the sentences to be trained.
7. The method of claim 6, wherein: inputting the sentence to be trained into a bert language model for continuous learning training, and reserving seed data in the training process, wherein the continuous learning training comprises the following steps:
randomly extracting two sentences to be trained from the sentences to be trained as sentences to be judged;
calculating the similarity between the sentences to be judged according to the cosine similarity to obtain a similarity calculation result;
screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold value to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold value;
calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, and if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same numerical value as the reserved quantity threshold value are randomly selected and stored in the json file as the seed data.
8. The method of claim 7, wherein: the calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result, comprising:
the sentences to be judged comprise a first sentence to be judged and a second sentence to be judged;
using a language processing tool to perform text splitting on the first sentence to be judged and the second sentence to be judged to obtain a first word segmentation result and a second word segmentation result;
merging the first word segmentation result and the second word segmentation result to obtain a word segmentation list;
converting the first sentence to be judged and the second sentence to be judged into digital vectors by using one-hot coding, and performing duplication degree comparison by combining the first sentence to be judged, the second sentence to be judged and the word segmentation list to obtain a first sentence vector representation and a second sentence vector representation;
and substituting the first sentence vector representation and the second sentence vector representation into a cosine similarity formula to obtain a similarity calculation result.
9. The method of claim 8, wherein: the fusing the reserved seed data and the new data comprises the following steps:
acquiring new data generated in the continuous training process;
acquiring reserved seed data by loading a json file;
and merging the new data and the seed data to obtain fused data, wherein the fused data has the characteristics of both the new data and the seed data.
10. The method of claim 9, wherein: freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result, wherein the step comprises the following steps of:
traversing layers 1-11 of an encoder of the bert language model in the continuous training process, setting gradient updating of the layers 0,4 and 8 as stop updating when traversing to the layers 0,4 and 8, and completing freezing of parameter information of the layers 0,4 and 8;
and inputting the fusion data into the frozen model for training to obtain a final training result.
11. A model training device for continuous learning based on medical named entity recognition is characterized in that: the method comprises the following steps:
the data acquisition module is used for acquiring medical text corpora from a plurality of data sources;
the medical knowledge map building module is used for building a medical knowledge map after processing the medical text corpus by adopting a binary language statistical model;
the sentence extracting module is used for extracting and extracting the sentences to be trained from the medical knowledge graph;
the data processing module is used for inputting the sentence to be trained into the bert language model for continuous learning training, reserving seed data in the training process and fusing the reserved seed data with new data;
and the model processing module is used for freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.
12. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 10 when executing the computer program.
13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.
CN202211294936.0A 2022-10-21 2022-10-21 Model training method and device for continuous learning based on medical named entity recognition Active CN115374787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211294936.0A CN115374787B (en) 2022-10-21 2022-10-21 Model training method and device for continuous learning based on medical named entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211294936.0A CN115374787B (en) 2022-10-21 2022-10-21 Model training method and device for continuous learning based on medical named entity recognition

Publications (2)

Publication Number Publication Date
CN115374787A true CN115374787A (en) 2022-11-22
CN115374787B CN115374787B (en) 2023-01-31

Family

ID=84073748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211294936.0A Active CN115374787B (en) 2022-10-21 2022-10-21 Model training method and device for continuous learning based on medical named entity recognition

Country Status (1)

Country Link
CN (1) CN115374787B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
CN111563534A (en) * 2020-04-09 2020-08-21 华南理工大学 Task-oriented word embedding vector fusion method based on self-encoder
CN114298050A (en) * 2021-12-31 2022-04-08 天津开心生活科技有限公司 Model training method, entity relation extraction method, device, medium and equipment
US20220114198A1 (en) * 2020-09-22 2022-04-14 Cognism Limited System and method for entity disambiguation for customer relationship management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
CN111563534A (en) * 2020-04-09 2020-08-21 华南理工大学 Task-oriented word embedding vector fusion method based on self-encoder
US20220114198A1 (en) * 2020-09-22 2022-04-14 Cognism Limited System and method for entity disambiguation for customer relationship management
CN114298050A (en) * 2021-12-31 2022-04-08 天津开心生活科技有限公司 Model training method, entity relation extraction method, device, medium and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICHAL LACLAVÍK等: "A Search Based Approach to Entity Recognition: Magnetic and IISAS Team at ERD Challenge", 《ERD "14》 *
武惠 等: "基于迁移学习和 BiLSTM-CRF 的中文命名实体识别", 《小型微型计算机系统》 *

Also Published As

Publication number Publication date
CN115374787B (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CN107731269B (en) Disease coding method and system based on original diagnosis data and medical record file data
CN107705839B (en) Disease automatic coding method and system
CN109920501B (en) Electronic medical record classification method and system based on convolutional neural network and active learning
JP7459386B2 (en) Disease diagnosis prediction system based on graph neural network
CN107577826B (en) Classification of diseases coding method and system based on raw diagnostic data
CN108831559B (en) Chinese electronic medical record text analysis method and system
EP3567605A1 (en) Structured report data from a medical text report
CN110335647A (en) A kind of clinical data standards system and standardized data acquisition method
CN105760874B (en) CT image processing system and its CT image processing method towards pneumoconiosis
CN107247881A (en) A kind of multi-modal intelligent analysis method and system
CN111584021A (en) Medical record information verification method and device, electronic equipment and storage medium
CN112183026A (en) ICD (interface control document) encoding method and device, electronic device and storage medium
CN115062165B (en) Medical image diagnosis method and device based on film reading knowledge graph
CN114912887B (en) Clinical data input method and device based on electronic medical record
CN112735544A (en) Medical record data processing method and device and storage medium
CN114330267A (en) Structural report template design method based on semantic association
CN112507138A (en) Method and device for constructing disease-specific knowledge map, medium and electronic equipment
CN111128388A (en) Value domain data matching method and device and related products
CN112071431B (en) Clinical path automatic generation method and system based on deep learning and knowledge graph
CN113658720A (en) Method, apparatus, electronic device and storage medium for matching diagnostic name and ICD code
CN113343680A (en) Structured information extraction method based on multi-type case history texts
CN115374787B (en) Model training method and device for continuous learning based on medical named entity recognition
CN110610766A (en) Apparatus and storage medium for deriving probability of disease based on symptom feature weight
CN116206767A (en) Disease knowledge mining method, device, electronic equipment and storage medium
CN115841861A (en) Similar medical record recommendation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant