CN115374787B

CN115374787B - Model training method and device for continuous learning based on medical named entity recognition

Info

Publication number: CN115374787B
Application number: CN202211294936.0A
Authority: CN
Inventors: 宋佳祥; 杨雅婷; 白焜太; 刘硕; 许娟
Original assignee: Digital Health China Technologies Co Ltd
Current assignee: Digital Health China Technologies Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-31
Anticipated expiration: 2042-10-21
Also published as: CN115374787A

Abstract

The invention discloses a model training method and a device for continuous learning based on medical named entity recognition, seed data is reserved in the model training process for continuous learning, model training is carried out together with new data when the model is used for training the new data, after the new model obtained by training has old knowledge, the new model can have the capacity of the new knowledge and the old knowledge at the same time, the 0 th, 4 th and 8 th layers of bert layers and parameter information are frozen, parameter updating is not carried out, previously learned information is reserved, the forgetting performance of the old knowledge is reduced, the forgetting rate of the obtained training result is lowest and the accuracy is highest, in the medical field, the model of an original hospital can be trained to the new hospital without needing full data, and the knowledge learned in the original hospital can not be forgotten, so that the model of the original hospital can be suitable for training, the new hospital is prevented from carrying out mass text labeling, the training time is saved, the training efficiency and the accuracy of the training result are improved, and the medical named entity recognition is more accurate.

Description

Model training method and device for continuous learning based on medical named entity recognition

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a model training method and device for continuous learning based on medical named entity recognition.

Background

The application of artificial intelligence pathological diagnosis makes the analysis and diagnosis of diseases more scientific and efficient. The medical named entity recognition is only one component in artificial intelligent pathological diagnosis, and important information such as disease names, clinical manifestations, disease duration and the like is extracted from a section of diagnosis.

However, different hospitals express different for the same disease, for example: gastric cancer, also known as gastric malignancy. Or due to regional or other reasons, the disease of hospital B does not appear in hospital a, and therefore the entity recognition model trained in hospital a is not necessarily suitable for hospital B. If a model is trained for the hospital B again, not only is time wasted, but also a large amount of label texts of the hospital B are needed, and time and labor are wasted. Without considering time cost and personnel cost, the data of A, B of two hospitals are combined to train a model, so that the method is suitable for hospitals A and B, and obviously, the method is not seen. Then if a model for hospital a is subsequently trained using only data from hospital B, such a model will eventually fit only to hospital B, and knowledge learned by hospital a will be forgotten, which is a problem that the model will be forgotten catastrophically during learning of new knowledge.

Disclosure of Invention

In view of the above deficiencies of the prior art, the present application provides a model training method and apparatus for continuous learning based on medical named entity recognition.

In a first aspect, the present application provides a model training method for continuous learning based on medical named entity recognition, comprising the following steps:

acquiring medical text corpora from a plurality of data sources;

processing the medical text corpus by adopting a binary language statistical model and then constructing a medical knowledge map;

extracting a sentence to be trained from the medical knowledge graph;

inputting the sentence to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data with new data;

and freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.

In some optional implementations of some embodiments, the plurality of data sources includes at least: a target hospital data source, a diagnosis and treatment data source and a medical professional book data source.

In some optional implementations of some embodiments, the constructing a medical knowledge graph after processing the medical text corpus by using a binary language statistical model includes:

performing word segmentation processing on the medical text corpus by using the binary language statistical model to obtain collocation information between adjacent words;

constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information;

and graphically reconstructing the dictionary to obtain a medical knowledge map corresponding to the binary language statistical model.

In some optional implementations of some embodiments, the constructing a medical dictionary corresponding to the binary language statistical model according to the collocation information includes:

traversing the medical text corpus according to the collocation information, and calculating the word frequency of the collocation information;

and establishing a corresponding relation between the collocation information and the word frequency, and storing the corresponding relation to form the medical dictionary.

In some optional implementations of some embodiments, the graphically reconstructing the dictionary to obtain the knowledge graph of the binary language statistical model includes:

and taking adjacent words contained in the collocation information in the medical dictionary as two adjacent nodes, connecting the two adjacent nodes according to the collocation relationship of the adjacent words to form edges, and marking the edges by the word frequency of the collocation information to construct and obtain the medical knowledge map.

In some optional implementations of some embodiments, the extracting the sentence to be trained from the medical knowledge-graph includes:

calculating joint probability of natural sentences in the neural network based on the binary language statistical model;

extracting and adjusting the natural sentences according to the joint probability to obtain reasonable sentences of which the joint probability is not zero;

and performing path search on the reasonable sentences according to the medical knowledge graph, and mapping according to a search result to obtain the sentences to be trained.

In some optional implementations of some embodiments, the inputting the sentence to be trained into the bert language model for continuous learning training, and the retaining seed data in the training process includes:

extracting any two sentences to be trained from the sentences to be trained as sentences to be judged;

calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result;

screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold value to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold value;

calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, and if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same value as the reserved quantity threshold value are randomly selected and stored in the json file as the seed data.

In some optional implementation manners of some embodiments, the calculating similarity between the sentences to be determined by cosine similarity to obtain a similarity calculation result includes:

the sentences to be judged comprise a first sentence to be judged and a second sentence to be judged;

using a language processing tool to perform text splitting on the first sentence to be judged and the second sentence to be judged to obtain a first word segmentation result and a second word segmentation result;

merging the first word segmentation result and the second word segmentation result to obtain a word segmentation list;

converting the first sentence to be judged and the second sentence to be judged into digital vectors by using one-hot coding, and performing duplication degree comparison by combining the first sentence to be judged, the second sentence to be judged and the word segmentation list to obtain a first sentence vector representation and a second sentence vector representation;

and substituting the first sentence vector representation and the second sentence vector representation into a cosine similarity formula to obtain a similarity calculation result.

In some optional implementations of some embodiments, the fusing the reserved seed data and the new data includes:

acquiring new data generated in the continuous training process;

acquiring reserved seed data by loading a json file;

and merging the new data and the seed data to obtain fused data, wherein the fused data has the characteristics of both the new data and the seed data.

In some optional implementation manners of some embodiments, the freezing a preset number of layers and parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result includes:

traversing layers 1-11 of an encoder of the bert language model in the continuous training process, setting gradient updating of the layers 0,4 and 8 as stop updating when traversing to the layers 0,4 and 8, and completing freezing of the layers 0,4 and 8 and parameter information;

and inputting the fusion data into the frozen model for training to obtain a final training result.

In a second aspect of the disclosed embodiments, a model training apparatus for continuous learning based on medical named entity recognition is provided, including:

the data acquisition module is used for acquiring medical text corpora from a plurality of data sources;

the medical knowledge map building module is used for building a medical knowledge map after processing the medical text corpus by adopting a binary language statistical model;

the sentence extracting module is used for extracting and extracting the sentences to be trained from the medical knowledge graph;

the data processing module is used for inputting the sentence to be trained into the bert language model for continuous learning training, reserving seed data in the training process and fusing the reserved seed data with new data;

and the model processing module is used for freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.

In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps of the above-mentioned method.

The invention has the beneficial effects that:

the method is characterized in that seed data is reserved in the model training process of continuous learning, model training is carried out with new data when the model is used for training the new data, the forgetting degree of the model on old knowledge is reduced, after the new model obtained through training has the old knowledge, the new model can have the capacity of the new knowledge and the old knowledge, the 0 th, 4 th and 8 th layers of bert layers and parameter information are frozen, the new model is not subjected to parameter updating, the information learned before is reserved, the forgetting performance of the old knowledge is reduced, the forgetting rate of the obtained training result is lowest and the accuracy is highest, in the medical field, the model of the original hospital can be trained without needing all data and forgetting the knowledge learned in the original hospital, the model of the original hospital can be suitable for the new hospital to be trained, a large number of text labels are avoided for the new hospital, the training time of the model is saved, the training efficiency and the accuracy of the training result are improved, and the medical naming entity is identified more accurately.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Fig. 2 is a block diagram of the system of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device implementing some embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In a first aspect, the present application proposes a model training method based on continuous learning of medical named entity recognition, as shown in fig. 1, including steps S100-S500:

s100: acquiring medical text corpora from a plurality of data sources;

The following describes the process of acquiring the medical text corpus for different data sources respectively:

(1) Target hospital data source

The data is provided by the cooperative academies and stored in an SQL database, and the specific acquisition mode is that the data enters the SQL database of the target hospital and is acquired by using a corresponding command.

(2) Medical data source

As an example, medical record data can be derived from an existing electronic medical record system database, the medical record data is analyzed, and an analysis result is converted into a text format, so that a medical text corpus is obtained.

(3) Medical professional book data source

As an example, for an electronic format, for example, a medical professional book in text format, the medical text corpus may be directly obtained without processing, and for a non-electronic format, for example, a paper medical professional book, the medical text corpus may be obtained by converting the medical professional book into text format.

S200: processing the medical text corpus by adopting a binary language statistical model to construct a medical knowledge map;

in some optional implementations of some embodiments, the constructing a medical knowledge graph after processing the medical text corpus using a binary language statistical model includes:

the formula of the binary language statistical model calculation statement is as follows:

wherein,

representing the probability of n words occurring at the same time,

indicates the probability of the 1 st word occurring,

indicating the probability that the 2 nd word occurs simultaneously with the 1 st word, and so on.

The equiprobability may be further obtained by counting the number of times the words occur simultaneously in the collected text corpus.

It should be clear that the collocation information between adjacent words refers to adjacent words in the sentence and the collocation relationship between the adjacent words, and the collocation relationship between the adjacent words refers to the reasonable collocation of the two words in the front-back order, for example, the "lung cancer" and the "patient" are adjacent words, and the reasonable collocation of the two words is the "lung cancer patient", and the collocation relationship between the two words is that the "lung cancer" is before the "patient". Therefore, the collocation information between adjacent words can reflect the collocation relationship between the adjacent words, that is, the collocation information between the adjacent words can know two adjacent words contained therein, so that the two adjacent words are reasonably collocated according to the front and back sequence.

In this embodiment, the word segmentation processing on the text corpus is implemented by a binary language statistical model. Specifically, the probability of simultaneous occurrence of adjacent words in the sentence processed by word segmentation is calculated through a binary language statistical model, and the most appropriate collocation information between adjacent words can be obtained according to the calculated maximum probability.

For example, "lung cancer" and "patient" are adjacent words, if they occur simultaneously according to the matching sequence of "lung cancer patient", the probability calculated by the binary language statistical model for this is larger, and if they occur simultaneously according to the matching sequence of "lung cancer patient", the probability calculated by the binary language statistical model for this is zero. Therefore, according to the principle of high probability, the matching information between two adjacent words of the lung cancer and the patient is obtained as the lung cancer patient, namely the adjacent words of the lung cancer and the patient are reasonably matched according to the sequence of the lung cancer before the patient.

the collocation information between adjacent words can reflect the collocation relationship between the adjacent words, that is, two adjacent words contained in the collocation information can be known through the collocation information between the adjacent words, and the two adjacent words are reasonably collocated according to the front and back sequence.

Therefore, the word frequency represents the number of times that adjacent words in the collocation information appear simultaneously according to the collocation relationship, for this reason, the text corpus is traversed according to the collocation information, that is, the adjacent words in all sentences of the text corpus are traversed according to the collocation relationship between the adjacent words and the adjacent words in the collocation information, and the number of times that the adjacent words in the collocation information appear simultaneously in the text corpus according to the collocation relationship is counted, so that the word frequency of the collocation information can be calculated.

After the word frequency of the collocation information is obtained, the corresponding relationship between the two can be established and stored, and a medical dictionary such as that shown in table 1 below is formed.

TABLE 1 medical dictionary corresponding to binary language statistical model

Collocation information	Word frequency
		Lung cancer&Patient's health	138
Patient's health&Symptoms and signs	113
		Symptoms and signs&Included	98
…	…

And graphically reconstructing the dictionary to obtain the medical knowledge map corresponding to the binary language statistical model.

Since the knowledge graph of the binary language statistical model is in a graph form, after the corresponding medical dictionary is constructed, the medical dictionary needs to be reconstructed graphically.

Furthermore, the collocation information between adjacent words contained in the medical dictionary is reconstructed graphically.

The adjacent words contained in the collocation information are used as nodes, edges connecting the nodes represent collocation relationships between the adjacent words, and the graphical reconstruction can utilize the probability or frequency of simultaneous occurrence of the adjacent words contained in the medical dictionary according to the collocation relationships. For example, the edges are identified according to the probability or frequency of the simultaneous occurrence of the adjacent words according to the collocation relationship, for example, the nodes include "lung cancer" and "patient", the edges formed by the two nodes as the adjacent words represent the collocation relationship between the two, which represents the collocation relationship of "lung cancer" in front of "patient" in the back of "patient". And representing the frequency of the two adjacent words appearing in the text corpus simultaneously according to the collocation relationship by numbers.

As described above, each node of the medical knowledge graph represents each word in the sentence, and the edge represents the collocation relationship between the words, for this reason, in this embodiment, based on the binary language statistical model, two adjacent nodes are used to represent adjacent words in the collocation information, and the edge formed by connecting the two adjacent nodes is used to represent the collocation relationship between the adjacent words, that is, the medical knowledge graph is constructed by identifying the edge according to the probability that the adjacent words in the collocation information occur at the same time, wherein, because the edge is formed by connecting the two adjacent nodes according to the collocation relationship between the adjacent words, the edge has directionality, and the directionality is closely related to the collocation relationship between the adjacent words. For example, the matching relationship between the adjacent words "lung cancer" and "patient" is "lung cancer patient", so the direction of the corresponding edge of the adjacent word is from the node "lung cancer" to the node "patient", the probability of the adjacent word appearing at the same time is positive, and the word frequency for representing the number of times of the adjacent word appearing at the same time is greater than zero, therefore, on the basis of the construction of the medical knowledge map, the edge is identified by replacing the probability of the adjacent word appearing at the same time with the word frequency of the matching information, which is beneficial to improving the generation efficiency of natural sentences.

S300: extracting a sentence to be trained from the medical knowledge graph;

the neural network trains the collected text corpora to enable the machine to learn various characteristics of the language, and further enable the machine to generate natural sentences on the premise of no manual intervention.

the rational sentence is a natural sentence with a simple or reasonable grammar, for example, "lung cancer" and "patient" in "symptom possessed by lung cancer patient" belong to rational collocation. In other words, two words in the form of "patient lung cancer" may not exist in the originally collected text corpus, i.e., the two words in the form of "patient lung cancer" are counted as occurring at the same time in the originally collected text corpus with zero number of times.

Based on the method, after the joint probability of the natural sentences is obtained through calculation, reasonable sentences can be screened out from the generated natural sentences according to the principle that the joint probability is not zero.

In the medical knowledge graph, a reasonable sentence can be mapped through a path formed by nodes and edges connecting the nodes, for example, the 'symptom of a lung cancer patient' in the reasonable sentence is mapped by a path formed by four nodes and corresponding edges of the 'lung cancer', 'patient', 'symptom' and 'symptom'.

Based on the method, the reasonable sentences are subjected to path search through the medical knowledge graph, the search results corresponding to the paths are obtained, after the search results are obtained, the corresponding paths in the search results are mapped to obtain the reasonable sentences according to the mapping relation between the paths in the medical knowledge graph and the reasonable sentences, and then the obtained reasonable sentences are used as the sentences to be trained to be stored, so that model training for continuous learning is carried out by using the sentences to be trained subsequently.

S400: inputting the sentence to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data and new data;

the two extracted sentences to be trained may be, for example:

sensor 1= "symptoms possessed by lung cancer patient"

sensor 2= "symptoms of lung cancer patient include"

in this embodiment, the presence 1 is used as a first sentence to be judged, and the presence 2 is used as a second sentence to be judged;

in this embodiment, a jieba word segmentation tool is used to perform text splitting on a first sentence to be determined and a second sentence to be determined, and the first word segmentation result is: the sensor 1 gets: [ "lung cancer", "patient", "have", "is", "symptoms" ], second word segmentation result: sensor 2 gets the following: [ "Lung cancer", "patient", "symptoms", "including" ]

merging the first segmentation result and the second segmentation result to obtain a segmentation list word _ list = [ 'lung cancer', 'patient', 'having', 'symptom', 'including' ]

in this embodiment, if words in the first sentence to be determined and the second sentence to be determined appear in the word _ list, the word _ list is represented by 1, otherwise, the word _ list is 0, and the result is:

word_vec_1 = [1,1,1,1,1，0]，word_vec_2 = [1,1,0,1,1,1]；

According to the cosine similarity formula:

wherein A and B are vector representations of the statement to be judged,

and

the vector is a subset of the statement to be judged, and n is the number of the subset.

The calculation method of the cosine similarity formula comprises the following steps: and calculating the dot product of the two vectors, calculating the modulus of each vector, dividing the dot product by the product of the two vector moduli to obtain the cosine included angle, and judging the similarity of the two sentences to be judged according to the size of the included angle.

In this embodiment, the similarity of the two sentences to be trained is calculated by a cosine similarity algorithm, and if the two sentences to be trained are more similar, the result is closer to 1, otherwise, the result is closer to-1. The number of the training texts (namely the total number of the stored sentences to be trained) is z, so that z x (z-1)/2 times of comparison is needed, each text is compared with other texts, the texts with similarity lower than 0 (threshold value self-defined) are reserved, and finally, the texts which are dissimilar to each other are obtained by duplication elimination.

And setting the number of the seed data as a threshold value t, if the number of the finally reserved texts which are dissimilar to each other is less than t, reserving all the texts, and if the number of the finally reserved texts is greater than t, randomly selecting t texts to be stored in the json file.

As can be seen from table 2 below, after the seed data is retained and model training for continuous learning is performed, the knowledge accuracy is significantly higher than that of the seed data that is not retained.

TABLE 2 seed data accuracy comparison table

Whether to join seed data	Accuracy of old knowledge	Rate of accuracy of new knowledge
			Whether or not	88.95%	95.95%
Is that	95.42%	95.44%

acquiring new data generated in the continuous training process;

acquiring reserved seed data by loading a json file;

For example: in the continuous training stage, a batch of new data sets with x data are added, the x data and the z data have no size relation, t text data representing the z data are obtained by reserving seed data, the t seed data are obtained by loading a json file, the x data and the t seed data are combined to obtain s representative data, and model training is performed by using the s fused data, so that the model training method has characteristics in the z data and the x data, and a new model obtained by training has old knowledge to a certain extent.

S500: and freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result.

traversing layers 1-11 of an encoder of the bert language model in the continuous training process, and setting the gradient updates of the layers 0,4 and 8 as stop updating when traversing to the layers 0,4 and 8 so as to finish freezing the layers 0,4 and 8 and corresponding parameter information;

In order to better enable the new model to have the ability of both new knowledge and old knowledge, the number of layers of the part of the bert language model is frozen, and the embeddings parameters are frozen without updating the parameters.

Each layer in the bert language model keeps learned parameter information, if parameters of partial layers of the bert language model are kept unchanged in a continuous training phase and gradient updating is not carried out, the previously learned information can be kept, and the forgetting performance of old knowledge can be reduced.

The specific implementation steps are as follows:

when the 0,4,8 layer of the bert encoder (encoder) is traversed, the parameters (parameters) of the above layers are traversed again, and the gradient update is set to be the update stop (False), namely, in the training process of the model, the parameters of the 0,4,8 layer are not subjected to gradient update, and the original parameters are reserved.

The comparative training results are shown in table 3 below:

TABLE 3 comparative training results

Number of frozen layers	Accuracy of old knowledge	New knowledge accuracy rate
			Odd number of layers	95.60%	95.00%
Even number of layers	95.60%	95.38%
			The first 6 layers	95.43%	94.50%
Rear 6 layers	95.50%	95.62%
			The first 10 layers	95.28%	92.08%
0. 3, 6 and 9 layers	95.40%	95.81%
			1. 4, 7 and 10 layers	95.53%	95.31%
0. 4,8 layers	95.72%	95.62%

As can be seen from table 3 above, in the model with frozen layers 0,4 and 8, the training sample obtained after training the fusion data has the highest knowledge accuracy and the lowest knowledge forgetting rate, and the error rate of the medical named entity recognition using the training sample is the lowest, so that the efficiency and accuracy of the medical named entity recognition can be effectively improved.

In a second aspect of the embodiments of the present disclosure, there is provided a model training apparatus for continuous learning based on medical named entity recognition, as shown in fig. 2, including:

and the model processing module is used for freezing the preset layer number and the parameter information in the bert language model, inputting the fused data into the processed bert language model and obtaining a final training result.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

Fig. 3 is a schematic diagram of a computer device 3 provided by the embodiment of the present disclosure. As shown in fig. 3, the computer device 3 of this embodiment includes: a processor 601, a memory 602, and a computer program 603 stored in the memory 602 and operable on the processor 601. The steps in the various method embodiments described above are implemented when the computer program 603 is executed by the processor 601. Alternatively, the processor 601 realizes the functions of each module/unit in the above-described apparatus embodiments when executing the computer program 603.

Illustratively, the computer program 603 may be partitioned into one or more modules/units, which are stored in the memory 602 and executed by the processor 601 to accomplish the present disclosure. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 603 in the computer device 3.

The computer device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computer devices. The computer device 3 may include, but is not limited to, a processor 601 and a memory 602. Those skilled in the art will appreciate that fig. 3 is merely an example of a computer device 3 and is not intended to limit the computer device 3 and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the computer device may also include input output devices, network access devices, buses, etc.

The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 602 may be an internal storage unit of the computer device 3, for example, a hard disk or a memory of the computer device 3. The memory 602 may also be an external storage device of the computer device 3, such as a plug-in hard disk provided on the computer device 3, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 602 may also include both an internal storage unit of the computer device 3 and an external storage device. The memory 602 is used for storing computer programs and other programs and data required by the computer device. The memory 602 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, another division may be made in actual implementation, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.

The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. The model training method of continuous learning based on medical named entity recognition is characterized by comprising the following steps: the method comprises the following steps:

acquiring medical text corpora from a plurality of data sources;

processing the medical text corpus by adopting a binary language statistical model to construct a medical knowledge map;

extracting a sentence to be trained from the medical knowledge graph;

inputting the sentences to be trained into a bert language model for continuous learning training, reserving seed data in the training process, and fusing the reserved seed data and new data, wherein two sentences to be trained are arbitrarily extracted from the sentences to be trained to serve as sentences to be judged; calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result; screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold value to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold value; calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same value as the reserved quantity threshold value are randomly selected as the seed data to be stored in the json file, and in the continuous training stage, a batch of new data sets with x data sets can be added;

2. The method of claim 1, wherein: the plurality of data sources includes at least: a target hospital data source, a diagnosis and treatment data source and a medical professional book data source.

3. The method of claim 2, wherein: the construction of the medical knowledge map after the medical text corpus is processed by adopting the binary language statistical model comprises the following steps:

4. The method of claim 3, wherein: the constructing of the medical dictionary corresponding to the binary language statistical model according to the collocation information includes:

5. The method of claim 4, wherein: the graphically reconstructing the dictionary to obtain the knowledge graph of the binary language statistical model comprises the following steps:

6. The method of claim 5, wherein: the extracting and adjusting the sentence to be trained from the medical knowledge graph comprises the following steps:

7. The method of claim 6, wherein: the calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result, comprising:

8. The method of claim 7, wherein: the fusing the reserved seed data and the new data comprises the following steps:

acquiring new data generated in the continuous training process;

acquiring reserved seed data by loading a json file;

9. The method of claim 8, wherein: freezing the preset layer number and the parameter information in the bert language model, and inputting the fused data into the processed bert language model to obtain a final training result, wherein the step comprises the following steps of:

traversing layers 1-11 of an encoder of the bert language model in the continuous training process, setting gradient updating of the layers 0,4 and 8 as stop updating when traversing to the layers 0,4 and 8, and completing freezing of parameter information of the layers 0,4 and 8;

10. A model training device for continuous learning based on medical named entity recognition is characterized in that: the method comprises the following steps:

the data processing module is used for inputting the sentences to be trained into a bert language model for continuous learning training, reserving seed data in the training process and fusing the reserved seed data and new data, wherein two sentences to be trained are arbitrarily extracted from the sentences to be trained to serve as sentences to be judged; calculating the similarity between the sentences to be judged through cosine similarity to obtain a similarity calculation result; screening the sentences to be judged according to the similarity calculation result and a preset similarity threshold value to obtain reserved sentences of which the similarity calculation result is lower than the similarity threshold value; calculating and screening all the sentences to be trained, setting a reserved quantity threshold value of seed data, if the quantity of the reserved sentences obtained finally is smaller than or equal to the reserved quantity threshold value, all the sentences are stored in a json file as the seed data, if the quantity of the reserved sentences obtained finally is larger than the reserved quantity threshold value, the reserved sentences with the same value as the reserved quantity threshold value are randomly selected as the seed data to be stored in the json file, and in the continuous training stage, a batch of new data sets with x data sets can be added;

11. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.