CN114676258B

CN114676258B - Disease classification method based on symptom description text and not aiming at diagnosis

Info

Publication number: CN114676258B
Application number: CN202210354283.4A
Authority: CN
Inventors: 吴文峻; 辛治旻; 汪群博; 庄予彰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2024-05-31
Anticipated expiration: 2042-04-06
Also published as: CN114676258A

Abstract

The invention provides a disease classification service method based on symptom description text and not aiming at diagnosis, which belongs to the field of artificial intelligence and specifically comprises the following steps: firstly, extracting an entity from the existing data according to a description statement of a user symptom, marking types, and storing the types as samples into a graphic database; then, generating new samples of the same type of each sample through a generator and marking, and judging that available samples are added into a marked data set by utilizing a checker according to each new sample; then sequentially constructing entity identification micro-services, inputting symptom description text of an actual user, and outputting an identification result of the entity; constructing a disease inquiry micro-service, inquiring and sequencing the diseases related to the identification results of the entities in a graph database, and outputting the diseases as candidate disease results; finally, constructing a user interaction micro-service, and packaging entity identification results, candidate disease results and historical health data together to return to the user; the invention reduces the operation and maintenance cost and is convenient for updating and perfecting the data subsequently.

Description

Disease classification method based on symptom description text and not aiming at diagnosis

Technical Field

The invention relates to the field of artificial intelligence, in particular to a disease classification method based on symptom description text and aiming at non-diagnosis.

Background

Along with the aggravation of the aging degree of the population in China, more and more people pay more attention to the health condition of the people; however, medical resource allocation in China is not balanced, and many patients usually need to spend a great deal of cost for determining the diseases suffered by the patients.

The disease of the patient can be automatically deduced according to the symptom description text, so that the patient can be helped to judge and select the hospital or department to visit, and the disease treatment efficiency is greatly improved.

The automatic inquiry system combines medical field knowledge and computer science and technology to diagnose the disease of the patient. In order to realize automatic diagnosis of diseases, the data normalization treatment of the diseases and symptoms is needed by a computer, and meanwhile, the relation between the symptoms and the diseases is needed to be mined and analyzed, and modeling and expression are needed by a computer language.

Knowledge graph has been successfully used as an important branch of computer science, and has been widely accepted by researchers in fields of natural language processing, information retrieval, data mining, artificial intelligence, big data and the like.

The knowledge graph uses nodes to represent entities and concepts, and edges represent relationships among the nodes so as to represent logical relationships among the knowledge. The medical field knowledge graph is constructed by utilizing the disease data set, the relationship between the diseases and the symptoms can be formally described, the upper symptoms in the medical field knowledge graph express the commonality among the diseases, and the lower symptoms express the dissimilarity among the diseases.

The final result is obtained by inducing, classifying, thinking and judging various disease information in the brain, which is a psychological process similar to the artificial intelligence and also belongs to the black box process. Recently, there have been many approaches to attempt to describe this process with deep neural networks. However, training the deep neural network model requires a large amount of labeling data, which requires a lot of time and labor.

A new idea is to gradually accumulate data in practical application, gradually iterate and upgrade a model, and prevent the model from being applied in practice because of high coupling degree between the model and a knowledge graph and between the model and a data set and high iterative deployment and operation cost.

Disclosure of Invention

The invention discloses a disease classification method based on symptom description text and not aiming at diagnosis, which is based on a micro-service framework, combines an AC automaton and a depth model to identify medical entities, introduces a data enhancement mechanism, realizes the effective implementation of an intelligent identification method for natural language diseases, improves the accuracy of identification results, reduces the demand of marking data, effectively reduces the cost of deployment and operation, and has important significance for the deployment and application of the intelligent identification method in a cluster platform.

The disease classification method based on the symptom description text and not aiming at diagnosis comprises the following steps:

Extracting description sentences of user symptoms from the existing related data in the medical field, extracting entities from the sentences, labeling each entity type, forming a sample from the sentences and the labeling results of the entity types, and storing the sample into a NoSQL graphic database;

entities include hospitals, departments, diseases, symptoms, medicines, and the like; each entity has own attribute;

sample s= < X, Y >; i.e. the sample comprises a sentence X and labeling results Y for each entity type.

Step two, each sample is respectively passed through a generator to generate a new sample, and a new sentence and the type label of each entity in the sentence are correspondingly generated;

the generator is used for carrying out word replacement according to the type labels of all entities in the original sample and automatically generating a new sample with the same type.

And thirdly, calculating the minimum editing distance of each new sample S' generated by the sample S one by using a checker containing an availability rule to obtain the score of each new sample, and selecting the first K new samples to be added into the marked data set after descending arrangement.

And the value of K is larger than or equal to 3 according to the actual experimental data.

And fourthly, constructing an entity identification micro-service, inputting the symptom description text of the actual user into the entity identification micro-service, and outputting an identification result of the entity.

The entity identification micro-service comprises the steps of constructing an AC automaton and constructing a BiLSTM +CRF model framework as a depth model;

The AC automaton constructs a Trie by using all entities of the NoSQL graphic database, and further constructs a mismatch pointer on the Trie; keyword matching is carried out on the Chinese symptom description text of the preprocessed user, and medical entity recognition results, namely entity positions, names and types, appearing in sentences are found;

The depth model sequentially comprises a first embedded layer, and converts an actual description text input by a user into a word vector matrix; a second layer BiLSTM layer, which receives the word vector matrix input by the embedded layer, splices two groups of outputs obtained by the forward LSTM and the reverse LSTM, and then sequentially passes through a dropout layer, a ReLU layer and a omit layer to finally obtain an output feature vector; the third CRF layer receives the feature vector output by the BiLSTM layer, and uses a CRF model to obtain entity type labels of each word in the actual description text input by the user, so as to obtain a medical entity identification result;

Finally, merging entity identification results respectively obtained by the AC automaton and the depth model through the entity identification micro-service interface; direct merging of the same recognition results; judging whether the recognition results conflict or not according to the confidence coefficient of the depth model, and if the confidence coefficient of the depth model is greater than or equal to a threshold value, reserving the result of the depth model; and when the confidence is lower than the threshold, reserving the result of the AC automaton as a final result.

Fifthly, constructing a disease inquiry micro-service, inquiring and sequencing the diseases related to the entities in a NoSQL graphic database according to the entity identification result of the actual user, and outputting the disease inquiry micro-service as a candidate disease result;

The specific process is as follows:

Firstly, inputting entity identification results, and inquiring in a NoSQL graphic database by using SQL inquiry sentences;

And then, respectively extracting symptom description statement results corresponding to the related diseases of each entity in the query results, and calculating the similarity with the symptom description text input by the actual user one by one.

And finally, sequentially returning the candidate disease names according to the sequence from high to low of the similarity.

And step six, constructing a user interaction micro-service, identifying the entity of the user, candidate disease results, and packaging the historical health data of the user and returning the data to the user.

The disease classification method based on symptom description text and not aiming at diagnosis has the advantages that:

(1) The invention enhances the data through the generator, and improves the use efficiency of the existing labeling data under the condition of limited labeling conditions.

(2) According to the invention, medical entity identification is performed by constructing the AC automaton and the depth model, ambiguity analysis in natural language is considered, and accuracy of medical entity identification, user experience and accuracy of disease diagnosis are improved.

(3) The invention uses the micro-service architecture, so as to reduce the coupling degree of the intelligent disease classification service; each functional microservice is relatively independent, can be deployed and updated respectively, reduces operation and maintenance cost, and facilitates the follow-up addition of data labeling, updating of entity identification models and perfecting of knowledge maps.

Drawings

FIG. 1 is a flow chart of a non-diagnostic disease classification method based on symptom descriptive text in accordance with the present invention;

FIG. 2 is a schematic structural diagram of BiLSTM +CRF model of the present invention;

FIG. 3 is a flow chart of a process taken by the entity recognition micro-service interface of the present invention on an input natural language text;

FIG. 4 is a diagram of the call relationship between all micro-services constructed in accordance with the present invention;

FIG. 5 is a flow chart of interactions between microservices of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples.

Aiming at the problem of automatic disease diagnosis of a user symptom description text, the invention provides a disease classification method facing to a micro-service architecture, and for the user symptom description text, an AC automaton and BiLstm +CRF-based depth model is used for entity identification, and based on the result of the entity identification, related candidate diseases are obtained by inquiring in a knowledge graph; and further calculating the sequence of the candidate diseases by using the cosine similarity, and finally returning to the user.

The disease classification method based on the symptom description text and not aiming at diagnosis is as shown in fig. 1, and comprises the following steps:

extracting description sentences of user symptoms from the existing related data in the medical field, extracting entities from the sentences, labeling each entity type, forming a sample from the sentences and the labeling results of the entity types, storing the sample into a NoSQL graphic database, and constructing a knowledge graph;

entities include hospitals, departments, diseases, symptoms, medicines, and the like; each entity has its own attribute, taking disease category entity as an example, and its attribute includes name, disease feature, etc.;

The relationship between entities includes relationship between departments, relationship between hospitals and departments, relationship between departments and diseases, and the like.

According to the entities and the relations, a knowledge graph of the medical field is constructed in a NoSQL database (such as Neo4 j), and a general database query interface is provided for the outside.

Step two, constructing a labeling management micro-service, respectively generating a new sample through a generator, and correspondingly generating a new sentence and a type label of each entity in the sentence;

Acquiring text data of the medical field marked with the named entity, carrying out data enhancement on each piece of data, and externally providing a data acquisition interface; in order to solve the problems, a data enhancement method is introduced, more annotation data are generated based on the existing annotation data, and the original annotation data and the enhancement data are provided for the model for learning, so that the model performance is further improved under the condition of the same manual annotation cost.

The data enhancement process specifically comprises two steps of data generation and checking;

The method and the device enable the data to be enhanced through the generator, and generate more marked data based on the existing marked data; the generator generates new data by using the original data according to the word replacement rule; the generator needs to input necessary parameters and data including PWS (PERCENTAGE OF WORDS TO SWAP PER AUGMENTED EXAMPLE, the ratio of characters replaced in each generated sample), chinese paraphrasing dictionary (to search for alternative paraphrasing, including semantic paraphrasing dictionary and embedded space paraphrasing dictionary). In generating a new sample using the labeled sample, labels for the new sample, i.e., the entity type for each word, are also generated simultaneously. The paraphrase replacement does not change the entity type of the word, so that the entity type label of the new sample can be automatically generated according to the label of the original sample and the record of the word replacement.

As in sample S, statement X is: i have some fever; the entity type label Y is: o O O B-Sym E-Sym;

Wherein, the three words of I, I 'have, and O' are of category O, not medical entity, the "hair" word is the beginning of symptoms (Begin-Symptom), and the "hot" word is the End of symptoms (End-Symptom);

In a new sample S' generated by the sample S, the sentence X is: i have some dizziness; the entity type label Y is: o O O B-Sym E-Sym;

Thirdly, checking sentences generated by the new samples by using a checker containing availability rules, and storing the sentences in the marked data set;

The checker includes general rules including duplicate detection (not allowing multiple substitutions of the same word in a generated sample), stop word detection (not allowing substitution of stop words); also included are rules regarding availability of medical scenarios, with medical proper noun detection (not allowing replacement of certain medical proper nouns in the original sentence). In order to obtain higher data use efficiency, sentences need to be scored and sorted according to the availability rule, and each sample is selected to generate a sample of K before sorting and added into a marked data set.

For each new sample S' generated by the sample S, the checker judges whether the new samples are available one by utilizing an availability rule, and marks the unavailable new samples as 0 points;

And calculating the score of each new sample for the remaining available new samples by utilizing the minimum editing distance of the sentence X of the original sample S and the sentence X 'of the generated new sample S', sorting according to the score, and selecting the first K new samples to add into the marked data set.

The value of K is selected to be 3 according to the actual experimental data.

The minimum edit distance refers to: the character string a is modified to the minimum number of edits required for the character string B, and only one character can be added, deleted or modified per edit.

The score calculation formula is:

Score(S,S')＝MED(S,S')*Useable(S,S')

MED (S, S ') is the edit distance of the sample S and the new sample S'; useable (S, S ') takes a value of 1 or 0, which is 1 when the new sample S' passes all availability checks; otherwise the value is 0.

Finally, a data acquisition interface is built, a Restful API is externally provided in a micro-service mode, and subsequent calling is facilitated; the interface is written using Flask architecture, and externally can obtain a specified number of < original sample, enhanced sample > pairs using GET type HTTP requests.

And fourthly, constructing an entity identification micro-service, realizing a named entity identification function in the medical field, inputting a symptom description text of an actual user into the entity identification micro-service, and outputting an identification result of the entity.

Entity keyword retrieval can ensure accurate identification of known-name medical entities, but is easily interfered by ambiguity in natural language. The deep naming entity recognition model can recognize the upper part and the lower part Wen Yuyi of sentences to effectively perform entity recognition, but the deep naming entity recognition model depends on a large amount of training data and has no good performance when a data set is deficient.

In order to improve the accuracy of medical entity identification, the invention combines two methods of entity keyword retrieval and deep naming entity identification model, and the proposed entity identification microservice comprises constructing an AC automaton and constructing BiLSTM +CRF model architecture as a deep model;

a multi-mode matching algorithm AC automaton (Aho-Corasick automation), constructing a Trie by using all entities of a NoSQL graphic database, and further constructing a mismatch pointer on the Trie;

the Trie uses the character string corresponding to the name of the entity name as the key constructed by the nodes, the key of each node is determined by the position of the key in the tree, and the value of the node points to each entity of the knowledge graph; when the pointer points to the current node and the matching fails, the next node is matched. The use of mismatched pointers can avoid time waste caused by backtracking.

The node structure in the Tire tree is as follows:

Class TireNode{

char c; character of the current node

Entity; entities in a directed knowledge base

TireNode son [ ]; pointer array pointing to all child nodes

}；

The BiLSTM +CRF (Bi-directional Long Short-terminal memory+ Conditional Random Fields) model, as shown in FIG. 2, includes: the first layer is embedded (embedding) with a layer, and the input descriptive text is converted into a word vector matrix;

The maximum sentence length max_length is set for the word vector matrix, max_length words before sentences which are larger than max_length are directly truncated, nonsensical words are used for completing sentences which are smaller than max_length, and the sizes of all word vector matrices are kept consistent.

And a second layer BiLSTM, which receives the word vector matrix input by the embedding layer, splices two groups of outputs obtained by the forward LSTM and the reverse LSTM, and then sequentially passes through a dropout layer (the robustness is enhanced by randomly discarding part of parameters), a ReLU layer (the fitting degree of a model is increased by using a ReLU function as an activation function), and a omit layer (the multidimensional outputs are integrated into one-dimensional feature vectors), so as to finally obtain the output feature vectors.

ReLU(x)＝max(0,x)

And the third CRF layer receives the feature vector output by the BiLSTM layers and obtains a final prediction result by using a CRF model.

The final layer needs to define a loss function and an objective function, and a label sequence calculation log maximum likelihood estimation output by the CRF model is used as the loss function, and a negative value of a loss function result mean value is used as the objective function. The design refers to the logistic regression objective function, reduces the calculation complexity and can reduce the time and the calculation cost of model training.

Finally, the established BiLSTM +CRF model uses a labeling data set and an interface to carry out forward propagation training according to the objective function and the loss function, uses backward propagation updating parameters, and circulates for a plurality of rounds until the model converges.

Building an entity identification micro-service interface by combining the two obtained models, compiling by utilizing Flask architecture, and requesting service by using a GET type HTTP message; the processing flow in the interface is shown in figure 3, and the text is described aiming at the Chinese symptoms of the actual user, and preprocessing is firstly carried out;

the pretreatment comprises the following steps: the half-angle character is converted into full-angle character, the Chinese character number is converted into Arabic number, the uppercase letter is converted into lowercase letter, and the traditional Chinese is converted into simplified Chinese;

Then, the AC automaton carries out keyword matching on the preprocessed symptom description text, and finds out the medical entity identification result appearing in the sentence, namely the position, name and type of the entity;

Simultaneously, the depth model sequentially passes through an embedding layer, biLSTM and a CRF layer for Chinese symptom description text of an actual user to obtain entity type labels of each word in the sentence, and further obtain a medical entity identification result, wherein the entity identification result comprises an entity position, a name and a type;

Finally, combining entity identification results obtained by the AC automaton and the depth model respectively through the entity identification microservice interface: direct merging of the same recognition results; judging the entity with conflict of the identification result according to the confidence coefficient output by the depth model, and reserving the result of the depth model when the confidence coefficient of the depth model is larger than or equal to a threshold value; and when the confidence is lower than the threshold, reserving the result of the AC automaton as a final result.

For example, for statement X in sample S, the input is: x= { X1, X2, X3, X4, X5} = i have some heat; the output is: y= { Y1, Y2, Y3, Y4, Y5} = { O B-Sym E-Sym }.

The specific process is as follows:

Firstly, inputting entity identification results, and inquiring in a NoSQL graphic database by using SQL inquiry sentences to find diseases related to each entity;

then, symptom description sentences corresponding to the diseases in the query result are respectively extracted, and similarity is calculated with symptom description texts input by an actual user one by one.

The similarity calculation method comprises the following steps:

for two input texts, calculating word embedding average values of all words in the texts by using a depth model to obtain two vectors And/>

Then, the cosine similarity S _cos between the two vectors is calculated, with the formula:

the larger the result obtained, the higher the similarity.

Word embedding is the Embedding layer output of the BiLSTM +crf model.

And finally, compiling a calling framework by utilizing Flask architecture, and sequentially returning each symptom description statement according to the sequence from high similarity to low similarity.

Step six, constructing a user interaction micro-service, identifying the entity of the user, candidate disease results, and packaging historical health data of the user and returning the data to the user.

In order to meet the lightweight use situation of the micro-service, the service end does not store the characteristic data generated by the user in the multi-round question-answering, but incorporates new medical entities according to the history data transmitted during calling each time, and attaches the characteristic data extracted by the current model to the network request for return.

The characteristic data may include extracted user symptoms, extracted user historical health data from questions and answers, and the like. The flow chart is shown in fig. 5.

The specific process is as follows:

firstly, analyzing the input data, including history features and natural language texts input by a user in turn; then, calling entity identification micro-service, and inputting the input natural language text to obtain the medical entity in the dialogue.

And then, merging the newly generated medical entity and the historical data, and calling the disease query micro-service as input to obtain a group of candidate diseases which are ranked from high to low according to the relevance.

And finally, packaging the extracted user characteristic information (namely the identified medical entity) in the current invocation, and packaging the user characteristic information serving as the historical information generated in the current round and the output candidate diseases together, and returning the packaged user characteristic information to the caller of the service.

The entities include user symptoms, historical health data, and the like.

The invention utilizes a micro-service architecture to deploy a plurality of micro-services forming intelligent disease classification services; firstly, packaging the constructed micro-services into mirror images by using a dock tool, deploying and maintaining containers (pod) loading the mirror images by using a Kubernetes (K8 s) container arrangement engine, writing yaml format configuration files according to K8s rules, and importing the configuration files into a K8s cluster to perform system automation deployment;

The configuration file comprises: configuring information such as the pod number, service host, port number, access authority, mounting storage and the like of each micro service; wherein Chart.yaml mainly configures information such as name, description and version number of the service; values.yaml is responsible for configuring information such as the number of pod, usage mirror, service host, port number, access rights, mounted storage, and node selection.

As shown in fig. 4, the present invention has 5 micro services, and the responsible functions are respectively:

1) User interaction microservices: and the whole system is used as an external interface and is responsible for receiving the input of a user, calling other modules to obtain possible disease information and outputting the possible disease information to the user.

2) Entity identification micro-services: the method is used for identifying medical entities, and the trained machine learning model is used for marking the medical entities on the input natural language and outputting the medical entities.

3) Annotation management microservices: is responsible for the access and enhancement of the marked data and provides marked data required by the training model.

4) Disease query micro-service: and according to the entity identification result, disease inquiry is carried out in the knowledge graph, and a plurality of possible diseases are scored and sequenced, so that an inquiry interface is provided.

5) Knowledge graph microservices: the data retrieval interface is provided externally using medical domain knowledge required by the NoSQL database storage system.

The above micro services interact by using a Restful API, and the whole system is also used as a micro service to provide an API interface for users to call. The user calls the service of the system through the API interface and transmits parameters (including natural language description and optional historical diagnosis information), the micro-service 1) is responsible for analyzing the transmitted parameters, the micro-service 2) is called to obtain a named entity identification result, then the entity identification result is used for calling 4) to inquire about related diseases, scoring and sorting are carried out on a plurality of possible diseases, and then the result is output to the user. 2) Training and updating of the middle model relies on the data in micro service 3); 4) The query and ranking in (2) depends on the knowledge-graph in 5).

The micro-service architecture is used, so that each module with relatively independent functions can be deployed and updated respectively, and corresponding modules can be updated independently when updating and upgrading such as adding data labels, updating entity identification models, perfecting knowledge maps, optimizing system replies and the like later, so long as the exposed APIs of the single module are unchanged.

Claims

1. A disease classification method based on symptom description text and not aiming at diagnosis, which is characterized by comprising the following specific steps:

Firstly, extracting a description statement of a user symptom from the existing related data in the medical field, extracting an entity from the statement, marking each entity type, forming a sample by the statement and the entity type marking result, and storing the sample into a NoSQL graphic database;

The entities include hospitals, departments, diseases, symptoms and medicines; sample s= < X, Y >; namely, the sample comprises a sentence X and a labeling result Y of each entity type;

Then, each sample is respectively passed through a generator to generate a new sample, and a new sentence and the type label of each entity in the sentence are correspondingly generated; for each new sample S' generated by the sample S, calculating the minimum editing distance of the two samples one by using a checker containing an availability rule to obtain the score of each new sample, and selecting the first K new samples to be added into the marked data set after descending arrangement;

the K value is obtained according to actual experimental data; the generator is used for carrying out word replacement according to the type label of each entity in the original sample and automatically generating a new sample with the same type;

then, constructing an entity identification micro-service, training by using a labeling data set, inputting a symptom description text of an actual user into the entity identification micro-service, and outputting an identification result of the entity;

The entity identification micro-service comprises the steps of constructing an AC automaton and constructing BiLSTM +CRF model architecture as a depth model;

finally, merging entity identification results respectively obtained by the AC automaton and the depth model through the entity identification micro-service interface; direct merging of the same recognition results; judging whether the recognition results conflict or not according to the confidence coefficient of the depth model, and if the confidence coefficient of the depth model is greater than or equal to a threshold value, reserving the result of the depth model; when the confidence is lower than the threshold, reserving the result of the AC automaton as a final result;

Constructing a disease inquiry micro-service, inquiring and sequencing the diseases related to the entities in a NoSQL graphic database according to the entity identification result of the actual user, and outputting the disease as a candidate disease result;

Finally, constructing a user interaction micro-service, identifying the entity of the user, candidate disease results, and packaging historical health data of the user and returning the historical health data to the user;

The specific acquisition process of the candidate disease results is as follows:

Then, respectively extracting symptom description statement results corresponding to the related diseases of each entity in the query results, and calculating similarity with symptom description texts input by actual users one by one;