Disclosure of Invention
In view of the above, the invention provides a system for intelligently identifying infectious diseases based on legal diagnostic standards, which is used for solving the problem that the traditional manual infectious disease comparison and screening can cause missed report and false report, and helping medical staff to perform infectious disease auxiliary diagnosis.
In a first aspect of the present invention, a system for intelligent identification of infectious diseases based on legal diagnostic criteria is presented, the system comprising:
an index construction module: the system is used for drawing and constructing the legal infectious disease case classification and the specific indexes of the diagnostic standard according to the legal infectious disease diagnostic standard;
the information extraction module: the method is used for extracting main characteristic information contained in each infectious disease case classification according to the legal infectious disease case classification and specific indexes of diagnostic standards;
a standard database: the standard database is used for establishing the incidence relation between the diagnosis standards of various infectious diseases and different case classification types of the same infectious disease and the corresponding main characteristic information;
a first text mining module: the system comprises a standard database, a first feature vector set, a second feature vector set and a third feature vector set, wherein the standard database is used for carrying out text mining on main feature information of the standard database, carrying out weight calculation and first core feature word extraction, and constructing a vector space model to obtain a first feature vector set corresponding to the main feature information of each infectious disease case classification;
the second text mining module: the system is used for constructing a feature selection model based on conditional mutual information, text mining is carried out on main feature information of the standard database by adopting a TF-IDF function, weight calculation is carried out according to the correlation degree between entries of the main feature information and case classification, a second core feature word is selected, a vector space model is constructed, and a second feature vector set corresponding to the main feature information of each infectious disease case classification is obtained;
a feature matching module: the cosine similarity between the text to be classified and the elements in the first feature vector set is calculated respectively; respectively calculating the mutual information correlation between the text to be classified and the second feature vector set element; and classifying cases of the texts to be classified according to the cosine similarity and the mutual information correlation.
Preferably, in the first text mining module, a TF-IDF function is used to calculate the entry weight of the main feature information:
let D be a set of documents comprising m documents, DiFor the feature vector of the ith document, there are: d ═ D1,D2,…,Dm},Di=(di1,di2,…,din) I is 1,2, …, m, wherein dijAs a document DiThe j-th entry tjThe weight value of (2):
where i is 1,2, …, m; j is 1,2, …, N is the total number of documents in the document database, NjIs that the document database contains an entry tjThe number of documents.
Preferably, in the feature matching module, the weight calculation is performed according to the correlation between the entry of the main feature information and the case classification, and the selecting of the second core feature word specifically includes:
calculating the mutual information correlation degree between each entry of the main characteristic information contained in the case classification and the case classification, wherein the formula is as follows:
wherein, A is the number of documents with the term t appearing in the case classification category c; b is the number of documents in which the term t appears in categories other than the case classification category c; c is the number of documents with no word bar t in the case classification category C; n is the sum of the number of documents in all categories; if the number of categories is m, each entry obtains m correlation values;
and taking the average value of the m values as the weight of each entry, sequencing the entries from low to high according to the word frequency, removing words which only appear in a single category and have the word frequency lower than a preset word frequency threshold, sequencing the rest entries from high to low according to the weight, and taking the words with the weight value higher than the preset weight threshold as second core characteristic words.
Preferably, in the feature matching module, the case classification of the text to be classified according to the cosine similarity and the mutual information correlation specifically includes:
and for each case classification category, taking the maximum value of the cosine similarity and the mutual information correlation as the output probability value of the corresponding case classification category, setting a first probability threshold, and taking the category with the probability value larger than the first probability threshold as the recognition recommendation result.
Preferably, in the feature matching module, the case classification of the text to be classified according to the cosine similarity and the mutual information correlation specifically includes:
and for each case classification category, taking the weighted sum of the cosine similarity and the correlation as the output probability value of the corresponding case classification category, setting a second probability threshold, and taking the category with the probability value larger than the second probability threshold as the recognition recommendation result.
In a second aspect of the present invention, an electronic device is disclosed, comprising: at least one processor, at least one memory, a communication interface, and a bus;
the processor, the memory and the communication interface complete mutual communication through the bus;
the memory stores program instructions executable by the processor which are invoked by the processor to implement the system according to the first aspect of the invention.
In a third aspect of the invention, a computer-readable storage medium is disclosed, which stores computer instructions for causing a computer to implement the system of the first aspect of the invention.
Compared with the prior art, the invention has the following beneficial effects:
1) the invention establishes a standard database of the incidence relation between the different infectious diseases and the different case classification type diagnostic standards of the same infectious disease and the corresponding main characteristic information based on the current legal diagnostic standard of the infectious diseases, the standard database provides a comprehensive standard characteristic information base for the different infectious diseases and the different case classification types of the same infectious disease, and provides a basis for the auxiliary diagnosis and accurate identification of the various infectious diseases;
2) based on the standard database, the vector space model is applied to feature extraction of infectious disease diagnosis standards, two important problems of type classification and feature information extraction can be effectively solved, the loss of feature information is greatly reduced, and the accuracy of intelligent identification and diagnosis is improved;
3) the invention respectively carries out intelligent identification through cosine similarity, mutual information similarity and combination thereof, further improves the diagnosis accuracy rate through a multi-aspect cross comparison mode, provides reliable auxiliary diagnosis results for medical personnel, and reduces missing reports and wrong reports.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1, the present invention provides a system for intelligently identifying infectious diseases based on legal diagnostic criteria, the system comprising:
an index construction module: the system is used for drawing and constructing the legal infectious disease case classification and the specific indexes of the diagnostic standard according to the legal infectious disease diagnostic standard;
the information extraction module: the method is used for extracting main characteristic information contained in each infectious disease case classification according to the legal infectious disease case classification and specific indexes of diagnostic standards;
a standard database: the standard database is used for establishing the incidence relation between the diagnosis standards of various infectious diseases and different case classification types of the same infectious disease and the corresponding main characteristic information;
a first text mining module: text mining is carried out on the main characteristic information of the standard database, weight calculation and first core characteristic word extraction are carried out by adopting TF-IDF, a vector space model is constructed, and a first characteristic vector set corresponding to the main characteristic information of each infectious disease case classification is obtained;
the second text mining module: constructing a feature selection model based on conditional mutual information, performing text mining on main feature information of the standard database, performing weight calculation according to the correlation between entries of the main feature information and case classifications, selecting a second core feature word, constructing a vector space model, and obtaining a second feature vector set corresponding to the main feature information of each infectious disease case classification;
a feature matching module: respectively calculating cosine similarity between the text to be classified and elements in the first feature vector set; respectively calculating the mutual information correlation between the text to be classified and the second feature vector set element; and classifying cases of the texts to be classified according to the cosine similarity and the mutual information correlation.
Embodiments of the invention are further described below in connection with specific classes of infectious diseases.
And drawing and constructing specific indexes of the legal infectious disease case classification and diagnosis standard through an index construction module. The accurate classification of the legal infectious disease diagnosis standard case types determines whether the infectious disease identification system can quickly and accurately search the characteristics of various infectious diseases, thereby improving the matching speed. For example, using the standard of legal infectious atypical pneumonia (trial) diagnosis as an example: 1. history of epidemiology. Two points are noted here: 1.1 there is a history of close contact with the patient or belongs to one of the infected group patients or has evidence of clearly infecting others; 1.2 before onset: in two weeks, patients who had been or who lived in an area where infectious atypical pneumonia was reported and had developed a secondary infectious epidemic; 2. symptoms and signs: the traditional Chinese medicine composition is not in an acute onset, takes fever as the first symptom, has the body temperature of generally 38 ℃, is occasionally intolerant of cold, can be accompanied by headache, joint ache, muscle ache, hypodynamia and diarrhea, is not frequently accompanied by upper respiratory catarrh symptoms, can be accompanied by cough, is mostly dry cough and less phlegm, is occasionally accompanied by blood streak phlegm, can be suffered from chest distress, and is accelerated in respiration, breathlessness or obvious respiratory distress for severe patients. The lung signs are not obvious, and some patients may smell a little damp Luo Yin or have lung excess signs. Note that: a few patients do not have fever as the first symptom, especially patients with recent surgical history or basic diseases; 3. laboratory examination results: peripheral blood leukocyte counts generally do not rise, or decrease, with a constant decrease in lymphocyte counts; 4. chest X-ray examination results: the lungs have varying degrees of flaky, patchy, infiltrative shadows or reticular changes, portions of the patient progress rapidly, large flakiness, often multi-lobal or bilateral changes, the shadows dissipate more slowly, and the lung shadows may be inconsistent with signs of symptoms. If the test result is negative, the test should be repeated after 1-2 days; 5. the antibacterial drug has no obvious effect on treatment.
From the above-mentioned legal diagnostic criteria, the classification type of the legal infectious atypical pneumonia (trial) diagnostic criteria can be determined: 1) and suspected diagnostic standard: 1.1+2+3 or 1.2+2+4 or 2+3+ 4; 2) and clinical diagnosis standard: the number of the above 1.1+2+4 and above or 1.2+2+4+5 or 1.2+2+3+ 4; 3) medical observation and diagnosis standard: the above 1.2+2+3 strips were followed. 4) And differential diagnosis: clinically, respiratory system diseases with similar clinical manifestations, such as upper respiratory infection, influenza, bacterial or fungal pneumonia, AIDS complicated with lung infection, legionnaires' disease, tuberculosis, epidemic hemorrhagic fever, lung tumor, noninfectious interstitial disease, pulmonary edema, pulmonary atelectasis, pulmonary embolism, lung eosinophilic infiltration disease, pulmonary vasculitis and the like, need to be excluded; 5) diagnosis standard of severe atypical pneumonia: severe "atypical pneumonia" can be diagnosed by meeting 1 of the following criteria: A. dyspnea, respiratory rate >30 beats/minute; B. hypoxemia, arterial partial pressure of blood oxygen PaO2<70mmHg or pulse volume blood oxygen saturation SpO2< 93% under oxygen inhalation condition of 3-5L/min, or has been diagnosed as acute lung injury ALI or acute respiratory distress syndrome ARDS; C. multilobal lesions with lesion range exceeding 1/3 or chest X-ray showing > 50% lesion progression within 48 hours; D. shock or multiple organ dysfunction syndrome MODS; E. with severe underlying disease or with other infections or an age >50 years.
After the legal infectious disease case classification and the specific indexes of the diagnostic standard are determined, the main characteristic information contained in each infectious disease case classification is extracted through the information extraction module and is used as the most core characteristic information for identifying, authenticating or distinguishing different case classification standards of the same infectious disease.
The core characteristic information of each infectious disease case classification plays an important role in an infectious disease recognition system as the most core detail characteristic for authenticating or distinguishing infectious diseases. For example, human avian influenza can be diagnosed after other diseases are excluded based on epidemiological history, clinical manifestations and laboratory test results. Then, the main characteristic information contained in the classification of cases of human infection with highly pathogenic avian influenza is: 1. medical observation cases: epidemiological history, clinical manifestations within 1 week; the medicine has close contact history with human avian influenza patients, and clinical manifestations appear within 1 week; 2. suspected cases: the patient respiratory secretion specimen adopts influenza A virus and H subtype monoclonal antibody antigen to detect positive patients; 3. the confirmed cases: has epidemiological history and clinical manifestations, and can be used for separating specific virus from airway secretion specimen of patient or detecting avian influenza H subtype virus gene by RT-PCR method, and the antibody titer of double serum against avian influenza virus in early stage of onset and convalescent period is 4 times or more higher.
For another example, the most core characteristic information of the legal cholera diagnostic standard includes: 1. suspected cholera diagnosis standard characteristic information: a. the first cases with typical clinical symptoms, such as severe diarrhea, watery stool (yellow water sample, clear water sample, rice swill sample or blood water sample), accompanied by vomiting, rapid occurrence of severe dehydration, circulatory failure and muscle spasm (especially gastrocnemius) are not yet confirmed in the etiological examination; b. during the epidemic, there is a definite history of contact (like meals, cohabitation or caregivers, etc.) and symptoms of vomiting are developed without any other reason to examine. One of the above items is diagnosed as suspected cholera; 2. determining diagnostic criteria characteristic information: a. the Vibrio cholerae of group 01 or group 0139 is cultured in feces with diarrhea symptoms to be positive; b. the cholera typical symptoms (see 1a) exist in epidemic areas during the epidemic period of cholera, and the cholera vibrio is negative in group 01 and group 0139 in fecal culture but has no other reasons to be examined; c. diarrhea symptoms in epidemic areas during epidemic period, and double serum antibody titer measurement is performed, such as the antibody of the vibrio killing bacteria is increased by more than 4 times in the serum agglutination test or more than 8 times in the vibrio killing antibody measurement; d. in epidemic source examination, the first fecal culture detects people with diarrhea symptoms in 5 days before and after the culture of 01 group or 0139 group of vibrio cholerae; and (3) clinical diagnosis: is provided with b; the confirmed cases: having a or c or d;
and establishing a standard database according to the correlation between the diagnosis standards of various infectious diseases and different case classification types of the same infectious disease and the corresponding main characteristic information, wherein the standard database has the characteristics of comprehensiveness and standardization and is used as a standard characteristic information base for auxiliary diagnosis and identification of the infectious diseases.
The invention respectively carries out text mining on the characteristic information of the standard database through a first text mining module and a second text mining module to construct a vector space model.
The existing characteristic information extraction algorithm usually needs a series of preprocessing steps with a priori knowledge as a support, and the preprocessing steps are often used forA great deal of information loss is caused, so that the extraction omission and the extraction error of the detail nodes (characteristic information) are caused, and the identification accuracy of the whole system is further influenced. In order to overcome the defects of the traditional algorithm, the vector space model is applied to the feature extraction of the infectious disease diagnosis standard, so that two important problems of type classification and feature information extraction can be effectively solved, the loss of feature information is greatly reduced, and the accuracy of intelligent identification and diagnosis is improved. The concrete mode is as follows: by characteristic entries (T)1,T2,…Tn) And its weight value omegaiRepresenting main characteristic information corresponding to a case classification type diagnosis standard in the database to form a space vector, and evaluating the correlation degree of the unknown text and the space vector in the database by using the characteristic items when information matching is carried out.
The first text mining module adopts TF-IDF to carry out weight calculation and first core feature word extraction, and a vector space model is constructed to obtain a first feature vector set corresponding to main feature information of each infectious disease case classification; in the first feature vector set, each feature vector represents a diagnosis standard of an infectious disease case classification and corresponding main feature information.
Let D be a set of documents comprising m documents, DiFor the feature vector of the ith document, there are: d ═ D1,D2,…,Dm},Di=(di1,di2,…,din) I is 1,2, …, m, wherein dijAs a document DiThe j-th entry tjThe weight value of (2):
where i is 1,2, …, m; j is 1,2, …, N is the total number of documents in the document database, NjIs that the document database contains an entry tjThe number of documents.
And after the entry weight is obtained through calculation, screening out a first core characteristic word according to the weight, and forming a vector space model by the first core characteristic word and the corresponding weight. Through the vector space model, text data is converted into structured data which can be processed by a computer, and the similarity problem between two documents is converted into the similarity problem between two vectors.
Suppose that a feature vector corresponding to a certain class in a first feature vector set of a standard database is V
kThe feature vector of the text to be classified is V
0The similarity between the two vectors can be the cosine of the included angle between the two vectors
By measure, a smaller angle indicates a higher similarity.
The second text mining module constructs a feature selection model based on conditional mutual information and performs text mining on the main feature information of the standard database; performing weight calculation according to the relevancy between the entry of the main characteristic information and the case classification, selecting a second core characteristic word, and constructing a vector space model to obtain a second characteristic vector set corresponding to the main characteristic information of each infectious disease case classification;
taking a certain infectious case classification as an example: selecting main characteristic information corpora of case classification such as suspected diagnosis standard, clinical diagnosis standard, confirmed diagnosis standard, medical observation diagnosis standard, severe diagnosis standard, diagnosis and discrimination standard and the like, and selecting words to establish a space vector model through mutual information.
Firstly, calculating the mutual information correlation degree between each entry of the main characteristic information contained in case classification and the case classification, wherein the formula is as follows:
wherein, A is the number of documents with the term t appearing in the case classification category c; b is the number of documents in which the term t appears in categories other than the case classification category c; c is the number of documents with no word bar t in the case classification category C; n is the sum of the number of documents in all categories; if the number of categories is m, each entry obtains m correlation values;
and taking the average value of the m values as the weight of each entry, sequencing the entries from low to high according to the word frequency, removing words which only appear in a single category and have the word frequency lower than a preset word frequency threshold, sequencing the rest entries from high to low according to the weight, and taking the words with the weight value higher than the preset weight threshold as second core characteristic words. And constructing a feature vector according to the second core feature word and the corresponding weight.
The feature matching module performs feature matching according to results of the first text mining module and the second text mining module, and respectively calculates cosine similarity between the text to be classified and elements in the first feature vector set; respectively calculating the mutual information correlation between the text to be classified and the second feature vector set element; mutual information is used for measuring the correlation between certain characteristic information and a specific class, and if the mutual information is larger, the correlation between the characteristic information and the class is larger, and the probability of belonging to the class is larger. The reverse is also true. And then, carrying out case classification on the text to be classified according to the cosine similarity and the mutual information correlation.
The concrete way of case classification of the text to be classified according to the cosine similarity and the mutual information correlation degree has multiple choices:
1. and for each case classification category, taking the weighted sum of the cosine similarity and the correlation as a first output probability value of the corresponding case classification category, arranging the output probability values in a descending order, setting a first probability threshold, and taking the category with the probability value larger than the first probability threshold as an identification recommendation result.
2. And for each case classification category, taking the maximum value of the cosine similarity and the mutual information correlation as a second output probability value of the corresponding case classification category, arranging the output probability values in a descending order, setting a second probability threshold, and taking the category with the probability value larger than the second probability threshold as an identification recommendation result.
The identification recommendation results are one or more, the identification recommendation results are arranged in a descending order, the types with higher similarity or the combination of the two modes are identified in multiple modes and are selected in a cross-contrast mode to serve as the recommendation diagnosis results, multi-directional auxiliary diagnosis reference is provided for medical staff, information which is ambiguous and plausible characteristic information and sometimes cannot be judged or can be correctly determined is provided for some medical staff, accurate auxiliary diagnosis is made through efficient characteristic information matching, the medical staff is helped to make correct judgment, and missing reports and wrong reports are reduced.
The present invention also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus;
the processor, the memory and the communication interface complete mutual communication through the bus;
the storage stores program instructions which can be executed by the processor, and the processor calls the program instructions to realize the system for intelligently identifying the infectious diseases based on the legal diagnosis standard, which comprises an index construction module, an information extraction module, a standard database, a first text mining module, a second text mining module and a feature matching module.
The invention also discloses a computer readable storage medium, which stores computer instructions, and the computer instructions enable the computer to realize all the system or part of the system according to the embodiment of the invention. For example, the system comprises an index construction module, an information extraction module, a standard database, a first text mining module, a second text mining module and a feature matching module. The storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.