CN111949759A - Method and system for retrieving medical record text similarity and computer equipment - Google Patents
Method and system for retrieving medical record text similarity and computer equipment Download PDFInfo
- Publication number
- CN111949759A CN111949759A CN201910407594.0A CN201910407594A CN111949759A CN 111949759 A CN111949759 A CN 111949759A CN 201910407594 A CN201910407594 A CN 201910407594A CN 111949759 A CN111949759 A CN 111949759A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- medical record
- words
- long
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 239000013598 vector Substances 0.000 claims abstract description 75
- 230000011218 segmentation Effects 0.000 claims abstract description 68
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000004590 computer program Methods 0.000 claims description 29
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 26
- 201000010099 disease Diseases 0.000 claims description 25
- 238000012216 screening Methods 0.000 claims description 19
- 238000013473 artificial intelligence Methods 0.000 abstract description 5
- 238000003759 clinical diagnosis Methods 0.000 description 10
- 238000003745 diagnosis Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 208000024891 symptom Diseases 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 208000035475 disorder Diseases 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000000968 medical method and process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides a method, a system and computer equipment for retrieving medical record text similarity, wherein the method for retrieving the medical record text similarity comprises the following steps: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector. According to the method for retrieving the similarity of the medical record texts, provided by the invention, medical knowledge is automatically mined and learned from the database through a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate the comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the comparison results are highly consistent with the results obtained by manual comparison of doctors, a clinical path reference result with practical value can be provided for the doctors, and the problem that the doctors consume a large amount of time in looking up the history of the previous medical records is effectively solved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method, a system and computer equipment for retrieving medical record text similarity.
Background
At present, an Electronic Medical Record (EMR) is a Medical Record generated when a patient visits a Medical institution, is a carrier of Medical experience and mode of a doctor, and has a core value in auxiliary diagnosis and provides decision support for the doctor. The main forms of the electronic medical record data include tables, free texts and images, wherein the free texts are mainly presented in the form of unstructured data. With the development of information-based hospitals, hospitals have accumulated a large amount of unstructured electronic medical record free text, which contains a large amount of valuable medical and clinical information. With the increase of standardization of medical information, more standard and complete patient information is covered in free text. At present, many scholars, organizations and enterprises at home and abroad are dedicated to research on an EMR (electronic medical record) based auxiliary diagnosis system, and the field of the research can relate to a complete medical process and has important effects on the aspects of optimizing a working process, improving the working efficiency, reducing medical errors, improving the medical quality and the like. The domestic application research based on Chinese EMR (electronic medical record) aims at the research and development of an EMR (electronic medical record) system on one hand, and clinical path optimization and similar EMR (electronic medical record) search based on EMR (electronic medical record) on the other hand. In the related technology, a core technology of similar Chinese medical record text retrieval is used, the method mainly carries out comparison through keywords or an ontology model, the knowledge of medical experts is relied on, and the existing information contained in large-scale EMR (electronic medical record) data is not well mined and utilized.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
Therefore, the invention provides a method for searching medical record text similarity in a first aspect.
The invention provides a system for searching medical record text similarity in a second aspect.
A third aspect of the invention provides a computer apparatus.
A fourth aspect of the invention provides a computer-readable storage medium.
In view of this, the first aspect of the present invention provides a method for retrieving similarity between medical records, including: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
The method for retrieving the similarity of the medical history texts carries out word segmentation on the received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and diseases can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and the medical history information similar to the text information is obtained in a database according to the long text vectors. The medical record information is retrieved by the method, medical knowledge is automatically mined and learned from the database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, doctors without medical experience can be assisted by the method, so that patients can obtain diagnosis and treatment better and timely, and the clinical diagnosis efficiency and the clinical diagnosis accuracy are further improved.
Specifically, main treatment objects of the method are main complaints, current medical history, past history, personal history, family history and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
According to the medical record text similarity retrieval method provided by the invention, the method can also have the following additional technical characteristics:
in the above technical solution, preferably, the method for retrieving medical record text similarity further includes: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In the technical scheme, text information is preprocessed through named entity recognition application, the part of speech of a word is labeled, the word is classified according to the label, each word in a sentence is endowed with a correct lexical mark, and each word is endowed with a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above technical solutions, preferably, the step of performing word segmentation processing on the text information to generate words specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the technical scheme, word segmentation is carried out on the text information according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above technical solutions, preferably, the step of training the words into long text vectors specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the technical scheme, firstly, the divided words are trained into word vectors, then the word vectors in each sentence are combined to form a long text vector, and further the numerical symbols of the long text of the medical record are obtained.
In any of the above technical solutions, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the technical scheme, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jaccard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is: (B.C)/(| B | | · | | | C |).
In a second aspect of the present invention, a system for retrieving similarity between medical records and texts is provided, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
The retrieval system for medical history text similarity carries out word segmentation on received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and time can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and medical history information similar to the text information is obtained in a database according to the long text vectors. The system searches medical record information, medical knowledge is automatically mined and learned from a database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, the system can be used for assisting doctors without medical experience, so that patients can obtain diagnosis and treatment better and timely, and further the clinical diagnosis efficiency and the clinical diagnosis accuracy are improved.
Specifically, main processing objects of the system are main complaints, current medical histories, past histories, personal histories, family histories and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
The system for searching the medical record text similarity provided by the invention can also have the following additional technical characteristics:
in the above technical solution, preferably, the processor further implements, when executing the computer program: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In the technical scheme, text information is preprocessed through named entity recognition application, the part of speech of a word is labeled, the word is classified according to the label, each word in a sentence is endowed with a correct lexical mark, and each word is endowed with a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above technical solutions, preferably, the step of performing word segmentation processing on the text information and generating words when the processor executes the computer program specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the technical scheme, word segmentation is carried out on the text information according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above technical solutions, preferably, the step of training the words into long text vectors is implemented when the processor executes the computer program, and specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the technical scheme, firstly, the divided words are trained into word vectors, then the word vectors in each sentence are combined to form a long text vector, and further the numerical symbols of the long text of the medical record are obtained.
In any of the above technical solutions, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector is implemented when the processor executes the computer program, and specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the technical scheme, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jaccard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
In a third aspect of the present invention, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for retrieving the similarity between the medical record texts according to any one of the above technical solutions.
The technical scheme provided by the invention comprises the medical record text similarity retrieval method according to any one of the technical schemes of the first aspect, so that the medical record text similarity retrieval method has all the beneficial effects of the medical record text similarity retrieval method.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when being executed by a processor, the computer program implements the steps of the method according to any of the above technical solutions, so that the method has all the technical effects of a medical history text similarity retrieval method, and details are not repeated herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application;
fig. 2 is another schematic flow chart of a method for retrieving medical record text similarity according to an embodiment of the present application;
fig. 3 is another schematic flow chart of a method for retrieving medical record text similarity according to an embodiment of the present application;
fig. 4 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application;
FIG. 5 is a block diagram of a system for retrieving medical record text similarity according to an embodiment of the present application;
FIG. 6 is another block diagram of a system for retrieving medical record text similarity according to one embodiment of the present application;
FIG. 7 is another block diagram of a system for retrieving medical record text similarity according to one embodiment of the present application;
FIG. 8 shows a schematic block diagram of a computer device of one embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The following describes a method, a system and a computer device for retrieving medical record text similarity according to some embodiments of the present invention with reference to fig. 1 to 8.
Fig. 1 is a flowchart illustrating a medical record text similarity retrieval method according to an embodiment of the present application. As shown in fig. 1, the method includes:
104, performing word segmentation processing on the text information to generate words;
and step 108, acquiring medical record information similar to the text information in the database according to the long text vector.
The method for retrieving the similarity of the medical history texts carries out word segmentation on the received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and diseases can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and the medical history information similar to the text information is obtained in a database according to the long text vectors. The medical record information is retrieved by the method, medical knowledge is automatically mined and learned from the database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, doctors without medical experience can be assisted by the method, so that patients can obtain diagnosis and treatment better and timely, and the clinical diagnosis efficiency and the clinical diagnosis accuracy are further improved.
Specifically, main treatment objects of the method are main complaints, current medical history, past history, personal history, family history and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
In the foregoing embodiment, preferably, after the step of performing word segmentation processing on the text information and generating words, the method further includes: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In this embodiment, the text information is preprocessed by the named entity recognition application, the parts of speech of the words are labeled, the words are classified according to the labels, each word in the sentence is given a correct lexical label, and each word is given a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above embodiments, preferably, the step of performing word segmentation processing on the text information to generate words specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the embodiment, the text information is subjected to word segmentation according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above embodiments, preferably, the step of training the words into long text vectors specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the embodiment, the divided words are trained into word vectors, and then the word vectors in each sentence are combined to form a long text vector, so that the numerical symbols of the long text of the medical record are obtained.
In any of the above embodiments, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the embodiment, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jacard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
Fig. 2 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 2, the method includes:
and step 210, calculating the chief complaint similarity according to a combined distance algorithm.
In this embodiment, the data objects received are patient complaint data (text type), disease history (numerical type). Firstly, the similarity of the chief complaint data is calculated, as shown in fig. 2, according to the patient chief complaints input by doctors, the chief complaints are trained into text vectors by using a CRF (conditional random field) algorithm, an RNN (recurrent neural network) and a Doc2Vec (emotion analysis), a retrieval range is screened according to whether the chief complaints contain disease names or specific characters, wherein the retrieval range is narrowed by using an edit distance, the time complexity is reduced, and the effect of quick retrieval is achieved, and then the chief complaint similarity is calculated by combining a jaccard (Jaccard) distance and a cos (cosine) distance.
Fig. 3 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 3, the method includes:
step 302, acquiring medical history information in a medical record of a patient according to medical history statistics;
step 304, automatically encoding the medical history;
step 306, performing word segmentation processing on the medical history to generate words;
step 308, training the words into long text vectors;
step 310, calculating the medical history similarity according to the long text vector.
In the embodiment, medical history records in medical history are obtained through medical history statistics, one-hot codes are used for coding the medical histories, and then the similarity between the medical histories is calculated to obtain the similarity of the medical histories.
Fig. 4 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 4, the method includes:
step 404, performing word segmentation processing on the text information to generate words;
step 410, normalizing the similarity of the chief complaints and the similarity of the medical histories;
and step 416, weighting and summing the main complaint similarity and the medical history similarity according to the obtained weight ratio to obtain comprehensive similarity.
In this example, after the similarity between the chief complaint and the medical history is obtained, the comprehensive similarity between the chief complaint and the medical history is calculated. As shown in fig. 4, the chief complaint similarity and the medical history similarity are normalized, and the input data format is standardized; calculating the weight ratio of each feature through feature selection; and weighting and summing the main complaint similarity and the medical history similarity according to the obtained weight ratio to obtain the comprehensive similarity.
In a second aspect of the present invention, a system 50 for retrieving medical record text similarity is provided, including: a memory 502, a processor 504, and a computer program stored on the memory 502 and executable on the processor 504, the processor 504 when executing the computer program implementing: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
As shown in fig. 5, the medical history text similarity retrieval system 50 provided by the present invention performs word segmentation on received text information, where the word segmentation includes ambiguous segmentation of words and identification of unknown words, so as to segment diseases, disorders and time, apply the segmented words to the next training, determine the accuracy of the next training step by means of accurate word segmentation, train the generated words into long text vectors, obtain corresponding long text numeric identifiers, and further obtain medical history information similar to the text information in a database according to the long text vectors. The system searches medical record information, medical knowledge is automatically mined and learned from a database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, the system can be used for assisting doctors without medical experience, so that patients can obtain diagnosis and treatment better and timely, and further the clinical diagnosis efficiency and the clinical diagnosis accuracy are improved.
Specifically, main processing objects of the system are main complaints, current medical histories, past histories, personal histories, family histories and general examination results in free texts, and complete auxiliary diagnosis of patients is obtained.
In the above embodiment, preferably, the processor 504, when executing the computer program, further implements: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In this embodiment, the text information is preprocessed by the named entity recognition application, the parts of speech of the words are labeled, the words are classified according to the labels, each word in the sentence is given a correct lexical label, and each word is given a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above embodiments, preferably, when the processor 504 executes the computer program, the step of performing word segmentation processing on text information to generate words is implemented, and specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the embodiment, the text information is subjected to word segmentation according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above embodiments, preferably, when the processor 504 executes the computer program, the step of training a word into a long text vector is realized, specifically including: training the words into word vectors; the word vectors are grouped into long text vectors.
In the embodiment, the divided words are trained into word vectors, and then the word vectors in each sentence are combined to form a long text vector, so that the numerical symbols of the long text of the medical record are obtained.
In any of the above embodiments, preferably, when the processor executes the computer program, the step of obtaining medical record information similar to the text information in the database according to the long text vector is implemented, and specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the embodiment, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jacard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
Specifically, as shown in FIG. 6, the patient medical record 6 is entered, a similar medical record 62 is obtained in the medical records database 60, and the results are returned to the physician. Specifically, after the patient describes the disease, the doctor can search the long text similar medical records according to experience to make corresponding clinical diagnosis and provide a proper treatment scheme for the patient.
Specifically, as shown in fig. 7, the doctor separates the chief complaint input data 70, the patient disease history data 72, and the general examination data 74 in the medical record data according to the medical record data 7 of the new patient input by the new patient, performs chief complaint similarity calculation 702, medical history similarity calculation 722, and comprehensive similarity calculation 742 on the basis of the separated data, acquires similar medical records from the Chinese electronic medical record database 78, and returns the examination result to 76 to assist the doctor in making a clinical diagnosis.
As shown in fig. 8, in a third aspect of the present invention, a computer device 8 is provided, which includes a memory 80, a processor 82, and a computer program stored on the memory 80 and executable on the processor 82, and when the processor 82 executes the computer program, the method for retrieving the similarity between the texts of the medical records according to any of the above embodiments is implemented.
The embodiment provided by the invention comprises the medical record text similarity retrieval method in any embodiment, so that the embodiment has all the beneficial effects of the medical record text similarity retrieval method.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when being executed by a processor, the computer program implements the steps of the method according to any of the above embodiments, so that the method has all the technical effects of the method for retrieving medical history text similarity, and details are not repeated herein.
In the present invention, the term "plurality" means two or more unless explicitly defined otherwise. The terms "mounted," "connected," "fixed," and the like are to be construed broadly, and for example, "connected" may be a fixed connection, a removable connection, or an integral connection; "coupled" may be direct or indirect through an intermediary. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (12)
1. A method for retrieving medical record text similarity is characterized by comprising the following steps:
receiving text information;
performing word segmentation processing on the text information to generate words;
training the words into long text vectors;
and acquiring medical record information similar to the text information in a database according to the long text vector.
2. The medical record text similarity retrieval method according to claim 1, wherein after the step of performing word segmentation processing on the text information to generate words, the method further comprises:
performing tagging processing on the part of speech of the word;
and classifying the words according to the part-of-speech labels of the words.
3. The medical record text similarity retrieval method according to claim 1, wherein the step of performing word segmentation processing on the text information to generate words specifically comprises:
and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and removing the stop word to generate a word.
4. The medical record text similarity retrieval method according to claim 2, wherein the step of training the words into long text vectors specifically comprises:
training the words into word vectors;
forming the word vector into the long text vector.
5. The medical record text similarity retrieval method according to any one of claims 1 to 4, wherein the step of obtaining medical record information similar to the text information in a database according to the long text vector specifically includes:
acquiring a plurality of long texts similar to the text information from the database, and respectively segmenting the long texts into word sets as a screening set;
acquiring a long text matched with the word set after the word segmentation processing is carried out on the text information in the screening set, and taking the long text as a priority result;
calculating the relevance of a word set which is not matched with the text information in the screening set and a word set after word segmentation processing is carried out on the text information according to the long text vector;
judging whether the relevance is greater than a preset threshold value or not;
and if the relevance is greater than the preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
6. A system for retrieving medical record text similarity is characterized by comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing:
receiving text information;
performing word segmentation processing on the text information to generate words;
training the words into long text vectors;
and acquiring medical record information similar to the text information in a database according to the long text vector.
7. The medical record text similarity retrieval system according to claim 6, wherein the processor, when executing the computer program, implements the word segmentation processing on the text information, and after the step of generating words, further comprises:
performing tagging processing on the part of speech of the word;
and classifying the words according to the part-of-speech labels of the words.
8. The medical record text similarity retrieval system according to claim 6, wherein the processor implements the step of performing word segmentation processing on the text information to generate words when executing the computer program, and specifically comprises:
and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and removing the stop word to generate a word.
9. The system for retrieving medical record text similarity according to claim 7, wherein the processor, when executing the computer program, implements the step of training the words into long text vectors, specifically comprising:
training the words into word vectors;
forming the word vector into the long text vector.
10. The system for retrieving medical record text similarity according to any one of claims 6 to 9, wherein the processor, when executing the computer program, implements the step of obtaining medical record information similar to the text information in the database according to the long text vector, specifically comprising:
acquiring a plurality of long texts similar to the text information from the database, and respectively segmenting the long texts into word sets as a screening set;
acquiring a long text matched with the word set after the word segmentation processing is carried out on the text information in the screening set, and taking the long text as a priority result;
calculating the relevance of a word set which is not matched with the text information in the screening set and a word set after word segmentation processing is carried out on the text information according to the long text vector;
judging whether the relevance is greater than a preset threshold value or not;
and if the relevance is greater than the preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the method for retrieving similarity between medical record texts according to any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for retrieving the similarity of medical record texts according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910407594.0A CN111949759A (en) | 2019-05-16 | 2019-05-16 | Method and system for retrieving medical record text similarity and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910407594.0A CN111949759A (en) | 2019-05-16 | 2019-05-16 | Method and system for retrieving medical record text similarity and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111949759A true CN111949759A (en) | 2020-11-17 |
Family
ID=73336902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910407594.0A Pending CN111949759A (en) | 2019-05-16 | 2019-05-16 | Method and system for retrieving medical record text similarity and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111949759A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329461A (en) * | 2020-11-24 | 2021-02-05 | 汤学民 | Similar medical record determination method, computer equipment and computer storage medium |
CN112466472A (en) * | 2021-02-03 | 2021-03-09 | 北京伯仲叔季科技有限公司 | Case text information retrieval system |
CN112579750A (en) * | 2020-11-30 | 2021-03-30 | 百度健康(北京)科技有限公司 | Similar medical record retrieval method, device, equipment and storage medium |
CN113254658A (en) * | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
CN113610112A (en) * | 2021-07-09 | 2021-11-05 | 中国商用飞机有限责任公司上海飞机设计研究院 | Auxiliary decision-making method for airplane assembly quality defects |
CN113689924A (en) * | 2021-08-24 | 2021-11-23 | 平安国际智慧城市科技股份有限公司 | Similar medical record retrieval method and device, electronic equipment and readable storage medium |
CN114020874A (en) * | 2021-11-11 | 2022-02-08 | 万里云医疗信息科技(北京)有限公司 | Medical record retrieval system, method, equipment and computer readable storage medium |
CN114218955A (en) * | 2021-12-28 | 2022-03-22 | 上海柯林布瑞信息技术有限公司 | Medical knowledge graph-based auxiliary reference information determination method and system |
CN114300083A (en) * | 2021-11-16 | 2022-04-08 | 北京左医科技有限公司 | Medical record construction method and system |
CN115083550A (en) * | 2022-06-29 | 2022-09-20 | 西安理工大学 | Patient similarity classification method based on multi-source information |
CN115269613A (en) * | 2022-09-27 | 2022-11-01 | 四川互慧软件有限公司 | Patient main index construction method, system, equipment and storage medium |
CN115662607A (en) * | 2022-12-13 | 2023-01-31 | 四川大学 | Internet online inquiry recommendation method based on big data analysis and server |
CN115983233A (en) * | 2023-01-04 | 2023-04-18 | 重庆邮电大学 | Electronic medical record duplication rate estimation method based on data stream matching |
CN116631614A (en) * | 2023-07-24 | 2023-08-22 | 北京惠每云科技有限公司 | Treatment scheme generation method, treatment scheme generation device, electronic equipment and storage medium |
CN116682526A (en) * | 2023-08-03 | 2023-09-01 | 中国中医科学院中国医史文献研究所 | Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing |
CN117690545A (en) * | 2023-12-12 | 2024-03-12 | 北京健康有益科技有限公司 | Treatment scheme generation method and device based on large model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105843818A (en) * | 2015-01-15 | 2016-08-10 | 富士通株式会社 | Training device, training method, determining device, and recommendation device |
CN109299469A (en) * | 2018-10-29 | 2019-02-01 | 复旦大学 | A method of identifying complicated address in long text |
CN109657062A (en) * | 2018-12-24 | 2019-04-19 | 万达信息股份有限公司 | A kind of electronic health record text resolution closed-loop policy based on big data technology |
-
2019
- 2019-05-16 CN CN201910407594.0A patent/CN111949759A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105843818A (en) * | 2015-01-15 | 2016-08-10 | 富士通株式会社 | Training device, training method, determining device, and recommendation device |
CN109299469A (en) * | 2018-10-29 | 2019-02-01 | 复旦大学 | A method of identifying complicated address in long text |
CN109657062A (en) * | 2018-12-24 | 2019-04-19 | 万达信息股份有限公司 | A kind of electronic health record text resolution closed-loop policy based on big data technology |
Non-Patent Citations (1)
Title |
---|
段旭磊: "微博文本处理及话题分析方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 December 2017 (2017-12-15), pages 3 - 4 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329461A (en) * | 2020-11-24 | 2021-02-05 | 汤学民 | Similar medical record determination method, computer equipment and computer storage medium |
CN112579750A (en) * | 2020-11-30 | 2021-03-30 | 百度健康(北京)科技有限公司 | Similar medical record retrieval method, device, equipment and storage medium |
CN112466472A (en) * | 2021-02-03 | 2021-03-09 | 北京伯仲叔季科技有限公司 | Case text information retrieval system |
CN113254658A (en) * | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
CN113254658B (en) * | 2021-07-07 | 2021-12-21 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
CN113610112A (en) * | 2021-07-09 | 2021-11-05 | 中国商用飞机有限责任公司上海飞机设计研究院 | Auxiliary decision-making method for airplane assembly quality defects |
CN113610112B (en) * | 2021-07-09 | 2024-04-16 | 中国商用飞机有限责任公司上海飞机设计研究院 | Auxiliary decision-making method for aircraft assembly quality defects |
CN113689924B (en) * | 2021-08-24 | 2024-04-05 | 深圳平安智慧医健科技有限公司 | Similar medical record retrieval method and device, electronic equipment and readable storage medium |
CN113689924A (en) * | 2021-08-24 | 2021-11-23 | 平安国际智慧城市科技股份有限公司 | Similar medical record retrieval method and device, electronic equipment and readable storage medium |
CN114020874A (en) * | 2021-11-11 | 2022-02-08 | 万里云医疗信息科技(北京)有限公司 | Medical record retrieval system, method, equipment and computer readable storage medium |
CN114300083A (en) * | 2021-11-16 | 2022-04-08 | 北京左医科技有限公司 | Medical record construction method and system |
CN114218955A (en) * | 2021-12-28 | 2022-03-22 | 上海柯林布瑞信息技术有限公司 | Medical knowledge graph-based auxiliary reference information determination method and system |
CN115083550A (en) * | 2022-06-29 | 2022-09-20 | 西安理工大学 | Patient similarity classification method based on multi-source information |
CN115083550B (en) * | 2022-06-29 | 2023-08-08 | 西安理工大学 | Patient similarity classification method based on multi-source information |
CN115269613A (en) * | 2022-09-27 | 2022-11-01 | 四川互慧软件有限公司 | Patient main index construction method, system, equipment and storage medium |
CN115662607B (en) * | 2022-12-13 | 2023-04-07 | 四川大学 | Internet online inquiry recommendation method based on big data analysis and server |
CN115662607A (en) * | 2022-12-13 | 2023-01-31 | 四川大学 | Internet online inquiry recommendation method based on big data analysis and server |
CN115983233A (en) * | 2023-01-04 | 2023-04-18 | 重庆邮电大学 | Electronic medical record duplication rate estimation method based on data stream matching |
CN116631614A (en) * | 2023-07-24 | 2023-08-22 | 北京惠每云科技有限公司 | Treatment scheme generation method, treatment scheme generation device, electronic equipment and storage medium |
CN116682526A (en) * | 2023-08-03 | 2023-09-01 | 中国中医科学院中国医史文献研究所 | Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing |
CN116682526B (en) * | 2023-08-03 | 2023-10-24 | 中国中医科学院中国医史文献研究所 | Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing |
CN117690545A (en) * | 2023-12-12 | 2024-03-12 | 北京健康有益科技有限公司 | Treatment scheme generation method and device based on large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111949759A (en) | Method and system for retrieving medical record text similarity and computer equipment | |
CN111540468B (en) | ICD automatic coding method and system for visualizing diagnostic reasons | |
CN111274806B (en) | Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record | |
CN112002411A (en) | Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record | |
CN106874643B (en) | Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors | |
CN109344250B (en) | Rapid structuring method of single disease diagnosis information based on medical insurance data | |
CN110364234B (en) | Intelligent storage, analysis and retrieval system and method for electronic medical records | |
CN112786194A (en) | Medical image diagnosis guide inspection system, method and equipment based on artificial intelligence | |
CN104699741A (en) | Analyzing natural language questions to determine missing information in order to improve accuracy of answers | |
CN110879831A (en) | Chinese medicine sentence word segmentation method based on entity recognition technology | |
CN112241457A (en) | Event detection method for event of affair knowledge graph fused with extension features | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN112183104B (en) | Code recommendation method, system, corresponding equipment and storage medium | |
CN110534185A (en) | Labeled data acquisition methods divide and examine method, apparatus, storage medium and equipment | |
CN111145903A (en) | Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system | |
Peng et al. | A self-attention based deep learning method for lesion attribute detection from CT reports | |
CN112037909A (en) | Diagnostic information rechecking system | |
CN117909466A (en) | Domain question-answering system, construction method, electronic device and storage medium | |
CN116881336A (en) | Efficient multi-mode contrast depth hash retrieval method for medical big data | |
CN116737924A (en) | Medical text data processing method and device | |
CN118171653B (en) | Health physical examination text treatment method based on deep neural network | |
CN113722507B (en) | Hospitalization cost prediction method and device based on knowledge graph and computer equipment | |
CN113130025A (en) | Entity relationship extraction method, terminal equipment and computer readable storage medium | |
CN117422074A (en) | Method, device, equipment and medium for standardizing clinical information text | |
Althari et al. | Exploring transformer-based learning for negation detection in biomedical texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |