CN111949759A - Method and system for retrieving medical record text similarity and computer equipment - Google Patents

Method and system for retrieving medical record text similarity and computer equipment Download PDF

Info

Publication number
CN111949759A
CN111949759A CN201910407594.0A CN201910407594A CN111949759A CN 111949759 A CN111949759 A CN 111949759A CN 201910407594 A CN201910407594 A CN 201910407594A CN 111949759 A CN111949759 A CN 111949759A
Authority
CN
China
Prior art keywords
text
word
medical record
words
long
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910407594.0A
Other languages
Chinese (zh)
Inventor
郭士成
王�琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Medical Information Technology Co ltd
Original Assignee
Peking University Medical Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Medical Information Technology Co ltd filed Critical Peking University Medical Information Technology Co ltd
Priority to CN201910407594.0A priority Critical patent/CN111949759A/en
Publication of CN111949759A publication Critical patent/CN111949759A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention provides a method, a system and computer equipment for retrieving medical record text similarity, wherein the method for retrieving the medical record text similarity comprises the following steps: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector. According to the method for retrieving the similarity of the medical record texts, provided by the invention, medical knowledge is automatically mined and learned from the database through a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate the comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the comparison results are highly consistent with the results obtained by manual comparison of doctors, a clinical path reference result with practical value can be provided for the doctors, and the problem that the doctors consume a large amount of time in looking up the history of the previous medical records is effectively solved.

Description

Method and system for retrieving medical record text similarity and computer equipment
Technical Field
The invention relates to the technical field of computers, in particular to a method, a system and computer equipment for retrieving medical record text similarity.
Background
At present, an Electronic Medical Record (EMR) is a Medical Record generated when a patient visits a Medical institution, is a carrier of Medical experience and mode of a doctor, and has a core value in auxiliary diagnosis and provides decision support for the doctor. The main forms of the electronic medical record data include tables, free texts and images, wherein the free texts are mainly presented in the form of unstructured data. With the development of information-based hospitals, hospitals have accumulated a large amount of unstructured electronic medical record free text, which contains a large amount of valuable medical and clinical information. With the increase of standardization of medical information, more standard and complete patient information is covered in free text. At present, many scholars, organizations and enterprises at home and abroad are dedicated to research on an EMR (electronic medical record) based auxiliary diagnosis system, and the field of the research can relate to a complete medical process and has important effects on the aspects of optimizing a working process, improving the working efficiency, reducing medical errors, improving the medical quality and the like. The domestic application research based on Chinese EMR (electronic medical record) aims at the research and development of an EMR (electronic medical record) system on one hand, and clinical path optimization and similar EMR (electronic medical record) search based on EMR (electronic medical record) on the other hand. In the related technology, a core technology of similar Chinese medical record text retrieval is used, the method mainly carries out comparison through keywords or an ontology model, the knowledge of medical experts is relied on, and the existing information contained in large-scale EMR (electronic medical record) data is not well mined and utilized.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art or the related art.
Therefore, the invention provides a method for searching medical record text similarity in a first aspect.
The invention provides a system for searching medical record text similarity in a second aspect.
A third aspect of the invention provides a computer apparatus.
A fourth aspect of the invention provides a computer-readable storage medium.
In view of this, the first aspect of the present invention provides a method for retrieving similarity between medical records, including: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
The method for retrieving the similarity of the medical history texts carries out word segmentation on the received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and diseases can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and the medical history information similar to the text information is obtained in a database according to the long text vectors. The medical record information is retrieved by the method, medical knowledge is automatically mined and learned from the database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, doctors without medical experience can be assisted by the method, so that patients can obtain diagnosis and treatment better and timely, and the clinical diagnosis efficiency and the clinical diagnosis accuracy are further improved.
Specifically, main treatment objects of the method are main complaints, current medical history, past history, personal history, family history and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
According to the medical record text similarity retrieval method provided by the invention, the method can also have the following additional technical characteristics:
in the above technical solution, preferably, the method for retrieving medical record text similarity further includes: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In the technical scheme, text information is preprocessed through named entity recognition application, the part of speech of a word is labeled, the word is classified according to the label, each word in a sentence is endowed with a correct lexical mark, and each word is endowed with a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above technical solutions, preferably, the step of performing word segmentation processing on the text information to generate words specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the technical scheme, word segmentation is carried out on the text information according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above technical solutions, preferably, the step of training the words into long text vectors specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the technical scheme, firstly, the divided words are trained into word vectors, then the word vectors in each sentence are combined to form a long text vector, and further the numerical symbols of the long text of the medical record are obtained.
In any of the above technical solutions, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the technical scheme, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jaccard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is: (B.C)/(| B | | · | | | C |).
In a second aspect of the present invention, a system for retrieving similarity between medical records and texts is provided, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
The retrieval system for medical history text similarity carries out word segmentation on received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and time can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and medical history information similar to the text information is obtained in a database according to the long text vectors. The system searches medical record information, medical knowledge is automatically mined and learned from a database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, the system can be used for assisting doctors without medical experience, so that patients can obtain diagnosis and treatment better and timely, and further the clinical diagnosis efficiency and the clinical diagnosis accuracy are improved.
Specifically, main processing objects of the system are main complaints, current medical histories, past histories, personal histories, family histories and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
The system for searching the medical record text similarity provided by the invention can also have the following additional technical characteristics:
in the above technical solution, preferably, the processor further implements, when executing the computer program: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In the technical scheme, text information is preprocessed through named entity recognition application, the part of speech of a word is labeled, the word is classified according to the label, each word in a sentence is endowed with a correct lexical mark, and each word is endowed with a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above technical solutions, preferably, the step of performing word segmentation processing on the text information and generating words when the processor executes the computer program specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the technical scheme, word segmentation is carried out on the text information according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above technical solutions, preferably, the step of training the words into long text vectors is implemented when the processor executes the computer program, and specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the technical scheme, firstly, the divided words are trained into word vectors, then the word vectors in each sentence are combined to form a long text vector, and further the numerical symbols of the long text of the medical record are obtained.
In any of the above technical solutions, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector is implemented when the processor executes the computer program, and specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the technical scheme, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jaccard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
In a third aspect of the present invention, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for retrieving the similarity between the medical record texts according to any one of the above technical solutions.
The technical scheme provided by the invention comprises the medical record text similarity retrieval method according to any one of the technical schemes of the first aspect, so that the medical record text similarity retrieval method has all the beneficial effects of the medical record text similarity retrieval method.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when being executed by a processor, the computer program implements the steps of the method according to any of the above technical solutions, so that the method has all the technical effects of a medical history text similarity retrieval method, and details are not repeated herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application;
fig. 2 is another schematic flow chart of a method for retrieving medical record text similarity according to an embodiment of the present application;
fig. 3 is another schematic flow chart of a method for retrieving medical record text similarity according to an embodiment of the present application;
fig. 4 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application;
FIG. 5 is a block diagram of a system for retrieving medical record text similarity according to an embodiment of the present application;
FIG. 6 is another block diagram of a system for retrieving medical record text similarity according to one embodiment of the present application;
FIG. 7 is another block diagram of a system for retrieving medical record text similarity according to one embodiment of the present application;
FIG. 8 shows a schematic block diagram of a computer device of one embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The following describes a method, a system and a computer device for retrieving medical record text similarity according to some embodiments of the present invention with reference to fig. 1 to 8.
Fig. 1 is a flowchart illustrating a medical record text similarity retrieval method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step 102, receiving text information;
104, performing word segmentation processing on the text information to generate words;
step 106, training the words into long text vectors;
and step 108, acquiring medical record information similar to the text information in the database according to the long text vector.
The method for retrieving the similarity of the medical history texts carries out word segmentation on the received text information, the word segmentation comprises word ambiguity segmentation and identification of unknown words, diseases and diseases can be segmented, the segmented words are used for next training, the accuracy of the next step is determined by accurate word segmentation, the generated words are trained into long text vectors, corresponding long text numerical identifiers are obtained, and the medical history information similar to the text information is obtained in a database according to the long text vectors. The medical record information is retrieved by the method, medical knowledge is automatically mined and learned from the database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, doctors without medical experience can be assisted by the method, so that patients can obtain diagnosis and treatment better and timely, and the clinical diagnosis efficiency and the clinical diagnosis accuracy are further improved.
Specifically, main treatment objects of the method are main complaints, current medical history, past history, personal history, family history and general examination results in free texts, and perfect auxiliary diagnosis of patients is obtained.
In the foregoing embodiment, preferably, after the step of performing word segmentation processing on the text information and generating words, the method further includes: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In this embodiment, the text information is preprocessed by the named entity recognition application, the parts of speech of the words are labeled, the words are classified according to the labels, each word in the sentence is given a correct lexical label, and each word is given a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above embodiments, preferably, the step of performing word segmentation processing on the text information to generate words specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the embodiment, the text information is subjected to word segmentation according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above embodiments, preferably, the step of training the words into long text vectors specifically includes: training the words into word vectors; the word vectors are grouped into long text vectors.
In the embodiment, the divided words are trained into word vectors, and then the word vectors in each sentence are combined to form a long text vector, so that the numerical symbols of the long text of the medical record are obtained.
In any of the above embodiments, preferably, the step of obtaining medical record information similar to the text information in the database according to the long text vector specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the embodiment, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jacard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
Fig. 2 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 2, the method includes:
step 202, receiving patient medical record chief complaint information;
step 204, performing word segmentation processing on patient medical record chief complaint information to generate words;
step 206, training the words into long text vectors;
step 208, screening a search range according to whether the disease name or the specific character is included;
and step 210, calculating the chief complaint similarity according to a combined distance algorithm.
In this embodiment, the data objects received are patient complaint data (text type), disease history (numerical type). Firstly, the similarity of the chief complaint data is calculated, as shown in fig. 2, according to the patient chief complaints input by doctors, the chief complaints are trained into text vectors by using a CRF (conditional random field) algorithm, an RNN (recurrent neural network) and a Doc2Vec (emotion analysis), a retrieval range is screened according to whether the chief complaints contain disease names or specific characters, wherein the retrieval range is narrowed by using an edit distance, the time complexity is reduced, and the effect of quick retrieval is achieved, and then the chief complaint similarity is calculated by combining a jaccard (Jaccard) distance and a cos (cosine) distance.
Fig. 3 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 3, the method includes:
step 302, acquiring medical history information in a medical record of a patient according to medical history statistics;
step 304, automatically encoding the medical history;
step 306, performing word segmentation processing on the medical history to generate words;
step 308, training the words into long text vectors;
step 310, calculating the medical history similarity according to the long text vector.
In the embodiment, medical history records in medical history are obtained through medical history statistics, one-hot codes are used for coding the medical histories, and then the similarity between the medical histories is calculated to obtain the similarity of the medical histories.
Fig. 4 is another flowchart illustrating a method for retrieving similarity between medical records according to an embodiment of the present application. As shown in fig. 4, the method includes:
step 402, receiving text information;
step 404, performing word segmentation processing on the text information to generate words;
step 406, training the words into long text vectors;
step 408, calculating the similarity of the chief complaints and the similarity of the medical histories;
step 410, normalizing the similarity of the chief complaints and the similarity of the medical histories;
step 412, feature selection;
step 414, calculating the weight ratio of each feature through feature selection;
and step 416, weighting and summing the main complaint similarity and the medical history similarity according to the obtained weight ratio to obtain comprehensive similarity.
In this example, after the similarity between the chief complaint and the medical history is obtained, the comprehensive similarity between the chief complaint and the medical history is calculated. As shown in fig. 4, the chief complaint similarity and the medical history similarity are normalized, and the input data format is standardized; calculating the weight ratio of each feature through feature selection; and weighting and summing the main complaint similarity and the medical history similarity according to the obtained weight ratio to obtain the comprehensive similarity.
In a second aspect of the present invention, a system 50 for retrieving medical record text similarity is provided, including: a memory 502, a processor 504, and a computer program stored on the memory 502 and executable on the processor 504, the processor 504 when executing the computer program implementing: receiving text information; performing word segmentation processing on the text information to generate words; training the words into long text vectors; and acquiring medical record information similar to the text information in the database according to the long text vector.
As shown in fig. 5, the medical history text similarity retrieval system 50 provided by the present invention performs word segmentation on received text information, where the word segmentation includes ambiguous segmentation of words and identification of unknown words, so as to segment diseases, disorders and time, apply the segmented words to the next training, determine the accuracy of the next training step by means of accurate word segmentation, train the generated words into long text vectors, obtain corresponding long text numeric identifiers, and further obtain medical history information similar to the text information in a database according to the long text vectors. The system searches medical record information, medical knowledge is automatically mined and learned from a database by a medical artificial intelligence method without the participation of experts, a model for comparing similar medical records is constructed, the model can integrate comparison results of various types of free texts, similar medical record recommendations can be efficiently and accurately obtained, the result obtained by manual comparison of doctors is highly consistent, a clinical path reference result with practical value can be provided for doctors, the problem that doctors consume a large amount of time to look up historical previous medical records is effectively solved, and meanwhile, the system can be used for assisting doctors without medical experience, so that patients can obtain diagnosis and treatment better and timely, and further the clinical diagnosis efficiency and the clinical diagnosis accuracy are improved.
Specifically, main processing objects of the system are main complaints, current medical histories, past histories, personal histories, family histories and general examination results in free texts, and complete auxiliary diagnosis of patients is obtained.
In the above embodiment, preferably, the processor 504, when executing the computer program, further implements: after the step of performing word segmentation processing on the text information to generate words, the method further comprises the following steps: performing tagging processing on the part of speech of the word; and classifying the words according to the part-of-speech labels of the words.
In this embodiment, the text information is preprocessed by the named entity recognition application, the parts of speech of the words are labeled, the words are classified according to the labels, each word in the sentence is given a correct lexical label, and each word is given a category. Further, the named entity recognition application can accurately segment unknown words, and part-of-speech tagging is mainly divided into rule-based and statistical-based methods. Specifically, firstly, the word segmented by the long text is part-of-speech labeled by using a CRF (conditional random field) algorithm, the word with the part-of-speech labeled is used as RNN (recurrent neural network) input, and the vocabulary classification of diseases and symptoms appearing in the long text is fed back according to the category type of the part-of-speech.
In any of the above embodiments, preferably, when the processor 504 executes the computer program, the step of performing word segmentation processing on text information to generate words is implemented, and specifically includes: and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and the removal disabled words to generate words.
In the embodiment, the text information is subjected to word segmentation according to the disease dictionary, the regular expression and the removal stop words, so that the effect of removing interference words is achieved, and meanwhile, the accuracy rate of word segmentation is improved by using a maximum matching method.
In any of the above embodiments, preferably, when the processor 504 executes the computer program, the step of training a word into a long text vector is realized, specifically including: training the words into word vectors; the word vectors are grouped into long text vectors.
In the embodiment, the divided words are trained into word vectors, and then the word vectors in each sentence are combined to form a long text vector, so that the numerical symbols of the long text of the medical record are obtained.
In any of the above embodiments, preferably, when the processor executes the computer program, the step of obtaining medical record information similar to the text information in the database according to the long text vector is implemented, and specifically includes: acquiring a plurality of long texts similar to the text information from a database, and respectively segmenting the long texts into word sets as a screening set; acquiring a long text matched with a word set obtained after word segmentation processing is carried out on text information in the screening set, and taking the long text as a priority result; calculating the relevance of a word set which is not matched with the text information in the screening set and the word set after the text information is subjected to word segmentation processing according to the long text vector; judging whether the relevance is greater than a preset threshold value or not; and if the relevance is greater than a preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
In the embodiment, firstly, the editing distance is used to solve the positive sequence ordering with the most similar character face of EMR (electronic medical record), and the EMR is divided into corresponding word sets, jaccard (Jacard) distance is used to calculate long texts completely matched with text information in the word sets, the priority of the long texts is set to be the highest, cosine distance is used to solve the relevance between words of the long texts which are not completely matched, a preset threshold is set, if the relevance is smaller than the preset threshold, the relevance is 0, no relevance can be considered, the relevant word distances are added to the positive sequence ordering, and the next best priority long text matching is solved. Specifically, for example, if the current long text word segmentation set { a, B } is the same as a set { C, a } in the library, the weighted similarity distance obtained after cosine distance calculation is as follows: (B.C)/(| B | | · | | | C |).
Specifically, as shown in FIG. 6, the patient medical record 6 is entered, a similar medical record 62 is obtained in the medical records database 60, and the results are returned to the physician. Specifically, after the patient describes the disease, the doctor can search the long text similar medical records according to experience to make corresponding clinical diagnosis and provide a proper treatment scheme for the patient.
Specifically, as shown in fig. 7, the doctor separates the chief complaint input data 70, the patient disease history data 72, and the general examination data 74 in the medical record data according to the medical record data 7 of the new patient input by the new patient, performs chief complaint similarity calculation 702, medical history similarity calculation 722, and comprehensive similarity calculation 742 on the basis of the separated data, acquires similar medical records from the Chinese electronic medical record database 78, and returns the examination result to 76 to assist the doctor in making a clinical diagnosis.
As shown in fig. 8, in a third aspect of the present invention, a computer device 8 is provided, which includes a memory 80, a processor 82, and a computer program stored on the memory 80 and executable on the processor 82, and when the processor 82 executes the computer program, the method for retrieving the similarity between the texts of the medical records according to any of the above embodiments is implemented.
The embodiment provided by the invention comprises the medical record text similarity retrieval method in any embodiment, so that the embodiment has all the beneficial effects of the medical record text similarity retrieval method.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and when being executed by a processor, the computer program implements the steps of the method according to any of the above embodiments, so that the method has all the technical effects of the method for retrieving medical history text similarity, and details are not repeated herein.
In the present invention, the term "plurality" means two or more unless explicitly defined otherwise. The terms "mounted," "connected," "fixed," and the like are to be construed broadly, and for example, "connected" may be a fixed connection, a removable connection, or an integral connection; "coupled" may be direct or indirect through an intermediary. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for retrieving medical record text similarity is characterized by comprising the following steps:
receiving text information;
performing word segmentation processing on the text information to generate words;
training the words into long text vectors;
and acquiring medical record information similar to the text information in a database according to the long text vector.
2. The medical record text similarity retrieval method according to claim 1, wherein after the step of performing word segmentation processing on the text information to generate words, the method further comprises:
performing tagging processing on the part of speech of the word;
and classifying the words according to the part-of-speech labels of the words.
3. The medical record text similarity retrieval method according to claim 1, wherein the step of performing word segmentation processing on the text information to generate words specifically comprises:
and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and removing the stop word to generate a word.
4. The medical record text similarity retrieval method according to claim 2, wherein the step of training the words into long text vectors specifically comprises:
training the words into word vectors;
forming the word vector into the long text vector.
5. The medical record text similarity retrieval method according to any one of claims 1 to 4, wherein the step of obtaining medical record information similar to the text information in a database according to the long text vector specifically includes:
acquiring a plurality of long texts similar to the text information from the database, and respectively segmenting the long texts into word sets as a screening set;
acquiring a long text matched with the word set after the word segmentation processing is carried out on the text information in the screening set, and taking the long text as a priority result;
calculating the relevance of a word set which is not matched with the text information in the screening set and a word set after word segmentation processing is carried out on the text information according to the long text vector;
judging whether the relevance is greater than a preset threshold value or not;
and if the relevance is greater than the preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
6. A system for retrieving medical record text similarity is characterized by comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing:
receiving text information;
performing word segmentation processing on the text information to generate words;
training the words into long text vectors;
and acquiring medical record information similar to the text information in a database according to the long text vector.
7. The medical record text similarity retrieval system according to claim 6, wherein the processor, when executing the computer program, implements the word segmentation processing on the text information, and after the step of generating words, further comprises:
performing tagging processing on the part of speech of the word;
and classifying the words according to the part-of-speech labels of the words.
8. The medical record text similarity retrieval system according to claim 6, wherein the processor implements the step of performing word segmentation processing on the text information to generate words when executing the computer program, and specifically comprises:
and performing word segmentation processing on the text information according to the disease dictionary and the regular expression and removing the stop word to generate a word.
9. The system for retrieving medical record text similarity according to claim 7, wherein the processor, when executing the computer program, implements the step of training the words into long text vectors, specifically comprising:
training the words into word vectors;
forming the word vector into the long text vector.
10. The system for retrieving medical record text similarity according to any one of claims 6 to 9, wherein the processor, when executing the computer program, implements the step of obtaining medical record information similar to the text information in the database according to the long text vector, specifically comprising:
acquiring a plurality of long texts similar to the text information from the database, and respectively segmenting the long texts into word sets as a screening set;
acquiring a long text matched with the word set after the word segmentation processing is carried out on the text information in the screening set, and taking the long text as a priority result;
calculating the relevance of a word set which is not matched with the text information in the screening set and a word set after word segmentation processing is carried out on the text information according to the long text vector;
judging whether the relevance is greater than a preset threshold value or not;
and if the relevance is greater than the preset threshold value, arranging the long texts which are not matched with the text information in a positive sequence according to the relevance.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the method for retrieving similarity between medical record texts according to any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for retrieving the similarity of medical record texts according to any one of claims 1 to 5.
CN201910407594.0A 2019-05-16 2019-05-16 Method and system for retrieving medical record text similarity and computer equipment Pending CN111949759A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910407594.0A CN111949759A (en) 2019-05-16 2019-05-16 Method and system for retrieving medical record text similarity and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910407594.0A CN111949759A (en) 2019-05-16 2019-05-16 Method and system for retrieving medical record text similarity and computer equipment

Publications (1)

Publication Number Publication Date
CN111949759A true CN111949759A (en) 2020-11-17

Family

ID=73336902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910407594.0A Pending CN111949759A (en) 2019-05-16 2019-05-16 Method and system for retrieving medical record text similarity and computer equipment

Country Status (1)

Country Link
CN (1) CN111949759A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466472A (en) * 2021-02-03 2021-03-09 北京伯仲叔季科技有限公司 Case text information retrieval system
CN112579750A (en) * 2020-11-30 2021-03-30 百度健康(北京)科技有限公司 Similar medical record retrieval method, device, equipment and storage medium
CN113254658A (en) * 2021-07-07 2021-08-13 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN113610112A (en) * 2021-07-09 2021-11-05 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for airplane assembly quality defects
CN113689924A (en) * 2021-08-24 2021-11-23 平安国际智慧城市科技股份有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN114020874A (en) * 2021-11-11 2022-02-08 万里云医疗信息科技(北京)有限公司 Medical record retrieval system, method, equipment and computer readable storage medium
CN114218955A (en) * 2021-12-28 2022-03-22 上海柯林布瑞信息技术有限公司 Medical knowledge graph-based auxiliary reference information determination method and system
CN114300083A (en) * 2021-11-16 2022-04-08 北京左医科技有限公司 Medical record construction method and system
CN115083550A (en) * 2022-06-29 2022-09-20 西安理工大学 Patient similarity classification method based on multi-source information
CN115269613A (en) * 2022-09-27 2022-11-01 四川互慧软件有限公司 Patient main index construction method, system, equipment and storage medium
CN115662607A (en) * 2022-12-13 2023-01-31 四川大学 Internet online inquiry recommendation method based on big data analysis and server
CN115983233A (en) * 2023-01-04 2023-04-18 重庆邮电大学 Electronic medical record duplication rate estimation method based on data stream matching
CN116631614A (en) * 2023-07-24 2023-08-22 北京惠每云科技有限公司 Treatment scheme generation method, treatment scheme generation device, electronic equipment and storage medium
CN116682526A (en) * 2023-08-03 2023-09-01 中国中医科学院中国医史文献研究所 Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579750A (en) * 2020-11-30 2021-03-30 百度健康(北京)科技有限公司 Similar medical record retrieval method, device, equipment and storage medium
CN112466472A (en) * 2021-02-03 2021-03-09 北京伯仲叔季科技有限公司 Case text information retrieval system
CN113254658A (en) * 2021-07-07 2021-08-13 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN113254658B (en) * 2021-07-07 2021-12-21 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN113610112A (en) * 2021-07-09 2021-11-05 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for airplane assembly quality defects
CN113610112B (en) * 2021-07-09 2024-04-16 中国商用飞机有限责任公司上海飞机设计研究院 Auxiliary decision-making method for aircraft assembly quality defects
CN113689924A (en) * 2021-08-24 2021-11-23 平安国际智慧城市科技股份有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN113689924B (en) * 2021-08-24 2024-04-05 深圳平安智慧医健科技有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN114020874A (en) * 2021-11-11 2022-02-08 万里云医疗信息科技(北京)有限公司 Medical record retrieval system, method, equipment and computer readable storage medium
CN114300083A (en) * 2021-11-16 2022-04-08 北京左医科技有限公司 Medical record construction method and system
CN114218955A (en) * 2021-12-28 2022-03-22 上海柯林布瑞信息技术有限公司 Medical knowledge graph-based auxiliary reference information determination method and system
CN115083550A (en) * 2022-06-29 2022-09-20 西安理工大学 Patient similarity classification method based on multi-source information
CN115083550B (en) * 2022-06-29 2023-08-08 西安理工大学 Patient similarity classification method based on multi-source information
CN115269613A (en) * 2022-09-27 2022-11-01 四川互慧软件有限公司 Patient main index construction method, system, equipment and storage medium
CN115662607B (en) * 2022-12-13 2023-04-07 四川大学 Internet online inquiry recommendation method based on big data analysis and server
CN115662607A (en) * 2022-12-13 2023-01-31 四川大学 Internet online inquiry recommendation method based on big data analysis and server
CN115983233A (en) * 2023-01-04 2023-04-18 重庆邮电大学 Electronic medical record duplication rate estimation method based on data stream matching
CN116631614A (en) * 2023-07-24 2023-08-22 北京惠每云科技有限公司 Treatment scheme generation method, treatment scheme generation device, electronic equipment and storage medium
CN116682526A (en) * 2023-08-03 2023-09-01 中国中医科学院中国医史文献研究所 Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing
CN116682526B (en) * 2023-08-03 2023-10-24 中国中医科学院中国医史文献研究所 Traditional Chinese medicine knowledge recommendation system based on ancient book knowledge unit processing

Similar Documents

Publication Publication Date Title
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
CN111274806B (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN110364234B (en) Intelligent storage, analysis and retrieval system and method for electronic medical records
CN104699741A (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
CN110675944A (en) Triage method and device, computer equipment and medium
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN112241457A (en) Event detection method for event of affair knowledge graph fused with extension features
CN112257422A (en) Named entity normalization processing method and device, electronic equipment and storage medium
CN112559684A (en) Keyword extraction and information retrieval method
CN111145903A (en) Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system
Peng et al. A self-attention based deep learning method for lesion attribute detection from CT reports
CN112037909A (en) Diagnostic information rechecking system
CN113722507B (en) Hospitalization cost prediction method and device based on knowledge graph and computer equipment
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text
Althari et al. Exploring transformer-based learning for negation detection in biomedical texts
CN113343680A (en) Structured information extraction method based on multi-type case history texts
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN116881336A (en) Efficient multi-mode contrast depth hash retrieval method for medical big data
CN113130025A (en) Entity relationship extraction method, terminal equipment and computer readable storage medium
CN116737924A (en) Medical text data processing method and device
Huang et al. An annotation model on end-to-end chest radiology reports
CN112530582A (en) Intelligent system for assisting cause of death classified coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination