CN109935337B

CN109935337B - Medical record searching method and system based on similarity measurement

Info

Publication number: CN109935337B
Application number: CN201910137294.5A
Authority: CN
Inventors: 朱培栋; 张振宇; 王平; 熊荫乔; 刘欣; 郭敏捷; 冯璐; 郑昱; 李勇
Original assignee: Changsha University
Current assignee: Changsha University
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2021-01-15
Anticipated expiration: 2039-02-25
Also published as: CN109935337A

Abstract

The invention discloses a medical record searching method and a medical record searching system based on similarity measurement, wherein the steps of the invention comprise that a medical record group is constructed aiming at a query medical record group A to obtain a medical record group set C; generating a medical record group data set D with similar labels for the medical record group set C; and constructing a machine learning model, completing training through a medical record group data set D, inputting all medical records in the target medical record and the query medical record set A into the machine learning model together, obtaining similarity measurement values between all medical records in the target medical record and the query medical record set A, and outputting N medical records with the highest similarity measurement. The method fully utilizes the information of the medical records and the theoretical knowledge related to the medical records, can improve the precision of the similarity measurement of the medical records, improves the accuracy of the sequencing of the medical records, has better precision improvement potential, and has the advantages of high precision, high applicability, good robustness and sustainable optimization potential.

Description

Medical record searching method and system based on similarity measurement

Technical Field

The invention relates to a similar medical record searching technology in the medical field, in particular to a medical record searching method and system based on similarity measurement, which can realize similar medical record searching under heterogeneous medical information and sort the similar medical records based on similarity.

Background

The medical record refers to the file for recording the disease performance and diagnosis and treatment condition of the patient according to the standard, and mainly comprises: basic information of patients, medical history information of patients, examination information, medical advice information, diagnosis information, treatment scheme, disease feedback and the like. The medical record describes the complete state of an illness in the process of seeing a doctor of a patient, and the state of the illness of the patient is stored in a data information mode. The research and analysis taking the medical record as the object has important significance, the medical record searching in the invention is the same, for example, when doctors in hospitals under county level can not grasp the state of illness of patients in the past treatment, the diagnosis and treatment of the current patients can be assisted by searching the treatment scheme of experts in similar medical records based on the invention. Therefore, the research of the medical record searching has great theoretical and practical significance.

Similar case finding is mainly based on similar ordering of cases. The similar ordering of medical records plays an extremely important fundamental role in understanding the types of medical records, identifying the relationship between medical records and predicting the trend of the disease conditions in the medical records, which is the premise and the basis of the application of the medical records. Similar sorting of medical records means that for a given medical record, all medical records in the medical record library are compared with the medical record, and then the medical records are sorted based on the similarity. The most important work is how to determine the similarity value between two medical records, i.e. the similarity measure of the medical records: the closer two cases are, the larger their similarity measure is, and the further apart the two cases are, the smaller their similarity measure is.

The existing methods for measuring the similarity of medical records can be divided into two categories: a machine learning model based on medical record data and a traditional theoretical model based on theoretical knowledge. The traditional theoretical model starts from medical field knowledge, and judges the similarity size relation between medical records through pathological analysis, and the model has the advantages of good interpretability, high similarity measurement precision among a small number of medical records, higher requirement on professional knowledge, limitation of professional field knowledge and high difficulty in improving model precision. The machine learning model based on the medical record data starts from the medical record data, analyzes and learns a large amount of medical record data with formed relations, and then learns the similarity relation among the medical record data. Therefore, the problem that the difficulty in improving the precision of the model and the difficulty in obtaining the similarity labels of the medical records exist in the medical record similarity measurement method in practical application, the problem influences the improvement of the precision of the medical record similarity measurement, and further influences the accuracy of the medical record similarity sequencing. Therefore, the method for measuring the similarity of the medical records, which can automatically acquire the label of the similarity of the medical records, is particularly important, and has important theoretical significance and practical requirements.

Chinese patent publication No. CN104572675B discloses a system and method for retrieving similar medical records, and in the technical scheme, a medical record similarity calculation method based on pathology is designed, thereby realizing a function of retrieving similar medical records from a medical record library. According to the technical scheme, the method for calculating the similarity of the medical records from the pathological angle ignores the data information of the medical records, and the similarity measurement method cannot perform self-optimization based on errors; meanwhile, the technical scheme takes the medical records as objects to carry out similarity retrieval, the medical records cannot completely reflect the condition of the patient, and the similarity of the medical records is only partial similarity of the disease condition of the patient. Yanghe et al disclose a similar medical record retrieval system based on a medical big data platform in the southeast national defense medicine, the technical scheme is mainly based on a natural language processing technology to realize the similar medical record retrieval function in a medical record library on the medical big data platform, and the method of the technical scheme has certain defects in the accuracy of similarity measurement and the completeness of medical record reaction conditions as in the previously described Chinese patent document with the publication number of CN 104572675B. Therefore, the accuracy and the integrity of the medical record query still have a space for further optimization.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the invention can be well suitable for similarity measurement of medical records, can well solve the practical problems of the deficiency of similar labels of the medical records, the imperfection of theoretical knowledge of the similarity of the medical records and the like, fully utilizes the self information of the medical records and the theoretical knowledge related to the medical records, can improve the precision of the similarity measurement of the medical records and the accuracy of the sequencing of the medical records, and simultaneously has better precision improvement potential based on the similarity measurement method of the medical records of machine learning, thereby having the advantages of high precision, high applicability, good robustness and sustainable optimization potential.

In order to solve the technical problems, the invention adopts the technical scheme that:

a medical record searching method based on similarity measurement comprises the following implementation steps:

1) establishing a medical record group aiming at the query medical record set A to obtain a medical record group set C;

2) assigning similar labels of the medical record groups to the medical record group set C to obtain a medical record group data set D with similar labels;

3) constructing a machine learning model, and completing the training of the machine learning model through a medical record group data set D with similar labels, wherein the machine learning model establishes a mapping relation between medical record groups and the similarity of the medical record groups through training;

4) and inputting all medical records in the target medical record and the query medical record set A into the machine learning model together to obtain similarity metric values between the target medical record and all medical records in the query medical record set A, and selecting N medical records with the highest similarity metric values to output.

Preferably, the detailed steps of step 1) include:

1.1) aiming at all medical records in the query medical record set A, carrying out full-arrangement and combination on every two medical records to obtain a medical record group set B, wherein elements in the medical record group set B are medical record groups, and each medical record group consists of two medical records;

1.2) randomly selecting part of the medical record groups aiming at the medical record group set B to obtain a medical record group selection set C₀(ii) a Respectively calculating similarity value index values according to multiple specified similarity value indexes aiming at the case group set B, and respectively performing descending order arrangement on the basis of the similarity value index values to obtain a case group ordered set B aiming at different similarity value indexes_i(ii) a Ordered set B for all case groups_iRespectively selecting and generating a case group selection set C based on probability distribution of similarity value indexes_iThe appointed multiple similarity value indexes comprise at least two of Euclidean distance, cosine distance, Jacard distance and adjusted cosine distance;

1.3) selecting the case groupSelect set C₀All case group selection set C_iAnd (5) carrying out collection and combination to obtain a case group set C.

Preferably, the selection and generation of the case group selection set C based on the probability distribution of similarity value indexes_iChronological, the ordered set of medical records B_iThe probability of random selection of each case group is shown as the formula (1);

in the formula (1), P (SM)_i) Set of medical records in order B_iProbability of the ith case group being selected, SM_iSet of medical records in order B_iIndex of similarity value of the ith case group, f (SM)_j) Set of medical records in order B_iThe similarity value of the ith medical record group is normalized, and m is the ordered set B of medical record groups_iNumber of cases in the middle, f (SM)_j) The functional expression of (a) is represented by the formula (2);

in the formula (2), SM₁Set of medical records in order B_iSimilarity index, SM, for the 1 st case group_mSet of medical records in order B_iThe index value of the similarity value of the mth case group, m is the ordered set B of case groups_iThe number of the group of the middle cases.

Preferably, the detailed steps of step 2) include:

2.1) representing the medical record group set C as a medical record group set matrix b, wherein each row of the medical record group set matrix b represents a medical record, the first n rows of the medical record respectively represent n characteristics of the medical record, and the last row of characteristics s is diagnostic information of the medical record;

2.2) determining the weight value of each characteristic of the medical record according to the medical record group set matrix b;

and 2.3) calculating the similarity value s of each case group in the case group set C according to the similarity between the features of the cases and the weight corresponding to the features, and further obtaining a case group set D with similar labels.

Preferably, the detailed steps of step 2.2) include: calculating an original weight value of each feature of the medical record according to the formula (4), and performing normalization processing on the original weight values of all the features to serve as a final weight value of each feature;

in the formula (4), y_i' original weight values for each i feature representing a medical condition,

representing a characteristic vector formed by the ith row of characteristics of all medical records of the medical record group set matrix b;

a diagnosis information characteristic vector formed by the diagnosis information s of all medical records of the medical record group set matrix b; sigma_iAnd (4) representing the variance of the ith column characteristics of all the medical cases of the medical case group set matrix b.

Preferably, the functional expression of the similarity value s of each case group in the case group set C calculated in step 2.3) is shown as formula (6);

in the formula (6), s_ijRepresenting the similarity of the case group consisting of case i and case j, n being the total number of features,

value of the x-th feature of case i

Is the value of the x-th feature of the medical condition j, y_xTo be the weight of the xth feature after normalization,

is the maximum value of the x-th feature,

is the minimum of the xth feature.

Preferably, the detailed step of constructing the machine learning model in step 3) includes:

3.1) designing three loss functions of a scoring loss function, a sorting loss function and a sorting probability loss function respectively; the input of the scoring loss function is data of two medical records, the output is a similarity score value, and the scoring loss function is represented by an absolute value between a prediction score value and a label score value; the input of the sorting loss function is three medical record data, wherein one of the three medical record data is used for inquiring the medical record, the two sorting medical record data are used for inquiring the medical record, and the output of the sorting loss function is a similarity score value; the input of the sequencing probability loss function is three medical record data, wherein one of the three medical record data queries a medical record, the two sequencing medical records output score comparison probability values, and the sequencing probability loss function is represented by difference probabilities between the prediction score values and the label score values of the two sequencing medical records respectively by querying the medical record;

3.2) constructing a neural network model, wherein the neural network model consists of an input layer, a hidden layer and an output layer, the input layer is used for completely inputting all dimensions of the medical record into a network, and the hidden layer is a complete connection layer network and is used for characteristic processing of the medical record; the output layer is used for outputting a similarity metric value between the two medical records;

3.3) respectively taking the loss functions of the neural network model, the sequencing loss function and the sequencing probability loss function as the loss functions of the neural network model and selecting the loss function with the best effect;

and 3.4) selecting an activation function of the neural network model according to the type of the selected loss function, wherein the activation function adopts a linear activation function when the loss function is selected and scored, the activation function adopts a tanh function as the activation function when the loss function is selected and ranked, and the activation function adopts a sigmoid function as the activation function when the loss function is selected and ranked to obtain the probability loss function.

Preferably, when three loss functions of the scoring loss function, the sorting loss function and the sorting probability loss function are designed in the step 3.1), a function expression of the scoring loss function is shown as a formula (7), a function expression of the sorting loss function is shown as a formula (8), and a function expression of the sorting probability loss function is shown as a formula (9);

in the formula (7), L (theta) is a loss value of the loss function, and t is the number of case groups;

is a medical record

The medical records

The value of the model prediction score of (a),

is a medical record

The medical records

A tag score value of;

in the formula (8), L (theta) is the loss value of the sorting loss function, t is the number of case groups,

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

Sign is a sign function;

in the formula (9), L (theta) represents the loss value of the ranking probability loss function, t represents the number of case groups,

medical record q under the value of label score_iAnd medical record

The similarity between them is greater than the medical record q_iAnd medical record

The probability of inter-similarity is determined,

medical record q under the value of the representation model prediction score_iAnd medical record

Probability of inter-similarity; wherein

The functional expressions of (a) are respectively expressed by the formulas (10) and (11);

in the formulae (10) and (11),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The neural network model of (1) predicts the score value.

The invention also provides a medical record searching system based on the similarity measurement, which comprises a computer device, wherein the computer device is programmed to execute the steps of the medical record searching method based on the similarity measurement, or a storage medium of the computer device is stored with a computer program which is programmed to execute the medical record searching method based on the similarity measurement.

The present invention also provides a computer readable storage medium having stored therein a computer program programmed to execute the aforementioned similarity metric-based medical record searching method of the present invention.

Compared with the prior art, the invention has the following advantages:

1. the medical record searching based on the similarity measurement has great requirements and significance in practical application. The existing medical field shows the phenomenon of uneven resource distribution: a large amount of medical resources and expert resources are concentrated in a small number of large hospitals, most of the hospitals below county level have only a small amount of medical resources and the level of doctor business is relatively low as a whole, but actually these hospitals below county level are the subjects of the majority of patients, and therefore, a state occurs in which the majority of patients cannot receive high-level medical services. The invention can relieve the problem to a certain extent, the medical record library comprises a large amount of medical record data, the medical record data comprises the diagnosis and treatment conditions of the patient by the expert, the fact is medical resource, when the condition of the patient to be diagnosed can not be grasped by the doctors of the hospitals under the county level, the primary examination can be carried out on the patient, so that the basic information, the symptoms, the history, the examination information and the like of the patient can be input into the system to form a primary medical record, and then the primary medical record is input into the medical record searching system as a whole, the invention can output part of similar medical records from the medical record library based on the similarity between the medical records, thus, the doctors of the hospitals under the county level can use the analysis of the diagnosis and treatment schemes of the similar patients to reference the diagnosis and treatment conditions of the similar patients by the doctors of the experts to further assist in diagnosing and treating the current patient, the invention can share the knowledge of expert resources in the form of electronic data, assist medical treatment and better serve the medical field.

2. The method is well suitable for measuring the similarity of the medical records, can well solve the practical problems of the deficiency of medical record similarity labels, the imperfection of medical record similarity theoretical knowledge and the like, fully utilizes the self information of the medical records and the relevant theoretical knowledge of the medical records, can improve the precision of the medical record similarity measurement and improve the accuracy of medical record sequencing, and meanwhile, the medical record similarity measurement method based on machine learning has better precision improvement potential and has the advantages of high precision, high applicability, good robustness and sustainable optimization potential.

3. The machine learning method has higher requirement on the distribution condition of training data, and the method designs a multi-index probability distribution method for selecting the case group under the condition that the distribution condition of the case is unknown, so that the distribution deviation condition of the case group data under a single index is avoided to a certain extent.

4. According to the invention, from the actual condition of medical record data, the traditional theoretical model and the machine learning model are integrated, the traditional theoretical model finishes the work of weak labels, the machine learning model learns the similarity of medical records, the advantages of each model are fully utilized, and the accuracy of medical record similarity measurement is improved.

5. Compared with the traditional theoretical model, the method provided by the invention has the advantages that convenience is provided for the improvement of the medical record similarity measurement precision by the application of the machine learning technology, the optimization of data, the adjustment of parameters and the improvement of the learning method are all optimization methods which cannot be provided by the traditional medical record similarity measurement method, the method provided by the invention provides a basis for the continuous optimization of the medical record similarity measurement, and the improvement potential of the medical record similarity measurement precision can be increased.

Drawings

FIG. 1 is a schematic diagram of the basic principle of the method of the embodiment of the present invention.

FIG. 2 is a schematic diagram of a process of generating a medical record group set C according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a score loss function in the method according to the embodiment of the present invention.

Fig. 4 is a schematic structural diagram of the ordering loss function in the method according to the embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an ordering probability loss function in the method according to the embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a network model in the method according to the embodiment of the present invention.

Detailed Description

As shown in fig. 1, the implementation steps of the medical record searching method based on similarity measurement in this embodiment include:

In practical application, the medical record refers to the file for recording the disease performance and diagnosis and treatment condition of patients according to the standard, and mainly comprises: basic information of patients, medical history information of patients, examination information, medical advice information, diagnosis information, treatment scheme, disease feedback and the like. The data formats of medical records are also various: formatted key-value pair data, text data, image data, audio data, etc. The medical record data applied in this embodiment is formatted key value-to-medical record data obtained by collating the original medical record data.

The machine learning method has higher requirement on the distribution condition of training data, and when the medical record set is constructed aiming at the inquiry medical record set A to obtain the medical record set C under the condition that the distribution condition of the medical record is unknown, the medical record set is constructed to obtain the medical record set C specifically based on a multi-index probability distribution mode.

In this embodiment, the detailed steps of step 1) include:

1.2) randomly selecting part of the medical record groups aiming at the medical record group set B to obtain a medical record group selection set C₀(ii) a Respectively calculating similarity value index values according to multiple specified similarity value indexes aiming at the case group set B, and respectively performing descending order arrangement on the basis of the similarity value index values to obtain a case group ordered set B aiming at different similarity value indexes_i(ii) a Ordered set B for all case groups_iRespectively selecting and generating a case group selection set C based on probability distribution of similarity value indexes_i；

1.3) selection of case groups C₀All case group selection set C_iAnd (5) carrying out collection and combination to obtain a case group set C.

In this embodiment, a case group selection set C is selected and generated based on probability distribution of similarity value indexes_iChronological, the ordered set of medical records B_iThe probability of random selection of each case group is shown as the formula (1);

in the formula (1), P (SM)_i) Set of medical records in order B_iMiddle and ith diseasesProbability of case group being selected, SM_iSet of medical records in order B_iIndex of similarity value of the ith case group, f (SM)_j) Set of medical records in order B_iThe similarity value of the ith medical record group is normalized, and m is the ordered set B of medical record groups_iNumber of cases in the middle, f (SM)_j) The functional expression of (a) is represented by the formula (2);

In this embodiment, the specified multiple similarity value indexes include an euclidean distance, a cosine distance, a jaccard distance, and an adjusted cosine distance, and at least two of them may be adopted or further expanded to add other similarity value indexes as needed.

Euclidean distance: referring to fig. 2, for the case group set B, euclidean distances are respectively calculated as similarity value index values according to a plurality of specified similarity value indexes, and for different similarity value indexes, the similarity value index values are respectively subjected to descending order arrangement to obtain a first case group ordered set B₁(ii) a For all first case group ordered sets B₁. First medical record group ordered set B₁The elements in the ordered set are selected based on the similarity values, and the similarity values of the elements in the ordered set are respectively recorded as SM₁、SM₂…SM_mThe medical records are arranged from big to small, the probability of selecting a specific medical record group is shown as formulas (1) and (2), and finally a first medical record group selection set C is generated based on probability distribution selection of similarity value indexes₁。

Cosine distance: referring to fig. 2, cosine distances are respectively calculated as similarity value index values according to a plurality of specified similarity value indexes for the case group B, and the cosine distances are respectively calculated as the similarity value index values for different similarity value indexesPerforming descending order arrangement based on the similarity value index value to obtain a second case group ordered set B₂(ii) a For all second case group ordered set B₂. Second case group ordered set B₂The elements in the ordered set are selected based on the similarity values, and the similarity values of the elements in the ordered set are respectively recorded as SM₁、SM₂…SM_mThe selection sets are arranged from large to small, the probability of selecting a specific case group is shown as formulas (1) and (2), and finally a second case group selection set C is generated based on probability distribution selection of similarity value indexes₂。

Jacard distance: referring to fig. 2, for the medical record group B, the jaccard distance is respectively calculated as the similarity index value according to the multiple specified similarity indexes, and for the different similarity indexes, the third medical record group ordered set B is obtained by performing descending order arrangement based on the similarity index values respectively₃(ii) a Ordered set B for all third case groups₃. Ordered set of third medical record group B₃The elements in the ordered set are selected based on the similarity values, and the similarity values of the elements in the ordered set are respectively recorded as SM₁、SM₂…SM_mThe selection sets are arranged from large to small, the probability of selecting a specific case group is shown as formulas (1) and (2), and finally a third case group selection set C is generated based on probability distribution selection of similarity value indexes₂。

Adjusting the cosine distance: referring to fig. 2, the adjusted cosine distances are respectively calculated as similarity value index values according to a plurality of specified similarity value indexes for the case group set B, and the similarity value indexes are respectively sorted in descending order to obtain a fourth case group ordered set B₂(ii) a Ordered set B for all fourth case groups₄. Fourth case group ordered set B₄The elements in the ordered set are selected based on the similarity values, and the similarity values of the elements in the ordered set are respectively recorded as SM₁、SM₂…SM_mThe probability of selecting a specific case group is shown as formulas (1) and (2) in a descending order, and finally, a fourth case group selection set C is generated based on probability distribution selection of similar value indexes₃。

With reference to figure 2 of the drawings,finally, the medical record group is selected and collected C₀The first medical record group selection set C₁The second medical record group selection set C₂Third medical record group selection set C₂Fourth case group selection set C₃And (5) carrying out collection and combination to obtain a case group set C. The case group set C is a method for constructing case groups based on multi-index probability distribution, a large number of suitable case groups can be obtained through the method, the distribution deviation condition of the case group data under a single index is avoided to a certain extent, and therefore the purpose of reducing data distribution errors can be achieved.

In this embodiment, the detailed steps of step 2) include:

2.1) expressing the medical record group set C as a medical record group set matrix b, wherein each row of the medical record group set matrix b expresses one medical record, the first n rows of the medical record respectively express n characteristics of the medical record, and the last row of characteristics s is diagnosis information of the medical record;

The precise description of the medical records is the basis of the similar ordering of the medical records, and in the embodiment, the medical records are described as vectors composed of attribute features:

wherein, b_iA medical record i is shown;

the attribute characteristics of the medical record are expressed; s_iThe diagnosis information is a special feature of the medical record. Therefore, the function expression of the medical record set matrix b is shown in formula (3);

referring to formula (3), each row of the medical record group set matrix b represents a medical record, the first n columns of the medical record respectively represent n features of the medical record, the last column of the features s is diagnostic information of the medical record, and the medical record in row 1 is taken as an example, wherein

N features representing the case, the last column of features s₁The diagnosis information of the medical record.

In this embodiment, the detailed steps of step 2.2) include: calculating an original weight value of each characteristic of the medical record according to the formula (4), and performing normalization processing on the original weight values of all the characteristics to serve as a final weight value of each characteristic, wherein a function expression of the normalization processing is shown as a formula (5);

a characteristic vector formed by the ith column characteristics of all medical records of the medical record group set matrix b,

a diagnosis information feature vector formed by the diagnosis information s of all the medical records of the medical record group set matrix b,

σ_ithe ith column of characteristics of all medical records in the medical record group set matrix b

The variance of (c). The multi-factor empowerment treatment is a problem which is widely applied but difficult, and in the process of measuring the similarity of medical records,in the weighting processing of the corresponding content of different data structures in the medical record data, due to the difficulty of modeling, weighting from the perspective of a model is difficult, while a general subjective weighting method excessively depends on subjective factors and easily generates subjective errors, in the embodiment, the original weight value of each characteristic of the medical record is calculated according to the formula (4) to be the objective weighting method provided based on data stability, and the method can reduce the errors to a certain extent, so that the purpose of reducing the weighting errors can be achieved.

In the formula (4), y_i' represents the original weight value of each i characteristics of the medical records, n represents the total number of all medical records in the medical record group set matrix b, y_j' represents the original weight value of each j feature of the medical condition.

In this embodiment, the functional expression of the similarity value s of each medical record group in the medical record group set C calculated in step 2.3) is as shown in formula (6);

value of the x-th feature of case i

is the maximum value of the x-th feature,

is the minimum value of the x-th feature。

In this embodiment, for the case group data in the case group set C { (b1, b2), (b3, b4) … }, the similarity between cases is characterized based on the method of similarity between features and weights corresponding to the features in the BM25 algorithm, and the specific calculation method is as shown in formula (6), and the similarity value s of each case group in the set C is obtained by calculation, so as to obtain the case group set D { (b1, b2, sm) with similar labels₁₂)、(b3,b4,sm₃₄)…},sm₁₂The similarity between the group of cases consisting of case b1 and case b2, i.e., the similarity between case b1 and case b2, is shown.

In this embodiment, the detailed steps of constructing the machine learning model in step 3) include:

3.1) respectively designing three loss functions of a scoring loss function, a sequencing loss function and a sequencing probability loss function, wherein:

as shown in fig. 3, the input of the scoring loss function is two medical records data, and the output is a similarity score value, and the scoring loss function is represented by an absolute value between the prediction score value and the label score value;

as shown in fig. 4, the input of the ranking loss function is three medical records data, wherein one of the medical records is queried, two of the ranking medical records are queried, and the output is the similarity score value, and the ranking loss function is represented by the difference between the prediction score value and the label score value between the two ranking medical records respectively;

as shown in fig. 5, the input of the ranking probability loss function is three medical records data, one of which queries the medical records, two of which ranks the medical records, and the output is a score comparison probability value, and the ranking probability loss function is represented by the difference probability between the prediction score value and the label score value of the query medical records between the two rows of the ranking medical records;

3.2) constructing a neural network model, wherein the neural network model consists of an input layer, a hidden layer and an output layer as shown in FIG. 6, the input layer is used for completely inputting all dimensions of the medical record into the network, and the hidden layer is a complete connection layer network and is used for characteristic processing of the medical record; the output layer is used for outputting a similarity metric value between the two medical records;

In the embodiment, when three loss functions, namely a scoring loss function, a sorting loss function and a sorting probability loss function, are designed in the step 3.1), the scoring loss function is used for scoring the similar value of the medical record as the loss function, and the functional expression of the scoring loss function is shown as a formula (7); the sorting loss function is used for taking the sorting relation of the similar values of the medical records as a loss function, and the function expression of the sorting loss function is shown as a formula (8); the sequencing probability loss function is used for taking the sequencing probability of the similar value of the case as a loss function, and the functional expression of the sequencing probability loss function is shown as a formula (9);

is a medical record

The medical records

The value of the model prediction score of (a),

is a medical record

The medical records

A tag score value of;

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

Sign is a sign function;

medical record q under the value of label score_iAnd medical record

The probability of inter-similarity is determined,

Probability of inter-similarity; wherein

in the formulae (10) and (11),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The neural network model of (1) predicts the score value.

And for the medical record group set D with the similar labels, inputting data into a neural network model, training the similarity of the medical records according to the designed model, and finally obtaining a neural network model which can measure the similarity of the two medical records, namely a regressor. The regressor is a network structure with determined parameters, is a method for judging the similarity value of two medical records, and realizes the functions of inputting two medical records and outputting the similarity value of the medical records, such as inputting medical records b1 and b2 and outputting the similarity value s of medical records b1 and b2 (b1, b 2). On the basis, for the target medical record and the query medical record set A, all medical records in the query medical record set A are respectively input into the regressor together with the medical records to obtain the similarity values of the target medical record and all medical records in the query medical record set A, the similarity value of the former N is found out, and N medical records with the highest similarity to the target medical record are obtained, wherein the value of N can be specified as required and can be one or more. It should be noted that, the above mentioned inquiry medical record set A, medical record group set B and medical record group selection setAnd C₀The first medical record group selection set C₁The second medical record group selection set C₂Third medical record group selection set C₂Fourth case group selection set C₃The letters referred to in the medical record group set C and the medical record group set D are only used for distinguishing the data sets, and they should not be used to constitute any specific limitation to the data sets themselves.

The embodiment fully considers various actual conditions in the medical record similarity measurement process, describes the actual difficulty of the process in detail, comprehensively discusses each stage of the medical record similarity measurement, provides technical difficulties and solutions in the analysis of each stage, finally integrates the analysis of each stage, and provides the medical record similarity measurement method based on weak supervision machine learning. The weak supervised machine learning is an attempt to construct a predicted machine learning model through weak supervision, the model provides a method for integrating a theoretical model based on theoretical knowledge and a machine learning model based on data, the embodiment method utilizes the advantage of no data dependence of the theoretical model to create a weak supervised label, and utilizes the advantage of the unlimited domain knowledge of the machine learning model to perform the weak supervised learning. The method can be well suitable for the similarity measurement of medical records, and can well solve the practical problems of the loss of medical record similar labels, the imperfection of medical record similar theoretical knowledge and the like.

In order to further verify the medical record searching method based on similarity measurement in the present embodiment, the following adopts the actual medical record data of the JK medical data center and the public data set Robust 04; the evaluation indexes adopt MAP, P @20 and nDCG @20, wherein the average precision of all retrieved medical records of MAP, P @20 represents the average precision of the first 20 retrieved medical records, and nDCG @20 is the accumulated discount precision of the first 20 retrieved medical records, namely the precision of each medical record is different in corresponding weight, and the weight of the previous medical record is larger; experiments are carried out through two types of medical record data sets; the distribution of MAP, P @20 and nDCG @20 for each algorithm under different data sets is shown in Table 1.

Table 1: and (5) measuring the similarity of the medical records and comparing the precision.

As can be seen from table 1, compared with the existing theory-based model method (BM25) and SVM-based weak supervised learning algorithm (rankkssvm), the case finding method (method) based on similarity measurement according to the present embodiment has greater advantages under each evaluation index when the ranking probability loss function is adopted as the loss function; the method has the greatest advantage under the index nDCG @20, and the method has relatively smaller advantage under the index MAP, which indicates that the medical record searching method based on the similarity measurement has sensitive medical record groups with larger similarity.

Compared with the Chinese patent document with the publication number of CN104572675B, the medical record searching method based on the similarity measurement of the embodiment carries out the similarity measurement of the medical record from the data perspective based on the machine learning method, fully utilizes the data information of the medical record and provides the self-optimization function based on the error; meanwhile, the problem that the medical record in the Chinese patent document with the publication number of CN104572675B can not completely reflect the illness state of the patient is solved by taking the medical record as a retrieval object. Yanghe et al disclose a similar medical record retrieval system based on a medical big data platform in the southeast national defense medicine, the method of the technical scheme has certain defects in the accuracy of similarity measurement and the completeness of medical record reaction conditions like the Chinese patent document with the publication number of CN104572675B, and the method provided by the medical record searching method based on the similarity measurement of the embodiment solves the problem to a certain extent.

In addition, the present embodiment further provides a medical record searching system based on similarity measurement, which includes a computer device programmed to execute the steps of the aforementioned medical record searching method based on similarity measurement according to the present embodiment. The present embodiment further provides a medical record searching system based on similarity measurement, which includes a computer device, where a storage medium of the computer device stores a computer program programmed to execute the aforementioned medical record searching method based on similarity measurement according to the present embodiment. The present embodiment also provides a computer-readable storage medium, in which a computer program is stored, which is programmed to execute the aforementioned medical record searching method based on similarity measurement of the present embodiment.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A medical record searching method based on similarity measurement is characterized by comprising the following implementation steps:

4) inputting the target medical records and all medical records in the query medical record set A into a machine learning model together to obtain similarity metric values between the target medical records and all medical records in the query medical record set A, and selecting N medical records with the highest similarity metric values to output;

the detailed steps of the step 1) comprise:

1.2) random selection for case group set BPart of the medical record groups obtain a medical record group selection set C₀(ii) a Respectively calculating similarity value index values according to multiple specified similarity value indexes aiming at the case group set B, and respectively performing descending order arrangement on the basis of the similarity value index values to obtain a case group ordered set B aiming at different similarity value indexes_i(ii) a Ordered set B for all case groups_iRespectively selecting and generating a case group selection set C based on probability distribution of similarity value indexes_iThe appointed multiple similarity value indexes comprise at least two of Euclidean distance, cosine distance, Jacard distance and adjusted cosine distance;

1.3) selection of case groups C₀All case group selection set C_iCarrying out set combination to obtain a medical record group set C;

the detailed steps of the step 2) comprise:

2. The method of claim 1, wherein the selection of the medical record group selection set C is generated based on probability distribution of similarity value index_iChronological, the ordered set of medical records B_iThe probability of random selection of each case group is shown as the formula (1);

in the formula (1), P (SM)_i) Set of medical records in order B_iThe ith medical record group quiltProbability of selection, SM_iSet of medical records in order B_iIndex of similarity value of the ith case group, f (SM)_j) Set of medical records in order B_iThe similarity value of the ith medical record group is normalized, and m is the ordered set B of medical record groups_iNumber of cases in the middle, f (SM)_j) The functional expression of (a) is represented by the formula (2);

3. The method for finding medical records based on similarity measurement according to claim 1, wherein the detailed steps of step 2.2) include: calculating an original weight value of each feature of the medical record according to the formula (4), and performing normalization processing on the original weight values of all the features to serve as a final weight value of each feature;

a diagnosis information characteristic vector formed by the diagnosis information s of all medical records of the medical record group set matrix b; sigma_iI column representing all medical records of the medical record group set matrix bThe variance of the features.

4. The method for finding medical records based on similarity measurement according to claim 1, wherein the functional expression of the similarity value s of each medical record group in the medical record group set C calculated in step 2.3) is shown in formula (6);

value of the x-th feature of case i

is the maximum value of the x-th feature,

is the minimum of the xth feature.

5. The medical record searching method based on similarity measurement as claimed in claim 1, wherein the detailed step of constructing the machine learning model in step 3) comprises:

6. The medical record searching method based on similarity measurement according to claim 5, wherein when three kinds of loss functions of a scoring loss function, a sorting loss function and a sorting probability loss function are designed in step 3.1), a functional expression of the scoring loss function is shown as formula (7), a functional expression of the sorting loss function is shown as formula (8), and a functional expression of the sorting probability loss function is shown as formula (9);

is a medical record

The medical records

The value of the model prediction score of (a),

is a medical record

The medical records

A tag score value of;

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

Neural network model ofThe value of the type prediction score is,

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

Sign is a sign function;

medical record q under the value of label score_iAnd medical record

The probability of inter-similarity is determined,

Probability of inter-similarity; wherein

in the formulae (10) and (11),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The value of the tag's score of (c),

is a medical record q_iThe medical records

The neural network model of (1) predicts a score value,

is a medical record q_iThe medical records

The neural network model of (1) predicts the score value.

7. A medical record searching system based on similarity measurement, comprising a computer device, characterized in that: the computer device is programmed to perform the steps of the similarity metric-based medical record searching method according to any one of claims 1 to 6, or a storage medium of the computer device has stored therein a computer program programmed to perform the similarity metric-based medical record searching method according to any one of claims 1 to 6.

8. A computer-readable storage medium characterized by: the computer readable storage medium has stored therein a computer program programmed to execute the similarity metric based medical record searching method according to any one of claims 1 to 6.