CN116796046B

CN116796046B - Case retrieval method and device based on rare characteristics

Info

Publication number: CN116796046B
Application number: CN202311096421.4A
Authority: CN
Inventors: 于红刚; 姚理文; 王静; 肖冰
Original assignee: Renmin Hospital of Wuhan University
Current assignee: Renmin Hospital of Wuhan University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-10
Anticipated expiration: 2043-08-29
Also published as: CN116796046A

Abstract

The application provides a case retrieval method and a case retrieval device based on rare characteristics, wherein the case retrieval method based on rare characteristics comprises the following steps: acquiring a plurality of first historical cases of a target case type; acquiring the occurrence frequency of each historical case feature in a plurality of first historical cases; obtaining a rare feature set and a standard feature set of each first historical case; acquiring a plurality of second historical cases to obtain rare characteristic sets and standard characteristic sets of the second historical cases; acquiring a feature set of a case to be searched of the case to be searched; matching the feature set of the case to be searched with the rare feature set to obtain a plurality of first matching cases; matching the characteristic set of the case to be searched with the standard characteristic set to obtain a plurality of second matched cases; outputting a first preset number of first matched cases which are ranked forward and a second preset number of second matched cases which are ranked forward. The application can improve the case retrieval accuracy.

Description

Case retrieval method and device based on rare characteristics

Technical Field

The application mainly relates to the technical field of big data, in particular to a case retrieval method and device based on rare characteristics.

Background

In the medical work of hospitals, although the reasons for most patients to visit are common symptoms, symptoms of some diseases are rare. The etiology of these diseases is often associated with large amounts of medical resources, difficulty, time consuming, and even life threatening. For example, referring to pheochromocytomas, the typical symptoms we first think of are "paroxysmal or persistent hypertension, headache, palpitations, sweating" etc., however, in a specific diagnosis, there are patients with "fever, upper right abdominal pain, elevated white blood cells" as the main clinical manifestations, and thus are misdiagnosed as biliary tract infection. Thereby increasing the probability of misdiagnosis.

In the prior art, cases similar to the cases to be searched are acquired from a large number of historical cases by adopting a case searching mode so as to assist doctors. However, the existing case retrieval mode mainly aims at some common features to retrieve, so that the retrieval accuracy is not high.

That is, the case retrieval accuracy in the prior art is not high.

Disclosure of Invention

The application provides a case retrieval method and device based on rare characteristics, and aims to solve the problem of low case retrieval accuracy in the prior art.

In a first aspect, the present application provides a rare feature-based case retrieval method, the rare feature-based case retrieval method comprising:

acquiring a plurality of first historical cases of a target case type;

acquiring the occurrence frequency of each historical case feature in a plurality of first historical cases;

determining the historical case characteristics with the occurrence frequency being greater than the preset frequency as rare characteristics of the target case type, and determining the historical case characteristics with the occurrence frequency not greater than the preset frequency as standard characteristics of the target case type, so as to obtain rare characteristic sets and standard characteristic sets of all first historical cases;

acquiring a plurality of second historical cases, wherein the plurality of second historical cases comprise cases of at least two case types;

determining a second history case of each case type in the plurality of second history cases as a plurality of first history cases of the target case type, obtaining a rare feature set and a standard feature set of each second history case;

acquiring a feature set of the case to be searched, which is formed by the features of each case in the case to be searched;

matching the characteristic set of the case to be searched with the rare characteristic set of each second historical case to obtain a plurality of first matched cases matched with the characteristic set of the case to be searched;

Matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matched cases matched with the characteristic set of the case to be searched;

respectively sorting the plurality of second matching cases and the plurality of first matching cases to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases;

outputting a first preset number of first matching cases which are ranked forward in the ranked plurality of first matching cases and a second preset number of second matching cases which are ranked forward in the ranked plurality of second matching cases.

Optionally, the matching the feature set of the case to be searched with the rare feature set of each second historical case to obtain a plurality of first matching cases matched with the feature set of the case to be searched, including:

converting each case feature in the case feature set to be searched into a first case feature vector, wherein each element in the first case feature vector is each case feature in the case feature set to be searched;

converting each rare feature in the rare feature set of each second historical case into a second case feature vector of each second historical case, wherein each element in the second case feature vector is each rare feature in the rare feature set;

Calculating first vector similarity between the first case feature vector and the second case feature vector of each second historical case to obtain first vector similarity corresponding to each second historical case;

and determining a plurality of second historical cases with the first vector similarity being greater than the first preset similarity as a plurality of matched first cases.

Optionally, the determining the plurality of second historical cases with the first vector similarity greater than the first preset similarity as the plurality of matched first cases includes:

dividing each case feature in the case feature set to be searched into a plurality of first case feature sets according to the feature categories of each case feature in the case feature set to be searched, wherein the case features in the same first case feature set belong to the same feature category, and the feature categories of each case feature comprise a medical history feature category, a clinical presentation feature category, a physical examination feature category, a laboratory examination feature category and an imaging examination feature category;

converting the plurality of first case feature sets into a corresponding plurality of third case feature vectors, respectively;

dividing each case feature in the rare feature set into a plurality of second case feature sets according to the feature class of each rare feature in the rare feature set, wherein case features in the same second case feature set belong to the same feature class;

Converting the plurality of second case feature sets into a corresponding plurality of fourth case feature vectors, respectively;

respectively calculating second vector similarity between the third case feature vector and the fourth case feature vector belonging to the same feature class to obtain a plurality of second vector similarity corresponding to a plurality of feature classes in the rare feature set;

acquiring preset weight coefficients of each feature class;

carrying out weighted average on a plurality of second vector similarities according to preset weight coefficients of each feature class to obtain a similarity weighted average value corresponding to the rare feature set of the second historical case, and obtaining a plurality of similarity weighted average values corresponding to the second historical case;

and determining a plurality of second historical cases with the first vector similarity being greater than the first preset similarity and the weighted average of the similarity being greater than the second preset similarity as a plurality of matched first cases.

Optionally, the obtaining preset weight coefficients of each feature class includes:

determining a plurality of second historical cases and corresponding case categories as a preset training set, wherein the preset training set comprises a plurality of training samples and corresponding labels, the training samples are the second historical cases, and the labels corresponding to the training samples are the second historical cases and the corresponding case categories;

Training a preset decision tree model based on a preset training set to obtain a target decision tree model;

obtaining importance coefficients of the characteristics of each case in a target decision tree model;

and determining preset weight coefficients of the target feature classes according to the importance coefficients of the case features belonging to the target feature classes in the target decision tree model, so as to obtain the preset weight coefficients of the feature classes.

Optionally, determining the preset weight coefficient of the target feature class according to the importance coefficient of each case feature belonging to the target feature class in the target decision tree model, to obtain the preset weight coefficient of each feature class, including:

and determining the average value of the importance coefficients of the case features belonging to the target feature class in the target decision tree model as a preset weight coefficient of the target feature class to obtain the preset weight coefficient of each feature class.

Optionally, the sorting the second matching cases and the first matching cases to obtain sorted second matching cases and sorted first matching cases includes:

acquiring the risk level of each first matching case;

and sequencing each first matching case from high to low according to the risk level of each first matching case to obtain a plurality of sequenced second matching cases.

Optionally, the outputting the first preset number of first matching cases ranked earlier in the ranked plurality of first matching cases and the second preset number of second matching cases ranked earlier in the ranked plurality of second matching cases includes:

acquiring the number value of each first case belonging to each case category in a first preset number of first matching cases;

acquiring the number value of each second case belonging to each case category in a second preset number of second matching cases;

weighting each second case number value according to a preset coefficient to obtain each third case number value corresponding to each second case number value, wherein the preset coefficient is larger than 1;

determining each fourth case number value corresponding to each case category according to each first case number value and each third case number value corresponding to each second case number value of each case category;

and determining the case type with the largest fourth case quantity value as the case type of the case to be searched.

In a second aspect, the present application provides a rare feature-based case retrieval device, the rare feature-based case retrieval device comprising:

a first acquisition unit configured to acquire a plurality of first history cases of a target case type;

A second acquisition unit configured to acquire an occurrence frequency of each of the characteristics of the plurality of first history cases;

a first determining unit, configured to determine, as rare features of a target case type, a history case feature having an occurrence frequency greater than a preset frequency, and determine, as standard features of the target case type, a history case feature having an occurrence frequency not greater than the preset frequency, to obtain rare feature sets and standard feature sets of each first history case;

a third acquisition unit configured to acquire a plurality of second history cases including cases of at least two case types;

a second determining unit configured to determine a second history case of each case type of the plurality of second history cases as a plurality of first history cases of the target case type, and obtain a rare feature set and a standard feature set of each second history case;

the fourth acquisition unit is used for acquiring a to-be-searched case feature set formed by the case features in the to-be-searched case;

the first matching unit is used for matching the characteristic set of the case to be searched with the rare characteristic set of each second historical case to obtain a plurality of first matching cases matched with the characteristic set of the case to be searched;

The second matching unit is used for matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matching cases matched with the characteristic set of the case to be searched;

the sorting unit is used for sorting the plurality of second matching cases and the plurality of first matching cases respectively to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases;

the output unit is used for outputting a first preset number of first matched cases which are ranked forward in the ranked plurality of first matched cases and a second preset number of second matched cases which are ranked forward in the ranked plurality of second matched cases.

In a third aspect, the present application provides a computer apparatus comprising:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the rare feature-based case retrieval method of any of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the rare feature-based case retrieval method of any one of the first aspects.

The application provides a case retrieval method and a case retrieval device based on rare characteristics, wherein the case retrieval method based on rare characteristics comprises the following steps: acquiring a plurality of first historical cases of a target case type; acquiring the occurrence frequency of each historical case feature in a plurality of first historical cases; determining the historical case characteristics with the occurrence frequency being greater than the preset frequency as rare characteristics of the target case type, and determining the historical case characteristics with the occurrence frequency not greater than the preset frequency as standard characteristics of the target case type, so as to obtain rare characteristic sets and standard characteristic sets of all first historical cases; acquiring a plurality of second historical cases, wherein the plurality of second historical cases comprise cases of at least two case types; determining a second history case of each case type in the plurality of second history cases as a plurality of first history cases of the target case type, obtaining a rare feature set and a standard feature set of each second history case; acquiring a feature set of the case to be searched, which is formed by the features of each case in the case to be searched; matching the characteristic set of the case to be searched with the rare characteristic set of each second historical case to obtain a plurality of first matched cases matched with the characteristic set of the case to be searched; matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matched cases matched with the characteristic set of the case to be searched; respectively sorting the plurality of second matching cases and the plurality of first matching cases to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases; outputting a first preset number of first matching cases which are ranked forward in the ranked plurality of first matching cases and a second preset number of second matching cases which are ranked forward in the ranked plurality of second matching cases. The application can improve the case retrieval accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of a scenario of a rare feature-based case retrieval system provided by an embodiment of the present application;

FIG. 2 is a flow chart of an embodiment of a rare feature-based case retrieval method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of one embodiment of a rare feature-based case retrieval device provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a computer device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In the description of the present application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the drawings are merely for convenience in describing the present application and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present application, the term "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described as "exemplary" in this disclosure is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for purposes of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail so as not to obscure the description of the application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The embodiment of the application provides a case retrieval method and device based on rare characteristics, and the method and device are respectively described in detail below.

Referring to fig. 1, fig. 1 is a schematic view of a case retrieval system based on rare features according to an embodiment of the present application, where the case retrieval system based on rare features may include a computer device 100, and a case retrieval device based on rare features is integrated in the computer device 100.

In the embodiment of the present application, the computer device 100 may be an independent server, or may be a server network or a server cluster formed by servers, for example, the computer device 100 described in the embodiment of the present application includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server formed by a plurality of servers. Wherein the Cloud server is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing).

In the embodiment of the present application, the computer device 100 may be a general-purpose computer device or a special-purpose computer device. In a specific implementation, the computer device 100 may be a desktop, a portable computer, a network server, a palm computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc., and the embodiment is not limited to the type of the computer device 100.

It will be appreciated by those skilled in the art that the application environment shown in fig. 1 is only one application scenario of the present application and is not limited to the application scenario of the present application, and that other application environments may include more or less computer devices than those shown in fig. 1, for example, only 1 computer device is shown in fig. 1, and that the rare feature-based case retrieval system may also include one or more other computer devices that can process data, and is not limited in particular herein.

In addition, as shown in FIG. 1, the rare feature-based case retrieval system may also include a memory 200 for storing data.

It should be noted that, the schematic view of the case retrieval system based on rare features shown in fig. 1 is only an example, and the case retrieval system based on rare features and the scene described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, as the case retrieval system based on rare features evolves and new business scenarios appear, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

Firstly, in the embodiment of the application, a case retrieval method based on rare characteristics is provided, and the case retrieval method based on rare characteristics comprises the following steps: acquiring a plurality of first historical cases of a target case type; acquiring the occurrence frequency of each historical case feature in a plurality of first historical cases; determining the historical case characteristics with the occurrence frequency being greater than the preset frequency as rare characteristics of the target case type, and determining the historical case characteristics with the occurrence frequency not greater than the preset frequency as standard characteristics of the target case type, so as to obtain rare characteristic sets and standard characteristic sets of all first historical cases; acquiring a plurality of second historical cases, wherein the plurality of second historical cases comprise cases of at least two case types; determining a second history case of each case type in the plurality of second history cases as a plurality of first history cases of the target case type, obtaining a rare feature set and a standard feature set of each second history case; acquiring a feature set of the case to be searched, which is formed by the features of each case in the case to be searched; matching the characteristic set of the case to be searched with the rare characteristic set of each second historical case to obtain a plurality of first matched cases matched with the characteristic set of the case to be searched; matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matched cases matched with the characteristic set of the case to be searched; respectively sorting the plurality of second matching cases and the plurality of first matching cases to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases; outputting a first preset number of first matching cases which are ranked forward in the ranked plurality of first matching cases and a second preset number of second matching cases which are ranked forward in the ranked plurality of second matching cases.

Referring to fig. 2, fig. 2 is a flowchart of an embodiment of a case retrieval method based on rare features according to an embodiment of the present application, where the case retrieval method based on rare features includes steps S201 to S210 as follows:

s201, acquiring a plurality of first historical cases of a target case type.

In an embodiment of the present application, the first historical case includes a plurality of historical case features of pancreatic edema, pancreatic Zhou Shenchu, pancreatic and/or peripancreatic tissue necrosis, pestle fingers, velcro (r) o, and the like.

The target case type may be a disease type corresponding to the first historical case, and the historical cases may be classified according to other factors. The first historical case may be data obtained from historical medical visits to the patient.

S202, obtaining the frequency of occurrence of each historical case feature in the plurality of first historical cases.

S203, determining the historical case characteristics with the occurrence frequency being greater than the preset frequency as rare characteristics of the target case type, and determining the historical case characteristics with the occurrence frequency not greater than the preset frequency as standard characteristics of the target case type, so as to obtain rare characteristic sets and standard characteristic sets of all the first historical cases.

The preset frequency may be a preset proportion of the number of the plurality of first historical cases, the preset proportion may be 1%, and the preset frequency may be set according to specific situations. According to the occurrence frequency, the historical case features of each first historical case can be divided into a rare feature set and a standard feature set by dividing the historical case features of each first historical case into the rare feature and the standard feature of the target case type.

S204, acquiring a plurality of second historical cases, wherein the plurality of second historical cases comprise cases of at least two case types.

The second historical case may be data obtained from historical medical visits to the patient.

S205, determining the second historical case of each case type in the plurality of second historical cases as a plurality of first historical cases of the target case type, and obtaining a rare feature set and a standard feature set of each second historical case.

The historical case characteristics of each first historical case can be divided into two sets of a rare characteristic set and a standard characteristic set by taking a plurality of second historical cases of each type as a plurality of first historical cases of a target case type, so that the historical case characteristics of each second historical case are divided into two sets of the rare characteristic set and the standard characteristic set.

S206, acquiring a feature set of the case to be searched, which is composed of the features of each case in the case to be searched.

The case to be searched is the case of the new patient, and the type of the case to be searched cannot be confirmed.

S207, matching the feature set of the case to be searched with the rare feature set of each second historical case to obtain a plurality of first matching cases matched with the feature set of the case to be searched.

In an embodiment of the body, matching the feature set of the case to be searched with the rare feature set of each second historical case to obtain a plurality of first matching cases matched with the feature set of the case to be searched, including:

(1) And converting each case feature in the case feature set to be searched into a first case feature vector, wherein each element in the first case feature vector is each case feature in the case feature set to be searched.

(2) Each rare feature in the rare feature set of each second historical case is converted into a second case feature vector of each second historical case, wherein each element in the second case feature vector is each rare feature in the rare feature set.

(3) First vector similarities between the first case feature vector and second case feature vectors of respective second historic cases are calculated.

In a specific embodiment, the first vector similarity is a cosine similarity between the first case feature vector and the second case feature vector.

(4) And determining a plurality of second historical cases corresponding to the second case feature vectors with the first vector similarity being greater than the first preset similarity as a plurality of matched first cases.

The first preset similarity is set according to specific conditions.

Further, to more accurately determine the matched first matching cases, determining the second historical cases corresponding to the second case feature vectors having the first vector similarity greater than the first preset similarity as the matched first matching cases may include:

(1) And dividing each case feature in the case feature set to be searched into a plurality of first case feature sets according to the feature categories of each case feature in the case feature set to be searched, wherein the case features in the same first case feature set belong to the same feature category, and the feature categories of each case feature comprise a medical history feature category, a clinical presentation feature category, a physical examination feature category, a laboratory examination feature category and an imaging examination feature category.

In a specific embodiment, training feature classification models are trained according to preset feature training sets, and each case feature in the feature set of the case to be searched is input into the feature classification models to obtain feature categories of each case feature. The preset feature training set is manually marked.

For example, in the preset feature training set, the case type is "ileus", and it is mentioned in the guideline that this case type occurs frequently in the past in patients with "history of abdominal tumors", "history of hernia or hernia repair", "history of inflammatory bowel disease", "history of short-term abdominal surgery", so that the above 4 case features are included in the medical history feature class.

For example, the case type is "tubal pregnancy", and "stop menstruation", "abdominal pain" and "vaginal bleeding" are the main clinical manifestations in the guideline, so the above 3 clinical manifestations are included in the clinical manifestation feature class.

For example, the case type is "idiopathic pulmonary fibrosis" disease, and the guideline states that the patient's physical examination has the manifestations of "pestle-like fingers", "Velcro-o-rales", so the above 2 physical examination manifestations are included in the physical examination feature class.

For example, the case type is "iron deficiency anemia", and the guidelines describe "Hb, mean red blood cell volume (MCV), mean red blood cell hemoglobin content (MCH), mean red blood cell hemoglobin concentration (MCHC) all decrease", "serum ferritin <20 μg/L", "serum iron decrease, total iron binding capacity increase and transferrin saturation decrease", "zinc protoporphyrin level increase", "soluble transferrin receptor level increase", so the laboratory test results described above are included in the laboratory test characterization category.

For example, the case type is "acute pancreatitis", and early typical imaging is referred to in the guideline as "pancreatic edema", "pancreas Zhou Shenchu", "pancreatic and/or peripancreatic tissue necrosis", etc., so the above 3 imaging features are included in the imaging examination feature class.

(2) The plurality of first case feature sets are respectively converted into a corresponding plurality of third case feature vectors.

The elements in the third case feature vector are individual case features in the first case feature set.

For example, the plurality of third case feature vectors respectively represent features of the laboratory test feature class and features of the imaging test feature class.

(3) Each case feature in the rare feature set is divided into a plurality of second case feature sets according to feature categories of each rare feature in the rare feature set, wherein case features in the same second case feature set belong to the same feature category.

(4) The plurality of second case feature sets are respectively converted into a corresponding plurality of fourth case feature vectors.

For example, the plurality of fourth disease feature vectors respectively represent features of the laboratory test feature class and features of the imaging test feature class.

(5) And respectively calculating the second vector similarity between the third case feature vector and the fourth case feature vector belonging to the same feature class to obtain a plurality of second vector similarities corresponding to a plurality of feature classes in the rare feature set.

For example, for the case feature set to be searched and the rare feature set of a second historical case, calculating the similarity of the third case feature vector and the fourth case feature vector belonging to the laboratory examination feature class to obtain a second vector similarity; and calculating the similarity of the third case feature vector and the fourth case feature vector belonging to the imaging examination feature class to obtain another second vector similarity, so as to obtain a plurality of second vector similarities corresponding to the plurality of feature classes.

(6) And acquiring preset weight coefficients of each feature class.

The preset weight coefficient of each feature class may be set according to a specific situation, and the preset weight coefficient of each feature class may be 1.

(7) And carrying out weighted average on the plurality of second vector similarities according to preset weight coefficients of the feature categories to obtain a weighted average of the similarities corresponding to the rare feature sets of the second historical cases, and obtaining a plurality of weighted average of the similarities corresponding to the plurality of second historical cases.

For example, a weighted average is performed on a plurality of second vector similarities corresponding to the plurality of feature classes, so as to obtain a weighted average of the similarities between the feature set of the case to be searched and the rare feature set of one second historical case, and the rare feature set of the second historical case is matched with the feature set of the case to be searched, so as to obtain a weighted average of the similarities between the rare feature set of the plurality of second historical cases and the feature set of the case to be searched, namely a weighted average of the similarities corresponding to the plurality of second historical cases.

Weighting each feature class can improve the accuracy of calculation of the second vector similarity between the third case feature vector and the fourth case feature vector.

(8) And determining the feature vectors of the second cases with the first vector similarity larger than the first preset similarity and the weighted average of the similarity larger than the second preset similarity as a plurality of matched first cases.

Further, to more accurately determine the preset weight coefficient of each feature class, obtaining the preset weight coefficient of each feature class may include:

(1) And determining a plurality of second historical cases and corresponding case categories as a preset training set, wherein the preset training set comprises a plurality of training samples and corresponding labels, the training samples are the second historical cases, and the labels corresponding to the training samples are the case categories corresponding to the second historical cases.

(2) Training a preset decision tree model based on a preset training set to obtain a target decision tree model.

The decision tree model is a process of using a structure similar to a tree to represent class division, the construction of the tree can be regarded as variable selection, the internal nodes represent the tree to select the variables as the division, the leaf nodes of each tree represent the labels of a class, and the top layer of the tree is the root node. Decision trees are the process of classifying data by a series of rules. It provides a rule-like way of how under what conditions what values will be obtained. Decision tree algorithms belong to guided learning, i.e. the raw data must contain both predicted and target variables. Decision trees are classified into classification decision trees (target variable is a classification type number) and regression decision trees (target variable is a continuous type variable). In samples contained in the classification decision leaf nodes, the mode of the output variable is the classification result; the average value of the output variables in the samples contained in the leaf nodes of the regression tree is the prediction result.

In an alternative embodiment, the preset decision tree model is a CART decision tree model (Classification And Regression Trees). Of course, the preset decision tree model may be an ID3 model, a C4.5 model, or the like. The CART decision tree is also called a classification regression tree, and when the dependent variable of the data set is a continuity value, the tree algorithm is a regression tree, and the average value observed by leaf nodes can be used as a predicted value; when the dependent variable of the data set is a discrete numerical value, the tree algorithm is a classification tree, so that the classification problem can be well solved.

(3) And obtaining importance coefficients of the characteristics of each case in the target decision tree model.

In the embodiment of the application, the importance coefficient is a feature importance score (feature_importances) of the target decision tree model. Specifically, in the decision tree formation process, a policy needs to be adopted in the splitting process of each node to select one feature from m features as a splitting attribute. In order to make the selected feature an optimal splitting property, the strategy may include, for example, using an ID3 algorithm based on information gain, a C4.5 algorithm based on information gain ratio, and a branching algorithm based on the Gini Index (Gini Index), etc.

In a specific embodiment, the importance coefficient of the case feature in the target decision tree model is the sum of the importance coefficients of the case feature in the respective leaf nodes in the target decision tree model. The importance coefficient of the case characteristics at the leaf node is the change quantity of the base index before and after the branching of the leaf node. The base index (also referred to as: base uncertainty, etc.), which is a measure of uncertainty in reaction data. The smaller the base index, the better the certainty of the sample set, the smaller the error probability, the larger the base index, the larger the uncertainty of the sample set, and the higher the error probability. Thus, it will be appreciated that the case feature is more important if the difference between the current base index of the sample set at one node (e.g., node m) and the base index of the sample set of the left and right child nodes after splitting based on the selected customer feature is greater.

(4) And determining preset weight coefficients of the target feature classes according to the importance coefficients of the case features belonging to the target feature classes in the target decision tree model, so as to obtain the preset weight coefficients of the feature classes.

Specifically, an average value of importance coefficients of each case feature belonging to the target feature class in the target decision tree model is determined as a preset weight coefficient of the target feature class, and the preset weight coefficient of each feature class is obtained.

Further, training the preset decision tree model based on a preset training set to obtain SHAP values of the characteristics of each case in the preset training set. Wherein SHAP (SHapley Additive ExPlanations, saprolide addition and interpretation) belongs to a model post-interpretation method, and the core idea is to calculate marginal contributions of features to model output, and interpret 'black box model' from both global and local layers. SHAP builds an additive interpretation model, with all features considered "contributors". For each training sample, the model generates a predicted value, the SHAP value being the value assigned to each sample feature in the training sample. The basic idea is as follows: the marginal contribution of a sample feature when added to the model is calculated, and then the average value, i.e. the SHAP value of a certain sample feature, is taken into account for the different marginal contributions of the sample feature in the case of all feature sequences.

And carrying out weighted average on importance coefficients of the case features belonging to the target feature class in the target decision tree model based on the SHAP values of the case features to obtain preset weight coefficients of the feature classes.

And S208, matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matched cases matched with the characteristic set of the case to be searched.

S208 may refer to S207, and will not be described herein.

S209, respectively sorting the plurality of second matching cases and the plurality of first matching cases to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases.

Specifically, acquiring the risk level of each first matching case; and sequencing each first matching case from high to low according to the risk level of each first matching case to obtain a plurality of sequenced second matching cases. Acquiring the risk level of each second matching case; and sequencing the second matching cases from high to low according to the risk level of the second matching cases to obtain a plurality of sequenced second matching cases.

S210, outputting a first preset number of first matching cases which are ranked forward in the ranked plurality of second matching cases and a second preset number of second matching cases which are ranked forward in the ranked plurality of first matching cases.

The first preset number and the second preset number are both larger than 1, and the first preset number and the second preset number are set according to the specific conditions.

Further, outputting a first preset number of first matching cases in the sorted plurality of second matching cases and a second preset number of second matching cases in the sorted plurality of first matching cases, the first preset number of first matching cases including:

(1) And acquiring the first case quantity values of the first matching cases belonging to the case categories in the first preset quantity.

For example, there are 10 first matching cases, there are 2 case categories, each first case count value for each case category: the first case number value of case class a is 3 and the first case number value of case class B is 7.

(2) And acquiring the second case number value of each case category in the second preset number of second matching cases.

For example, there are 7 second matching cases, there are 2 case categories, each first case count value for each case category: the first case number value of case class B is 3 and the first case number value of case class C is 4.

(3) And weighting each second case number value according to a preset coefficient to obtain each third case number value corresponding to each second case number value, wherein the preset coefficient is larger than 1.

The preset coefficient may be set according to a specific situation, for example, the preset coefficient is 2, and each second case number value is weighted to obtain each third case number value corresponding to each second case number value: the first case number value of case class B is 3 and the first case number value of case class C is 8.

(4) And determining each fourth case number value corresponding to each case category according to each first case number value and each third case number value corresponding to each second case number value of each case category.

Summarizing the first case number value and the third case number value corresponding to the second case number value of each case category to obtain the fourth case number value corresponding to each case category: the first case number value of case class a is 3, the first case number value of case class B is 10, and the first case number value of case class C is 8.

(5) And determining the case type with the largest fourth case quantity value as the case type of the case to be searched.

The case category B is the case category with the largest fourth case quantity value, and is determined to be the case category of the case to be searched.

In order to better implement the case retrieval method based on rare features in the embodiment of the present application, on the basis of the case retrieval method based on rare features, the embodiment of the present application further provides a case retrieval device based on rare features, as shown in fig. 3, a case retrieval device 300 based on rare features includes:

A first acquisition unit 301 for acquiring a plurality of first history cases of a target case type;

a second acquiring unit 302 configured to acquire an occurrence frequency of each of the characteristics of the first historical cases;

a first determining unit 303, configured to determine a history case feature with an occurrence frequency greater than a preset frequency as a rare feature of a target case type, and determine a history case feature with an occurrence frequency not greater than the preset frequency as a standard feature of the target case type, so as to obtain a rare feature set and a standard feature set of each first history case;

a third obtaining unit 304 for obtaining a plurality of second history cases, the plurality of second history cases including cases of at least two case types;

a second determining unit 305 for determining a second history case of each case type of the plurality of second history cases as a plurality of first history cases of the target case type, resulting in a rare feature set and a standard feature set of each second history case;

a fourth obtaining unit 306, configured to obtain a feature set of the case to be searched, where the feature set of the case to be searched is formed by features of each case in the case to be searched;

a first matching unit 307, configured to match the feature set of the case to be searched with the rare feature sets of the second historical cases, so as to obtain a plurality of first matching cases matched with the feature set of the case to be searched;

A second matching unit 308, configured to match the feature set of the case to be searched with the standard feature set of each second historical case, so as to obtain a plurality of second matching cases matched with the feature set of the case to be searched;

a sorting unit 309, configured to sort the plurality of second matching cases and the plurality of first matching cases respectively, to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases;

an output unit 310, configured to output a first preset number of first matching cases, which are ranked earlier, from among the ranked plurality of first matching cases, and a second preset number of second matching cases, which are ranked earlier, from among the ranked plurality of second matching cases.

The embodiment of the application also provides a computer device, which integrates any of the case retrieval devices based on rare characteristics, and the computer device comprises:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to perform the steps in the rare feature-based case retrieval method in any of the rare feature-based case retrieval method embodiments described above.

As shown in fig. 4, a schematic structural diagram of a computer device according to an embodiment of the present application is shown, specifically:

the computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the computer device. Optionally, processor 401 may include one or more processing cores; the processor 401 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and preferably, the processor 401 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, with a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

acquiring a plurality of first historical cases of a target case type; acquiring the occurrence frequency of each historical case feature in a plurality of first historical cases; determining the historical case characteristics with the occurrence frequency being greater than the preset frequency as rare characteristics of the target case type, and determining the historical case characteristics with the occurrence frequency not greater than the preset frequency as standard characteristics of the target case type, so as to obtain rare characteristic sets and standard characteristic sets of all first historical cases; acquiring a plurality of second historical cases, wherein the plurality of second historical cases comprise cases of at least two case types; determining a second history case of each case type in the plurality of second history cases as a plurality of first history cases of the target case type, obtaining a rare feature set and a standard feature set of each second history case; acquiring a feature set of the case to be searched, which is formed by the features of each case in the case to be searched; matching the characteristic set of the case to be searched with the rare characteristic set of each second historical case to obtain a plurality of first matched cases matched with the characteristic set of the case to be searched; matching the characteristic set of the case to be searched with the standard characteristic set of each second historical case to obtain a plurality of second matched cases matched with the characteristic set of the case to be searched; respectively sorting the plurality of second matching cases and the plurality of first matching cases to obtain a plurality of sorted second matching cases and a plurality of sorted first matching cases; outputting a first preset number of first matching cases which are ranked forward in the ranked plurality of first matching cases and a second preset number of second matching cases which are ranked forward in the ranked plurality of second matching cases.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer-readable storage medium, which may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like. On which a computer program is stored that is loaded by a processor to perform the steps of any of the rare feature-based case retrieval methods provided by embodiments of the present application. For example, the loading of the computer program by the processor may perform the steps of:

In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of one embodiment that are not described in detail in the foregoing embodiments may be referred to in the foregoing detailed description of other embodiments, which are not described herein again.

In the implementation, each unit or structure may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit or structure may be referred to the foregoing method embodiments and will not be repeated herein.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The above description has been made in detail of a case retrieval method and apparatus based on rare features provided in the embodiments of the present application, and specific examples are applied herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for helping to understand the method and core ideas of the present application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the present description should not be construed as limiting the present application in summary.

Claims

1. A rare feature-based case retrieval method, characterized in that the rare feature-based case retrieval method comprises:

Acquiring a plurality of first historical cases of a target case type;

2. The rare feature-based case retrieval method according to claim 1, wherein the matching the feature set of the case to be retrieved with the rare feature set of each second historical case to obtain a plurality of first matching cases matching the feature set of the case to be retrieved, includes:

3. The rare feature-based case retrieval method according to claim 2, wherein the determining a plurality of second history cases having a first vector similarity greater than a first preset similarity as a plurality of first matching cases includes:

acquiring preset weight coefficients of each feature class;

4. The rare feature-based case retrieval method according to claim 3, wherein the obtaining preset weight coefficients of each feature class comprises:

5. The rare feature-based case retrieval method according to claim 4, wherein determining the preset weight coefficient of the target feature class according to the importance coefficient of each case feature belonging to the target feature class in the target decision tree model, to obtain the preset weight coefficient of each feature class, comprises:

6. The rare feature-based case retrieval method according to claim 1, wherein the sorting the second plurality of matching cases and the first plurality of matching cases to obtain the sorted second plurality of matching cases and the sorted first plurality of matching cases, respectively, comprises:

acquiring the risk level of each first matching case;

sequencing each first matching case from high to low according to the risk level of each first matching case to obtain a plurality of sequenced first matching cases;

acquiring the risk level of each second matching case;

and sequencing the second matching cases from high to low according to the risk level of the second matching cases to obtain a plurality of sequenced second matching cases.

7. The rare feature-based case retrieval method according to claim 1, wherein outputting the first preset number of first matching cases from the ranked plurality of first matching cases and the second preset number of second matching cases from the ranked plurality of second matching cases, and then comprises:

8. A rare feature-based case retrieval device, the rare feature-based case retrieval device comprising:

9. A computer device, the computer device comprising:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the rare feature-based case retrieval method of any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program, the computer program being loaded by a processor to perform the steps in the rare feature-based case retrieval method of any one of claims 1 to 7.