CN111429289B

CN111429289B - Single disease identification method and device, computer equipment and storage medium

Info

Publication number: CN111429289B
Application number: CN202010208503.3A
Authority: CN
Inventors: 董奕; 张旭
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Ping An Medical and Healthcare Management Co Ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2023-03-24
Anticipated expiration: 2040-03-23
Also published as: CN111429289A

Abstract

The embodiment of the invention provides a single disease species identification method, which comprises the following steps: acquiring medical insurance document data, and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data; calculating the similarity value of the first document item vector and a second document item vector in a pre-constructed single disease category set; normalizing each calculated similarity value to obtain a first probability set of each single disease type corresponding to the medical insurance receipt data; inputting the first document item vector into a pre-trained single disease type prediction model so as to predict and output a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model; and determining a target single disease type corresponding to the medical insurance receipt data according to the first probability set and the second probability set. The embodiment of the invention can improve the accuracy of single disease fraud identification.

Description

Single disease identification method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to single disease identification, a single disease identification device and computer equipment.

Background

Currently, medical insurance has become a major expenditure in many countries, however, there is some in these expenditures that are due to medical fraud. The occurrence of medical insurance fraud causes huge impact on medical insurance funds in China, and billions of yuan of economic loss is caused every year. At present, the more common medical insurance fraud behaviors comprise single disease fraud behaviors, wherein the single disease is a single disease which cannot cause complications, the common single disease is non-suppurative appendicitis, cholecystitis, gallstone, caesarean section and the like, the single disease fraud behaviors report documents which do not meet the single disease standard to a hospital as the single disease, the medical insurance fund is cheated, the specific expression is that the actual diagnosis of a patient is A single disease, and the report is reported to another single disease B with higher reimbursement limit when the hospital reports.

Currently, when single-disease fraud behaviors are identified, the identification is generally carried out according to rules formed by medical experience. However, the generation of such rules strongly depends on the input of business experience, and it is difficult to ensure the accuracy of the identification of single-disease fraudulent behaviors.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a single-disease identification method, a single-disease identification device, a computer device, and a computer-readable storage medium, which are used to solve the problem of low accuracy of the existing single-disease fraud identification method.

In order to achieve the above object, an embodiment of the present invention provides a single disease category identification method, including:

acquiring medical insurance document data, and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data, wherein the medical insurance document data is document data required for applying medical insurance funds corresponding to the medical insurance type of a disease;

calculating the similarity value of the first bill item vector and a second bill item vector in a pre-constructed single disease category set, wherein the single disease category set comprises multiple types of single disease categories and second bill item vectors mapped by the single disease categories;

normalizing each calculated similarity value to obtain a first probability set of each single disease type corresponding to the medical insurance receipt data;

inputting the first document item vector into a pre-trained single disease type prediction model, and predicting and outputting a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model;

and determining a target single disease type corresponding to the medical insurance receipt data according to the first probability set and the second probability set.

Optionally, the inputting the first document item vector into a pre-trained single disease type prediction model to output a second probability set of each single disease type corresponding to the medical insurance document data through prediction by the single disease type prediction model includes:

performing dimension reduction processing on the first bill item vector by adopting a preset dimension reduction algorithm to obtain a compressed third bill item vector;

and inputting the third document item vector into a pre-trained single disease type prediction model so as to predict and output a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model.

Optionally, the preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data includes:

extracting a catalog list contained in the medical insurance document data, wherein the catalog list is medicine, diagnosis and treatment items or medical service facility range list data which can be paid by medical insurance plan funds;

and carrying out unique hot coding on the directory list to obtain a first document item vector corresponding to the medical insurance document data.

Optionally, constructing a second document item vector in the single disease category set includes:

acquiring a single disease category prescription set corresponding to each single disease category, wherein the single disease category prescription set comprises one or more prescriptions corresponding to the single disease category;

calculating the frequency of occurrence of each directory list contained in the single disease prescription set;

and forming the frequency into a second bill item vector corresponding to the single disease type.

Optionally, the determining, according to the first probability set and the second probability set, a target single disease type corresponding to the medical insurance document data includes:

weighting and summing the first probability set and the second probability set through a preset weighted summing formula to obtain a third probability set, wherein each probability value in the third probability set is obtained by weighting and summing each probability value in the first probability set and each probability value in the second probability set in a one-to-one correspondence manner;

and selecting the single disease species corresponding to the maximum probability value in the third probability set as the target single disease species.

Optionally, the determining, according to the first probability set and the second probability set, a target single disease category corresponding to the medical insurance document data includes:

selecting the single disease species mapped by the maximum probability value in the first probability set as a first middle single disease species;

selecting the single disease species mapped by the maximum probability value in the second probability set as a second intermediate single disease species;

and comparing the probability values corresponding to the first intermediate single disease species and the second intermediate single disease species, and taking the single disease species mapped by the larger probability value as the target single disease species.

Optionally, the single disease species identification method further includes:

comparing the target single disease species with the single disease species corresponding to the medical insurance receipt data;

and if the target single disease type is different from the single disease type corresponding to the medical insurance receipt data, marking the medical insurance receipt data as an abnormal receipt.

In order to achieve the above object, an embodiment of the present invention further provides a single disease category identification device, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring medical insurance document data and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data, and the medical insurance document data is document data required by medical insurance funds corresponding to a declared patient type;

and the calculating module is used for calculating the similarity value of the first bill item vector and a second bill item vector in a pre-constructed single disease category set, wherein the single disease category set comprises multiple types of single disease categories and second bill item vectors mapped by the single disease categories.

The normalization module is used for performing normalization processing on each calculated similarity value to obtain a first probability set of each single disease type corresponding to the medical insurance receipt data;

the input module is used for inputting the first document item vector into a pre-trained single disease type prediction model so as to predict and output a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model;

and the determining module is used for determining the target single disease category corresponding to the medical insurance receipt data according to the first probability set and the second probability set.

In order to achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the single disease identification method as described above when executing the computer program.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the single-disease identification method described above.

According to the single disease type identification method, the single disease type identification device, the computer equipment and the computer readable storage medium provided by the embodiment of the invention, the first probability that the medical insurance receipt data is the medical insurance receipt of each single disease type is determined by calculating the similarity value of the first receipt item vector corresponding to the medical insurance receipt data and the second receipt item vector corresponding to each single disease type, then the second probability that the medical insurance receipt data is the medical insurance receipt of each single disease type is obtained through prediction based on a pre-trained single disease type prediction model, and finally the target single disease type corresponding to the medical insurance receipt data is determined based on the obtained first probability and the obtained second probability. According to the method and the device for identifying the medical insurance document data, the probability that the medical insurance document data are of each single disease type can be accurately represented by the first probability and the second probability, so that the target single disease type corresponding to the medical insurance document data can be more accurately identified by combining the first probability and the second probability, the single disease type corresponding to the medical insurance document data does not need to be identified according to business experience, and the accuracy of single disease type fraudulent behaviors can be further improved.

Drawings

Fig. 1 is a schematic step flow diagram of a single disease identification method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating a step of refining the step of preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data in an embodiment of the present invention.

Fig. 3 is a flowchart illustrating a detailed process of constructing a second document item vector in a single disease category set according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a step refinement process of determining a target individual disease type corresponding to the medical insurance document data according to the first probability set and the second probability set in an embodiment of the present invention.

Fig. 5 is a flowchart illustrating a detailed process of determining a target individual disease type corresponding to the medical insurance document data according to the first probability set and the second probability set in another embodiment of the present invention.

Fig. 6 is a schematic diagram of program modules of a single disease type identification apparatus according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

The terminology used in the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

Referring to fig. 1, a flowchart of a single disease identification method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description will be made by taking a computer device as an execution subject, where the computer device may be a device having a data transmission function, such as a mobile phone, a tablet personal computer (tablet personal computer), a laptop computer (laptop computer), and a server. The method comprises the following specific steps:

s10, acquiring medical insurance document data, and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data.

Specifically, the medical insurance document data can be acquired from a medical insurance database, and the medical insurance document data is document data required to be prepared by medical insurance funds corresponding to the patient types of the declaration form in the hospital.

Wherein, the single disease is a single disease without complications. Non-suppurative appendicitis, cholecystitis, gallstone, caesarean section, etc. are common. In practical application, the medical insurance document data comprises a catalogue list used by a doctor for treating the single disease, namely, the medical insurance document data comprises medicines, diagnosis and treatment items and medical service facility range data which can be paid by medical insurance overall funds.

In this embodiment, after acquiring the medical insurance document data, in order to perform subsequent processing on the medical insurance document data, the medical insurance document data needs to be preprocessed to be converted into a first document item vector of the medical insurance document data.

In an embodiment, referring to fig. 2, the preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data includes:

s20, extracting a directory list contained in the medical insurance document data;

specifically, the catalog list is data of a drug, a diagnosis and treatment item or a medical service facility range list which can be paid by the medical insurance pool. In this embodiment, when extracting each directory list from the medical insurance document data, whether the medical insurance document data includes a name of a medicine, a name of a diagnosis and treatment item, and a name of a medical service facility range that can be paid by medical insurance aggregate or whether the medical insurance document data includes a code corresponding to the medicine, the diagnosis and treatment item, and the medical service facility range that can be paid by medical insurance aggregate can be searched, and if the medical insurance document data includes the code, each included directory list is extracted.

And S21, carrying out unique hot coding on each directory list to obtain a first document item vector corresponding to the medical insurance document data.

In particular, one-Hot coding, or One-Hot coding, also known as One-bit-efficient coding, uses an N-bit status register to encode N states, each state being represented by its own independent register bit and only One of which is active at any time.

One can understand the one-hot coding, and for each feature, if it has m possible values, then after the one-hot coding, it becomes m binary features (e.g. the one-hot is 100,010,001 if the one-hot performs the good feature). And, these features are mutually exclusive, with only one activation at a time. Therefore, the data may become sparse.

In this embodiment, each directory list is subjected to unique hot coding to obtain corresponding feature data, and then all the feature data are combined according to a preset rule to obtain a first document item vector corresponding to the medical insurance document data.

And S11, calculating the similarity value of the first bill item vector and a second bill item vector in a pre-constructed single disease category set.

Specifically, in order to calculate the similarity value between the first document item vector and the second document item vector, a single-disease category set can be constructed in advance, so that the second document item vector can be directly taken out from the single-disease category set when the similarity value is calculated, and the medical insurance document data of the single-disease category to be calculated does not need to be processed to obtain the second document item vector, so that the calculation speed can be increased. And the single disease species are concentrated into multiple types of single disease species and second bill item vectors mapped by the single disease species. In this embodiment, when calculating the similarity value between the first sheet item vector and each second sheet item vector, the cosine similarity between the first sheet item vector and each second sheet item vector may be used as the similarity value, and the euclidean distance between the first sheet item vector and each second sheet item vector may also be used as the similarity value, which is not limited in this embodiment.

Further, in one embodiment, with reference to fig. 3, constructing a second document item vector in the single disease category set comprises:

and step S30, obtaining a single disease prescription set corresponding to each single disease.

Specifically, the single-disease-category prescription set includes one or more prescriptions corresponding to a single disease category, for example, if a second document item vector corresponding to the single-disease category a is currently constructed, the single-disease-category prescription set includes one or more prescriptions corresponding to the single-disease category a, where contents of each prescription in the single-disease-category prescription set may be the same or different. In a specific application, all prescriptions for treating the single disease category in a city collected in advance within one year can be used as the single disease category prescription set corresponding to each single disease category.

Step S31, calculating the frequency of the occurrence of each directory list contained in the single disease prescription set.

Specifically, when calculating the frequency of occurrence of each directory listing, the frequency of occurrence of each directory listing in all the single-disease prescriptions may be calculated first, and then the frequency of occurrence of each directory listing may be calculated according to the frequency of occurrence and the total number of prescriptions included in the single-disease prescription set.

And S32, forming the frequency into a second bill item vector corresponding to the single disease type.

Specifically, after the occurrence frequency of each catalog list is obtained, the frequency values may be arranged according to a preset rule, and then combined to obtain a second document item vector corresponding to the single disease type.

In this embodiment, when constructing the second document item vectors corresponding to other single disease species in the single disease species set, the method is also performed according to the above method, and details are not described in this embodiment.

The rules for the respective probability combinations in the present embodiment are the same as the rules for combining the feature data described above.

And S12, normalizing each calculated similarity value to obtain a first probability set of each single disease type corresponding to the medical insurance receipt data.

Specifically, the normalization processing may be performed on the acquired respective similarity values by a normalization exponential function (Softmax function). Wherein the Softmax function can compress a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0, 1) and the sum of all elements is 1.

After normalization processing is carried out on each similarity value, a first probability set of the medical insurance receipt data as the receipt of each single disease type can be obtained. For example, after normalization, the probability that the medical insurance document data is a document of a single disease type a is 0.6, the probability that the document of a single disease type B is 0.2, and the probability that the document of a single disease type C is 0.2.

And S13, inputting the first document item vector into a pre-trained single disease type prediction model, and predicting and outputting a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model.

Specifically, the single disease prediction model is a classifier, and prediction from document item vectors to single disease can be realized through the classifier. In this embodiment, the classifier may be used to implement prediction from document item Vector to single disease type by xgboost algorithm, lightgbm algorithm, svm (Support Vector Machine) model, DNN (Deep Neural network) model, CNN (Convolutional Neural network) model, or logistic regression algorithm.

In a specific application, the single disease species prediction model can be constructed by the following steps:

step 1, acquiring n groups of training sample data, wherein each training sample data comprises a document item vector and label data, and the label data is used for indicating which specific single disease type corresponding to the document item vector is;

step 2, constructing a single disease species prediction model, wherein the single disease species prediction model can be an xgboost, lightgbm, svm, DNN, CNN or logistic regression model;

step 4, training the single disease prediction model by using a training data set, and continuously adjusting all parameters in the model according to the output result of the model in the training process until the model converges;

and 5, acquiring m groups of verification sample data, verifying whether the trained single disease prediction model is valid by using the verification sample data, storing the model after the validity is confirmed, and continuing training until the model is valid if the model is invalid.

In this embodiment, the first document vector is input into the single disease category prediction model, that is, the second probability set of each single disease category corresponding to the medical insurance document data can be obtained through prediction of the single disease category prediction model, for example, the output result of the model is that the medical insurance document data is a document of the single disease category a with a probability of 0.65, a document of the single disease category B with a probability of 0.15, and a document of the single disease category C with a probability of 0.2.

Further, in an embodiment, in order to reduce the complexity of data and save space, the inputting the first document item vector into a pre-trained single-disease prediction model, so as to predict and output a second probability of each single disease corresponding to the medical insurance document data through the single-disease prediction model specifically includes:

performing dimension reduction processing on the first bill item vector by adopting a preset dimension reduction algorithm to obtain a compressed third bill item vector; and inputting the third document item vector into a pre-trained single disease type prediction model so as to predict and output a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model.

Specifically, the dimensionality reduction algorithm may be a Principal Component Analysis (PCA) algorithm, a singular matrix decomposition (SVD) algorithm, or an auto-encoder (Autoencoder), etc.

In the embodiment, the dimension reduction processing is performed on the first document item vector input to the pre-trained single-disease prediction model, so that the complexity of data can be reduced and the storage space of the data can be saved on the premise of keeping the effective information of the first document item vector.

And S14, determining a target single disease type corresponding to the medical insurance receipt data according to the first probability set and the second probability set.

Specifically, after the first probability set and the second probability value set are obtained, the two groups of probability sets may be subjected to comprehensive analysis, and then a target single disease category corresponding to the medical insurance document data may be determined according to an analysis result.

In an embodiment, referring to fig. 4, the determining the target individual disease category corresponding to the medical insurance document data according to the first probability set and the second probability set includes:

step S40, performing weighted summation on the first probability set and the second probability set through a preset weighted summation formula to obtain a third probability set, where each probability value in the third probability set is obtained by performing weighted summation on each probability value in the first probability set and each probability value in the second probability set in a one-to-one correspondence manner.

Specifically, the first probability set includes a plurality of first probability values P1, the second probability set includes a plurality of second probability values P2, and after obtaining each of the first probabilities P1 and each of the second probabilities P2, a preset weighted summation formula may be used: p3= P1 × a + P2 × b performs weighted summation on each first probability P1 and each second probability P2 in a one-to-one correspondence manner to obtain each probability value in the third probability set, where a is a preset first weight value, b is a preset second weight value, a + b =1, specific values of a and b can be set according to actual conditions, and P3 is a probability value in the weighted summation value set.

In an example, assuming that the first probability set includes 5 probability values of A1, A2, A3, A4, and A5, and the second probability set includes 5 probability values of B1, B2, B3, B4, and B5, the third probability set also includes 5 probability values of C1, C2, C3, C4, and C5, respectively, where C1= A1 a + B1, C2= A2 a + B2B, C3= A3 a + B3B, C4= A4 a + B4, and C5= A5 a + B5.

And S41, selecting the single disease type corresponding to the maximum probability value in the third probability set as the target single disease type.

Specifically, after the third probability set is obtained through weighted summation, the single disease species corresponding to the maximum probability in the third probability set can be selected as the target single disease species.

In one example, the first probability of the medical insurance receipt data being a receipt of a single disease type A is assumed to be 0.6, the first probability of the receipt of a single disease type B is assumed to be 0.2, and the first probability of the receipt of a single disease type C is assumed to be 0.2; the second probability of the medical insurance receipt data being the receipt of the single disease type A is 0.65, the probability of the receipt of the single disease type B is 0.15, the second probability of the receipt of the single disease type C is 0.2, and a weighted summation formula is as follows: p3= P1 + 0.5+ P2 + 0.5, after weighted summation, the third probability that the medical insurance document data is the document of the single disease species a is 0.65 + 0.5+0.6 + 0.5=0.625, the third probability that the document of the single disease species B is the second 0.2 + 0.5+0.15 + 0.5=0.175, the third probability that the document of the single disease species C is 0.2 + 0.5+0.2 + 0.5=0.2, wherein the maximum probability value is 0.625, and the single disease species corresponding to the maximum probability value is the single disease species a, so the disease species a is selected as the target disease species.

In another embodiment, referring to fig. 5, the determining the target individual disease category corresponding to the medical insurance document data according to the first probability set and the second probability set includes:

and S50, selecting the single disease type mapped by the maximum probability value in the first probability set as a first intermediate single disease type.

Specifically, after obtaining the first probability set composed of the first probabilities, the largest first probability value is found, and then the single disease species mapped by the largest first probability value is taken as the first intermediate single disease species.

And S51, selecting the single disease species mapped by the maximum probability value in the second probability set as a second intermediate single disease species.

Specifically, after a second probability set composed of the second probabilities is obtained, a maximum second probability value is found, and then the single disease species mapped by the maximum second probability value is used as a second intermediate single disease species.

And S52, comparing the probability values corresponding to the first intermediate single disease species and the second intermediate single disease species, and taking the single disease species mapped by the larger probability value as the target single disease species.

Specifically, after the probability values corresponding to the first intermediate individual disease type and the second intermediate individual disease type are obtained, the two obtained probability values may be compared, and then the mapped individual disease type with the larger probability value is used as the target individual disease type.

In one example, the first probability of the medical insurance document data being the document of the single disease type A is assumed to be 0.6, the first probability of the document of the single disease type B is assumed to be 0.2, and the first probability of the document of the single disease type C is assumed to be 0.2; the second probability of the medical insurance receipt data being the receipt of the single disease type A is 0.15, the probability of the receipt of the single disease type B is 0.65, and the second probability of the receipt of the single disease type C is 0.2. As the first probability that the medical insurance receipt data is the receipt of the single disease type A is 0.6, which is the maximum probability value, the single disease type A is selected as a first middle single disease type. And selecting the single disease type B as a second intermediate single disease type because the second probability of the medical insurance receipt data being the receipt of the single disease type B is 0.65 which is the maximum probability value. However, since the second probability of the medical insurance receipt data being the receipt of the single disease type B is 0.65 is greater than the first probability of the medical insurance receipt data being the receipt of the single disease type a is 0.6, the single disease type B is finally selected as the target single disease type.

Further, in an embodiment, the single disease species identification method further includes the following steps:

comparing the target single disease type with the single disease type corresponding to the medical insurance document data;

Specifically, after obtaining the target single disease category, the target single disease category may be compared with the single disease category corresponding to the medical insurance document data to determine whether the single disease category corresponding to the medical insurance document data is the single disease category for which the right report is declared. In this embodiment, if the target single disease type is the same as the single disease type corresponding to the medical insurance receipt data, it can be determined that the single disease type corresponding to the medical insurance receipt data is the single disease type for which a correct report is declared; if the target single disease type is different from the single disease type corresponding to the medical insurance document data, whether the single disease type corresponding to the medical insurance document data is the single disease type which declares the incorrect medical insurance document data can be judged, namely the medical insurance document data is the abnormal document. In this embodiment, in order to distinguish a normal document from an abnormal document, when it is determined whether the medical insurance document data corresponds to a single disease type that declares an incorrect document, the medical insurance document data may be marked to be marked as an abnormal document, and specific marking manners may be multiple, for example, the medical insurance document data may be marked by adding a character that "this document is an abnormal document" to the medical insurance document data, or may be marked by coloring the medical insurance document data, which is not limited in this embodiment.

According to the single disease type identification method provided by the embodiment of the invention, the similarity value of a first document item vector corresponding to the medical insurance document data and a second document item vector corresponding to each single disease type is calculated to determine the first probability of the medical insurance document data being the medical insurance document of each single disease type, then the second probability of the medical insurance document data being the medical insurance document of each single disease type is obtained through prediction based on a pre-trained single disease type prediction model, and finally the target single disease type corresponding to the medical insurance document data is determined based on the obtained first probability and the obtained second probability. According to the method and the device for identifying the medical insurance receipt data, the first probability and the second probability can represent the probability that the medical insurance receipt data is of each single disease type, so that the target single disease type corresponding to the medical insurance receipt data can be identified more accurately by combining the first probability and the second probability.

Referring to fig. 6, a schematic diagram of program modules of a single disease type identification apparatus 600 (hereinafter referred to as "identification apparatus 600") according to an embodiment of the invention is shown. The recognition apparatus 600 may be applied to a computer device, which may be a mobile phone, a tablet personal computer (tablet personal computer), a laptop computer (laptop computer), a server, or other devices having a data transmission function, and the computer device is preferably a server. In this embodiment, the identification apparatus 600 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the above-mentioned single disease species identification method. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the single disease identification method in the storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:

the obtaining module 601 is configured to obtain medical insurance document data, and preprocess the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data.

In an embodiment, the obtaining module 601 is further configured to extract a directory list included in the medical insurance document data;

specifically, the catalog list is data of a drug, a diagnosis and treatment item or a medical service facility range list which can be paid by the medical insurance pool. In this embodiment, when extracting each directory list from the medical insurance document data, it may be searched whether the medical insurance document data includes a name of a medicine, a name of a diagnosis and treatment item, and a name of a medical service facility range, which can be paid by the medical insurance aggregate, or includes a code corresponding to the medicine, the diagnosis and treatment item, and the medical service facility range, which can be paid by the medical insurance aggregate, and if the medical insurance document data includes the code, each included directory list is extracted.

The obtaining module 601 is further configured to perform unique hot coding on each directory listing to obtain a first document item vector corresponding to the medical insurance document data.

One can understand the unique hot coding, and for each feature, if it has m possible values, then after the unique hot coding, it becomes m binary features (e.g. the performance of this feature is good, and the difference becomes one-hot which is 100,010,001). And, these features are mutually exclusive, with only one activation at a time. Therefore, the data may become sparse.

The calculating module 602 is configured to calculate a similarity value between the first document item vector and a second document item vector in a pre-constructed single disease category set.

Further, in an embodiment, the calculating module 602 is further configured to obtain a single disease category prescription set corresponding to each single disease category.

The calculating module 602 is further configured to calculate the frequency of occurrence of each directory list included in the single disease prescription set.

Specifically, when calculating the occurrence frequency of each directory listing, the frequency of occurrence of each directory listing in all the single-disease prescriptions may be counted, and then the frequency of occurrence of each directory listing may be calculated according to the number of occurrences and the total number of prescriptions included in the single-disease prescription set.

The calculating module 602 is further configured to combine the frequencies into a second document item vector corresponding to the single disease category.

Specifically, after the occurrence frequency of each list is obtained, the frequency values may be arranged according to a preset rule, and then combined to obtain a second document item vector corresponding to the single disease category.

In this embodiment, when second document item vectors corresponding to other single disease types in the single disease type set are constructed, the method is also performed according to the above method, and details are not described in this embodiment.

Note that the rule of each probability combination in the present embodiment is the same as the rule of combining the feature data described above.

And the normalization module 603 is configured to perform normalization processing on each calculated similarity value to obtain a first probability set of each single disease type corresponding to the medical insurance receipt data.

The input module 604 is configured to input the first document item vector into a pre-trained single disease category prediction model, so as to output a second probability set of each single disease category corresponding to the medical insurance document data through prediction of the single disease category prediction model.

Further, in an embodiment, in order to reduce the complexity of data and save space, the input module 604 is further configured to perform dimension reduction processing on the first document item vector by using a preset dimension reduction algorithm to obtain a compressed third document item vector; and the third document item vector is input into a pre-trained single disease type prediction model, so that a second probability of each single disease type corresponding to the medical insurance document data is predicted and output through the single disease type prediction model.

The determining module 605 is configured to determine a target single disease category corresponding to the medical insurance receipt data according to the first probability set and the second probability set.

Specifically, after the first probability set and the second probability value set are obtained, the two groups of probability sets may be comprehensively analyzed, and then the target individual disease type corresponding to the medical insurance document data may be determined according to the analysis result.

In an embodiment, the determining module 605 is further configured to perform weighted summation on the first probability set and the second probability set through a preset weighted summation formula to obtain a third probability set, where each probability value in the third probability set is obtained by performing weighted summation on each probability value in the first probability set and each probability value in the second probability set in a one-to-one correspondence manner.

Specifically, the first probability set includes a plurality of first probability values P1, the second probability set includes a plurality of second probability values P2, and after obtaining each of the first probabilities P1 and each of the second probabilities P2, a preset weighted summation formula may be used: p3= P1 × a + P2 × b performs weighted summation on each first probability P1 and each second probability P2 in a one-to-one correspondence manner to obtain each probability value in the third probability set, where a is a preset first weight value, b is a preset second weight value, a + b =1, specific values of a and b may be set according to actual situations, and P3 is a probability value in the weighted summation value set.

The determining module 605 is further configured to select the single disease species corresponding to the maximum probability value in the third probability set as the target single disease species.

In another embodiment, the determining module 605 is further configured to select the single disease type mapped by the maximum probability value in the first probability set as the first intermediate single disease type.

The determining module 605 is further configured to select the single disease species mapped by the maximum probability value in the second probability set as a second intermediate single disease species.

The determining module 605 is further configured to compare the probability values corresponding to the first intermediate single disease species and the second intermediate single disease species, and use the single disease species mapped with the larger probability value as the target single disease species.

Specifically, after the probability values corresponding to the first intermediate monosomy seed and the second intermediate monosomy seed are obtained, the obtained two probability values may be compared, and then the mapped monosomy seed with the larger probability value may be used as the target monosomy seed.

In one example, the first probability of the medical insurance receipt data being a receipt of a single disease type A is assumed to be 0.6, the first probability of the receipt of a single disease type B is assumed to be 0.2, and the first probability of the receipt of a single disease type C is assumed to be 0.2; the second probability of the medical insurance receipt data being the receipt of the single disease type A is 0.15, the probability of the receipt of the single disease type B is 0.65, and the second probability of the receipt of the single disease type C is 0.2. As the first probability that the medical insurance receipt data is the receipt of the single disease type A is 0.6 and is the maximum probability value, the single disease type A is selected as a first middle single disease type. And selecting the single disease type B as a second intermediate single disease type because the second probability of the medical insurance receipt data being the receipt of the single disease type B is 0.65 which is the maximum probability value. However, since the second probability of the medical insurance receipt data being the receipt of the single disease type B is 0.65 is greater than the first probability of the medical insurance receipt data being the receipt of the single disease type a is 0.6, the single disease type B is finally selected as the target single disease type.

Further, in an embodiment, the identification apparatus 600 further includes: a comparison module and a marking module.

And the comparison module is used for comparing the target single disease species with the single disease species corresponding to the medical insurance receipt data.

And the marking module is used for marking the medical insurance receipt data as an abnormal receipt if the target single disease type is different from the single disease type corresponding to the medical insurance receipt data.

Specifically, after obtaining the target single disease category, the target single disease category may be compared with the single disease category corresponding to the medical insurance document data to determine whether the single disease category corresponding to the medical insurance document data is the single disease category for which the right report is declared. In this embodiment, if the target single disease type is the same as the single disease type corresponding to the medical insurance receipt data, it can be determined that the single disease type corresponding to the medical insurance receipt data is the single disease type for which a correct report is declared; if the target single disease type is different from the single disease type corresponding to the medical insurance document data, whether the single disease type corresponding to the medical insurance document data is the single disease type which declares the medical insurance document data to be incorrect can be judged, namely the medical insurance document data is an abnormal document. In this embodiment, in order to distinguish a normal document from an abnormal document, when it is determined whether a single disease type corresponding to the medical insurance document data is a single disease type which declares an incorrect document, the medical insurance document data may be marked to be an abnormal document, and specific marking manners may be multiple, for example, the medical insurance document data may be marked by adding a text that "this document is an abnormal document", or may be marked by coloring the medical insurance document data, which is not limited in this embodiment.

According to the single disease type identification method provided by the embodiment of the invention, the similarity value of a first document item vector corresponding to the medical insurance document data and a second document item vector corresponding to each single disease type is calculated to determine the first probability of the medical insurance document data being the medical insurance document of each single disease type, then the second probability of the medical insurance document data being the medical insurance document of each single disease type is obtained through prediction based on a pre-trained single disease type prediction model, and finally the target single disease type corresponding to the medical insurance document data is determined based on the obtained first probability and the obtained second probability. According to the embodiment of the application, the first probability and the second probability can represent the probability that the medical insurance receipt data is of each single disease type, so that the target single disease type corresponding to the medical insurance receipt data can be more accurately identified by combining the first probability and the second probability.

Fig. 7 is a schematic diagram of a hardware architecture of a computer device 700 according to an embodiment of the present invention. In the present embodiment, the computer device 700 is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance. As shown, the computer device 700 includes, at least, but is not limited to, a memory 701, a processor 702, and a network interface 703 communicatively coupled to each other via a device bus. Wherein:

in this embodiment, the memory 701 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 701 may be an internal storage unit of the computer device 700, such as a hard disk or a memory of the computer device 700. In other embodiments, the memory 701 may also be an external storage device of the computer device 700, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 700. Of course, the memory 701 may also include both internal and external memory units of the computer device 700. In this embodiment, the memory 701 is generally used for storing the operating device and various application software installed on the computer device 700, such as the program code of the single disease identification device 600. In addition, the memory 701 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 702 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 702 is generally configured to control the overall operation of the computer device 700. In this embodiment, the processor 702 is configured to run the program code stored in the memory 701 or process data, for example, run the single disease identification apparatus 600, so as to implement the single disease identification method in the foregoing embodiments.

The network interface 703 may include a wireless network interface or a wired network interface, and the network interface 703 is generally used for establishing a communication connection between the computer apparatus 700 and other electronic devices. For example, the network interface 703 is used to connect the computer device 700 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 700 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, bluetooth (Bluetooth), wi-Fi, and the like.

It is noted that fig. 7 only shows a computer device 700 with components 701-703, but it is to be understood that not all shown components need be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the single disease identification apparatus 600 stored in the memory 701 may be further divided into one or more program modules, and the one or more program modules are stored in the memory 701 and executed by one or more processors (in this embodiment, the processor 702) to implement the single disease identification method or the single disease identification method of the present invention.

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer readable storage medium of this embodiment is used for storing the single-disease identification apparatus 600, so that when being executed by the processor, the single-disease identification method or the single-disease identification method of the present invention can be implemented.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. A single disease species identification method is characterized by comprising the following steps:

acquiring medical insurance document data, and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data, wherein the medical insurance document data is document data required for applying medical insurance funds corresponding to the medical insurance type of a disease; the step of preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data comprises the following steps: extracting a directory list contained in the medical insurance document data;

inputting the first document item vector into a pre-trained single disease type prediction model so as to predict and output a second probability set of each single disease type corresponding to the medical insurance document data through the single disease type prediction model;

determining a target single disease type corresponding to the medical insurance receipt data according to the first probability set and the second probability set;

constructing a second document item vector in the single disease category set comprises: acquiring a single disease category prescription set corresponding to each single disease category, wherein the single disease category prescription set comprises one or more prescriptions corresponding to a certain single disease category; calculating the frequency of occurrence of each directory list contained in the single disease seed prescription set; and forming the frequency into a second bill item vector corresponding to the single disease category.

2. The method for identifying the single disease category according to claim 1, wherein the step of inputting the first document item vector into a pre-trained single disease category prediction model to output the second probability set of each single disease category corresponding to the medical insurance document data through the single disease category prediction model comprises the steps of:

3. The single disease category identification method of claim 2, wherein the preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data comprises:

extracting a catalog list contained in the medical insurance document data, wherein the catalog list is medicine, diagnosis and treatment items or medical service facility range list data which can be paid by medical insurance overall funds;

4. The method for identifying the single disease category according to claim 1, wherein the determining the target single disease category corresponding to the medical insurance document data according to the first probability set and the second probability set comprises:

5. The method for identifying the single disease category according to claim 1, wherein the determining the target single disease category corresponding to the medical insurance document data according to the first probability set and the second probability set comprises:

6. The single disease species identification method according to any one of claims 1 to 5, further comprising:

7. An individual disease species identification device, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring medical insurance document data and preprocessing the medical insurance document data to obtain a first document item vector corresponding to the medical insurance document data, and the medical insurance document data is document data required by medical insurance funds corresponding to a declared patient type; wherein the obtaining module is further configured to: extracting a directory list contained in the medical insurance document data;

the calculation module is used for calculating the similarity value of the first bill item vector and a second bill item vector in a pre-constructed single disease type set, wherein the single disease type set comprises multiple types of single disease types and second bill item vectors mapped by the single disease types, and the frequency is formed into the second bill item vector corresponding to the single disease type;

the determining module is used for determining a target single disease type corresponding to the medical insurance receipt data according to the first probability set and the second probability set;

the calculation module is further configured to obtain a single disease category prescription set corresponding to each single disease category, where the single disease category prescription set includes one or more prescriptions corresponding to a certain single disease category; calculating the frequency of occurrence of each directory list contained in the single disease seed prescription set; and forming the frequency into a second bill item vector corresponding to the single disease category.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the single disease species identification method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the single disease species identification method according to any one of claims 1 to 6.