CN116561183B - Intelligent information retrieval system for massive medical insurance data - Google Patents

Intelligent information retrieval system for massive medical insurance data Download PDF

Info

Publication number
CN116561183B
CN116561183B CN202310833085.0A CN202310833085A CN116561183B CN 116561183 B CN116561183 B CN 116561183B CN 202310833085 A CN202310833085 A CN 202310833085A CN 116561183 B CN116561183 B CN 116561183B
Authority
CN
China
Prior art keywords
data
insurance
insurance data
retrieval
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310833085.0A
Other languages
Chinese (zh)
Other versions
CN116561183A (en
Inventor
刘利锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Universal Medical Rescue Co ltd
Original Assignee
Beijing Universal Medical Rescue Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Universal Medical Rescue Co ltd filed Critical Beijing Universal Medical Rescue Co ltd
Priority to CN202310833085.0A priority Critical patent/CN116561183B/en
Publication of CN116561183A publication Critical patent/CN116561183A/en
Application granted granted Critical
Publication of CN116561183B publication Critical patent/CN116561183B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to the technical field of electronic digital data processing, in particular to an intelligent information retrieval system for massive medical insurance data, which comprises the following components: and obtaining the risk rate and the retrieval probability corresponding to the insurance data according to the relation between the corresponding quantity of the insurance data under different ages and different sexes when different past cases exist in the insurance data, and encoding and compressing the insurance data according to the retrieval probability. According to the invention, by linking the relation of different features in the insurance data, the stability and the accuracy of the risk rate evaluation result of the logistic regression model on the insurance data are improved, the insurance data are encoded and compressed according to the retrieval probability, the problem that the encoding of the insurance data with high retrieval probability is overlong is avoided, and the efficiency and the speed in the data retrieval process are greatly improved.

Description

Intelligent information retrieval system for massive medical insurance data
Technical Field
The invention relates to the technical field of electronic digital data processing, in particular to an intelligent information retrieval system for massive medical insurance data.
Background
With the development of society and the increase of population aging trend, the medical insurance industry plays an increasingly important role, the core of medical insurance is data, and the processing and management of medical insurance data is a very important task for insurance companies. However, the conventional data management method has difficulty in meeting the requirements of efficient processing and retrieval of huge insurance data, and the massive insurance data processing requires a great deal of time and resources, so that the problems of data redundancy, repetition and the like are easy to occur, and the data retrieval efficiency is low. Therefore, a new intelligent mass medical insurance data information retrieval system is needed, and the medical insurance data can be analyzed and encoded by using a machine learning algorithm, so that the compression and the structural encoding of the data are realized, and meanwhile, the data retrieval efficiency and accuracy are improved, so that the requirements of the medical insurance industry are met.
Currently, when retrieving medical insurance information, the existing character matching technology is adopted, however, the method has the following defects: 1. storage space is wasted. In a conventional relational database, each piece of data needs to store the values of the respective attributes, and a large amount of redundant data exists. 2. The search efficiency is low. In the case of huge data volume, the efficiency problems caused by the traditional character string matching and fuzzy query will become more and more obvious.
Disclosure of Invention
The invention provides an intelligent information retrieval system for massive medical insurance data, which aims to solve the existing problems.
The invention relates to an intelligent information retrieval system for massive medical insurance data, which adopts the following technical scheme:
the invention provides an intelligent information retrieval system for massive medical insurance data, which comprises the following modules:
a data preparation module: acquiring insurance data in a medical insurance information database to obtain a first data set and a second data set;
and a data dividing module: the method comprises the steps of dividing a first data set to obtain a training set and a verification set;
probability analysis module: the method comprises the steps of acquiring a plurality of past cases in a first data set, and acquiring correlation factors between the past cases and ages according to the number of people with different ages under any past case; acquiring the connection parameters between the past cases and the ages by combining the correlation factors; further combining the contact parameters to obtain characteristic parameters of the previous case; acquiring the risk rate of the insurance data in the second data set according to the characteristic parameters, and acquiring the retrieval probability of the insurance data by combining the risk rate;
and a data storage module: and according to the size of the retrieval probability, primary coding data is obtained, and the primary coding data is subjected to coding compression storage, so that the quick retrieval of the insurance data is further realized.
Further, the first data set and the second data set are acquired by the following steps:
recording a set formed by all insurance data in the medical insurance database as an insurance data set;
recording a set formed by insurance data corresponding to the medical insurance information paid for in the insurance data set as a first data set;
and recording a set formed by all data corresponding to the medical insurance being used by the applicant in the insurance data set as a second data set.
Further, the training set and the verification set are obtained by the following steps:
firstly, clustering all insurance data in a first data set by using a K-means++ algorithm according to the ages and sexes of corresponding insurance applicators in the insurance data and the distances among three dimensions of the previous cases to obtain a plurality of cluster clusters;
then, scrambling all cluster clusters by using a random huffling algorithm;
finally, each cluster is divided according to a preset proportion to respectively obtain a training set and a verification set.
Further, the probability analysis module comprises the following units:
a multi-data set unit: extracting case names of past cases of the applicant corresponding to different insurance data in the training set, obtaining a set formed by all the past cases, and marking the set as a multi-element data set;
contact parameter unit: the method is used for obtaining the correlation factor between the past cases and the ages according to the difference between the number of people at different ages in the past cases and the average value of the number of people in all the past cases; obtaining the odds of the past cases at different ages, and obtaining the contact parameters between the past cases and the ages by combining the differences among the number of the past cases at different ages and the correlation factors;
characteristic parameter unit: obtaining the odds of the past cases according to the odds and the association parameters;
risk rate unit: taking the characteristic parameters of all past cases in the training set as independent variables, training a logistic regression model, and optimizing the trained logistic regression model by utilizing the characteristic parameters of all past cases in the verification set to obtain a logistic regression model for risk rate assessment of insurance data; acquiring characteristic parameters of all past cases in the second data set, and outputting risk rates of corresponding insurance data of all the past cases as input of a logistic regression model;
search probability unit: and acquiring the retrieval time of the insurance data and the update time of the medical insurance information database, and acquiring the retrieval probability of the insurance data in the second data set by combining the risk rate.
Further, the correlation factor is obtained by the following steps:
wherein ,indicate->A factor related between past cases and age; />Representing the training set age asHas +.>The total number of previous cases, wherein ∈>,/>The age interval of the applicant corresponding to all insurance data in the training set; />Representing the maximum age of the insurance data in the training set corresponding to the applicant; />Mean value of the number of people in all past cases; />Representing a hyperbolic tangent function.
Further, the contact parameter obtaining method includes the following steps:
firstly, acquiring the number of persons who pay after any past case exists under each age and different property in a first data set;
then, the first data is concentrated in the age intervalAnd age interval->In (1) there is->The total number of patients who pay for the past cases and the total number of patients who have +.>The ratio between the total number of past cases is recorded as the odds;
finally, the specific acquisition method of the contact parameters comprises the following steps:
wherein ,indicate->The connection parameters between the past cases and the ages; />Representing the training set age asHas +.>The total number of past cases; />Representing the maximum age of the insurance data in the training set corresponding to the applicant; />Indicate->The former case is->Probability of reimbursement within an age interval; />Indicate->The former case is->Probability of reimbursement for an age interval; />Indicate->Correlation factors between past cases and age.
Further, the characteristic parameters are obtained by the following steps:
firstly, respectively acquiring the number of reimbursements of men and women when any past case exists in a first data set, and respectively marking the ratio of the number of reimbursements to all the number of reimbursements in the first data set as male reimbursement probability and female reimbursement probability;
and then, the product result of the 1 plus male odds or female odds and the contact parameters is recorded as the characteristic parameters of the previous case.
Further, the retrieval probability is obtained by the following steps:
wherein ,representing the +.>The retrieval probability of the individual insurance data; />Representing the second data setRisk rate of individual insurance data; />Representing the +.>The insurance data is at the +.>Time of the secondary search; />Representing the +.>Time when the individual insurance data was last retrieved, +.>Indicating the last update time of the medical insurance information database,/->Representing natural constants.
Further, according to the size of the retrieval probability, primary encoded data is obtained, the primary encoded data is encoded, compressed and stored, and the quick retrieval of insurance data is further realized, and the method comprises the following specific steps:
firstly, carrying out linear normalization processing on the retrieval probabilities of all insurance data in a second data set to obtain normalized retrieval probabilities, and presetting a retrieval probability threshold according to experience;
then, recording insurance data with normalized retrieval probability larger than a retrieval probability threshold value as first encoded data; recording insurance data with normalized retrieval probability smaller than a retrieval probability threshold as non-primary coding data; acquiring repeated characters in all primary coding data by utilizing a character statistics method, and performing short code length coding in variable length coding on the primary coding data and the repeated characters; the non-repeated characters and the non-primary coded data are coded by using long codes in variable length codes to obtain coded compressed data corresponding to all insurance data;
and finally, storing all the coded compressed data in a medical insurance information database to realize quick retrieval of insurance data.
The technical scheme of the invention has the beneficial effects that:
(1) Compared with a machine learning algorithm with single characteristics, the method and the device for combining the characteristics of the safety data have the advantages that the characteristics are combined through the relation between different characteristics in the safety data, so that a result obtained when the risk rate of the safety data is analyzed by machine learning is more stable, the risk rate evaluation of the safety data is more accurate, the noise interference resistance is stronger, and the stability degree of abnormal data is higher.
(2) And obtaining the retrieval probability of the insurance data by using the risk rates of different insurance data and the frequency statistics results of the retrieved insurance data, and performing variable length coding of different coding lengths according to the size of the retrieval probability, so that the insurance data with higher retrieval probability has small enough data volume after coding compression, and is retrieved more quickly when being retrieved, thereby improving the retrieval efficiency.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block flow diagram of an intelligent information retrieval system for massive medical insurance data according to the present invention;
fig. 2 is a schematic diagram of a module refinement structure of the probability analysis module.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description refers to the specific implementation, structure, characteristics and effects of an intelligent information retrieval system for mass medical insurance data according to the invention by combining the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the intelligent information retrieval system for massive medical insurance data provided by the invention with reference to the accompanying drawings.
Referring to fig. 1, a block flow diagram of a system for intelligent information retrieval of mass medical insurance data according to an embodiment of the present invention is shown, the system includes the following blocks:
a data preparation module: the method is used for acquiring insurance data in the medical insurance information database and obtaining a first data set and a second data set.
In this embodiment, when retrieving the medical insurance information, the risk ratio is calculated according to the characteristics of the medical insurance so as to facilitate the encoding and compression of the insurance data better later, so that the insurance data and the corresponding data set need to be acquired first:
acquiring all insurance data in a medical insurance database, wherein any one insurance data comprises personal information of an applicant and corresponding insurance information, the insurance data comprises age, sex and past cases of the applicant, and a set formed by all insurance data is recorded as an insurance data set;
dividing the insurance data set into two parts, namely a first data set and a second data set, wherein the specific dividing method comprises the following steps:
recording a set formed by insurance data corresponding to medical insurance information which has been paid for in the insurance data set as a first data set, wherein the first data set contains M pieces of insurance data;
the set formed by all the data corresponding to the medical insurance used by the applicant in the insurance data set is recorded as a second data set, and the second data set comprisesAnd insurance data.
In addition, the number of persons who pay for any past case exists at each age and different grade in the first data set is acquired.
So far, insurance data sets corresponding to all insurance data are obtained.
And a data dividing module: for partitioning the first data set to obtain a training set and a validation set for training the machine learning model.
When insurance data in the medical insurance information database is retrieved, there are typically the following reasons:
(1) The insurance data which is searched for many times recently shows that the requirement degree of consulting the insurance data in the latest time is higher;
(2) The risk of medical insurance is relatively high, and since claims may be required, the probability of retrieving insurance data with a high risk is high.
Therefore, the embodiment analyzes the risk rate of the insurance data of different insurance applicant by machine learning, and obtains the probability of the insurance data being searched according to the risk rate of the insurance data and the search times of the recently corresponding insurance data, and marks the probability as the search probability; and then, according to the similarity relation between the insurance data with higher retrieval probability and the difference between the insurance data corresponding to the lower retrieval probability, the insurance data is subjected to variable length data coding, so that the insurance data with higher retrieval probability is in a shorter code length state as much as possible, and the corresponding insurance data can be quickly retrieved when the data is retrieved, thereby improving the retrieval efficiency.
In addition, when the risk rate analysis is performed on insurance data of an applicant by using machine learning, a fitting phenomenon often appears, namely the generalization capability of the machine learning is insufficient, and a corresponding risk rate cannot be accurately analyzed on new insurance data by the machine learning, so that the subsequent retrieval probability is judged to be wrong, the encoding rule of variable length encoding is further changed, the code length allocated to the insurance data with larger original retrieval probability is longer, and the time spent in retrieval is longer.
The risk rate refers to a risk of disease occurrence of the applicant or a risk that the insurance company needs to pay for the applicant.
Therefore, the insurance data set needs to be divided into a training set and a verification set, the verification set is utilized to conduct supervised learning of the training set, the generalization capability of machine learning is improved, and the specific dividing method is as follows:
firstly, features of insurance data are extracted, and since the risk rate analysis is performed on the insurance data by machine learning in this embodiment, feature extraction related to the risk rate needs to be performed on the insurance data, and factors affecting the risk rate of medical insurance are more common: the invention uses the age, sex and past cases of the applicant as the characteristics of the corresponding insurance data.
And then, clustering all insurance data in the first data set by using a K-means++ algorithm according to the ages and sexes of corresponding insurance applicators in the insurance data and the distances of three dimensions of the previous cases to obtain a plurality of clusters.
Finally, the present embodiment employs conventionalThe dividing ratio of (a) is that the quantity ratio of the training set to the insurance data in the verification set is +>Memory training set->Contains insurance data->Personal, verification set->Contains insurance data->And if it is, thenThe specific dividing mode of the training set and the verification set is as follows: scrambling all clusters by using a random huffling algorithm, and then, each cluster is in accordance with +.>Dividing the ratio of (2) to obtain training sets +.>And verification set->
So far, the training set is obtained by dividing the first data set in the acquired insurance dataAnd verification set->
Probability analysis module: the method is used for training and verifying machine learning, acquiring the risk rate of insurance data and further acquiring the retrieval probability of the insurance data.
Specifically, as shown in fig. 2, a schematic diagram of a module refinement structure of the probability analysis module includes: a multi-data set unit, a contact parameter unit, a characteristic parameter unit, a risk rate unit and a retrieval probability unit.
A multi-data set unit: in order to make the detection process faster when retrieving the insurance data, the embodiment performs the relationship analysis based on the medical insurance basic characteristics according to all the insurance data in the training set, evaluates the risk rate of the insurance data in combination with the machine learning model, and then obtains the retrieval probability of different insurance data by using the risk rate.
When selecting the machine learning model, the embodiment selects the logistic regression model to evaluate the risk rate of the insurance data because the risk rate of the medical insurance essentially belongs to the binary problem; when the existing logistic regression model is used for risk assessment of insurance data, single low-level features are usually used for assessment, and the assessment result is not accurate enough, so that the risk rate assessment result is more accurate by using the multi-feature fusion method for assessment in the embodiment.
In the characteristics of medical insurance, the past case is a multiple parameter and is a direct influence factor of the risk rate of insurance data, so that the embodiment constructs the connection parameters among different past cases, ages and sexes as independent variables to establish a logistic regression model.
Extracting the case names of the past cases of the applicant corresponding to different insurance data in the training set to obtain a past case multi-element data set
wherein ,representing the->Case name of previous case ++>, wherein />The total number of case names of all past cases in the training set is represented.
Contact parameter unit: and the relationship between the insurance data is analyzed, and the contact parameters are obtained.
Firstly, according to the relation between each past case and different ages, the related factors between the past cases and the ages in the first data set are obtained, and the specific obtaining method is as follows:
wherein ,indicate->A factor related between past cases and age; />Representing the training set age asHas +.>The total number of previous cases, wherein ∈>,/>The age interval of the applicant corresponding to all insurance data in the training set; />Representing the maximum age of the insurance data in the training set corresponding to the applicant; />Mean of the number of people in all past cases; />Representing a hyperbolic tangent function;
then, the first data is concentrated in the age intervalAnd age interval->In (1) there is->The total number of patients who pay for the past cases and the total number of patients who have +.>The ratio between the total number of past cases is recorded as the odds; the method for acquiring the connection parameters between the past cases and the ages by combining the correlation factors comprises the following steps:
wherein ,indicate->The connection parameters between the past cases and the ages; />Representing the training set age asHas +.>Headcount of past casesWherein->,/>The age interval of the applicant corresponding to all insurance data in the training set; />Representing the maximum age of the insurance data in the training set corresponding to the applicant; />Indicate->The former case is->Probability of reimbursement within an age interval; />Indicate->The former case is->Probability of reimbursement for an age interval; />Indicate->Correlation factors between past cases and age.
Acquisition of the firstWhen the relation parameters between the previous cases and the ages are the same, three logical relations are introduced in the embodiment: "age is irrelevant to previous cases", "the incidence of previous cases is linked with a smaller age relative to an older age" and "the previous cases are linked with a larger age relative to a smaller ageIncidence relation of ";
wherein the logical relationship 'age is irrelevant to the past cases', is obtained by the correlation factorRepresentation of age and +.>One constraint of the incidence rate of the previous cases is calculated by normalizing the variance of the number of the occurrence of each previous case in different age stages, and the smaller the variance is, the more common the incidence rate of the corresponding previous case in each age stage is, so that the correlation between the previous case and the age is not great;
the larger the variance, the greater the incidence of the past cases at a certain age group, i.eThe correlation between the past cases and the ages is strong;
then utilizeTwo different values of (a) to represent +.>The relation between the "incidence relation of previous cases with a smaller age relative to a larger age" and the "incidence relation of previous cases with a larger age relative to a smaller age";
taking the first logical relationship as an example, if the incidence of the past case exists at a lower age, the older the incidence of the population is compared with the higher age, i.eThe larger the correlation factor is, the stronger the relation between the corresponding past case and the age is, then the probability of reimbursement is used for multiplying the relation to obtain the relation parameters of the past case and the age, the larger the relation parameters are, the more the first part is in the age stage>In the case of the past cases, the greater the probability of paying a claim, the greater the probability of searching the database.
In the subsequent machine learning training and verification, the corresponding logical relationship is selected according to the age of the applicant corresponding to each insurance data.
Characteristic parameter unit: and obtaining characteristic parameters by combining the contact parameters, and training the logistic regression model by combining the characteristic parameters to obtain the logistic regression model corresponding to the odds ratio of the insurance data.
Firstly, respectively acquiring the number of reimbursements of men and women when any past case exists in a first data set, and respectively marking the ratio of the number of reimbursements to all the number of reimbursements in the first data set as male reimbursement probability and female reimbursement probability;
then, sex connection is carried out by utilizing the connection parameters of the age and the previous case to obtain the characteristic parameters of the previous case, and the specific obtaining method comprises the following steps:
wherein ,indicate->Characteristic parameters of the previous cases; />Indicate->The connection parameters between the past cases and the ages; />Indicate->Male odds of past cases; />Indicate->Probability of female reimbursement for past cases.
The previous cases with different sexes have the firstThe greater the probability of reimbursement for an existing case, the greater the likelihood that it will be retrieved.
For example: hypertension is a common disorder, but the likelihood of making an insurance claim is different for different sexes of different ages, i.e., the higher the risk, the higher the likelihood of making the claim, the higher the probability that the corresponding insurance data will be retrieved.
Finally, taking the characteristic parameters of all past cases in the training set as independent variables, training a logistic regression model, and optimizing the trained logistic regression model by utilizing the characteristic parameters of all past cases in the verification set to obtain the logistic regression model for risk rate assessment of insurance data
It should be noted that, training and optimization of the logistic regression model are performed in the prior art, and are not repeated in this embodiment.
Risk rate unit: acquiring characteristic parameters of all past cases in the second data set and taking the characteristic parameters as a logistic regression modelThe input and output of (1) are the corresponding risk rates of the prior cases, and the corresponding risk rate is taken as the risk rate of the insurance data with the corresponding prior case, and is recorded as +.>Representing the +.>Risk rate of individual insurance data.
Search probability unit: for obtaining a retrieval probability of the insurance data in the second data set.
Acquiring the corresponding time when each insurance data in the medical insurance database is searched and the corresponding time when the medical insurance database is updated; the retrieval probability of the insurance data in the second data set is obtained by combining the risk rate of the insurance data, and the specific obtaining method is as follows:
wherein ,representing the +.>The retrieval probability of the individual insurance data; />Representing the second data setRisk rate of individual insurance data; />Representing the +.>The insurance data is at the +.>Time of the secondary search; />Representing the +.>Time when the individual insurance data was last retrieved, +.>Indicating the last update time of the medical insurance information database,/->Representing natural constants.
The search probability is obtained by the interaction of two parts:
(1) The first part is the first part in the second data setThe risk rate of the insurance data after being evaluated by the machine learning model is larger, the greater the risk rate is, the greater the possibility that the insurance data needs to be paid for reimbursement is, the greater the possibility that the insurance data is searched is, namely the greater the searching probability is;
(2) The second part is the second data setThe more frequently the insurance data is retrieved, the closer the insurance data is retrieved to the last medical insurance information database update time, indicating a greater likelihood that the insurance data is retrieved again.
So far, the retrieval probability of the insurance data in the second data set is obtained.
And a data storage module: the intelligent information retrieval method is used for intelligently encoding the insurance data, storing the encoded compressed data and further realizing intelligent information retrieval of the insurance data.
Classifying the insurance data according to the retrieval probability, and performing variable length coding on the insurance data corresponding to the higher retrieval probability, wherein the specific method comprises the following steps:
firstly, carrying out linear normalization processing on the retrieval probabilities of all insurance data in a second data set to obtain normalized retrieval probabilities, and presetting a retrieval probability threshold according to experienceExperience value->, wherein />Representing the amount of insurance data in the second data set;
then, recording insurance data with normalized retrieval probability larger than a retrieval probability threshold value as first encoded data; recording insurance data with normalized retrieval probability smaller than a retrieval probability threshold as non-primary coding data; acquiring repeated characters in all primary coding data by utilizing a character statistics method, and performing short code length coding in variable length coding on the primary coding data and the repeated characters; the non-repeated characters and the non-primary coded data are coded by using long codes in variable length codes to obtain coded compressed data corresponding to all insurance data;
finally, storing all the coded compressed data in a medical insurance information database, so that the staff can conveniently search insurance data;
it should be noted that, the character statistics method and the variable length coding are both the prior art, and this embodiment is not repeated.
All insurance data are encoded and compressed by combining with the retrieval probability, and because the insurance data with higher retrieval probability are encoded in a variable length mode by utilizing the occurrence frequency of the data, the encoding length of the insurance data with higher retrieval probability is shortened after encoding and compression, the whole data size is smaller, and retrieval can be completed only by sentence matching in a shorter time when retrieval is carried out.
The following examples were usedThe model is used only to represent the negative correlation and the result of the constraint model output is at +.>In the section, other models with the same purpose can be replaced in the specific implementation, and the embodiment only uses/>The model is described as an example, without specific limitation, wherein +.>Refers to the input of the model.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (8)

1. An intelligent information retrieval system for massive medical insurance data is characterized by comprising the following modules:
a data preparation module: acquiring insurance data in a medical insurance information database to obtain a first data set and a second data set;
and a data dividing module: the method comprises the steps of dividing a first data set to obtain a training set and a verification set;
probability analysis module: the method comprises the steps of acquiring a plurality of past cases in a first data set, and acquiring correlation factors between the past cases and ages according to the number of people with different ages under any past case; acquiring the connection parameters between the past cases and the ages by combining the correlation factors; further combining the contact parameters to obtain characteristic parameters of the previous case; acquiring the risk rate of the insurance data in the second data set according to the characteristic parameters, and acquiring the retrieval probability of the insurance data by combining the risk rate;
and a data storage module: according to the size of the retrieval probability, primary coding data are obtained, the primary coding data are subjected to coding compression storage, and rapid retrieval of insurance data is further realized;
the probability analysis module comprises the following units:
a multi-data set unit: extracting case names of past cases of the applicant corresponding to different insurance data in the training set, obtaining a set formed by all the past cases, and marking the set as a multi-element data set;
contact parameter unit: the method is used for obtaining the correlation factor between the past cases and the ages according to the difference between the number of people at different ages in the past cases and the average value of the number of people in all the past cases; obtaining the odds of the past cases at different ages, and obtaining the contact parameters between the past cases and the ages by combining the differences among the number of the past cases at different ages and the correlation factors;
characteristic parameter unit: obtaining the odds of the past cases according to the odds and the association parameters;
risk rate unit: taking the characteristic parameters of all past cases in the training set as independent variables, training a logistic regression model, and optimizing the trained logistic regression model by utilizing the characteristic parameters of all past cases in the verification set to obtain a logistic regression model for risk rate assessment of insurance data; acquiring characteristic parameters of all past cases in the second data set, and outputting risk rates of corresponding insurance data of all the past cases as input of a logistic regression model;
search probability unit: and acquiring the retrieval time of the insurance data and the update time of the medical insurance information database, and acquiring the retrieval probability of the insurance data in the second data set by combining the risk rate.
2. The intelligent information retrieval system of mass medical insurance data according to claim 1, wherein the first data set and the second data set are obtained by the following steps:
recording a set formed by all insurance data in the medical insurance database as an insurance data set;
recording a set formed by insurance data corresponding to the medical insurance information paid for in the insurance data set as a first data set;
and recording a set formed by all data corresponding to the medical insurance being used by the applicant in the insurance data set as a second data set.
3. The intelligent information retrieval system of massive medical insurance data according to claim 1, wherein the training set and the verification set are obtained by the following steps:
firstly, clustering all insurance data in a first data set by using a K-means++ algorithm according to the ages and sexes of corresponding insurance applicators in the insurance data and the distances among three dimensions of the previous cases to obtain a plurality of cluster clusters;
then, scrambling all cluster clusters by using a random huffling algorithm;
finally, each cluster is divided according to a preset proportion to respectively obtain a training set and a verification set.
4. The intelligent information retrieval system for massive medical insurance data according to claim 1, wherein the correlation factor is obtained by the following method:
wherein ,indicate->A factor related between past cases and age; />Representing the training set age size +.>Has +.>The total number of previous cases, wherein ∈>,/>The age interval of the applicant corresponding to all insurance data in the training set; />Representing the maximum age of the insurance data in the training set corresponding to the applicant; />Mean value of the number of people in all past cases; />Representing a hyperbolic tangent function.
5. The intelligent information retrieval system for massive medical insurance data according to claim 1, wherein the contact parameters are obtained by the following steps:
firstly, acquiring the number of persons who pay after any past case exists under each age and different property in a first data set;
then, the first data is concentrated in the age intervalAnd age interval->In (1) there is->The total number of patients who pay for the past cases and the total number of patients who have +.>The ratio between the total number of past cases is recorded as the odds;
finally, the specific acquisition method of the contact parameters comprises the following steps:
wherein ,indicate->The connection parameters between the past cases and the ages; />Representing the training set age size +.>Has +.>The total number of past cases; />Representing the maximum age of the insurance data in the training set corresponding to the applicant;indicate->The former case is->Probability of reimbursement within an age interval; />Indicate->The former case is->Age intervalThe odds of (2); />Indicate->Correlation factors between past cases and age.
6. The intelligent information retrieval system for massive medical insurance data according to claim 1, wherein the characteristic parameters are obtained by the following steps:
firstly, respectively acquiring the number of reimbursements of men and women when any past case exists in a first data set, and respectively marking the ratio of the number of reimbursements to all the number of reimbursements in the first data set as male reimbursement probability and female reimbursement probability;
and then, the product result of the 1 plus male odds or female odds and the contact parameters is recorded as the characteristic parameters of the previous case.
7. The intelligent information retrieval system for massive medical insurance data according to claim 1, wherein the retrieval probability is obtained by the following steps:
wherein ,representing the +.>The retrieval probability of the individual insurance data; />Representing the +.>Risk rate of individual insurance data; />Representing the +.>The insurance data is at the +.>Time of the secondary search; />Representing the +.>Time when the individual insurance data was last retrieved, +.>Indicating the last update time of the medical insurance information database,/->Representing natural constants.
8. The intelligent information retrieval system of massive medical insurance data according to claim 1, wherein the primary coding data is obtained according to the size of the retrieval probability, the primary coding data is coded, compressed and stored, and further the quick retrieval of the insurance data is realized, and the method comprises the following specific steps:
firstly, carrying out linear normalization processing on the retrieval probabilities of all insurance data in a second data set to obtain normalized retrieval probabilities, and presetting a retrieval probability threshold according to experience;
then, recording insurance data with normalized retrieval probability larger than a retrieval probability threshold value as first encoded data; recording insurance data with normalized retrieval probability smaller than a retrieval probability threshold as non-primary coding data; acquiring repeated characters in all primary coding data by utilizing a character statistics method, and performing short code length coding in variable length coding on the primary coding data and the repeated characters; the non-repeated characters and the non-primary coded data are coded by using long codes in variable length codes to obtain coded compressed data corresponding to all insurance data;
and finally, storing all the coded compressed data in a medical insurance information database to realize quick retrieval of insurance data.
CN202310833085.0A 2023-07-10 2023-07-10 Intelligent information retrieval system for massive medical insurance data Active CN116561183B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310833085.0A CN116561183B (en) 2023-07-10 2023-07-10 Intelligent information retrieval system for massive medical insurance data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310833085.0A CN116561183B (en) 2023-07-10 2023-07-10 Intelligent information retrieval system for massive medical insurance data

Publications (2)

Publication Number Publication Date
CN116561183A CN116561183A (en) 2023-08-08
CN116561183B true CN116561183B (en) 2023-09-19

Family

ID=87503868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310833085.0A Active CN116561183B (en) 2023-07-10 2023-07-10 Intelligent information retrieval system for massive medical insurance data

Country Status (1)

Country Link
CN (1) CN116561183B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513008A (en) * 1990-03-05 1996-04-30 Mitsubishi Denki Kabushiki Kaisha Variable length coding method using different bit assigning schemes for luminance and chrominance signals
US7664662B1 (en) * 2006-03-16 2010-02-16 Trurisk Llc Computerized medical modeling of group life and disability insurance using medical claims data
CN105631235A (en) * 2016-03-10 2016-06-01 深圳市前海安测信息技术有限公司 Medical big data based medical insurance actuarial system and medical big data based medical insurance actuarial method
CN107146161A (en) * 2017-04-05 2017-09-08 昆明理工大学 A kind of insurance search method selected based on classification
CN109165144A (en) * 2018-09-06 2019-01-08 南京聚铭网络科技有限公司 A kind of security log compression storage and search method based on variable-length record
CN110993103A (en) * 2019-11-28 2020-04-10 阳光人寿保险股份有限公司 Method for establishing disease risk prediction model and method for recommending disease insurance product
CN111127225A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 System, method, apparatus and computer readable medium for insurance underwriting
CN111179102A (en) * 2019-12-25 2020-05-19 北京亚信数据有限公司 Medical insurance underwriting and protecting wind control method and device and storage medium
CN111210881A (en) * 2020-01-07 2020-05-29 上海健交科技服务有限责任公司 Medical big data-based insurance disease spectrum detection dynamic generation method
CN115064230A (en) * 2022-06-09 2022-09-16 山东浪潮智慧医疗科技有限公司 DRG block code generation method and system
CN115248842A (en) * 2022-06-20 2022-10-28 北京雅丁信息技术有限公司 ICD intelligent coding system based on knowledge graph and retrieval engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7471841B2 (en) * 2004-06-15 2008-12-30 Cisco Technology, Inc. Adaptive breakpoint for hybrid variable length coding

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513008A (en) * 1990-03-05 1996-04-30 Mitsubishi Denki Kabushiki Kaisha Variable length coding method using different bit assigning schemes for luminance and chrominance signals
US7664662B1 (en) * 2006-03-16 2010-02-16 Trurisk Llc Computerized medical modeling of group life and disability insurance using medical claims data
CN105631235A (en) * 2016-03-10 2016-06-01 深圳市前海安测信息技术有限公司 Medical big data based medical insurance actuarial system and medical big data based medical insurance actuarial method
CN107146161A (en) * 2017-04-05 2017-09-08 昆明理工大学 A kind of insurance search method selected based on classification
CN109165144A (en) * 2018-09-06 2019-01-08 南京聚铭网络科技有限公司 A kind of security log compression storage and search method based on variable-length record
CN111127225A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 System, method, apparatus and computer readable medium for insurance underwriting
CN110993103A (en) * 2019-11-28 2020-04-10 阳光人寿保险股份有限公司 Method for establishing disease risk prediction model and method for recommending disease insurance product
CN111179102A (en) * 2019-12-25 2020-05-19 北京亚信数据有限公司 Medical insurance underwriting and protecting wind control method and device and storage medium
CN111210881A (en) * 2020-01-07 2020-05-29 上海健交科技服务有限责任公司 Medical big data-based insurance disease spectrum detection dynamic generation method
CN115064230A (en) * 2022-06-09 2022-09-16 山东浪潮智慧医疗科技有限公司 DRG block code generation method and system
CN115248842A (en) * 2022-06-20 2022-10-28 北京雅丁信息技术有限公司 ICD intelligent coding system based on knowledge graph and retrieval engine

Also Published As

Publication number Publication date
CN116561183A (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN110597735A (en) Software defect prediction method for open-source software defect feature deep learning
CN110147445A (en) Intension recognizing method, device, equipment and storage medium based on text classification
van Leeuwen et al. Compression picks item sets that matter
CN112951443B (en) Syndrome monitoring and early warning method, device, computer equipment and storage medium
CN111079430A (en) Power failure event extraction method combining deep learning and concept map
CN111143507B (en) Reading and understanding method based on compound problem
CN112581006A (en) Public opinion engine and method for screening public opinion information and monitoring enterprise main body risk level
CN114003791B (en) Depth map matching-based automatic classification method and system for medical data elements
CN111414513B (en) Music genre classification method, device and storage medium
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN117290364B (en) Intelligent market investigation data storage method
CN113157903A (en) Multi-field-oriented electric power word stock construction method
CN110222192A (en) Corpus method for building up and device
CN115858476A (en) Efficient storage method for user-defined form acquisition data in web development system
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN114330335A (en) Keyword extraction method, device, equipment and storage medium
CN114579768A (en) Maintenance method for realizing intelligent operation and maintenance knowledge base of equipment
CN116561183B (en) Intelligent information retrieval system for massive medical insurance data
CN116452353A (en) Financial data management method and system
CN115599917A (en) Text double-clustering method based on improved bat algorithm
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN112860815A (en) Finance and tax informatization data processing system based on big data
CN114036923A (en) Document false identification system and method based on text similarity
CN115438101B (en) Data feature construction system and method based on feature morphology and data relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant