WO2022257458A1 - Vehicle insurance claim behavior recognition method, apparatus, and device, and storage medium - Google Patents

Vehicle insurance claim behavior recognition method, apparatus, and device, and storage medium Download PDF

Info

Publication number
WO2022257458A1
WO2022257458A1 PCT/CN2022/071477 CN2022071477W WO2022257458A1 WO 2022257458 A1 WO2022257458 A1 WO 2022257458A1 CN 2022071477 W CN2022071477 W CN 2022071477W WO 2022257458 A1 WO2022257458 A1 WO 2022257458A1
Authority
WO
WIPO (PCT)
Prior art keywords
behavior
behavior recognition
data set
samples
feature
Prior art date
Application number
PCT/CN2022/071477
Other languages
French (fr)
Chinese (zh)
Inventor
朱磊
徐赛奕
张霖
俞丽娟
朱艳乔
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022257458A1 publication Critical patent/WO2022257458A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Definitions

  • the present application relates to the field of big data, and in particular to a method, device, equipment and storage medium for identifying an auto insurance claim settlement behavior.
  • Automobile insurance refers to a kind of commercial insurance that is liable for personal injury or property loss caused by motor vehicles due to natural disasters or accidents. Automobile insurance is produced and developed with the emergence and popularization of automobiles. With the development of society, the quality of life of the people has improved, and more and more people are buying cars. However, some lawbreakers defraud insurance companies of compensation by applying for auto insurance and forging accident scenes. Insurance incidents occur frequently, especially in the field of auto insurance. The means of defrauding insurance emerge in endlessly and in various ways, causing heavy losses to auto insurance companies at any time. Insurance fraudsters often commit crimes in gangs, and they cooperate with auto repair companies and even bribe loss assessment personnel to defraud insurance.
  • the inventor realizes that in the existing anti-fraud data set of auto insurance, since the behavior of auto insurance fraud is still a small number compared with the normal compensation behavior, that is, when the data set is used for machine learning to identify whether the auto insurance compensation is abnormal, the cases of fraudulent behavior and normal compensation The cases of compensation behavior are unbalanced, resulting in an imbalance in the number of positive and negative samples. In this case, if the general machine learning modeling method is used, the effect will be very poor. The model will cater to the characteristics of the data and tend to As a result, it was judged as a category with a large number of samples. That is, the existing auto insurance anti-fraud data set stores an imbalance of positive and negative samples, which leads to low accuracy of model training for auto insurance claim anomaly recognition.
  • the main purpose of this application is to solve the problem of the imbalance of positive and negative samples stored in the existing anti-fraud data set of auto insurance, which leads to the low accuracy of model training for abnormal identification of auto insurance claims.
  • the first aspect of the present application provides a method for identifying auto insurance claims, including: obtaining historical auto insurance claims data, and dividing the historical auto insurance claims data into positive samples and negative samples; performing neighbor propagation processing on the positive samples to obtain A plurality of expanded samples, and performing down-sampling processing on the negative samples to obtain a plurality of sub-samples; combining each of the sub-samples with the positive samples and the expanded samples to obtain the first data set and the second data set Data set; respectively input the first data set and the second data set into a preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the corresponding The second behavior recognition result corresponding to the second data set; according to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification of the first data set by the behavior recognition model rate, and calculate the relative entropy loss between the first behavior recognition result and the second behavior recognition result; update the behavior recognition model according to the misclassification rate and the relative entropy
  • the second aspect of the present application provides an auto insurance claim settlement behavior recognition device, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor executes the computer
  • the following steps are implemented when the instructions are readable: obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain multiple expanded samples, and The negative sample is subjected to down-sampling processing to obtain multiple sub-samples; each of the sub-samples is combined with the positive sample and the expanded sample to obtain a first data set and a second data set; the first data set
  • the first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set are respectively input into the preset behavior recognition model to identify the behavior type.
  • Two behavior recognition results according to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior Relative entropy loss between the recognition result and the second behavior recognition result; update the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges; obtain the pending
  • the auto insurance claim data is identified, and the auto insurance claim data to be identified is input into the behavior recognition model, and the behavior category corresponding to the auto insurance claim data to be identified is identified.
  • the third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps: obtaining historical auto insurance claim data , and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain multiple expanded samples, and perform down-sampling processing on the negative samples to obtain multiple sub-samples; Each of the sub-samples is combined with the positive sample and the expanded sample to obtain a first data set and a second data set; respectively input the first data set and the second data set to a preset Identify the behavior type in the behavior recognition model, and obtain the first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set; according to the first behavior recognition result For the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the difference between the first behavior recognition result and the second behavior recognition result Relative
  • the fourth aspect of the present application provides an auto insurance claim settlement behavior recognition device, including: an expansion module for obtaining historical auto insurance claim settlement data, and dividing the historical auto insurance claim settlement data into positive samples and negative samples; Neighbor propagation processing, to obtain a plurality of extended samples, and perform down-sampling processing on the negative samples, to obtain a plurality of sub-samples; a combination module, for performing each sub-sample with the positive samples and the extended samples respectively Combining to obtain the first data set and the second data set; the training module is used to input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the same The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set; an update module, configured to correspond to the first behavior recognition result according to the first data set behavior type, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the relative entropy loss between the first behavior recognition result and the second behavior recognition result; according to the misclass
  • the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine the samples to obtain the first and second data sets, and input two identical behavior recognition models for training, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first by the error rate
  • the accuracy of the data set, the difference between the first data set and the second data set is measured by the relative entropy loss.
  • the behavior recognition model can be obtained, which fully weakens the difference between the samples.
  • the identification bias brought by the balance; finally, the behavior category of the auto insurance claim to be identified is identified through the behavior identification model, and the identification result obtained is more accurate.
  • Fig. 1 is the schematic diagram of the first embodiment of the method for identifying the behavior of auto insurance claim settlement in the embodiment of the present application;
  • Fig. 2 is the schematic diagram of the second embodiment of the method for identifying the behavior of auto insurance claim settlement in the embodiment of the present application;
  • Fig. 3 is the schematic diagram of the third embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application;
  • FIG. 4 is a schematic diagram of an embodiment of an auto insurance claim settlement behavior recognition device in the embodiment of the present application.
  • FIG. 5 is a schematic diagram of another embodiment of the auto insurance claim settlement behavior recognition device in the embodiment of the present application.
  • Fig. 6 is a schematic diagram of an embodiment of an auto insurance claim settlement behavior recognition device in the embodiment of the present application.
  • the embodiment of the present application provides a method, device, equipment and storage medium for identifying the behavior of auto insurance claims, which divides the historical auto insurance claims data into positive samples and negative samples, processes the positive samples to obtain expanded samples, and down-samples the negative samples to obtain sub-samples; Combining the sub-samples with the positive sample and the expanded sample respectively, the first and second data sets are obtained and input into the behavior recognition model for recognition, and the first and second behavior recognition results are obtained, so as to calculate the misclassification rate and the second behavior recognition model. 1.
  • the relative entropy loss of the second distribution probability update the behavior recognition model according to the misclassification rate and relative entropy loss, and stop until the behavior recognition model converges; finally, input the auto insurance claims data to be identified into the behavior recognition model to identify the auto insurance to be identified Behavior category corresponding to claims data.
  • the present application solves the problem of unbalanced storage of positive and negative samples in the anti-fraud data set of auto insurance, thereby improving the accuracy of abnormal identification of auto insurance claims.
  • the first embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
  • the executor of the present application may be an auto insurance claim settlement behavior recognition device, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application is described by taking the server as an execution subject as an example.
  • historical auto insurance claims refer to the records of auto insurance claims in past enterprises, including normal auto insurance claims records and abnormal (such as fraudulent insurance behavior) auto insurance claims records, which may specifically include historical claim records, accident scene records, Historical auto insurance claim information such as maintenance records, policy records, policyholder information, and marks of normal or abnormal auto insurance claims.
  • each underlying factor for the above-mentioned historical auto insurance claim information, first describe each underlying factor and describe it in terms of vocabulary. Then process these underlying factors, such as LBS (Location Based Services, location-based service) factor processing, WIFI factor expansion, and other fraud-related factor processing. Then, further feature cleaning is performed on the processed underlying factors, and the underlying factors that meet the requirements of data saturation and model correlation are screened. Finally, the screened underlying factors are encoded through feature engineering to obtain the numerical underlying factors, which is the final expression form of the first data set.
  • LBS Location Based Services, location-based service
  • LBS factor processing refers to processing the customer’s life trajectory in the latest period based on the customer’s latitude and longitude and POI (Point of Interest); WIFI factor expansion refers to the user’s WIFI link information and historical blacklist records Process the correlation between the current customer and the fraudulent insurance blacklist; processing of other fraud-related factors can include the Euclidean distance between the place of accident and the repair shop, whether the driver of the accident and the insured are the same person, etc.
  • feature engineering can encode the underlying factors through data feature normalization, data binning, and discrete feature numericalization.
  • the historical auto insurance claim data of normal auto insurance claim records is used as negative samples
  • the historical auto insurance claim data of abnormal auto insurance claim records is used as positive samples
  • the number of positive samples is relative to The number of negative samples is small, and the result of direct model training tends to be normal. Therefore, the positive samples are expanded here, and the negative samples are down-sampled.
  • the number of positive samples is increased, and on the other hand, the number of negative samples is reduced. The ratio of the two is reduced, and the influence of the tendency result caused by the amount of training data identified by the model is reduced.
  • the behavior recognition model training for auto insurance claims consists of stacking two identical training models.
  • the first model is used to train the first data set, which is used to predict the behavior type of auto insurance claims in each sample; the second model is used Train the second data set to compare the prediction results of the first model and iterate on the first model.
  • the K-nearest neighbors of each positive sample can be calculated by the K-Nearest Neighbor (KNN) classification algorithm, and the positive samples that can be used to synthesize new samples, such as those close to the classification boundary, can be screened.
  • KNN K-Nearest Neighbor
  • Positive samples, the number of screening can be set according to the ratio of positive samples to negative samples, and then construct new positive samples with the original positive samples.
  • two identical behavior recognition models are used to train the first data set and the second data set respectively, and respectively identify the recognition distribution probability of the behavior type of each sample in the first data set, and the second data set Collect the identification distribution probabilities of the behavior types of each sample, and then perform linear fitting on the identification distribution probabilities corresponding to the two data sets respectively, so as to obtain the first distribution probability and the second distribution probability.
  • the behavior recognition model can be trained using the Random Forest (Random Forest) algorithm.
  • Each sample in the first data set or the second data set is trained by a learner to obtain multiple decision trees, that is, one sample corresponds to one Decision tree, the nodes in the decision tree correspond to the characteristic attributes of the sample, according to the decision tree, calculate the recognition probability of the behavior type corresponding to each characteristic attribute in each sample, and fit the recognition probability corresponding to each characteristic attribute, that is The identification distribution probability of the behavior type of each sample can be obtained.
  • Random Forest Random Forest
  • the behavior recognition model According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the relative entropy between the first behavior recognition result and the second behavior recognition result Loss, and update the behavior recognition model according to the misclassification rate and relative entropy loss until the behavior recognition model converges;
  • the misclassification rate of each sample in the first data set is calculated through the first distribution probability, where the misclassification rate refers to the number of correct behavior types and the total number of predicted results predicted by the same sample in different learners The ratio.
  • the behavior type with the most identical prediction results in the sample is regarded as the correctly predicted behavior type, and the other behavior types are regarded as the wrongly predicted behavior types.
  • the behavior types of the sample corresponding to auto insurance claims here may only include normal and abnormal, and the calculation method of the misclassification rate here is: the number of abnormal characteristic attributes/the total number of characteristic attributes.
  • the second data set of auto insurance claims to be identified is directly input into the behavior recognition model, and the behavior type of the auto insurance claims to be identified can be directly output, and the behavior types may only include normal Or abnormal.
  • the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine to obtain the first and second data sets, and input two identical behavior recognition models for training respectively, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first data set by the misclassification rate
  • the accuracy of the first data set and the second data set are measured by the relative entropy loss.
  • the behavior recognition model can be obtained, which fully weakens the sample imbalance band.
  • the behavior recognition model is used to identify the behavior category of auto insurance claims to be recognized, and the recognition results obtained are more accurate.
  • the second embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
  • K-nearest neighbors (K-Nearest Neighbor, KNN) classification algorithm can be used to calculate the K-nearest neighbors of each positive sample, and filter positive samples that can be used to synthesize new samples, such as close to For the positive samples of the classification boundary, the number of screening can be set according to the ratio of positive samples and negative samples, and then construct new positive samples with the original positive samples.
  • KNN K-nearest Neighbor
  • the negative samples when downsampling the negative samples, can be randomly sampled with replacement according to the preset sampling multiple, and the negative samples can be down-sampled to a state that is relatively balanced in proportion to the number of positive samples.
  • the sampling multiple It can be set according to the ratio of positive samples and negative samples.
  • the data balance of positive samples and negative samples is maintained through the data set, and then through further data segmentation of the data set, multiple feature subsets are obtained. At this time, the data of positive samples and negative samples in each feature subset Like the positive and negative samples in the data set, the relative balance of the data can be maintained.
  • each learner outputs the recognition probability of the behavior type of each characteristic attribute in each sample, and performs fitting to obtain the recognition distribution probability, and then further fits the distribution probability of the behavior type recognition of the data set.
  • the training process of the feature subset through the learner specifically includes the following steps:
  • the recognition distribution probability of the behavior type corresponding to each learner is calculated by the learner and output.
  • sample as a sample at the root node of the decision tree, that is, to construct a sample node; if each sample contains K attributes, when each sample node of the decision tree needs branch extension, randomly select k from the K attributes feature attributes, where k ⁇ K, use a preset strategy such as information gain from the k feature attributes, select one feature attribute as the branch attribute of the node, that is, construct the feature node; repeat the above steps until the feature node Stop extending when the number is m; the combination of decision trees obtained by multiple learners can form the entire random forest.
  • a small number of positive samples are expanded to obtain expanded samples, and a large number of negative samples are down-sampled to solve the problem of imbalance between positive and negative samples problem, increase the accuracy of model training; and measure the output accuracy of the behavior recognition model through the misclassification rate and loss value to ensure that the model can fully consider the characteristics of the positive sample when the sample is unbalanced, so that the output result is more accurate .
  • the third embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
  • sample 1 contains A feature attributes, where a1 is the number of feature attributes predicted to be normal, and a2 is the number of feature attributes predicted to be abnormal, and a1>a2, then The classification result of this sample is normal, and the misclassification rate is: a2/A.
  • the relative entropy loss of the first distribution probability and the second distribution probability is used to measure the degree of differentiation of the prediction results of the two models, that is, when the positive sample and the expanded sample after the positive sample are used for model training, the two The degree of differentiation of the prediction results of the authors is used to iteratively update the model.
  • the relative entropy loss calculation formula of the first distribution probability and the second distribution probability is as follows:
  • R(P ⁇ Q) is the relative entropy loss
  • is the balance coefficient of positive sample expansion
  • p(x1) is each probability value in the first distribution probability
  • q(x1) is each probability in the second distribution probability value.
  • the accuracy of the first data set is measured by the misclassification rate, and the difference between the two models is measured by the cross-entropy loss.
  • the accuracy and the difference between the two models meet the conditions, it can be judged that the behavior recognition model is converged. Specifically, it can be passed Set the misclassification rate threshold and cross-entropy loss threshold to determine whether the misclassification rate and cross-entropy loss meet the loss conditions. Among them, you can first judge whether the misclassification rate meets the loss conditions. If not, the accuracy of the first data set Insufficient, there is no need for subsequent cross-entropy loss discrimination.
  • multiple learners in the training model are used to learn the decision tree in the data set to identify the category probability of the auto insurance claim settlement behavior, reduce the result bias caused by sample imbalance, and correct the bias problem of the model output.
  • An embodiment of the auto insurance claim settlement behavior identification device in the embodiment of the application includes:
  • the expansion module 401 is used to obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain a plurality of expanded samples, and perform the processing on the negative samples Perform down-sampling processing to obtain multiple sub-samples;
  • a combination module 402 configured to combine each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
  • a training module 403 configured to respectively input the first data set and the second data set into a preset behavior recognition model for behavior type recognition, and obtain a first behavior recognition corresponding to the first data set a result and a second behavior recognition result corresponding to the second data set;
  • An update module 404 configured to calculate the misclassification rate of the first data set by the behavior recognition model according to the first behavior recognition result and the behavior type corresponding to the first data set, and calculate the first A relative entropy loss between the behavior recognition result and the second behavior recognition result; updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
  • the identification module 405 is configured to acquire the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
  • the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine to obtain the first and second data sets, and input two identical behavior recognition models for training respectively, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first data set by the misclassification rate
  • the accuracy of the first data set and the second data set are measured by the relative entropy loss.
  • the behavior recognition model can be obtained, which fully weakens the sample imbalance band.
  • the behavior recognition model is used to identify the behavior category of auto insurance claims to be recognized, and the recognition results obtained are more accurate.
  • FIG. 5 another embodiment of the auto insurance claim settlement behavior recognition device in the embodiment of the present application includes:
  • the expansion module 401 is used to obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain a plurality of expanded samples, and perform the processing on the negative samples Perform down-sampling processing to obtain multiple sub-samples;
  • a combination module 402 configured to combine each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
  • a training module 403 configured to respectively input the first data set and the second data set into a preset behavior recognition model for behavior type recognition, and obtain a first behavior recognition corresponding to the first data set a result and a second behavior recognition result corresponding to the second data set;
  • An update module 404 configured to calculate the misclassification rate of the first data set by the behavior recognition model according to the first behavior recognition result and the behavior type corresponding to the first data set, and calculate the first A relative entropy loss between the behavior recognition result and the second behavior recognition result; updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
  • the identification module 405 is configured to acquire the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
  • the expansion module 401 includes:
  • the distance calculation unit 4011 is used to sequentially calculate the Euclidean distance between every two positive samples, and determine the neighbor samples of each positive sample according to the Euclidean distance;
  • the interpolation processing unit 4012 is configured to randomly select a preset number of neighboring samples for linear interpolation processing, and construct extended samples according to the processing results.
  • the training module 403 includes:
  • the input unit 4031 is used to input the data sets into the preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision layer, and the data set includes the first data set and the second data set. data set;
  • the training unit 4032 is configured to perform random sampling processing on the data set through the input layer to obtain multiple feature subsets; input each of the feature subsets into different learners in the decision-making layer, and pass the The learner identifies each of the feature subsets, and outputs the recognition result of each of the learners for the corresponding feature subset;
  • the output unit 4033 is configured to determine the recognition result of the behavior recognition model for the data set according to the recognition results output by each of the learners, wherein the recognition result of the behavior recognition model for the data set includes the first Behavior recognition results and second behavior recognition results.
  • the update module 404 includes:
  • a statistics unit 4041 configured to perform statistics on the behavior types in the first behavior recognition result, to obtain a first distribution probability of behavior types in the first data set;
  • a ratio calculation unit 4042 configured to determine the number of misclassified samples of the first data set by the behavior recognition model according to the first distribution probability and the behavior type corresponding to the first data set; calculate the misclassification The ratio between the number of samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set.
  • the training unit is also used for:
  • a feature attribute is randomly selected from the selected m feature attributes by the learner to construct a child node under the sample node;
  • the updating module 404 also includes:
  • a loss calculation unit 4043 configured to calculate a cross-entropy loss between the misclassification rate and the relative entropy loss, and determine whether the cross-entropy loss and the misclassification rate meet a preset loss condition
  • the adjustment unit 4044 is used to adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate if not satisfied;
  • the determining unit 4045 is configured to update the behavior recognition model according to the adjusted feature selection parameters until the behavior recognition model converges.
  • a small number of positive samples are expanded to obtain expanded samples, and a large number of negative samples are down-sampled to solve the problem of imbalance between positive and negative samples problem, increase the accuracy of model training; and measure the output accuracy of the behavior recognition model through the misclassification rate and loss value to ensure that the model can fully consider the characteristics of the positive sample when the sample is unbalanced, so that the output result is more accurate ;
  • multiple learners in the training model are used to learn the decision tree in the data set to identify the category probability of auto insurance claims, reduce the result bias caused by sample imbalance, and correct the bias problem of the model output.
  • FIGS 4 and 5 above describe the auto insurance claim settlement behavior recognition device in the embodiment of the present application in detail from the perspective of modular functional entities.
  • the following describes the auto insurance claim settlement behavior recognition device in the embodiment of the present application in detail from the perspective of hardware processing.
  • Fig. 6 is a schematic structural diagram of an auto insurance claim settlement behavior recognition device provided by an embodiment of the present application.
  • the auto insurance claim settlement behavior recognition device 600 may have relatively large differences due to different configurations or performances, and may include one or more than one processor (central processing units (CPU) 610 (for example, one or more processors) and memory 620, one or more storage media 630 for storing application programs 633 or data 632 (for example, one or more mass storage devices).
  • the memory 620 and the storage medium 630 may be temporary storage or persistent storage.
  • the program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the auto insurance claim settlement behavior recognition device 600 .
  • the processor 610 may be configured to communicate with the storage medium 630 , and execute a series of instruction operations in the storage medium 630 on the auto insurance claim settlement behavior recognition device 600 .
  • the auto insurance claim settlement behavior recognition device 600 can also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or, one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 631 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the present application also provides an auto insurance claim settlement behavior recognition device.
  • the computer device includes a memory and a processor, and computer readable instructions are stored in the memory.
  • the processor executes the above-mentioned embodiments The steps of the method for identifying the auto insurance claim settlement behavior.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are run on the computer, the computer is made to execute the steps of the method for identifying the auto insurance claim settlement behavior.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Finance (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The present invention relates to the field of big data. Disclosed are a vehicle insurance claim behavior recognition method, apparatus, and device, and a storage medium. The method comprises: dividing historical vehicle insurance claim data into positive samples and negative samples, performing near-neighbor processing on the positive samples to obtain expanded samples, and downsampling the negative samples to obtain sub-samples; combining the sub-samples with the positive samples and the expanded samples respectively to obtain first and second data sets and inputting the first and second data sets to a behavior recognition model for recognition so as to obtain first and second behavior recognition results, on the basis of which the error rate of the behavior recognition model and the relative entropy loss of the first and second recognition results are calculated; updating the behavior recognition model according to the error rate and the relative entropy loss until the behavior recognition model converges, and then stopping; and finally, inputting vehicle insurance claim data to be recognized into the behavior recognition model to recognize a behavior category corresponding to the vehicle insurance claim data. The present invention solves the imbalance between positive and negative samples stored in vehicle insurance anti-fraud data sets, thereby improving the accuracy of recognizing abnormal vehicle insurance compensation.

Description

车险理赔行为识别方法、装置、设备及存储介质Auto insurance claim settlement behavior identification method, device, equipment and storage medium
本申请要求于2021年06月08日提交中国专利局、申请号为202110635315.3、发明名称为“车险理赔行为识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202110635315.3 and the title of the invention "method, device, equipment and storage medium for identification of auto insurance claim settlement" submitted to the China Patent Office on June 08, 2021, the entire contents of which are incorporated by reference incorporated in the application.
技术领域technical field
本申请涉及大数据领域,尤其涉及一种车险理赔行为识别方法、装置、设备及存储介质。The present application relates to the field of big data, and in particular to a method, device, equipment and storage medium for identifying an auto insurance claim settlement behavior.
背景技术Background technique
车险是指对机动车辆由于自然灾害或意外事故所造成的人身伤亡或财产损失负赔偿责任的一种商业保险,汽车保险是伴随着汽车的出现和普及而产生和发展的。随着社会的发展,人民的生活质量提高,买车的人越来越多,但是有一部分违法分子通过办理车险,伪造事故现场,骗取保险公司的理赔金,诈骗的手段越来越多,骗保事件频发,尤其是针对车险这个领域,骗保的手段层出不穷,花样繁多,让车险公司随时惨重,骗保人员多以团伙作案,联合汽车维修公司甚至买通定损人员进行骗保。Automobile insurance refers to a kind of commercial insurance that is liable for personal injury or property loss caused by motor vehicles due to natural disasters or accidents. Automobile insurance is produced and developed with the emergence and popularization of automobiles. With the development of society, the quality of life of the people has improved, and more and more people are buying cars. However, some lawbreakers defraud insurance companies of compensation by applying for auto insurance and forging accident scenes. Insurance incidents occur frequently, especially in the field of auto insurance. The means of defrauding insurance emerge in endlessly and in various ways, causing heavy losses to auto insurance companies at any time. Insurance fraudsters often commit crimes in gangs, and they cooperate with auto repair companies and even bribe loss assessment personnel to defraud insurance.
发明人意识到,现有车险反欺诈的数据集中,由于车险欺诈的行为相对于正常赔付行为仍为少数,即在采用数据集进行机器学习以鉴定车险赔付是否异常时,欺诈行为的案例和正常赔付行为的案例不平衡,导致正负样本的数量不平衡,在这种情况下,如果使用一般的机器学习的建模方法,得到的效果会非常差,模型会迎合数据的特点,倾向于把结果判断为样本数多的一类。即现有车险反欺诈的数据集存储正负样本不平衡的情况,导致车险赔付异常识别的模型训练准确度低。The inventor realizes that in the existing anti-fraud data set of auto insurance, since the behavior of auto insurance fraud is still a small number compared with the normal compensation behavior, that is, when the data set is used for machine learning to identify whether the auto insurance compensation is abnormal, the cases of fraudulent behavior and normal compensation The cases of compensation behavior are unbalanced, resulting in an imbalance in the number of positive and negative samples. In this case, if the general machine learning modeling method is used, the effect will be very poor. The model will cater to the characteristics of the data and tend to As a result, it was judged as a category with a large number of samples. That is, the existing auto insurance anti-fraud data set stores an imbalance of positive and negative samples, which leads to low accuracy of model training for auto insurance claim anomaly recognition.
发明内容Contents of the invention
本申请的主要目的在于解决现有车险反欺诈的数据集存储正负样本不平衡的情况,导致车险赔付异常识别的模型训练准确度低的问题。The main purpose of this application is to solve the problem of the imbalance of positive and negative samples stored in the existing anti-fraud data set of auto insurance, which leads to the low accuracy of model training for abnormal identification of auto insurance claims.
本申请第一方面提供了一种车险理赔行为识别方法,包括:获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The first aspect of the present application provides a method for identifying auto insurance claims, including: obtaining historical auto insurance claims data, and dividing the historical auto insurance claims data into positive samples and negative samples; performing neighbor propagation processing on the positive samples to obtain A plurality of expanded samples, and performing down-sampling processing on the negative samples to obtain a plurality of sub-samples; combining each of the sub-samples with the positive samples and the expanded samples to obtain the first data set and the second data set Data set; respectively input the first data set and the second data set into a preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the corresponding The second behavior recognition result corresponding to the second data set; according to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification of the first data set by the behavior recognition model rate, and calculate the relative entropy loss between the first behavior recognition result and the second behavior recognition result; update the behavior recognition model according to the misclassification rate and the relative entropy loss until the Stop when the behavior recognition model converges; acquire auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
本申请第二方面提供了一种车险理赔行为识别设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所 述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The second aspect of the present application provides an auto insurance claim settlement behavior recognition device, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor executes the computer The following steps are implemented when the instructions are readable: obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain multiple expanded samples, and The negative sample is subjected to down-sampling processing to obtain multiple sub-samples; each of the sub-samples is combined with the positive sample and the expanded sample to obtain a first data set and a second data set; the first data set The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set are respectively input into the preset behavior recognition model to identify the behavior type. Two behavior recognition results; according to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior Relative entropy loss between the recognition result and the second behavior recognition result; update the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges; obtain the pending The auto insurance claim data is identified, and the auto insurance claim data to be identified is input into the behavior recognition model, and the behavior category corresponding to the auto insurance claim data to be identified is identified.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps: obtaining historical auto insurance claim data , and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain multiple expanded samples, and perform down-sampling processing on the negative samples to obtain multiple sub-samples; Each of the sub-samples is combined with the positive sample and the expanded sample to obtain a first data set and a second data set; respectively input the first data set and the second data set to a preset Identify the behavior type in the behavior recognition model, and obtain the first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set; according to the first behavior recognition result For the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the difference between the first behavior recognition result and the second behavior recognition result Relative entropy loss; update the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges; obtain the auto insurance claims data to be identified, and settle the auto insurance claims to be identified The data is input into the behavior recognition model to identify the behavior category corresponding to the auto insurance claims data to be recognized.
本申请第四方面提供了一种车险理赔行为识别装置,包括:扩充模块,用于获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;组合模块,用于将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;训练模块,用于将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;更新模块,用于根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;识别模块,用于获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The fourth aspect of the present application provides an auto insurance claim settlement behavior recognition device, including: an expansion module for obtaining historical auto insurance claim settlement data, and dividing the historical auto insurance claim settlement data into positive samples and negative samples; Neighbor propagation processing, to obtain a plurality of extended samples, and perform down-sampling processing on the negative samples, to obtain a plurality of sub-samples; a combination module, for performing each sub-sample with the positive samples and the extended samples respectively Combining to obtain the first data set and the second data set; the training module is used to input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the same The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set; an update module, configured to correspond to the first behavior recognition result according to the first data set behavior type, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the relative entropy loss between the first behavior recognition result and the second behavior recognition result; according to the misclassification The ratio and the relative entropy loss update the behavior recognition model until the behavior recognition model converges; the recognition module is used to obtain the auto insurance claim data to be identified, and input the auto insurance claim data to be identified The behavior recognition model is used to identify the behavior category corresponding to the auto insurance claims data to be recognized.
本申请提供的技术方案中,通过对数量较少的正样本进行扩充,以及对数量较多的负样本进行下采样,解决正负样本不平衡的问题;然后通过将子样本分别与扩充样本和正样本进行组合,得到第一、第二数据集,并分别输入两个相同的行为识别模型进行训练,对应得到第一、第二分布概率;接着对输出结果进行处理,通过误分率衡量第一数据集的准确度,通过相对熵损失衡量第一数据集和第二数据集的差别,当准确度和两个数据集的差别均满足条件时,即可得到行为识别模型,充分削弱了样本不平衡带来的识别偏差;最后通过该行为识别模型识别待识别车险理赔的行为类别,得到的识别结果更准确。In the technical solution provided by this application, the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine the samples to obtain the first and second data sets, and input two identical behavior recognition models for training, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first by the error rate The accuracy of the data set, the difference between the first data set and the second data set is measured by the relative entropy loss. When the accuracy and the difference between the two data sets meet the conditions, the behavior recognition model can be obtained, which fully weakens the difference between the samples. The identification bias brought by the balance; finally, the behavior category of the auto insurance claim to be identified is identified through the behavior identification model, and the identification result obtained is more accurate.
附图说明Description of drawings
图1为本申请实施例中车险理赔行为识别方法的第一个实施例示意图;Fig. 1 is the schematic diagram of the first embodiment of the method for identifying the behavior of auto insurance claim settlement in the embodiment of the present application;
图2为本申请实施例中车险理赔行为识别方法的第二个实施例示意图;Fig. 2 is the schematic diagram of the second embodiment of the method for identifying the behavior of auto insurance claim settlement in the embodiment of the present application;
图3为本申请实施例中车险理赔行为识别方法的第三个实施例示意图;Fig. 3 is the schematic diagram of the third embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application;
图4为本申请实施例中车险理赔行为识别装置的一个实施例示意图;FIG. 4 is a schematic diagram of an embodiment of an auto insurance claim settlement behavior recognition device in the embodiment of the present application;
图5为本申请实施例中车险理赔行为识别装置的另一个实施例示意图;FIG. 5 is a schematic diagram of another embodiment of the auto insurance claim settlement behavior recognition device in the embodiment of the present application;
图6为本申请实施例中车险理赔行为识别设备的一个实施例示意图。Fig. 6 is a schematic diagram of an embodiment of an auto insurance claim settlement behavior recognition device in the embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种车险理赔行为识别方法、装置、设备及存储介质,将历史车险理赔数据划分为正样本和负样本,近邻处理正样本得到扩充样本,下采样负样本得到子样本;将子样本分别与正样本和扩充样本组合,得到第一、第二数据集并输入行为识别模型进行识别,得到第一、第二行为识别结果,以此计算行为识别模型的误分率和第一、第二分布概率的相对熵损失;根据误分率和相对熵损失对行为识别模型进行更新,直到行为识别模型收敛时停止;最后将待识别车险理赔数据输入行为识别模型,识别待识别车险理赔数据对应的行为类别。本申请解决了车险反欺诈的数据集存储正负样本不平衡的情况,从而提升车险赔付异常识别的准确度。The embodiment of the present application provides a method, device, equipment and storage medium for identifying the behavior of auto insurance claims, which divides the historical auto insurance claims data into positive samples and negative samples, processes the positive samples to obtain expanded samples, and down-samples the negative samples to obtain sub-samples; Combining the sub-samples with the positive sample and the expanded sample respectively, the first and second data sets are obtained and input into the behavior recognition model for recognition, and the first and second behavior recognition results are obtained, so as to calculate the misclassification rate and the second behavior recognition model. 1. The relative entropy loss of the second distribution probability; update the behavior recognition model according to the misclassification rate and relative entropy loss, and stop until the behavior recognition model converges; finally, input the auto insurance claims data to be identified into the behavior recognition model to identify the auto insurance to be identified Behavior category corresponding to claims data. The present application solves the problem of unbalanced storage of positive and negative samples in the anti-fraud data set of auto insurance, thereby improving the accuracy of abnormal identification of auto insurance claims.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily Used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the term "comprising" or "having" and any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to those explicitly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中车险理赔行为识别方法的第一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application, please refer to Figure 1, the first embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
101、获取历史车险理赔数据,并将历史车险理赔数据划分为正样本和负样本,对正样本进行近邻传播处理,得到多个扩充样本,以及对负样本进行下采样处理,得到多个子样本;101. Obtain the historical auto insurance claims data, divide the historical auto insurance claims data into positive samples and negative samples, perform neighbor propagation processing on the positive samples, obtain multiple expanded samples, and perform down-sampling processing on the negative samples to obtain multiple sub-samples;
可以理解的是,本申请的执行主体可以为车险理赔行为识别装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the executor of the present application may be an auto insurance claim settlement behavior recognition device, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present application is described by taking the server as an execution subject as an example.
本实施例中,历史车险理赔指的是在过往企业对车险理赔的记录,包括正常的车险理赔记录,异常(比如骗保行为)的车险理赔记录,具体可以包括历史索赔记录、出险现场记录、维修记录、保单记录、被投保人信息,以及正常或异常车险理赔的标记等历史车险理赔信息。In this embodiment, historical auto insurance claims refer to the records of auto insurance claims in past enterprises, including normal auto insurance claims records and abnormal (such as fraudulent insurance behavior) auto insurance claims records, which may specifically include historical claim records, accident scene records, Historical auto insurance claim information such as maintenance records, policy records, policyholder information, and marks of normal or abnormal auto insurance claims.
其中,对于上述的历史车险理赔信息,先描述成一个个底层因子,以词汇进行描述。然后对该些底层因子进行加工,比如LBS(Location Based Services,基于位置的服务)因子加工、WIFI因子扩充、其他欺诈关联相关因子加工。接着进一步对加工后的底层因子进行特征清洗,筛选符合数据饱和度要求和模型相关性要求的底层因子。最后通过特征工程对筛选出的底层因子进行编码处理,得到数值化的底层因子,即为第一数据集的最终表达形式。Among them, for the above-mentioned historical auto insurance claim information, first describe each underlying factor and describe it in terms of vocabulary. Then process these underlying factors, such as LBS (Location Based Services, location-based service) factor processing, WIFI factor expansion, and other fraud-related factor processing. Then, further feature cleaning is performed on the processed underlying factors, and the underlying factors that meet the requirements of data saturation and model correlation are screened. Finally, the screened underlying factors are encoded through feature engineering to obtain the numerical underlying factors, which is the final expression form of the first data set.
进一步的,LBS因子加工指的是基于客户经纬度和POI(Point of Interest,感兴趣点)加工客户最近一段时间的生活轨迹;WIFI因子扩充指的是根据用户的WIFI链接信息,以及历史黑名单记录加工当前客户与骗保黑名单之间的相关性;其他欺诈关联相关因子加工可以包括出险地与维修厂之间的欧式距离,出险驾驶人与被保人是否同一人等。且特征工程可以通过数据特征归一化、数据分箱、离散特征数值化的方法对底层因子进行编码处理。Furthermore, LBS factor processing refers to processing the customer’s life trajectory in the latest period based on the customer’s latitude and longitude and POI (Point of Interest); WIFI factor expansion refers to the user’s WIFI link information and historical blacklist records Process the correlation between the current customer and the fraudulent insurance blacklist; processing of other fraud-related factors can include the Euclidean distance between the place of accident and the repair shop, whether the driver of the accident and the insured are the same person, etc. And feature engineering can encode the underlying factors through data feature normalization, data binning, and discrete feature numericalization.
本实施例中,在历史车险理赔数据中,将正常的车险理赔记录的历史车险理赔数据作为负样本,将异常的车险理赔记录的历史车险理赔数据作为正样本,其中,正样本的数量相对于负样本的数量较少,直接进行模型训练,结果容易倾向于正常,故此处对正样本进行扩充,对负样本进行下采样,一方面增加正样本的数量,另一方面减少负样本的数量,使得两者的比例减少,降低模型识别的训练数据数量导致的倾向性结果影响。In this embodiment, in the historical auto insurance claim data, the historical auto insurance claim data of normal auto insurance claim records is used as negative samples, and the historical auto insurance claim data of abnormal auto insurance claim records is used as positive samples, wherein the number of positive samples is relative to The number of negative samples is small, and the result of direct model training tends to be normal. Therefore, the positive samples are expanded here, and the negative samples are down-sampled. On the one hand, the number of positive samples is increased, and on the other hand, the number of negative samples is reduced. The ratio of the two is reduced, and the influence of the tendency result caused by the amount of training data identified by the model is reduced.
102、将每个子样本分别与正样本和扩充样本进行组合,得到第一数据集和第二数据集;102. Combine each sub-sample with a positive sample and an expanded sample to obtain a first data set and a second data set;
本实施例中,车险理赔的行为识别模型训练由两个相同的训练模型堆叠组成,采用第一个模型训练第一数据集,用于预测各个样本中车险理赔的行为类型;采用第二个模型训练第二数据集,用于第一个模型的预测结果进行比较,对第一个模型进行迭代。In this embodiment, the behavior recognition model training for auto insurance claims consists of stacking two identical training models. The first model is used to train the first data set, which is used to predict the behavior type of auto insurance claims in each sample; the second model is used Train the second data set to compare the prediction results of the first model and iterate on the first model.
其中,在对正样本进行扩充时,可以通过K最近邻(K-Nearest Neighbor,KNN)分类算法计算每个正样本的K近邻,筛选可以用于合成新样本的正样本,比如接近分类边界的正样本,筛选的数量可以根据正样本和负样本的比例进行设置,然后与原正样本进行新的正样本的构建。Among them, when expanding the positive samples, the K-nearest neighbors of each positive sample can be calculated by the K-Nearest Neighbor (KNN) classification algorithm, and the positive samples that can be used to synthesize new samples, such as those close to the classification boundary, can be screened. Positive samples, the number of screening can be set according to the ratio of positive samples to negative samples, and then construct new positive samples with the original positive samples.
103、将第一数据集和第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与第一数据集对应的第一行为识别结果和与第二数据集对应的第二行为识别结果;103. Input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the first behavior recognition result corresponding to the second data set. Two behavior recognition results;
本实施例中,分别采用两个相同的行为识别模型分别对第一数据集和第二数据集进行训练,分别识别出第一数据集中每个样本的行为类型的识别分布概率,以及第二数据集中每个样本的行为类型的识别分布概率,然后分别对两个数据集对应的识别分布概率进行线性拟合,即可得到第一分布概率和第二分布概率。In this embodiment, two identical behavior recognition models are used to train the first data set and the second data set respectively, and respectively identify the recognition distribution probability of the behavior type of each sample in the first data set, and the second data set Collect the identification distribution probabilities of the behavior types of each sample, and then perform linear fitting on the identification distribution probabilities corresponding to the two data sets respectively, so as to obtain the first distribution probability and the second distribution probability.
具体的,行为识别模型可以采用随机森林(Random Forest)算法进行训练,第一数据集或者第二数据集中的每一个样本通过一个学习器进行训练,得到多棵决策树,即一个样本对应一颗决策树,决策树中的节点对应样本的特征属性,根据决策树,计算出每个样本中每个特征属性对应的行为类型的识别概率,并对各特征属性对应的识别概率进行拟合,即可得到每个样本的行为类型的识别分布概率。Specifically, the behavior recognition model can be trained using the Random Forest (Random Forest) algorithm. Each sample in the first data set or the second data set is trained by a learner to obtain multiple decision trees, that is, one sample corresponds to one Decision tree, the nodes in the decision tree correspond to the characteristic attributes of the sample, according to the decision tree, calculate the recognition probability of the behavior type corresponding to each characteristic attribute in each sample, and fit the recognition probability corresponding to each characteristic attribute, that is The identification distribution probability of the behavior type of each sample can be obtained.
104、根据第一行为识别结果和第一数据集对应的行为类型,计算行为识别模型对第一数据集的误分率,以及计算第一行为识别结果和第二行为识别结果之间的相对熵损失,并根据误分率和相对熵损失对行为识别模型进行更新,直到行为识别模型收敛时停止;104. According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the relative entropy between the first behavior recognition result and the second behavior recognition result Loss, and update the behavior recognition model according to the misclassification rate and relative entropy loss until the behavior recognition model converges;
本实施例中,通过第一分布概率计算第一数据集中各样本的误分率,此处误分率指的是同一个样本在不同的学习器中预测正确的行为类型数量与总预测结果数量的比率。其中,将样本中相同预测结果最多的行为类型作为预测正确的行为类型,并将其他的行为类型作为预测错误的行为类型。In this embodiment, the misclassification rate of each sample in the first data set is calculated through the first distribution probability, where the misclassification rate refers to the number of correct behavior types and the total number of predicted results predicted by the same sample in different learners The ratio. Among them, the behavior type with the most identical prediction results in the sample is regarded as the correctly predicted behavior type, and the other behavior types are regarded as the wrongly predicted behavior types.
具体的,此处样本对应车险赔付的行为类型可以仅包括正常和异常,此处误分率的计算方式为:异常的特征属性数量/总特征属性数量。Specifically, the behavior types of the sample corresponding to auto insurance claims here may only include normal and abnormal, and the calculation method of the misclassification rate here is: the number of abnormal characteristic attributes/the total number of characteristic attributes.
105、获取待识别车险理赔数据,并将待识别车险理赔数据输入行为识别模型,识别待识别车险理赔数据对应的行为类别。105. Obtain the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior recognition model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
本实施例中,在训练完行为识别模型后,直接将待识别车险理赔的第二数据集输入该行为识别模型后,即可直接输出该待识别车险理赔的行为类型,行为类型可以仅包括正常或者异常。In this embodiment, after the behavior recognition model is trained, the second data set of auto insurance claims to be identified is directly input into the behavior recognition model, and the behavior type of the auto insurance claims to be identified can be directly output, and the behavior types may only include normal Or abnormal.
本申请实施例中,通过对数量较少的正样本进行扩充,以及对数量较多的负样本进行下采样,解决正负样本不平衡的问题;然后通过将子样本分别与扩充样本和正样本进行组合,得到第一、第二数据集,并分别输入两个相同的行为识别模型进行训练,对应得到第一、第二分布概率;接着对输出结果进行处理,通过误分率衡量第一数据集的准确度,通过相对熵损失衡量第一数据集和第二数据集的差别,当准确度和两个数据集的差别均满足条件时,即可得到行为识别模型,充分削弱了样本不平衡带来的识别偏差;最后通过该行为识别模型识别待识别车险理赔的行为类别,得到的识别结果更准确。In the embodiment of the present application, the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine to obtain the first and second data sets, and input two identical behavior recognition models for training respectively, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first data set by the misclassification rate The accuracy of the first data set and the second data set are measured by the relative entropy loss. When the accuracy and the difference between the two data sets meet the conditions, the behavior recognition model can be obtained, which fully weakens the sample imbalance band. In the end, the behavior recognition model is used to identify the behavior category of auto insurance claims to be recognized, and the recognition results obtained are more accurate.
请参阅图2,本申请实施例中车险理赔行为识别方法的第二个实施例包括:Please refer to Fig. 2, the second embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
201、获取历史车险理赔数据,将历史车险理赔数据划分为正样本和负样本,并对负样 本进行下采样处理,得到多个子样本;201. Obtain historical auto insurance claims data, divide historical auto insurance claims data into positive samples and negative samples, and perform down-sampling processing on negative samples to obtain multiple sub-samples;
202、依次计算每两个正样本之间的欧式距离,并根据欧式距离,确定每个正样本的近邻样本;202. Calculate the Euclidean distance between each two positive samples in turn, and determine the nearest neighbor sample of each positive sample according to the Euclidean distance;
203、随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本;203. Randomly screen a preset number of neighboring samples for linear interpolation processing, and construct extended samples according to the processing results;
本实施例中,在对正样本进行扩充时,可以通过K最近邻(K-Nearest Neighbor,KNN)分类算法计算每个正样本的K近邻,筛选可以用于合成新样本的正样本,比如接近分类边界的正样本,筛选的数量可以根据正样本和负样本的比例进行设置,然后与原正样本进行新的正样本的构建。In this embodiment, when expanding positive samples, K-nearest neighbors (K-Nearest Neighbor, KNN) classification algorithm can be used to calculate the K-nearest neighbors of each positive sample, and filter positive samples that can be used to synthesize new samples, such as close to For the positive samples of the classification boundary, the number of screening can be set according to the ratio of positive samples and negative samples, and then construct new positive samples with the original positive samples.
具体的,在对负样本进行下采样时,可以对负样本按照预先设置的采样倍数进行放回式的随机采样,将负样本下采样至与正样本数量比例较为平衡的状态,其中,采样倍数可以根据正样本和负样本的比例进行设置。Specifically, when downsampling the negative samples, the negative samples can be randomly sampled with replacement according to the preset sampling multiple, and the negative samples can be down-sampled to a state that is relatively balanced in proportion to the number of positive samples. Among them, the sampling multiple It can be set according to the ratio of positive samples and negative samples.
204、将每个子样本分别与正样本和扩充样本进行组合,得到第一数据集和第二数据集;204. Combine each sub-sample with a positive sample and an expanded sample to obtain a first data set and a second data set;
205、将数据集分别输入至预置的行为识别模型中,其中,行为识别模型包括输入层和决策层,数据集包括第一数据集和第二数据集;205. Input the data sets into a preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision-making layer, and the data sets include a first data set and a second data set;
206、通过输入层对数据集进行随机采样处理,得到多个特征子集;206. Randomly sample the data set through the input layer to obtain multiple feature subsets;
207、将各特征子集输入决策层中不同的学习器,并通过学习器对各特征子集进行识别,输出各学习器对对应特征子集的识别结果;207. Input each feature subset into different learners in the decision-making layer, and use the learner to identify each feature subset, and output the recognition results of each learner for the corresponding feature subset;
本实施例中,通过数据集保持正样本和负样本的数据平衡,然后再通过对数据集进一步的数据分割,得到多个特征子集,此时每个特征子集中正样本和负样本的数据跟数据集中的正样本和负样本一样可以保持数据的相对平衡。In this embodiment, the data balance of positive samples and negative samples is maintained through the data set, and then through further data segmentation of the data set, multiple feature subsets are obtained. At this time, the data of positive samples and negative samples in each feature subset Like the positive and negative samples in the data set, the relative balance of the data can be maintained.
另外,每一个学习器输出每一个样本中各特征属性的行为类型的识别概率,并进行拟合得到识别分布概率,再进一步拟合数据集的行为类型识别的分布概率。In addition, each learner outputs the recognition probability of the behavior type of each characteristic attribute in each sample, and performs fitting to obtain the recognition distribution probability, and then further fits the distribution probability of the behavior type recognition of the data set.
进一步的,通过学习器对特征子集的训练过程具体包括以下步骤:Further, the training process of the feature subset through the learner specifically includes the following steps:
(1)通过当前学习器从特征子集中选取一个子样本、扩充样本或正样本构建样本节点,并根据预设的特征选择参数,从选取的子样本、扩充样本或正样本中选取m个特征属性;(1) Use the current learner to select a sub-sample, extended sample or positive sample from the feature subset to construct a sample node, and select m features from the selected sub-sample, expanded sample or positive sample according to the preset feature selection parameters Attributes;
(2)通过学习器从选取的多个样本属性中筛选出一个特征属性构建样本节点下的特征节点;(2) Select a feature attribute from the selected multiple sample attributes by the learner to construct a feature node under the sample node;
(3)通过学习器重新从选取的子样本、扩充样本或正样本中选取m个特征属性,并构建特征节点的下级特征节点,直到特征节点的数量为m时停止构建特征节点,得到对应的决策树;(3) Re-select m feature attributes from the selected sub-samples, expanded samples or positive samples through the learner, and construct the lower-level feature nodes of the feature nodes, stop building feature nodes until the number of feature nodes is m, and obtain the corresponding decision tree;
(4)通过下一个学习器重新从特征子集中筛选一个未被选取的子样本、扩充样本或正样本构建样本节点构建决策树,直到构建得到特征子集中各子样本、扩充样本和/或正样本的决策树时停止;(4) Use the next learner to re-screen an unselected sub-sample, extended sample or positive sample from the feature subset to construct a decision tree until each sub-sample, extended sample and/or positive sample in the feature subset is constructed. The decision tree of the sample stops when;
(5)根据特征子集对应的多个决策树,通过学习器计算各学习器对应的行为类型的识别分布概率并输出。(5) According to the plurality of decision trees corresponding to the feature subsets, the recognition distribution probability of the behavior type corresponding to each learner is calculated by the learner and output.
本实施例中,在采用随机森林算法进行模型训练时,假设特征子集中包括有A个样本(子样本、扩充样本和正样本的总和),则有放回的从A各样本中随机选择1个样本,作为决策树根节点处的样本,即构建样本节点;若每个样本中包含有K个属性时,在决策树的每个样本节点需要分支延伸时,随机从K个属性中选取出k个特征属性,其中,k≤K,从k个特征属性中采用预设的策略比如信息增益,选择1个特征属性作为该节点的分支属性,即构建特征节点;循环上述步骤,直到特征节点的数量为m时停止延伸;多个学习器得到的决策树组合即可构成整个随机森林。In this embodiment, when using the random forest algorithm for model training, assuming that there are A samples (the sum of sub-samples, expanded samples and positive samples) in the feature subset, then randomly select one from each sample of A with replacement. Sample, as a sample at the root node of the decision tree, that is, to construct a sample node; if each sample contains K attributes, when each sample node of the decision tree needs branch extension, randomly select k from the K attributes feature attributes, where k≤K, use a preset strategy such as information gain from the k feature attributes, select one feature attribute as the branch attribute of the node, that is, construct the feature node; repeat the above steps until the feature node Stop extending when the number is m; the combination of decision trees obtained by multiple learners can form the entire random forest.
208、根据各学习器输出的识别结果,确定行为识别模型对数据集的识别结果,其中,行为识别模型对数据集的识别结果包括第一行为识别结果和第二行为识别结果;208. Determine the recognition result of the behavior recognition model for the data set according to the recognition results output by each learner, wherein the recognition result of the behavior recognition model for the data set includes a first behavior recognition result and a second behavior recognition result;
209、根据第一行为识别结果和第一数据集对应的行为类型,计算行为识别模型对第一数据集的误分率,以及计算第一行为识别结果和第二行为识别结果之间的相对熵损失,并根据误分率和相对熵损失对行为识别模型进行更新,直到行为识别模型收敛时停止;209. According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the relative entropy between the first behavior recognition result and the second behavior recognition result Loss, and update the behavior recognition model according to the misclassification rate and relative entropy loss until the behavior recognition model converges;
210、获取待识别车险理赔数据,并将待识别车险理赔数据输入行为识别模型,识别待识别车险理赔数据对应的行为类别。210. Obtain the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior recognition model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
本申请实施例中,通过采样近邻样本的方法和线性插值处理,对数量较少的正样本进行扩充,得到扩充样本,以及对数量较多的负样本进行下采样,解决正负样本不平衡的问题,增加模型训练的准确度;并通过误分率和损失值衡量行为识别模型的输出准确度,确保样本不平衡的情况下,模型可以充分考虑到正样本的特征性,使得输出结果更准确。In the embodiment of the present application, through the method of sampling neighboring samples and linear interpolation processing, a small number of positive samples are expanded to obtain expanded samples, and a large number of negative samples are down-sampled to solve the problem of imbalance between positive and negative samples problem, increase the accuracy of model training; and measure the output accuracy of the behavior recognition model through the misclassification rate and loss value to ensure that the model can fully consider the characteristics of the positive sample when the sample is unbalanced, so that the output result is more accurate .
请参阅图3,本申请实施例中车险理赔行为识别方法的第三个实施例包括:Please refer to Fig. 3, the third embodiment of the method for identifying the behavior of auto insurance claims in the embodiment of the present application includes:
301、获取历史车险理赔数据,将历史车险理赔数据划分为正样本和负样本,并对正样本进行近邻传播处理,得到多个扩充样本,以及对负样本进行下采样处理,得到多个子样本;301. Obtain historical auto insurance claim data, divide historical auto insurance claim data into positive samples and negative samples, and perform neighbor propagation processing on positive samples to obtain multiple expanded samples, and perform down-sampling processing on negative samples to obtain multiple sub-samples;
302、将每个子样本分别与正样本和扩充样本进行组合,得到第一数据集和第二数据集;302. Combine each sub-sample with a positive sample and an expanded sample to obtain a first data set and a second data set;
303、将第一数据集和第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与第一数据集对应的第一行为识别结果和与第二数据集对应的第二行为识别结果;303. Input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the first behavior recognition result corresponding to the second data set. Two behavior recognition results;
304、对第一行为识别结果中行为类型进行统计,得到第一数据集中行为类型的第一分布概率,并根据第一分布概率和第一数据集对应的行为类型,确定行为识别模型对第一数据集的误分样本数量;304. Make statistics on the behavior types in the first behavior recognition results, obtain the first distribution probability of the behavior types in the first data set, and determine the first distribution probability of the behavior recognition model according to the first distribution probability and the behavior type corresponding to the first data set. The number of misclassified samples in the dataset;
305、计算误分样本数量和第一数据集中总样本数量之间的比率,并将比率作为行为识别模型对第一数据集的误分率,以及计算第一行为识别结果和第二行为识别结果之间的相对熵损失;305. Calculate the ratio between the number of misclassified samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior recognition result and the second behavior recognition result The relative entropy loss between;
本实施例中,若车险赔付的行为类型包括正常和异常,样本1的包含A个特征属性,其中a1为预测正常的特征属性数量,a2为预测异常的特征属性数量,且a1>a2,则该样本的的分类结果为正常,误分率则为:a2/A。In this embodiment, if the behavior types of auto insurance compensation include normal and abnormal, sample 1 contains A feature attributes, where a1 is the number of feature attributes predicted to be normal, and a2 is the number of feature attributes predicted to be abnormal, and a1>a2, then The classification result of this sample is normal, and the misclassification rate is: a2/A.
306、计算误分率和相对熵损失之间的交叉熵损失,并判断交叉熵损失和误分率是否满足预置损失条件;306. Calculate the cross-entropy loss between the misclassification rate and the relative entropy loss, and judge whether the cross-entropy loss and the misclassification rate meet the preset loss conditions;
307、若不满足,则根据交叉熵损失和误分率,调整行为识别模型中的特征选择参数;307. If not satisfied, adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate;
308、根据调整后的特征选择参数,对行为识别模型进行更新,直到行为识别模型收敛时停止;308. Update the behavior recognition model according to the adjusted feature selection parameters, and stop until the behavior recognition model converges;
本实施例中,第一分布概率和第二分布概率的相对熵损失用于衡量两个模型的预测结果的差异化程度,即在采用正样本和正样本扩充后的扩充样本进行模型训练时,两者的预测结果差异化程度,以用于对模型进行迭代更新。具体的,第一分布概率和第二分布概率的相对熵损失计算公式如下所示:In this embodiment, the relative entropy loss of the first distribution probability and the second distribution probability is used to measure the degree of differentiation of the prediction results of the two models, that is, when the positive sample and the expanded sample after the positive sample are used for model training, the two The degree of differentiation of the prediction results of the authors is used to iteratively update the model. Specifically, the relative entropy loss calculation formula of the first distribution probability and the second distribution probability is as follows:
Figure PCTCN2022071477-appb-000001
Figure PCTCN2022071477-appb-000001
其中,R(P‖Q)为相对熵损失,λ为正样本扩充的平衡系数,p(x1)为第一分布概率中的各概率值,q(x1)为第二分布概率中的各概率值。Among them, R(P∥Q) is the relative entropy loss, λ is the balance coefficient of positive sample expansion, p(x1) is each probability value in the first distribution probability, q(x1) is each probability in the second distribution probability value.
通过误分率衡量第一数据集的准确度,通过交叉熵损失衡量两个模型的差别,当准确 度和两个模型的差别均满足条件时,即可判定该行为识别模型收敛,具体可以通过设置误分率阈值和交叉熵损失阈值来判定误分率和交叉熵损失是否均满足损失条件,其中,可先判别误分率是否满足损失条件,如果不满足,则第一数据集的准确度不足,无需进行后续的交叉熵损失判别。The accuracy of the first data set is measured by the misclassification rate, and the difference between the two models is measured by the cross-entropy loss. When the accuracy and the difference between the two models meet the conditions, it can be judged that the behavior recognition model is converged. Specifically, it can be passed Set the misclassification rate threshold and cross-entropy loss threshold to determine whether the misclassification rate and cross-entropy loss meet the loss conditions. Among them, you can first judge whether the misclassification rate meets the loss conditions. If not, the accuracy of the first data set Insufficient, there is no need for subsequent cross-entropy loss discrimination.
309、获取待识别车险理赔数据,并将待识别车险理赔数据输入行为识别模型,识别待识别车险理赔数据对应的行为类别。309. Obtain the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior recognition model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
本申请实施例中,通过训练模型中的多个学习器学习数据集中的决策树,以此识别车险理赔行为的类别概率,减少样本不平衡带来的结果偏向,纠正模型输出的偏置问题。In the embodiment of this application, multiple learners in the training model are used to learn the decision tree in the data set to identify the category probability of the auto insurance claim settlement behavior, reduce the result bias caused by sample imbalance, and correct the bias problem of the model output.
上面对本申请实施例中车险理赔行为识别方法进行了描述,下面对本申请实施例中车险理赔行为识别装置进行描述,请参阅图4,本申请实施例中车险理赔行为识别装置一个实施例包括:The method for identifying the auto insurance claim settlement behavior in the embodiment of the present application has been described above. The following describes the identification device for the auto insurance claim settlement behavior in the embodiment of the present application. Please refer to FIG. 4. An embodiment of the auto insurance claim settlement behavior identification device in the embodiment of the application includes:
扩充模块401,用于获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;The expansion module 401 is used to obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain a plurality of expanded samples, and perform the processing on the negative samples Perform down-sampling processing to obtain multiple sub-samples;
组合模块402,用于将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;A combination module 402, configured to combine each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
训练模块403,用于将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;A training module 403, configured to respectively input the first data set and the second data set into a preset behavior recognition model for behavior type recognition, and obtain a first behavior recognition corresponding to the first data set a result and a second behavior recognition result corresponding to the second data set;
更新模块404,用于根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;An update module 404, configured to calculate the misclassification rate of the first data set by the behavior recognition model according to the first behavior recognition result and the behavior type corresponding to the first data set, and calculate the first A relative entropy loss between the behavior recognition result and the second behavior recognition result; updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
识别模块405,用于获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The identification module 405 is configured to acquire the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
本申请实施例中,通过对数量较少的正样本进行扩充,以及对数量较多的负样本进行下采样,解决正负样本不平衡的问题;然后通过将子样本分别与扩充样本和正样本进行组合,得到第一、第二数据集,并分别输入两个相同的行为识别模型进行训练,对应得到第一、第二分布概率;接着对输出结果进行处理,通过误分率衡量第一数据集的准确度,通过相对熵损失衡量第一数据集和第二数据集的差别,当准确度和两个数据集的差别均满足条件时,即可得到行为识别模型,充分削弱了样本不平衡带来的识别偏差;最后通过该行为识别模型识别待识别车险理赔的行为类别,得到的识别结果更准确。In the embodiment of the present application, the problem of unbalanced positive and negative samples is solved by expanding the positive samples with a small number and downsampling the negative samples with a large number; Combine to obtain the first and second data sets, and input two identical behavior recognition models for training respectively, correspondingly obtain the first and second distribution probabilities; then process the output results, and measure the first data set by the misclassification rate The accuracy of the first data set and the second data set are measured by the relative entropy loss. When the accuracy and the difference between the two data sets meet the conditions, the behavior recognition model can be obtained, which fully weakens the sample imbalance band. In the end, the behavior recognition model is used to identify the behavior category of auto insurance claims to be recognized, and the recognition results obtained are more accurate.
请参阅图5,本申请实施例中车险理赔行为识别装置的另一个实施例包括:Please refer to Figure 5, another embodiment of the auto insurance claim settlement behavior recognition device in the embodiment of the present application includes:
扩充模块401,用于获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;The expansion module 401 is used to obtain historical auto insurance claims data, and divide the historical auto insurance claims data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain a plurality of expanded samples, and perform the processing on the negative samples Perform down-sampling processing to obtain multiple sub-samples;
组合模块402,用于将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;A combination module 402, configured to combine each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
训练模块403,用于将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;A training module 403, configured to respectively input the first data set and the second data set into a preset behavior recognition model for behavior type recognition, and obtain a first behavior recognition corresponding to the first data set a result and a second behavior recognition result corresponding to the second data set;
更新模块404,用于根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;An update module 404, configured to calculate the misclassification rate of the first data set by the behavior recognition model according to the first behavior recognition result and the behavior type corresponding to the first data set, and calculate the first A relative entropy loss between the behavior recognition result and the second behavior recognition result; updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
识别模块405,用于获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The identification module 405 is configured to acquire the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
具体的,所述扩充模块401包括:Specifically, the expansion module 401 includes:
距离计算单元4011,用于依次计算每两个正样本之间的欧式距离,并根据所述欧式距离,确定每个正样本的近邻样本;The distance calculation unit 4011 is used to sequentially calculate the Euclidean distance between every two positive samples, and determine the neighbor samples of each positive sample according to the Euclidean distance;
插值处理单元4012,用于随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本。The interpolation processing unit 4012 is configured to randomly select a preset number of neighboring samples for linear interpolation processing, and construct extended samples according to the processing results.
具体的,所述训练模块403包括:Specifically, the training module 403 includes:
输入单元4031,用于将数据集分别输入至预置的行为识别模型中,其中,所述行为识别模型包括输入层和决策层,所述数据集包括所述第一数据集和所述第二数据集;The input unit 4031 is used to input the data sets into the preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision layer, and the data set includes the first data set and the second data set. data set;
训练单元4032,用于通过所述输入层对所述数据集进行随机采样处理,得到多个特征子集;将各所述特征子集输入所述决策层中不同的学习器,并通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果;The training unit 4032 is configured to perform random sampling processing on the data set through the input layer to obtain multiple feature subsets; input each of the feature subsets into different learners in the decision-making layer, and pass the The learner identifies each of the feature subsets, and outputs the recognition result of each of the learners for the corresponding feature subset;
输出单元4033,用于根据各所述学习器输出的识别结果,确定所述行为识别模型对所述数据集的识别结果,其中,所述行为识别模型对所述数据集的识别结果包括第一行为识别结果和第二行为识别结果。The output unit 4033 is configured to determine the recognition result of the behavior recognition model for the data set according to the recognition results output by each of the learners, wherein the recognition result of the behavior recognition model for the data set includes the first Behavior recognition results and second behavior recognition results.
具体的,所述更新模块404包括:Specifically, the update module 404 includes:
统计单元4041,用于对所述第一行为识别结果中行为类型进行统计,得到所述第一数据集中行为类型的第一分布概率;A statistics unit 4041, configured to perform statistics on the behavior types in the first behavior recognition result, to obtain a first distribution probability of behavior types in the first data set;
比率计算单元4042,用于根据所述第一分布概率和所述第一数据集对应的行为类型,确定所述行为识别模型对所述第一数据集的误分样本数量;计算所述误分样本数量和所述第一数据集中总样本数量之间的比率,并将所述比率作为所述行为识别模型对所述第一数据集的误分率。A ratio calculation unit 4042, configured to determine the number of misclassified samples of the first data set by the behavior recognition model according to the first distribution probability and the behavior type corresponding to the first data set; calculate the misclassification The ratio between the number of samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set.
具体的,所述训练单元还用于:Specifically, the training unit is also used for:
通过当前学习器从所述特征子集中选取一个特征样本构建样本节点,并根据预设的特征选择参数,从选取的特征样本中选取m个特征属性;Selecting a feature sample from the feature subset by the current learner to construct a sample node, and selecting m feature attributes from the selected feature samples according to preset feature selection parameters;
通过所述学习器从选取的m个特征属性中随机筛选出一个特征属性构建所述样本节点下的子节点;A feature attribute is randomly selected from the selected m feature attributes by the learner to construct a child node under the sample node;
通过所述学习器重新从选取的特征样本中选取m个特征属性,并构建所述子节点下的下级子节点,直到所述子节点的数量为m时停止,得到对应的决策树;Re-selecting m feature attributes from the selected feature samples by the learner, and constructing lower-level child nodes under the child nodes, stopping until the number of the child nodes is m, and obtaining a corresponding decision tree;
通过下一个学习器重新从所述特征子集中筛选一个未被选取的特征样本构建决策树,直到得到所述特征子集中各特征样本的决策树时停止;Rescreening an unselected feature sample from the feature subset by the next learner to construct a decision tree until the decision tree of each feature sample in the feature subset is obtained;
采用各所述决策树对所述特征子集中对应特征样本的行为类型进行识别,得到所述特征子集的识别结果。Using each of the decision trees to identify the behavior type of the corresponding feature sample in the feature subset, to obtain the identification result of the feature subset.
具体的,所述更新模块404还包括:Specifically, the updating module 404 also includes:
损失计算单元4043,用于计算所述误分率和所述相对熵损失之间的交叉熵损失,并判断所述交叉熵损失和所述误分率是否满足预置损失条件;A loss calculation unit 4043, configured to calculate a cross-entropy loss between the misclassification rate and the relative entropy loss, and determine whether the cross-entropy loss and the misclassification rate meet a preset loss condition;
调整单元4044,用于若不满足,则根据交叉熵损失和所述误分率,调整所述行为识别模型中的特征选择参数;The adjustment unit 4044 is used to adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate if not satisfied;
确定单元4045,用于根据调整后的特征选择参数,对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止。The determining unit 4045 is configured to update the behavior recognition model according to the adjusted feature selection parameters until the behavior recognition model converges.
本申请实施例中,通过采样近邻样本的方法和线性插值处理,对数量较少的正样本进行扩充,得到扩充样本,以及对数量较多的负样本进行下采样,解决正负样本不平衡的问题,增加模型训练的准确度;并通过误分率和损失值衡量行为识别模型的输出准确度,确保样本不平衡的情况下,模型可以充分考虑到正样本的特征性,使得输出结果更准确;另外,还通过训练模型中的多个学习器学习数据集中的决策树,以此识别车险理赔行为的类别概率,减少样本不平衡带来的结果偏向,纠正模型输出的偏置问题。In the embodiment of the present application, through the method of sampling neighboring samples and linear interpolation processing, a small number of positive samples are expanded to obtain expanded samples, and a large number of negative samples are down-sampled to solve the problem of imbalance between positive and negative samples problem, increase the accuracy of model training; and measure the output accuracy of the behavior recognition model through the misclassification rate and loss value to ensure that the model can fully consider the characteristics of the positive sample when the sample is unbalanced, so that the output result is more accurate ; In addition, multiple learners in the training model are used to learn the decision tree in the data set to identify the category probability of auto insurance claims, reduce the result bias caused by sample imbalance, and correct the bias problem of the model output.
上面图4和图5从模块化功能实体的角度对本申请实施例中的车险理赔行为识别装置进行详细描述,下面从硬件处理的角度对本申请实施例中车险理赔行为识别设备进行详细描述。Figures 4 and 5 above describe the auto insurance claim settlement behavior recognition device in the embodiment of the present application in detail from the perspective of modular functional entities. The following describes the auto insurance claim settlement behavior recognition device in the embodiment of the present application in detail from the perspective of hardware processing.
图6是本申请实施例提供的一种车险理赔行为识别设备的结构示意图,该车险理赔行为识别设备600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)610(例如,一个或一个以上处理器)和存储器620,一个或一个以上存储应用程序633或数据632的存储介质630(例如一个或一个以上海量存储设备)。其中,存储器620和存储介质630可以是短暂存储或持久存储。存储在存储介质630的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对车险理赔行为识别设备600中的一系列指令操作。更进一步地,处理器610可以设置为与存储介质630通信,在车险理赔行为识别设备600上执行存储介质630中的一系列指令操作。Fig. 6 is a schematic structural diagram of an auto insurance claim settlement behavior recognition device provided by an embodiment of the present application. The auto insurance claim settlement behavior recognition device 600 may have relatively large differences due to different configurations or performances, and may include one or more than one processor (central processing units (CPU) 610 (for example, one or more processors) and memory 620, one or more storage media 630 for storing application programs 633 or data 632 (for example, one or more mass storage devices). Wherein, the memory 620 and the storage medium 630 may be temporary storage or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the auto insurance claim settlement behavior recognition device 600 . Furthermore, the processor 610 may be configured to communicate with the storage medium 630 , and execute a series of instruction operations in the storage medium 630 on the auto insurance claim settlement behavior recognition device 600 .
车险理赔行为识别设备600还可以包括一个或一个以上电源640,一个或一个以上有线或无线网络接口650,一个或一个以上输入输出接口660,和/或,一个或一个以上操作系统631,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图6示出的车险理赔行为识别设备结构并不构成对车险理赔行为识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The auto insurance claim settlement behavior recognition device 600 can also include one or more power supplies 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or, one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the auto insurance claim settlement behavior recognition device shown in Figure 6 does not constitute a limitation on the auto insurance claim settlement behavior recognition device, and may include more or less components than those shown in the illustration, or combine certain components, or Different component arrangements.
本申请还提供一种车险理赔行为识别设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述车险理赔行为识别方法的步骤。The present application also provides an auto insurance claim settlement behavior recognition device. The computer device includes a memory and a processor, and computer readable instructions are stored in the memory. When the computer readable instructions are executed by the processor, the processor executes the above-mentioned embodiments The steps of the method for identifying the auto insurance claim settlement behavior.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述车险理赔行为识别方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are run on the computer, the computer is made to execute the steps of the method for identifying the auto insurance claim settlement behavior.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, and are not intended to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still understand the foregoing The technical solutions described in each embodiment are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application.

Claims (20)

  1. 一种车险理赔行为识别方法,其中,所述车险理赔行为识别方法包括:A method for identifying an auto insurance claim settlement behavior, wherein the auto insurance claim settlement behavior identification method includes:
    获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;Obtain historical auto insurance claim data, and divide the historical auto insurance claim data into positive samples and negative samples;
    对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;performing neighbor propagation processing on the positive samples to obtain multiple expanded samples, and performing downsampling processing on the negative samples to obtain multiple sub-samples;
    将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;combining each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
    将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;Input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the first behavior recognition result corresponding to the second data set. The second behavior recognition result corresponding to the second data set;
    根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior recognition result and the The relative entropy loss between the recognition results of the second behavior;
    根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;Updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
    获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。Acquiring auto insurance claim data to be identified, inputting the auto insurance claim data to be identified into the behavior recognition model, and identifying the behavior category corresponding to the auto insurance claim data to be identified.
  2. 根据权利要求1所述的车险理赔行为识别方法,其中,所述对所述正样本进行近邻传播处理,得到多个扩充样本包括:The auto insurance claim settlement behavior identification method according to claim 1, wherein said performing neighbor propagation processing on said positive sample to obtain a plurality of expanded samples includes:
    依次计算每两个正样本之间的欧式距离,并根据所述欧式距离,确定每个正样本的近邻样本;Calculate the Euclidean distance between each two positive samples in turn, and determine the nearest neighbor sample of each positive sample according to the Euclidean distance;
    随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本。Randomly screen a preset number of neighboring samples for linear interpolation processing, and construct expanded samples according to the processing results.
  3. 根据权利要求1所述的车险理赔行为识别方法,其中,所述将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果包括:The auto insurance claim settlement behavior recognition method according to claim 1, wherein the first data set and the second data set are respectively input into a preset behavior recognition model for behavior type recognition, and the obtained behavior is consistent with the The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set include:
    将数据集分别输入至预置的行为识别模型中,其中,所述行为识别模型包括输入层和决策层,所述数据集包括所述第一数据集和所述第二数据集;Input the data sets into the preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision layer, and the data sets include the first data set and the second data set;
    通过所述输入层对所述数据集进行随机采样处理,得到多个特征子集;performing random sampling processing on the data set through the input layer to obtain multiple feature subsets;
    将各所述特征子集输入所述决策层中不同的学习器,并通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果;Input each of the feature subsets into different learners in the decision-making layer, and identify each of the feature subsets by the learner, and output the recognition results of each of the learners for the corresponding feature subsets;
    根据各所述学习器输出的识别结果,确定所述行为识别模型对所述数据集的识别结果,其中,所述行为识别模型对所述数据集的识别结果包括第一行为识别结果和第二行为识别结果。According to the recognition results output by each of the learners, the recognition result of the behavior recognition model for the data set is determined, wherein the recognition result of the behavior recognition model for the data set includes the first behavior recognition result and the second behavior recognition result. behavior recognition results.
  4. 根据权利要求1所述的车险理赔行为识别方法,其中,所述根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率包括:The auto insurance claim settlement behavior recognition method according to claim 1, wherein, according to the first behavior recognition result and the behavior type corresponding to the first data set, the calculation of the behavior recognition model for the first data set Misclassification rates include:
    对所述第一行为识别结果中行为类型进行统计,得到所述第一数据集中行为类型的第一分布概率;performing statistics on the behavior types in the first behavior recognition result to obtain a first distribution probability of the behavior types in the first data set;
    根据所述第一分布概率和所述第一数据集对应的行为类型,确定所述行为识别模型对所述第一数据集的误分样本数量;According to the first distribution probability and the behavior type corresponding to the first data set, determine the number of misclassified samples of the first data set by the behavior recognition model;
    计算所述误分样本数量和所述第一数据集中总样本数量之间的比率,并将所述比率作为所述行为识别模型对所述第一数据集的误分率。Calculate the ratio between the number of misclassified samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set.
  5. 根据权利要求3所述的车险理赔行为识别方法,其中,所述通过所述学习器对各所 述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果包括:The auto insurance claim settlement behavior identification method according to claim 3, wherein, each described feature subset is identified by the learner, and the output of each described learner to the identification result of the corresponding feature subset includes:
    通过当前学习器从所述特征子集中选取一个特征样本构建样本节点,并根据预设的特征选择参数,从选取的特征样本中选取m个特征属性;Selecting a feature sample from the feature subset by the current learner to construct a sample node, and selecting m feature attributes from the selected feature samples according to preset feature selection parameters;
    通过所述学习器从选取的m个特征属性中随机筛选出一个特征属性构建所述样本节点下的子节点;A feature attribute is randomly selected from the selected m feature attributes by the learner to construct a child node under the sample node;
    通过所述学习器重新从选取的特征样本中选取m个特征属性,并构建所述子节点下的下级子节点,直到所述子节点的数量为m时停止,得到对应的决策树;Re-selecting m feature attributes from the selected feature samples by the learner, and constructing lower-level child nodes under the child nodes, stopping until the number of the child nodes is m, and obtaining a corresponding decision tree;
    通过下一个学习器重新从所述特征子集中筛选一个未被选取的特征样本构建决策树,直到得到所述特征子集中各特征样本的决策树时停止;Rescreening an unselected feature sample from the feature subset by the next learner to construct a decision tree until the decision tree of each feature sample in the feature subset is obtained;
    采用各所述决策树对所述特征子集中对应特征样本的行为类型进行识别,得到所述特征子集的识别结果。Using each of the decision trees to identify the behavior type of the corresponding feature sample in the feature subset, to obtain the identification result of the feature subset.
  6. 根据权利要求1-5中任一项所述的车险理赔行为识别方法,其中,所述根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止包括:The auto insurance claim settlement behavior recognition method according to any one of claims 1-5, wherein the behavior recognition model is updated according to the misclassification rate and the relative entropy loss until the behavior recognition model Stopping on Convergence includes:
    计算所述误分率和所述相对熵损失之间的交叉熵损失,并判断所述交叉熵损失和所述误分率是否满足预置损失条件;calculating a cross-entropy loss between the misclassification rate and the relative entropy loss, and judging whether the cross-entropy loss and the misclassification rate meet a preset loss condition;
    若不满足,则根据交叉熵损失和所述误分率,调整所述行为识别模型中的特征选择参数;If not satisfied, then adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate;
    根据调整后的特征选择参数,对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止。According to the adjusted feature selection parameters, the behavior recognition model is updated until the behavior recognition model converges.
    识别模块,用于获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The identification module is used to obtain the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
  7. 一种车险理赔行为识别设备,其中,所述车险理赔行为识别设备包括:存储器和至少一个处理器,所述存储器中存储有指令;An auto insurance claim settlement behavior recognition device, wherein the auto insurance claim settlement behavior recognition device includes: a memory and at least one processor, instructions are stored in the memory;
    所述至少一个处理器调用所述存储器中的所述指令,以使得所述车险理赔行为识别设备执行如下所述的车险理赔行为识别方法:The at least one processor invokes the instructions in the memory, so that the auto insurance claim settlement behavior recognition device executes the auto insurance claim settlement behavior recognition method as follows:
    获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;Obtain historical auto insurance claim data, and divide the historical auto insurance claim data into positive samples and negative samples;
    对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;performing neighbor propagation processing on the positive samples to obtain multiple expanded samples, and performing downsampling processing on the negative samples to obtain multiple sub-samples;
    将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;combining each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
    将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;Input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the first behavior recognition result corresponding to the second data set. The second behavior recognition result corresponding to the second data set;
    根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior recognition result and the The relative entropy loss between the recognition results of the second behavior;
    根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;Updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
    获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。Acquiring auto insurance claim data to be identified, inputting the auto insurance claim data to be identified into the behavior recognition model, and identifying the behavior category corresponding to the auto insurance claim data to be identified.
  8. 根据权利要求7所述的车险理赔行为识别设备,其中,所述对所述正样本进行近邻传播处理,得到多个扩充样本包括:The auto insurance claim settlement behavior identification device according to claim 7, wherein said performing neighbor propagation processing on said positive sample to obtain a plurality of expanded samples includes:
    依次计算每两个正样本之间的欧式距离,并根据所述欧式距离,确定每个正样本的近邻样本;Calculate the Euclidean distance between each two positive samples in turn, and determine the nearest neighbor sample of each positive sample according to the Euclidean distance;
    随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本。Randomly screen a preset number of neighboring samples for linear interpolation processing, and construct expanded samples according to the processing results.
  9. 根据权利要求7所述的车险理赔行为识别设备,其中,所述将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果包括:The auto insurance claim settlement behavior recognition device according to claim 7, wherein, said inputting said first data set and said second data set into a preset behavior recognition model to identify the behavior type, and obtain the same The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set include:
    将数据集分别输入至预置的行为识别模型中,其中,所述行为识别模型包括输入层和决策层,所述数据集包括所述第一数据集和所述第二数据集;Input the data sets into the preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision layer, and the data sets include the first data set and the second data set;
    通过所述输入层对所述数据集进行随机采样处理,得到多个特征子集;performing random sampling processing on the data set through the input layer to obtain multiple feature subsets;
    将各所述特征子集输入所述决策层中不同的学习器,并通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果;Input each of the feature subsets into different learners in the decision-making layer, and identify each of the feature subsets by the learner, and output the recognition results of each of the learners for the corresponding feature subsets;
    根据各所述学习器输出的识别结果,确定所述行为识别模型对所述数据集的识别结果,其中,所述行为识别模型对所述数据集的识别结果包括第一行为识别结果和第二行为识别结果。According to the recognition results output by each of the learners, the recognition result of the behavior recognition model for the data set is determined, wherein the recognition result of the behavior recognition model for the data set includes the first behavior recognition result and the second behavior recognition result. behavior recognition results.
  10. 根据权利要求7所述的车险理赔行为识别设备,其中,所述根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率包括:The auto insurance claim settlement behavior recognition device according to claim 7, wherein, according to the first behavior recognition result and the behavior type corresponding to the first data set, the calculation of the behavior recognition model for the first data set Misclassification rates include:
    对所述第一行为识别结果中行为类型进行统计,得到所述第一数据集中行为类型的第一分布概率;performing statistics on the behavior types in the first behavior recognition result to obtain a first distribution probability of the behavior types in the first data set;
    根据所述第一分布概率和所述第一数据集对应的行为类型,确定所述行为识别模型对所述第一数据集的误分样本数量;According to the first distribution probability and the behavior type corresponding to the first data set, determine the number of misclassified samples of the first data set by the behavior recognition model;
    计算所述误分样本数量和所述第一数据集中总样本数量之间的比率,并将所述比率作为所述行为识别模型对所述第一数据集的误分率。Calculate the ratio between the number of misclassified samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set.
  11. 根据权利要求10所述的车险理赔行为识别设备,其中,所述通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果包括:The auto insurance claim settlement behavior recognition device according to claim 10, wherein said learner is used to identify each of said feature subsets, and outputting the recognition results of each of said learners for corresponding feature subsets includes:
    通过当前学习器从所述特征子集中选取一个特征样本构建样本节点,并根据预设的特征选择参数,从选取的特征样本中选取m个特征属性;Selecting a feature sample from the feature subset by the current learner to construct a sample node, and selecting m feature attributes from the selected feature samples according to preset feature selection parameters;
    通过所述学习器从选取的m个特征属性中随机筛选出一个特征属性构建所述样本节点下的子节点;A feature attribute is randomly selected from the selected m feature attributes by the learner to construct a child node under the sample node;
    通过所述学习器重新从选取的特征样本中选取m个特征属性,并构建所述子节点下的下级子节点,直到所述子节点的数量为m时停止,得到对应的决策树;Re-selecting m feature attributes from the selected feature samples by the learner, and constructing lower-level child nodes under the child nodes, stopping until the number of the child nodes is m, and obtaining a corresponding decision tree;
    通过下一个学习器重新从所述特征子集中筛选一个未被选取的特征样本构建决策树,直到得到所述特征子集中各特征样本的决策树时停止;Rescreening an unselected feature sample from the feature subset by the next learner to construct a decision tree until the decision tree of each feature sample in the feature subset is obtained;
    采用各所述决策树对所述特征子集中对应特征样本的行为类型进行识别,得到所述特征子集的识别结果。Using each of the decision trees to identify the behavior type of the corresponding feature sample in the feature subset, to obtain the identification result of the feature subset.
  12. 根据权利要求7-11中任一项所述的车险理赔行为识别设备,其中,所述根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止包括:The auto insurance claim settlement behavior recognition device according to any one of claims 7-11, wherein the behavior recognition model is updated according to the misclassification rate and the relative entropy loss until the behavior recognition model Stopping on Convergence includes:
    计算所述误分率和所述相对熵损失之间的交叉熵损失,并判断所述交叉熵损失和所述误分率是否满足预置损失条件;calculating a cross-entropy loss between the misclassification rate and the relative entropy loss, and judging whether the cross-entropy loss and the misclassification rate meet a preset loss condition;
    若不满足,则根据交叉熵损失和所述误分率,调整所述行为识别模型中的特征选择参数;If not satisfied, then adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate;
    根据调整后的特征选择参数,对所述行为识别模型进行更新,直到所述行为识别模型 收敛时停止。According to the adjusted feature selection parameters, the behavior recognition model is updated until the behavior recognition model converges and stops.
  13. 一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,其中,所述指令被处理器执行时实现如下所述的车险理赔行为识别方法:A computer-readable storage medium, where instructions are stored on the computer-readable storage medium, wherein, when the instructions are executed by a processor, the following method for identifying behavior of auto insurance claim settlement is implemented:
    获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;Obtain historical auto insurance claim data, and divide the historical auto insurance claim data into positive samples and negative samples;
    对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;performing neighbor propagation processing on the positive samples to obtain multiple expanded samples, and performing downsampling processing on the negative samples to obtain multiple sub-samples;
    将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;combining each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
    将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;Input the first data set and the second data set into the preset behavior recognition model to identify the behavior type, and obtain the first behavior recognition result corresponding to the first data set and the first behavior recognition result corresponding to the second data set. The second behavior recognition result corresponding to the second data set;
    根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;According to the first behavior recognition result and the behavior type corresponding to the first data set, calculate the misclassification rate of the behavior recognition model for the first data set, and calculate the first behavior recognition result and the The relative entropy loss between the recognition results of the second behavior;
    根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;Updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
    获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。Acquiring auto insurance claim data to be identified, inputting the auto insurance claim data to be identified into the behavior recognition model, and identifying the behavior category corresponding to the auto insurance claim data to be identified.
  14. 根据权利要求13所述的计算机可读存储介质,其中,所述对所述正样本进行近邻传播处理,得到多个扩充样本包括:The computer-readable storage medium according to claim 13, wherein said performing neighbor propagation processing on said positive samples to obtain a plurality of extended samples comprises:
    依次计算每两个正样本之间的欧式距离,并根据所述欧式距离,确定每个正样本的近邻样本;Calculate the Euclidean distance between each two positive samples in turn, and determine the nearest neighbor sample of each positive sample according to the Euclidean distance;
    随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本。Randomly screen a preset number of neighboring samples for linear interpolation processing, and construct expanded samples according to the processing results.
  15. 根据权利要求13所述的计算机可读存储介质,其中,所述将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果包括:The computer-readable storage medium according to claim 13, wherein the first data set and the second data set are respectively input into a preset behavior recognition model to identify behavior types, and the obtained The first behavior recognition result corresponding to the first data set and the second behavior recognition result corresponding to the second data set include:
    将数据集分别输入至预置的行为识别模型中,其中,所述行为识别模型包括输入层和决策层,所述数据集包括所述第一数据集和所述第二数据集;Input the data sets into the preset behavior recognition model, wherein the behavior recognition model includes an input layer and a decision layer, and the data sets include the first data set and the second data set;
    通过所述输入层对所述数据集进行随机采样处理,得到多个特征子集;performing random sampling processing on the data set through the input layer to obtain multiple feature subsets;
    将各所述特征子集输入所述决策层中不同的学习器,并通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果;Input each of the feature subsets into different learners in the decision-making layer, and identify each of the feature subsets by the learner, and output the recognition results of each of the learners for the corresponding feature subsets;
    根据各所述学习器输出的识别结果,确定所述行为识别模型对所述数据集的识别结果,其中,所述行为识别模型对所述数据集的识别结果包括第一行为识别结果和第二行为识别结果。According to the recognition results output by each of the learners, the recognition result of the behavior recognition model for the data set is determined, wherein the recognition result of the behavior recognition model for the data set includes the first behavior recognition result and the second behavior recognition result. behavior recognition results.
  16. 根据权利要求13所述的计算机可读存储介质,其中,所述根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率包括:The computer-readable storage medium according to claim 13, wherein, according to the first behavior recognition result and the behavior type corresponding to the first data set, the calculation of the behavior recognition model for the first data set Misclassification rates include:
    对所述第一行为识别结果中行为类型进行统计,得到所述第一数据集中行为类型的第一分布概率;performing statistics on the behavior types in the first behavior recognition result to obtain a first distribution probability of the behavior types in the first data set;
    根据所述第一分布概率和所述第一数据集对应的行为类型,确定所述行为识别模型对所述第一数据集的误分样本数量;According to the first distribution probability and the behavior type corresponding to the first data set, determine the number of misclassified samples of the first data set by the behavior recognition model;
    计算所述误分样本数量和所述第一数据集中总样本数量之间的比率,并将所述比率作为所述行为识别模型对所述第一数据集的误分率。Calculate the ratio between the number of misclassified samples and the total number of samples in the first data set, and use the ratio as the misclassification rate of the behavior recognition model for the first data set.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述通过所述学习器对各所述特征子集进行识别,输出各所述学习器对对应特征子集的识别结果包括:The computer-readable storage medium according to claim 16, wherein the identifying each of the feature subsets by the learner, and outputting the recognition result of each of the learners for the corresponding feature subset includes:
    通过当前学习器从所述特征子集中选取一个特征样本构建样本节点,并根据预设的特征选择参数,从选取的特征样本中选取m个特征属性;Selecting a feature sample from the feature subset by the current learner to construct a sample node, and selecting m feature attributes from the selected feature samples according to preset feature selection parameters;
    通过所述学习器从选取的m个特征属性中随机筛选出一个特征属性构建所述样本节点下的子节点;A feature attribute is randomly selected from the selected m feature attributes by the learner to construct a child node under the sample node;
    通过所述学习器重新从选取的特征样本中选取m个特征属性,并构建所述子节点下的下级子节点,直到所述子节点的数量为m时停止,得到对应的决策树;Re-selecting m feature attributes from the selected feature samples by the learner, and constructing lower-level child nodes under the child nodes, stopping until the number of the child nodes is m, and obtaining a corresponding decision tree;
    通过下一个学习器重新从所述特征子集中筛选一个未被选取的特征样本构建决策树,直到得到所述特征子集中各特征样本的决策树时停止;Rescreening an unselected feature sample from the feature subset by the next learner to construct a decision tree until the decision tree of each feature sample in the feature subset is obtained;
    采用各所述决策树对所述特征子集中对应特征样本的行为类型进行识别,得到所述特征子集的识别结果。Using each of the decision trees to identify the behavior type of the corresponding feature sample in the feature subset, to obtain the identification result of the feature subset.
  18. 根据权利要求13-17中任一项所述的计算机可读存储介质,其中,所述根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止包括:The computer-readable storage medium according to any one of claims 13-17, wherein the behavior recognition model is updated according to the misclassification rate and the relative entropy loss until the behavior recognition model Stopping on Convergence includes:
    计算所述误分率和所述相对熵损失之间的交叉熵损失,并判断所述交叉熵损失和所述误分率是否满足预置损失条件;calculating a cross-entropy loss between the misclassification rate and the relative entropy loss, and judging whether the cross-entropy loss and the misclassification rate meet a preset loss condition;
    若不满足,则根据交叉熵损失和所述误分率,调整所述行为识别模型中的特征选择参数;If not satisfied, then adjust the feature selection parameters in the behavior recognition model according to the cross-entropy loss and the misclassification rate;
    根据调整后的特征选择参数,对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止。According to the adjusted feature selection parameters, the behavior recognition model is updated until the behavior recognition model converges.
  19. 一种车险理赔行为识别装置,其中,所述车险理赔行为识别装置包括:An auto insurance claim settlement behavior recognition device, wherein the auto insurance claim settlement behavior recognition device includes:
    扩充模块,用于获取历史车险理赔数据,并将所述历史车险理赔数据划分为正样本和负样本;对所述正样本进行近邻传播处理,得到多个扩充样本,以及对所述负样本进行下采样处理,得到多个子样本;The expansion module is used to obtain historical auto insurance claim data, and divide the historical auto insurance claim data into positive samples and negative samples; perform neighbor propagation processing on the positive samples to obtain a plurality of expanded samples, and perform processing on the negative samples Downsampling processing to obtain multiple sub-samples;
    组合模块,用于将每个所述子样本分别与所述正样本和所述扩充样本进行组合,得到第一数据集和第二数据集;A combination module, configured to combine each of the sub-samples with the positive sample and the expanded sample to obtain a first data set and a second data set;
    训练模块,用于将所述第一数据集和所述第二数据集分别输入至预置的行为识别模型中进行行为类型的识别,得到与所述第一数据集对应的第一行为识别结果和与所述第二数据集对应的第二行为识别结果;A training module, configured to input the first data set and the second data set into a preset behavior recognition model to identify behavior types, and obtain a first behavior recognition result corresponding to the first data set and a second behavior recognition result corresponding to the second data set;
    更新模块,用于根据所述第一行为识别结果和所述第一数据集对应的行为类型,计算所述行为识别模型对所述第一数据集的误分率,以及计算所述第一行为识别结果和所述第二行为识别结果之间的相对熵损失;根据所述误分率和所述相对熵损失对所述行为识别模型进行更新,直到所述行为识别模型收敛时停止;An update module, configured to calculate the misclassification rate of the first data set by the behavior recognition model according to the first behavior recognition result and the behavior type corresponding to the first data set, and calculate the first behavior A relative entropy loss between the recognition result and the second behavior recognition result; updating the behavior recognition model according to the misclassification rate and the relative entropy loss until the behavior recognition model converges;
    识别模块,用于获取待识别车险理赔数据,并将所述待识别车险理赔数据输入所述行为识别模型,识别所述待识别车险理赔数据对应的行为类别。The identification module is used to obtain the auto insurance claim data to be identified, input the auto insurance claim data to be identified into the behavior identification model, and identify the behavior category corresponding to the auto insurance claim data to be identified.
  20. 根据权利要求19所述的车险理赔行为识别装置,其中,所述扩充模块包括:The auto insurance claim settlement behavior identification device according to claim 19, wherein the expansion module includes:
    距离计算单元,用于依次计算每两个正样本之间的欧式距离,并根据所述欧式距离,确定每个正样本的近邻样本;The distance calculation unit is used to calculate the Euclidean distance between each two positive samples in turn, and determine the nearest neighbor sample of each positive sample according to the Euclidean distance;
    插值处理单元,用于随机筛选预置数量的近邻样本进行线性插值处理,并根据处理的结果构造扩充样本。The interpolation processing unit is used for randomly screening a preset number of neighboring samples for linear interpolation processing, and constructing extended samples according to the processing results.
PCT/CN2022/071477 2021-06-08 2022-01-12 Vehicle insurance claim behavior recognition method, apparatus, and device, and storage medium WO2022257458A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110635315.3 2021-06-08
CN202110635315.3A CN113256434B (en) 2021-06-08 2021-06-08 Method, device, equipment and storage medium for recognizing vehicle insurance claim settlement behaviors

Publications (1)

Publication Number Publication Date
WO2022257458A1 true WO2022257458A1 (en) 2022-12-15

Family

ID=77186966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071477 WO2022257458A1 (en) 2021-06-08 2022-01-12 Vehicle insurance claim behavior recognition method, apparatus, and device, and storage medium

Country Status (2)

Country Link
CN (1) CN113256434B (en)
WO (1) WO2022257458A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116720577A (en) * 2023-08-09 2023-09-08 凯泰铭科技(北京)有限公司 Decision tree-based vehicle insurance rule writing and deploying method and system
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113256434B (en) * 2021-06-08 2021-11-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing vehicle insurance claim settlement behaviors
TWI809635B (en) * 2021-12-29 2023-07-21 國泰世紀產物保險股份有限公司 Insurance claims fraud detecting system and method for assessing the risk of insurance claims fraud using the same

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545275A (en) * 2017-07-27 2018-01-05 华南理工大学 The unbalanced data Ensemble classifier method that resampling is merged with cost sensitive learning
CN108764346A (en) * 2018-05-30 2018-11-06 华东理工大学 A kind of mixing sampling integrated classifier based on entropy
US20190005586A1 (en) * 2017-06-30 2019-01-03 Alibaba Group Holding Limited Prediction algorithm based attribute data processing
CN109523412A (en) * 2018-11-14 2019-03-26 平安科技(深圳)有限公司 Intelligent core protects method, apparatus, computer equipment and computer readable storage medium
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110390348A (en) * 2019-06-11 2019-10-29 仲恺农业工程学院 A kind of unbalanced dataset classification method, system, device and storage medium
CN111062806A (en) * 2019-12-13 2020-04-24 合肥工业大学 Personal finance credit risk evaluation method, system and storage medium
CN111582651A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 User risk analysis model training method and device and electronic equipment
CN111612640A (en) * 2020-05-27 2020-09-01 上海海事大学 Data-driven vehicle insurance fraud identification method
CN111782472A (en) * 2020-06-30 2020-10-16 平安科技(深圳)有限公司 System abnormality detection method, device, equipment and storage medium
CN113256434A (en) * 2021-06-08 2021-08-13 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing vehicle insurance claim settlement behaviors

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544190A (en) * 2018-11-28 2019-03-29 北京芯盾时代科技有限公司 A kind of fraud identification model training method, fraud recognition methods and device
CN111461168A (en) * 2020-03-02 2020-07-28 平安科技(深圳)有限公司 Training sample expansion method and device, electronic equipment and storage medium
CN111881991B (en) * 2020-08-03 2023-11-10 联仁健康医疗大数据科技股份有限公司 Method and device for identifying fraud and electronic equipment
CN112766319A (en) * 2020-12-31 2021-05-07 平安科技(深圳)有限公司 Dialogue intention recognition model training method and device, computer equipment and medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005586A1 (en) * 2017-06-30 2019-01-03 Alibaba Group Holding Limited Prediction algorithm based attribute data processing
CN107545275A (en) * 2017-07-27 2018-01-05 华南理工大学 The unbalanced data Ensemble classifier method that resampling is merged with cost sensitive learning
CN108764346A (en) * 2018-05-30 2018-11-06 华东理工大学 A kind of mixing sampling integrated classifier based on entropy
CN109523412A (en) * 2018-11-14 2019-03-26 平安科技(深圳)有限公司 Intelligent core protects method, apparatus, computer equipment and computer readable storage medium
CN109886388A (en) * 2019-01-09 2019-06-14 平安科技(深圳)有限公司 A kind of training sample data extending method and device based on variation self-encoding encoder
CN110390348A (en) * 2019-06-11 2019-10-29 仲恺农业工程学院 A kind of unbalanced dataset classification method, system, device and storage medium
CN111062806A (en) * 2019-12-13 2020-04-24 合肥工业大学 Personal finance credit risk evaluation method, system and storage medium
CN111582651A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 User risk analysis model training method and device and electronic equipment
CN111612640A (en) * 2020-05-27 2020-09-01 上海海事大学 Data-driven vehicle insurance fraud identification method
CN111782472A (en) * 2020-06-30 2020-10-16 平安科技(深圳)有限公司 System abnormality detection method, device, equipment and storage medium
CN113256434A (en) * 2021-06-08 2021-08-13 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing vehicle insurance claim settlement behaviors

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm
CN117577214B (en) * 2023-05-19 2024-04-12 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm
CN116720577A (en) * 2023-08-09 2023-09-08 凯泰铭科技(北京)有限公司 Decision tree-based vehicle insurance rule writing and deploying method and system
CN116720577B (en) * 2023-08-09 2023-10-27 凯泰铭科技(北京)有限公司 Decision tree-based vehicle insurance rule writing and deploying method and system

Also Published As

Publication number Publication date
CN113256434B (en) 2021-11-23
CN113256434A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2022257458A1 (en) Vehicle insurance claim behavior recognition method, apparatus, and device, and storage medium
US7296018B2 (en) Resource-light method and apparatus for outlier detection
CN109739844B (en) Data classification method based on attenuation weight
CN113255815B (en) User behavior abnormity analysis method, device, equipment and storage medium
KR20050007306A (en) Processing mixed numeric and/or non-numeric data
WO2023279696A1 (en) Service risk customer group identification method, apparatus and device, and storage medium
CN108364463B (en) Traffic flow prediction method and system
CN110991474A (en) Machine learning modeling platform
CN111105241B (en) Identification method for anti-fraud of credit card transaction
CN114221790A (en) BGP (Border gateway protocol) anomaly detection method and system based on graph attention network
CA3037941A1 (en) Method and system for generating and using vehicle pricing models
CN109189876A (en) A kind of data processing method and device
WO2023279694A1 (en) Vehicle trade-in prediction method, apparatus, device, and storage medium
CN107392217B (en) Computer-implemented information processing method and device
CN111046930A (en) Power supply service satisfaction influence factor identification method based on decision tree algorithm
CN111695824A (en) Risk tail end client analysis method, device, equipment and computer storage medium
CN114299742B (en) Speed limit information dynamic identification and update recommendation method for expressway
CN114519519A (en) Method, device and medium for assessing enterprise default risk based on GBDT algorithm and logistic regression model
CN111639688B (en) Local interpretation method of Internet of things intelligent model based on linear kernel SVM
JP4343140B2 (en) Evaluation apparatus and computer program therefor
CN115018210B (en) Service data classification prediction method and device, computer equipment and storage medium
CN112733903B (en) SVM-RF-DT combination-based air quality monitoring and alarming method, system, device and medium
US20230196133A1 (en) Systems and methods for weight of evidence based feature engineering and machine learning
CN113657441A (en) Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening
CN114429172A (en) Load clustering method, device, equipment and medium based on transformer substation user constitution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22819066

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE