CN117056782A

CN117056782A - Data anomaly identification method, device, equipment and storage medium thereof

Info

Publication number: CN117056782A
Application number: CN202311037408.1A
Authority: CN
Inventors: 陈奕宇
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-14

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence and financial science and technology, is applied to claim risk prediction business, and relates to a data anomaly identification method, device and equipment and a storage medium thereof, wherein the method comprises the steps of obtaining test data to be identified; performing time sequence processing and feature extraction on the test data to obtain a test feature set; updating the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The long-term and short-term memory network is introduced to solve the problems that the related business data relate to a plurality of links and are unfavorable for time-series training when the traditional vehicle claims. Meanwhile, a gradient decreasing algorithm is introduced, a model super-parameter combination with the highest decreasing speed of the cost function is screened out, the training convergence speed and the prediction speed are guaranteed, whether the claim data is abnormal or not is rapidly predicted, and the claim data is conveniently and intelligently analyzed in an abnormal mode.

Description

Data anomaly identification method, device, equipment and storage medium thereof

Technical Field

The application relates to the technical field of artificial intelligence and financial science and technology, and is applied to claim risk prediction business, in particular to a data anomaly identification method, device and equipment and a storage medium thereof.

Background

Along with the development of the computer industry, the traditional financial business is gradually transformed to the financial and technological business direction, and particularly in the insurance claim settlement business in the financial industry, the more common insurance claim settlement business mostly occurs in the car insurance business field. In the insurance industry, the number of car insurance is the largest and the amount paid is the largest. Obviously, the wind control of insurance claim settlement business is also the most important, whether the insurance rules are the core insurance rules before insurance or the wind control models in claim settlement, or the post enterprise atlas, etc., are all used for better solving risk leakage.

In the existing vehicle risk wind control, the model using deep learning is mostly a RNN (Recurrent Neural Network) model, and the RNN model is good at processing continuous characteristics, namely, the output of the last service is the input of the next service, each ring is tightly buckled, the last result directly influences the next result, and the method is suitable for processing the service with continuity. However, for insurance claim settlement business, each link is relatively independent and cannot be connected, or the output of the last link is not the input of the next link, and the time sequence data corresponding to the claim settlement business is too much and too long, so that gradient attenuation is extremely serious, and the model prediction effect is seriously affected by continuously adopting the RNN model, so that the prediction accuracy is reduced.

Disclosure of Invention

The embodiment of the application aims to provide a data anomaly identification method, device and equipment and a storage medium thereof, so as to solve the problems that the model prediction effect is seriously influenced and the prediction accuracy is reduced by adopting an RNN model in the prior art during the prediction of claim settlement business.

In order to solve the above technical problems, the embodiment of the present application provides a data anomaly identification method, which adopts the following technical scheme:

a data anomaly identification method comprising the steps of:

acquiring test data to be identified;

performing time sequence processing and feature extraction on the test data to obtain a test feature set;

performing feature engineering processing on the test feature set to obtain derived features;

incorporating the derived features into the test feature set to update the test feature set;

and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model.

Further, the step of performing time sequence processing and feature extraction on the test data to obtain a test feature set specifically includes:

According to a preset arrangement rule, the test data are arranged into time sequence data;

extracting feature data in the time sequence data according to a preset feature keyword table to obtain a test feature set;

extracting feature data in the time sequence data according to a preset feature keyword table to obtain a test feature set, wherein the method specifically comprises the following steps of:

judging whether the feature data extracted according to the feature keyword list are all target extraction results;

if the non-target extraction result exists, carrying out missing value filling processing on the non-target extraction result, and taking filling data as a corresponding target extraction result;

and obtaining all target extraction results, adding all target extraction results into a preset ordered set according to the arrangement sequence of the time sequence data, and generating a test feature set.

Further, before performing the step of inputting the updated set of test features into the trained data anomaly identification model, the method further comprises:

acquiring full sample data, wherein the full sample data comprises positive sample data and negative sample data, the positive sample data is abnormal case setting data in a claim case, and the negative sample data is normal case setting data in the claim case;

Resampling the positive sample data according to a preset sampling mode to obtain a resampling result, and obtaining new positive sample data according to the resampling result;

inputting the anti-sample data and the new positive sample data into a pre-built data anomaly identification model together, performing model training, and obtaining a pre-trained data anomaly identification model, wherein the data anomaly identification model is constructed by a long-term memory network and a gradient decreasing algorithm;

inputting the full sample data into the pre-trained data anomaly identification model to obtain a model output result;

according to the labeling result of the total sample data and the model output result, carrying out output verification on the pre-trained data anomaly identification model;

if verification fails, performing super-parameter tuning on the pre-constructed data anomaly identification model, and performing iterative training until a model loss value meets a preset loss threshold, and stopping iterative training to obtain the data anomaly identification model with successful training.

Further, the step of resampling the positive sample data according to a preset sampling mode to obtain a resampling result and obtaining new positive sample data according to the resampling result specifically includes:

Respectively up-sampling each piece of data in the positive sample data according to a preset sampling frequency to obtain a re-sampling result;

combining the resampling result into the positive sample data to jointly form new positive sample data;

the step of inputting the anti-sample data and the new positive sample data into a pre-built data anomaly identification model together for model training to obtain a pre-trained data anomaly identification model specifically comprises the following steps:

inputting the anti-sample data and the new positive sample data into a pre-constructed data anomaly identification model together for model training;

and screening out a target recognition model according to a preset screening rule, and taking the target recognition model as the data anomaly recognition model after the pre-training is completed.

Further, the step of screening out a target recognition model according to a preset screening rule, and taking the target recognition model as the data anomaly recognition model after the pre-training is completed specifically includes:

initializing model super parameters, wherein the model parameters refer to configuration parameters outside the model, including learning rate, iteration times, batch size and hidden layer number of the long-term and short-term memory network;

Randomly combining model super-parameters, obtaining all model super-parameter combinations, and respectively setting the model super-parameter combinations to the pre-built data anomaly identification model;

calculating the cost function decrement speed corresponding to each model hyper-parameter combination in the model training process according to the gradient decrement algorithm;

screening out the model super-parameter combination with the fastest cost function decreasing speed as a target super-parameter combination;

and acquiring a data anomaly recognition model corresponding to the target hyper-parameter combination as a target model, and taking the target recognition model as the pre-trained data anomaly recognition model.

Further, the step of performing output verification on the pre-trained data anomaly identification model according to the labeling result of the full sample data and the model output result specifically includes:

counting the number of positive and negative sample data output by the model according to the model output result;

counting the number of actual positive and negative sample data in the full sample data according to the labeling result of the full sample data;

inputting the number of the positive and negative sample data output by the model and the number of the actual positive and negative sample data in the total sample data into a preset loss function as method parameters to obtain a model loss value;

Comparing the model loss value with a preset loss threshold value in a magnitude relation manner;

if the model loss value is smaller than the preset loss threshold value, the verification is successful;

if the model loss value is not smaller than the preset loss threshold value, verification fails.

Further, the step of predicting whether the test data is abnormal data according to the trained data abnormality recognition model specifically includes:

obtaining an output result of the trained data anomaly identification model;

if the output result shows that the test data is positive sample data, the test data is abnormal data;

and if the output result shows that the test data are anti-sample data, the test data are normal data.

In order to solve the above technical problems, the embodiment of the present application further provides a data anomaly identification device, which adopts the following technical scheme:

a data anomaly identification device, comprising:

the test data acquisition module is used for acquiring test data to be identified;

the feature extraction module is used for carrying out time sequence processing and feature extraction on the test data to obtain a test feature set;

the feature engineering module is used for carrying out feature engineering treatment on the test feature set to obtain derived features;

A feature update module for incorporating the derived features into the test feature set to update the test feature set;

the model identification module is used for inputting the updated test feature set into the trained data anomaly identification model, and predicting whether the test data are anomaly data according to the trained data anomaly identification model.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the data anomaly identification method described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of a data anomaly identification method as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

according to the data anomaly identification method, test data to be identified are obtained; performing time sequence processing and feature extraction on the test data to obtain a test feature set; performing feature engineering processing on the test feature set to obtain derived features; incorporating the derived features into the test feature set to update the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The problem that the model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved by considering the fact that a long-term and short-term memory network is introduced to solve the problems that the time sequence data are predicted in a plurality of links. Meanwhile, the corresponding relation between the characteristic data of each link in the vehicle insurance claim settlement business and the positive and negative output nodes is predicted by combining with the LSTM neural network, a gradient decreasing algorithm is introduced into the LSTM neural network, the model super-parameter combination with the highest decreasing speed of the cost function is screened out by the gradient decreasing algorithm, the training convergence speed and the prediction speed of the model are further ensured, each link of the claim settlement is tightly connected, whether the claim settlement data is abnormal or not is rapidly predicted, and the abnormal intelligent analysis of the claim settlement data is facilitated.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a data anomaly identification method in accordance with the present application;

FIG. 3 is a flow chart of one embodiment of step 202 of FIG. 2;

FIG. 4 is a flow chart of one particular embodiment of training a pre-constructed data anomaly identification model according to a data anomaly identification method in accordance with an embodiment of the present application;

FIG. 5 is a flow chart of one embodiment of step 402 shown in FIG. 4;

FIG. 6 is a flow chart of one embodiment of step 403 shown in FIG. 4;

FIG. 7 is a flow chart of one embodiment of step 602 shown in FIG. 6;

FIG. 8 is a flow chart of one embodiment of step 405 of FIG. 4;

FIG. 9 is a schematic diagram of a data anomaly identification device according to one embodiment of the present application;

FIG. 10 is a schematic structural view of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the data anomaly identification method provided in the embodiment of the present application is generally executed by a server, and accordingly, the data anomaly identification device is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of a data anomaly identification method in accordance with the present application is shown. The data anomaly identification method comprises the following steps:

in step 201, test data to be identified is obtained.

In this embodiment, the test data to be identified includes all audit data related to the claim settlement service after the accident of the vehicle insurance, including report link data, investigation link data, accident responsibility fixing link data, policy audit link data, etc.

In this embodiment, the test data to be identified refers to a series of time-series data for performing counterfeit identification, for example, the report link data, the investigation link data, the accident responsibility determining link data, and the policy audit link data have a certain time-series property in the whole claim settlement event.

Step 202, performing time sequence processing and feature extraction on the test data to obtain a test feature set.

With continued reference to FIG. 3, FIG. 3 is a flow chart of one embodiment of step 202 shown in FIG. 2, comprising:

step 301, according to a preset arrangement rule, sorting the test data into time sequence data;

in this embodiment, the test data is arranged into time series data according to a preset arrangement rule, that is, according to the time sequence of the report link data, the investigation link data, the accident responsibility determining link data and the policy checking link data in the whole claim settlement business process, the test data is arranged into time series data.

The time sequence data are data which are arranged according to time sequence, and accordingly, in the whole claim settlement business process, the case reporting link data, the investigation link data, the accident responsibility determining link data and the policy auditing link data have certain front-back time sequence, so that the case reporting link data, the investigation link data, the accident responsibility determining link data and the policy auditing link data are arranged according to time sequence, and the test data are arranged into the time sequence data.

Step 302, extracting feature data in the time sequence data according to a preset feature keyword table to obtain a test feature set;

Specifically, feature keywords are set for the report link data, the investigation link data, the accident responsibility fixing link data and the policy audit link data respectively in advance according to the whole claim settlement business flow, and feature keyword tables are built according to the feature keywords respectively corresponding to the report link data, the investigation link data, the accident responsibility fixing link data and the policy audit link data. The characteristic keywords can be set by taking batch historical claim data as experience data, and determining characteristic keywords corresponding to the report link data, the investigation link data, the accident responsibility determining link data and the policy auditing link data respectively in a word frequency calculation mode.

step 3021, judging whether the feature data extracted according to the feature keyword table are all target extraction results;

step 3022, if there is a non-target extraction result, performing missing value filling processing on the non-target extraction result, and taking the filling data as a corresponding target extraction result;

specifically, if feature data corresponding to a keyword cannot be extracted or the extracted data is wrong, it is indicated that the feature data corresponding to the keyword is not a target extraction result, then missing value filling processing is performed on the non-target extraction result, and when the filling data is used as a corresponding target extraction result, that is, feature extraction is performed, some feature data cannot be extracted, and missing values exist, at this time, mode data corresponding to the target feature data in historical test data can be obtained, and the missing value filling is performed by the mode data, wherein the mode data refers to data content with highest frequency corresponding to the target feature data in historical claim data.

Step 3023, obtaining all the target extraction results, and adding all the target extraction results into a preset ordered set according to the arrangement sequence of the time series data to generate a test feature set.

In this embodiment, by adding all feature data into a preset ordered set according to the arrangement sequence of the time series data, a test feature set is generated, so that the whole data text to be identified is not required to be input during testing, the input data quantity is reduced to a certain extent, most of non-feature data is removed, and the computing power resource consumption of a computer is saved.

And 203, performing feature engineering processing on the test feature set to obtain derived features.

Specifically, the performing feature engineering processing on the test feature set to obtain derived features includes: new features are created according to existing features in the test feature set, for example, the test feature set comprises a case report time and a time of risk, and new feature data such as interval time of the case report time and the time of risk are created and derived according to the case report time and the time of risk. For another example, the test feature set includes a holiday, and new feature data derived from the holiday is created based on the holiday, such as whether the holiday is a holiday. For another example, the test feature set includes a point of risk, and the new feature data derived from the point of risk is created based on the point of risk, such as whether the point of risk is a bar. For another example, the test feature set includes a time of risk, and new feature data derived from the time of risk is created, such as whether the time of risk is night, etc.

Specifically, different derivation rules may be used to derive features from the test feature set to obtain derived features, and in addition, for the numerical feature data, a polynomial derivation mode may also be used to obtain derived features, where the feature derivation method includes, but is not limited to, univariate feature derivation, bivariate feature derivation, and multivariate feature derivation, where the univariate, bivariate, and multivariate refer to the number of variables according to which feature data is derived, or a text feature derivation method is used, including analyzing semantic information of feature data in the test feature set, and performing feature derivation according to the analyzed semantic information.

Step 204, incorporating the derived features into the test feature set to update the test feature set.

The derived features are integrated into the test feature set to update the test feature set, so that data in the test feature set is further enriched, the dimension of the feature data is improved, and the accuracy of model identification is further improved through the dimension of the feature data.

Step 205, inputting the updated test feature set to a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model.

In this embodiment, before the step of inputting the updated test feature set into the trained data anomaly identification model is performed, the method further includes: training a pre-constructed data anomaly identification model.

With continued reference to fig. 4, fig. 4 shows a flowchart of a specific embodiment of training a pre-constructed data anomaly identification model according to a data anomaly identification method according to an embodiment of the present application, including:

step 401, acquiring full sample data, wherein the full sample data comprises positive sample data and negative sample data, the positive sample data is abnormal case setting data in a claim case, and the negative sample data is normal case setting data in the claim case;

specifically, the acquiring of the full sample data refers to acquiring all the data of the settlement of the claim service in the car insurance accident settlement service system, wherein the full sample data comprises marked true and false positive and negative sample data, the positive sample data is abnormal settlement data in the claim case, the negative sample data is normal settlement data in the claim case, namely, all the data of the settlement of the claim service in the car insurance accident settlement service system comprises data of normal settlement through a claim settlement program and data of abnormal settlement identified as the result of the claim making, wherein the negative sample data is the data of the normal settlement through the claim settlement program, and the positive sample data is the data of the abnormal settlement identified as the result of the claim making.

Step 402, resampling the positive sample data according to a preset sampling mode to obtain a resampling result, and obtaining new positive sample data according to the resampling result;

because the data of the normal case of the claim settlement program in all the claim settlement business data of the car insurance accident claim settlement business system is often more, and possibly accounts for 98% of the total data, the data of the abnormal case of the claim settlement is identified to be only about 1%, the proportion of the positive and negative sample data is extremely unbalanced, and the resampling processing is considered to be carried out on the positive sample data so as to increase the data quantity of the positive sample data, so that the proportion of the positive and negative sample data is balanced.

With continued reference to fig. 5, fig. 5 is a flow chart of one embodiment of step 402 shown in fig. 4, comprising:

step 501, up-sampling each piece of data in the positive sample data according to a preset sampling frequency to obtain a resampling result;

in this embodiment, the sampling frequency is increased by up-sampling, and the data with low latitude is collected to high latitude, that is, the dense data is obtained by collecting sparse data to high dimension.

The resampling process includes resampling the positive sample data by using a sample function in a Pandas data analysis tool, specifically, the Pandas library is a free and open-source third party Python library, and is a Python data analysis tool, the Pandas data analysis tool includes a time sequence data, and the data quantity of the positive sample data can be regarded as data loss due to the fact that the data quantity of the positive sample data is less, and the sampling frequency is set as a parameter entering parameter of the sample function, and the sample function is used for upsampling to enrich the data quantity of the positive sample data.

Step 502, merging the resampling result into the positive sample data to jointly form new positive sample data.

Step 403, inputting the anti-sample data and the new positive sample data into a pre-constructed data anomaly identification model together, performing model training, and obtaining a pre-trained data anomaly identification model, wherein the data anomaly identification model is constructed by a long-term memory network and a gradient decreasing algorithm;

the Long-term memory network may be an LSTM (Long-Short Term Memory) neural network, which is a time-circulating neural network, and combines different links such as a case reporting link, a investigation link, an accident responsibility determining link, a policy checking link and the like in the whole claim settlement business process, so that training of data of a plurality of links is involved, the problem that the time sequence data of a prediction time is involved in consideration of introducing the Long-term memory network to solve the problem that a model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved. Meanwhile, a gradient decreasing algorithm is introduced into the long-period memory network, so that the change speed of the cost function of the model can be conveniently determined by calculating the processing result of the long-period memory network during model training.

With continued reference to fig. 6, fig. 6 is a flow chart of one embodiment of step 403 shown in fig. 4, comprising:

step 601, inputting the anti-sample data and the new positive sample data into a pre-constructed data anomaly identification model together for model training;

in this embodiment, before the step of performing model training by inputting the anti-sample data and the new positive sample data into a pre-built data anomaly identification model, the method further includes: performing time sequence processing and feature extraction on the anti-sample data and the new positive sample data to respectively obtain an anti-sample feature set and a positive sample feature set; performing feature engineering processing on the anti-sample feature set and the positive sample feature set to obtain derived features; the derived features are incorporated into corresponding sample feature sets to update the anti-sample feature set and the positive sample feature set.

Specifically, the time sequence processing, the feature extraction and the feature engineering mode in the test are consistent with the time sequence processing, the feature extraction and the feature engineering mode in the training.

Step 602, screening out a target recognition model according to a preset screening rule, and taking the target recognition model as the data anomaly recognition model after pre-training.

With continued reference to fig. 7, fig. 7 is a flow chart of one embodiment of step 602 shown in fig. 6, comprising:

step 701, initializing model super parameters, wherein the model parameters refer to configuration parameters outside the model, including learning rate, iteration times, batch size and hidden layer number of the long-term and short-term memory network;

step 702, randomly combining model super parameters, obtaining all model super parameter combinations, and respectively setting the model super parameter combinations to the pre-constructed data anomaly identification model;

step 703, calculating a cost function decrement speed corresponding to each model hyper-parameter combination in the model training process according to the gradient decrement algorithm;

step 704, screening out the model super-parameter combination with the fastest cost function decrement speed as a target super-parameter combination;

step 705, obtaining a data anomaly identification model corresponding to the target hyper-parameter combination as a target model, and using the target identification model as the pre-trained data anomaly identification model.

And screening out the model super-parameter combination with the highest cost function decrementing speed by the gradient decrementing algorithm to be used as a target super-parameter combination, thereby determining a target model and further ensuring the training convergence speed and the prediction speed of the model.

Step 404, inputting the full sample data into the pre-trained data anomaly identification model to obtain a model output result;

step 405, according to the labeling result of the full sample data and the model output result, performing output verification on the pre-trained data anomaly identification model;

with continued reference to fig. 8, fig. 8 is a flow chart of one embodiment of step 405 shown in fig. 4, comprising:

step 801, counting the number of positive and negative sample data output by the model according to the model output result;

step 802, counting the number of actual positive and negative sample data in the full sample data according to the labeling result of the full sample data;

step 803, the number of positive and negative sample data output by the model and the number of actual positive and negative sample data in the total sample data are used as method parameters to be input into a preset loss function, and a model loss value is obtained;

step 804, comparing the model loss value with a preset loss threshold value in a magnitude relation manner;

step 805, if the model loss value is smaller than the preset loss threshold, the verification is successful;

step 806, if the model loss value is not less than the preset loss threshold, the verification fails.

In this embodiment, when the loss function is invoked and executed, it is able to calculate a verification sample, that is, a ratio of an unrecognized sample size in the total sample data to a duty ratio in the total sample data, and also calculate a recognition degree of positive and negative sample data, where the recognition degree of positive and negative sample data is obtained by performing a proportional calculation according to the number of positive and negative sample data output by a model and the number of positive and negative sample data in the total sample data.

And step 406, if verification fails, performing super-parameter tuning on the pre-constructed data anomaly identification model, and performing iterative training until the model loss value meets a preset loss threshold, and stopping iterative training to obtain the data anomaly identification model with successful training.

In this embodiment, the training of the pre-constructed data anomaly identification model is mainly aimed at predicting the correspondence between the feature data of each link and the positive and negative output nodes in the vehicle insurance claim service by combining with the LSTM neural network.

In this embodiment, the step of predicting whether the test data is abnormal data according to the trained data abnormality recognition model specifically includes: obtaining an output result of the trained data anomaly identification model; if the output result shows that the test data is positive sample data, the test data is abnormal data; and if the output result shows that the test data are anti-sample data, the test data are normal data.

The application obtains the test data to be identified; performing time sequence processing and feature extraction on the test data to obtain a test feature set; performing feature engineering processing on the test feature set to obtain derived features; incorporating the derived features into the test feature set to update the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The problem that the model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved by considering the fact that a long-term and short-term memory network is introduced to solve the problems that the time sequence data are predicted in a plurality of links. Meanwhile, the corresponding relation between the characteristic data of each link in the vehicle insurance claim settlement business and the positive and negative output nodes is predicted by combining with the LSTM neural network, a gradient decreasing algorithm is introduced into the LSTM neural network, the model super-parameter combination with the highest decreasing speed of the cost function is screened out by the gradient decreasing algorithm, the training convergence speed and the prediction speed of the model are further ensured, each link of the claim settlement is tightly connected, whether the claim settlement data is abnormal or not is rapidly predicted, and the abnormal intelligent analysis of the claim settlement data is facilitated.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the application, the test data to be identified is obtained; performing time sequence processing and feature extraction on the test data to obtain a test feature set; performing feature engineering processing on the test feature set to obtain derived features; incorporating the derived features into the test feature set to update the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The problem that the model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved by considering the fact that a long-term and short-term memory network is introduced to solve the problems that the time sequence data are predicted in a plurality of links. Meanwhile, a gradient decreasing algorithm is introduced into the long-period memory network, and the model super-parameter combination with the highest decreasing speed of the cost function is screened out through the gradient decreasing algorithm, so that the training convergence speed and the prediction speed of the model are further ensured.

With further reference to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data anomaly identification device, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device is specifically applicable to various electronic devices.

As shown in fig. 9, the data anomaly identification device 900 according to the present embodiment includes: a test data acquisition module 901, a feature extraction module 902, a feature engineering module 903, a feature update module 904, and a model identification module 905. Wherein:

the test data acquisition module 901 is used for acquiring test data to be identified;

the feature extraction module 902 is configured to perform timing sequence processing and feature extraction on the test data, and obtain a test feature set;

the feature engineering module 903 is configured to perform feature engineering processing on the test feature set to obtain a derived feature;

a feature update module 904 for incorporating the derived features into the test feature set to update the test feature set;

the model recognition module 905 is configured to input the updated test feature set to a trained data anomaly recognition model, and predict whether the test data is anomaly data according to the trained data anomaly recognition model.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by computer readable instructions, stored on a computer readable storage medium, that the program when executed may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 10 includes a memory 10a, a processor 10b, and a network interface 10c communicatively coupled to each other via a system bus. It should be noted that only computer device 10 having components 10a-10c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 10a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 10a may be an internal storage unit of the computer device 10, such as a hard disk or a memory of the computer device 10. In other embodiments, the memory 10a may also be an external storage device of the computer device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 10. Of course, the memory 10a may also include both internal storage units of the computer device 10 and external storage devices thereof. In this embodiment, the memory 10a is generally used to store an operating system and various application software installed on the computer device 10, such as computer readable instructions of a data anomaly identification method. Further, the memory 10a may be used to temporarily store various types of data that have been output or are to be output.

The processor 10b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 10b is generally used to control the overall operation of the computer device 10. In this embodiment, the processor 10b is configured to execute computer readable instructions stored in the memory 10a or process data, such as computer readable instructions for executing the data anomaly identification method.

The network interface 10c may comprise a wireless network interface or a wired network interface, the network interface 10c typically being used to establish a communication connection between the computer device 10 and other electronic devices.

The computer equipment provided by the embodiment belongs to the technical field of artificial intelligence and financial science and technology, and is applied to claim risk prediction business. The application obtains the test data to be identified; performing time sequence processing and feature extraction on the test data to obtain a test feature set; performing feature engineering processing on the test feature set to obtain derived features; incorporating the derived features into the test feature set to update the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The problem that the model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved by considering the fact that a long-term and short-term memory network is introduced to solve the problems that the time sequence data are predicted in a plurality of links. Meanwhile, a gradient decreasing algorithm is introduced into the long-period memory network, and the model super-parameter combination with the highest decreasing speed of the cost function is screened out through the gradient decreasing algorithm, so that the training convergence speed and the prediction speed of the model are further ensured.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by a processor to cause the processor to perform the steps of the data anomaly identification method as described above.

The computer readable storage medium provided by the embodiment belongs to the technical field of artificial intelligence and financial science and technology, and is applied to claim risk prediction business. The application obtains the test data to be identified; performing time sequence processing and feature extraction on the test data to obtain a test feature set; performing feature engineering processing on the test feature set to obtain derived features; incorporating the derived features into the test feature set to update the test feature set; and inputting the updated test feature set into a trained data anomaly identification model, and predicting whether the test data is anomaly data according to the trained data anomaly identification model. The problem that the model prediction effect is greatly reduced due to the adoption of a common RNN processing model is solved by considering the fact that a long-term and short-term memory network is introduced to solve the problems that the time sequence data are predicted in a plurality of links. Meanwhile, a gradient decreasing algorithm is introduced into the long-period memory network, and the model super-parameter combination with the highest decreasing speed of the cost function is screened out through the gradient decreasing algorithm, so that the training convergence speed and the prediction speed of the model are further ensured.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method for identifying data anomalies, comprising the steps of:

acquiring test data to be identified;

2. The method for identifying data anomalies according to claim 1, wherein the step of performing timing processing and feature extraction on the test data to obtain a test feature set includes:

3. The data anomaly identification method of claim 1, wherein prior to performing the step of inputting the updated set of test features into the trained data anomaly identification model, the method further comprises:

4. The method for identifying data anomalies according to claim 3, wherein the step of resampling the positive sample data according to a preset sampling mode, obtaining a resampling result, and obtaining new positive sample data according to the resampling result specifically comprises:

5. The method for recognizing data anomalies according to claim 4, wherein the step of screening out a target recognition model according to a preset screening rule, using the target recognition model as the pre-trained data anomaly recognition model specifically comprises:

6. The data anomaly identification method according to claim 3, wherein the step of performing output verification on the pre-trained data anomaly identification model according to the labeling result of the full-scale sample data and the model output result specifically comprises:

7. The method for recognizing abnormal data according to claim 1, wherein the step of predicting whether the test data is abnormal data according to the trained data abnormality recognition model comprises:

obtaining an output result of the trained data anomaly identification model;

8. A data anomaly identification device, comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which when executed implement the steps of the data anomaly identification method of any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the data anomaly identification method of any one of claims 1 to 7.