CN112288025A

CN112288025A - Abnormal case identification method, device and equipment based on tree structure and storage medium

Info

Publication number: CN112288025A
Application number: CN202011211514.3A
Authority: CN
Inventors: 殷振滔
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-01-29
Anticipated expiration: 2040-11-03
Also published as: CN112288025B

Abstract

The application discloses abnormal case identification method, device, equipment and storage medium based on tree structure, belongs to the technical field of artificial intelligence, and comprises the following steps: acquiring an original training sample from an initial case database; calculating an abnormal score of an original training sample based on an IForest algorithm; comparing the abnormal value of the original training sample with a preset threshold value, classifying the original training sample according to the comparison result, and forming a target training set; performing model training on the initial recognition model through a target training set, and outputting an abnormal recognition model; and importing case data of the case to be identified into the abnormal identification model, and outputting an identification result. The application also relates to a blockchain technique, and the anomaly scores of the original training samples can be stored in the blockchain. According to the method and the device, the abnormal score of the original training sample is calculated based on the IForest algorithm, the interference of human factors is effectively eliminated in the process of classifying the original training sample according to the abnormal score, and the accuracy of the case abnormal recognition model is improved.

Description

Abnormal case identification method, device and equipment based on tree structure and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a method, a device, equipment and a storage medium for identifying abnormal cases based on a tree structure.

Background

The traditional case abnormity identification usually adopts methods of surveyor survey and statistical model, but both of the methods adopt artificial judgment as a benchmark, and the artificial judgment is difficult to quantify and has subjective factors. In the artificial judgment process, the abnormal cases are generally marked as positive samples, other cases are considered as non-abnormal cases, namely negative samples, and generally the proportion of the abnormal cases is far smaller than that of the non-abnormal cases in the artificial judgment process, so that the accuracy of the case abnormality recognition two classifiers trained on the basis of artificially marked data is not high enough. Because the proportion of positive samples (i.e. abnormal cases) in the adopted training data is often very small, and the abnormal forms are various, the non-abnormal case samples are impure, i.e. the missing situation may exist in the artificial judgment, so that the non-abnormal cases are doped in part, which means that the distribution of the abnormal cases in the original historical data is different from the actual distribution, and the unidentified abnormal samples are part of dirty data, which may affect the effect of the classifier.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a storage medium for identifying abnormal cases based on a tree structure, so as to solve the technical problem that the accuracy of the existing case identification model obtained through artificial labeling data training is not enough.

In order to solve the above technical problem, an embodiment of the present application provides a method for identifying an abnormal case based on a tree structure, which adopts the following technical solutions:

a method for identifying abnormal cases based on a tree structure comprises the following steps:

acquiring an original training set in a preset time period from an initial case database, wherein the original training set comprises a plurality of original training samples;

calculating the abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

comparing the abnormal value of each original training sample with a preset threshold value, and classifying the original training samples according to the comparison result to obtain a positive sample and a negative sample;

randomly combining the positive sample and the negative sample obtained by identification to form a target training set;

constructing an initial recognition model, performing model training on the initial recognition model through a target training set, and outputting an abnormal recognition model;

and acquiring case data of the case to be identified, importing the case data of the case to be identified into the abnormal identification model, and outputting an identification result.

Further, the step of calculating the abnormal score of each original training sample in the original training set based on the random isolated forest algorithm specifically comprises the following steps:

constructing a binary tree through a plurality of original training samples in an original training set;

and calculating the path length of each original training sample in the binary tree, and calculating the abnormal score of each original training sample based on the path length.

Further, the step of constructing the binary tree through a plurality of original training samples in the original training set specifically includes:

extracting a plurality of original training samples from an original training set, and introducing the extracted plurality of original training samples into a preset initial binary tree model;

acquiring sample characteristics of each original training sample, and combining the acquired sample characteristics to form a characteristic set;

and dividing the original training set through the feature set until the original training sample of the original training set is irrevocable, and outputting a binary tree.

Further, the step of dividing the original training set by the feature set until the original training sample of the original training set is irrevocable and outputting a binary tree specifically comprises:

sequentially randomly extracting sample features in the feature set, and determining the maximum value and the minimum value of the extracted sample features;

randomly selecting a numerical value between the maximum value and the minimum value as a cutting point, and dividing original training samples of an original training set;

and traversing the sample characteristics of the characteristic set until the depth of the binary tree meets the preset depth, and acquiring the binary tree with the depth meeting the requirement.

Further, the step of calculating the path length of each original training sample in the binary tree and calculating the anomaly score of each original training sample based on the path length specifically includes:

counting the number of edges of each original training sample in the binary tree, and calculating the initial path length of each original training sample in the binary tree according to the number of the edges;

calculating a path correction value, and correcting the initial path length of each original training sample through the path correction value to obtain the path length of each original training sample in the binary tree;

and (4) the length of a through path, and calculating the abnormal score of each original training sample.

Further, the method comprises the steps of constructing an initial recognition model, performing model training on the initial recognition model through a target training set, and outputting an abnormal recognition model, and specifically comprises the following steps:

randomly cutting a target training set into K equal parts of training subsets, wherein K is a positive integer;

randomly extracting K-1 training subsets to form a model training set, and performing model training on the initial recognition model;

taking the rest training subsets as a cross validation set, performing cross authentication on the trained initial recognition model, and outputting a first validation result;

and iteratively updating the initial identification model according to the first verification result until the initial identification model is converged, and outputting the abnormal identification model after the model is converged.

Further, the step of iteratively updating the initial identification model according to the first verification result until the initial identification model converges and outputting the abnormal identification model after model convergence specifically includes:

adjusting model parameters of the initial recognition model, and training the initial recognition model with modified parameters through a model training set; and

performing cross authentication on the trained initial recognition model through a cross validation set, and outputting a second validation result;

and comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial identification model until the first verification result is identical to the second verification result after training, and outputting the abnormal identification model after model convergence.

In order to solve the above technical problem, an embodiment of the present application further provides an abnormal case identification apparatus based on a tree structure, which adopts the following technical solutions:

an abnormal case recognition apparatus based on a tree structure, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original training set in a preset time period from an initial case database, and the original training set comprises a plurality of original training samples;

the calculating module is used for calculating the abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

the comparison module is used for comparing the abnormal value of each original training sample with a preset threshold value and classifying the original training samples according to the comparison result to obtain a positive sample and a negative sample;

the combination module is used for randomly combining the positive samples and the negative samples obtained by identification to form a target training set;

the training module is used for constructing an initial recognition model, performing model training on the initial recognition model through a target training set and outputting an abnormal recognition model;

and the recognition module is used for acquiring case data of the case to be recognized, importing the case data of the case to be recognized into the abnormal recognition model, and outputting a recognition result.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions, the processor when executing the computer readable instructions implementing the steps of the tree structure based abnormal case identification method as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the tree structure based abnormal case identification method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

when a model training set is constructed, the abnormal value of each original training sample is calculated through an IForest algorithm, then the magnitude relation between the abnormal value of each original training sample and a preset threshold value is sequentially compared, the original training samples are classified according to the comparison result, a positive sample and a negative sample are obtained, then an abnormal recognition model is trained on the basis of the model training set obtained in the mode, and then the abnormal condition of the case to be recognized is recognized through the trained abnormal recognition model. The method and the device for identifying the abnormal case recognition model based on the IForest algorithm calculate the abnormal score of the original training sample and effectively eliminate the interference of human factors according to the process of classifying the original training sample by the abnormal score, reduce the influence of subjective factors, and can change the preset threshold value according to the actual situation in the method and the device for identifying the abnormal case recognition model based on the IForest algorithm, so that the proportion of positive and negative samples in the training set can be improved, and the problem that the accuracy of the abnormal case recognition model obtained by training due to too few positive samples in the prior art is not high enough is avoided.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow diagram of one embodiment of a tree structure based abnormal case identification method according to the present application;

FIG. 3 is a flow diagram illustrating one embodiment of step S202 of FIG. 2;

FIG. 4 is a schematic diagram illustrating construction of a binary tree in an embodiment of the present application;

FIG. 5 illustrates a schematic structural diagram of one embodiment of a tree structure based abnormal case identification apparatus according to the present application;

FIG. 6 shows a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the abnormal case identification method based on the tree structure provided in the embodiment of the present application is generally executed by a server, and accordingly, the abnormal case identification apparatus based on the tree structure is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flowchart of one embodiment of a method for tree structure based identification of abnormal cases according to the present application is shown. The abnormal case identification method based on the tree structure comprises the following steps:

s201, acquiring an original training set in a preset time period from an initial case database, wherein the original training set comprises a plurality of original training samples;

specifically, the original training set is a data set of all cases in a preset time in an initial case database, wherein data information of all cases is stored in the initial case database, the cases in the initial case database can be regarded as original training samples during model training, and the cases in the initial case database are unprocessed cases. The case in the initial case database, such as an automobile insurance claim abnormal case, has the relevant information including case number, case involved personnel, case involved vehicles, relevant certificates and the like, and the case involved personnel mainly include insured, repair shop personnel, insurance company personnel, relevant traffic police and the like, which are the original information recorded when the case occurs. It should be noted that the cases in the initial case database may also be financial claim cases or serious illness claim cases, and the application is not limited herein.

In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the abnormal case identification method based on the tree structure operates may receive a user request through a server in a wired connection manner or a server in a wireless connection manner for calibration. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

S202, calculating the abnormal score of each original training sample in the original training set based on a random isolated forest algorithm;

the analysis of the abnormal score is a process of checking whether data contain abnormal data, and in the model training process, it is very dangerous to ignore the existence of the abnormal score. In a specific embodiment of the present application, the abnormal score of the case can be calculated by IForest (random isolated forest) algorithm, which belongs to a mixed abnormal score analysis method of Non-parameter statistics and unsupervised learning, i.e. without defining a mathematical model and without labeled training. IForest consists of t iTree (isolation tree) orphan trees, each iTree being a binary tree structure. For how to find out which points are easy to isolate (isolated), iForest uses a very efficient strategy, i.e. assuming that a random hyperplane is used to cut (split) a data space (data space), two subspaces can be generated by cutting once, then each subspace is cut by a random hyperplane, and the process is repeated until only one data point is in each subspace. Intuitively, it can be seen that clusters with very high density can be cut many times before cutting stops, but that points with very low density can easily stop to a subspace early.

Specifically, a binary tree is constructed by using original training samples in the original training set, data information of the formed binary tree is acquired, and an abnormal score of each original training sample in the original training set is calculated according to the data information of the binary tree.

S203, comparing the abnormal value of each original training sample with a preset threshold value, and classifying the original training samples according to the comparison result to obtain a positive sample and a negative sample;

specifically, the abnormal scores of all original training samples on an original training set are obtained by using an Iforest algorithm, wherein the value range of the abnormal scores is [0,1], a preset threshold value threshold is set, if the value range is 0.8, the abnormal score of each original training sample is sequentially compared with the preset threshold value, the type of the original training sample is identified, cases higher than the threshold value are selected from the original training set to be added into positive samples (marked as abnormal cases), and cases lower than the threshold value are selected to be added into negative samples (marked as normal cases). In a specific embodiment of the present application, the preset threshold may be set according to a requirement of a time scenario, and if the identification requirement is strict, the preset threshold is adjusted upwards.

S204, randomly combining the positive sample and the negative sample obtained by identification to form a target training set;

specifically, after the original training samples are classified, positive samples and negative samples are obtained, the positive samples and the negative samples are recombined to generate a target training set, and the initial recognition model is trained and verified through the target training set.

S205, constructing an initial recognition model, performing model training on the initial recognition model through a target training set, and outputting an abnormal recognition model;

specifically, the initial recognition model can be constructed by adopting a K-Fold cross validation framework, so that subsequent cross validation can be conveniently carried out, training set data are randomly divided into K parts, K-1 part of the training set is used as a model training set, the rest 1 part of the training set is used as a cross validation set, the initial recognition model is carried out, and then the trained initial recognition model is validated by a cross validation method, so that overfitting of the model is reduced, and robustness is improved. And forming a mature abnormality recognition model by K supervised classifiers.

It should be noted that, in many cases, a single machine learning model is not good, and therefore, in the specific embodiment of the present application, a stacking model is used in the training of the anomaly recognition model, that is, the training is performed by using multiple base classifiers together, and finally the anomaly recognition model is formed, and then another sub-classifier is used to organize and utilize the base classifier, that is, the answer of the base layer model is used as an input, and the sub-classifier learning organization assigns a weight to the answer of the base layer model, so as to reduce the generalization error. The base classifier and the sub-classifier are both one of two classifiers. Specifically, the principle of the stacking model is that a plurality of two classifiers are adopted, wherein the two classifiers comprise K base classifiers and 1 secondary classifier, training set data are randomly divided into K parts, K-1 parts of training set data are used as a model training set, the rest 1 parts of training set data are used as a cross validation set to circularly train the base classifiers, finally the K base classifiers after training are obtained, validation results of the K base classifiers are output, K-1 parts of validation results are randomly combined to be used as a secondary classifier training data set, the rest 1 parts of validation results are used as a cross validation set to circularly train the secondary classifiers, finally the secondary base classifiers after training are obtained, and a converged case abnormity identification model is formed by combining the 1 secondary classifier and the K base classifiers. In a specific embodiment of the application, a LightGBM model or a CatBoost model is selected by the base classifier, and a Logitics Regression model is selected by the secondary classifier.

S206, acquiring case data of the case to be identified, importing the case data of the case to be identified into the abnormal identification model, and outputting an identification result.

Specifically, after the abnormal recognition model is trained, case data of a case to be recognized is acquired, and the case data of the case to be recognized is imported into the abnormal recognition model, so that an abnormal recognition result of the case to be recognized can be directly acquired.

The embodiment of the application discloses an abnormal case identification method based on a tree structure, when a model training set is constructed, the abnormal score of each original training sample is calculated through an IForest algorithm, then the magnitude relation between the abnormal score of each original training sample and a preset threshold value is sequentially compared, the original training samples are classified according to the comparison result to obtain a positive sample and a negative sample, then an abnormal identification model is trained on the basis of the model training set obtained in the mode, and then the abnormal condition of a case to be identified is identified through the trained abnormal identification model. The method and the device for identifying the abnormal case recognition model based on the IForest algorithm calculate the abnormal score of the original training sample and effectively eliminate the interference of human factors according to the process of classifying the original training sample by the abnormal score, reduce the influence of subjective factors, and can change the preset threshold value according to the actual situation in the method and the device for identifying the abnormal case recognition model based on the IForest algorithm, so that the proportion of positive and negative samples in the training set can be improved, and the problem that the accuracy of the abnormal case recognition model obtained by training due to too few positive samples in the prior art is not high enough is avoided.

Further, referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of step S202 in fig. 2, where the step of calculating the anomaly score of each original training sample in the original training set based on the random isolated forest algorithm specifically includes:

s301, constructing a binary tree through a plurality of original training samples in an original training set;

s302, calculating the path length of each original training sample in the binary tree, and calculating the abnormal score of each original training sample based on the path length.

In the above embodiment, the original training samples in the original training set are introduced into the root node of the tree model, then the original training samples are divided according to a certain condition and filled into the leaf nodes of the tree model to form a binary tree, the path length of each original training sample on the binary tree is counted, and the anomaly score of each original training sample can be calculated based on the obtained path length.

In the above embodiment, the sample features of each original training sample are taken, the obtained sample features are combined to form a feature set, and the original training set is divided by the feature set. It should be noted that there may be a plurality of sample features in the feature set, when the original training set is divided by the feature set, one sample feature is randomly obtained from the feature set to divide the original training set to obtain a subset of the original training set, another sample feature is randomly obtained from the feature set to divide the subset of the original training set, sample features in all feature sets are performed until the original training sample of the original training set is not re-divisible, and at this time, the divided original training samples are filled into leaf nodes of the tree model to obtain the binary tree. It should be noted that, by adjusting the sequence of the sample features, multiple binary trees can be obtained, and the multiple binary trees are used to jointly calculate the abnormal score, which can improve the accuracy of sample division.

In the above embodiment, when randomly obtaining a sample feature from the feature set to divide the original training set, the maximum and minimum values of the sample feature can be determined, for example, referring to fig. 4, fig. 4 shows a schematic diagram of constructing a binary tree in an embodiment of the present application, the 10 original training samples, the age of 10 original training samples is 56 years maximum and 28 years minimum, then a value (e.g. 40) is randomly selected between 28 and 56 as a cut point to divide the 10 original training samples, then, the original training samples with the age less than 40 are placed in a new leaf node 1, and the original training samples with the age greater than or equal to 40 are placed in a new leaf node 2, wherein the leaf node 1 is located on the left side of the root node, and the leaf node 2 is located on the right side of the root node. In the above embodiment, the original training samples in the leaf node 1 and the leaf node 2 are divided by the sample feature of "car age", the division results are respectively placed into the leaf node 3, the leaf node 4, the leaf node 5 and the leaf node 6, and the sample features of the feature set are traversed in such a division manner until the depth of the binary tree meets the preset depth, that is, the original training sample is not re-divisible, so that the binary tree with the depth meeting the requirement can be obtained.

In the above embodiment, when calculating the abnormal score of an original training sample x, the path length (also called depth) of the original training sample x in each binary tree is calculated first. Specifically, the number e of edges passing by x from the root node to the leaf node is counted along the binary tree from top to bottom according to values of different sample characteristics until the leaf node is reached, and the initial path length h is obtained₀(x) I.e. h₀(x) E. In the above specific embodiment, in order to accurately obtain the abnormal score of the original training sample x, the initial path length h needs to be calculated₀(x) Making corrections, in particular at the initial path length h₀(x) Adding path correction values, assuming that the number of samples falling on the same leaf node as x in the original training samples is n, the path length h (x) of x in the binary tree can be calculated by the following formula:

h(x)＝h₀(x)+C(n)

i.e., (x) e + c (n), where e represents the number of edges the original training sample x passes from the root node to the leaf node, and c (n) is a path modification value representing the average path length of n samples falling on the same leaf node as x. In general, the formula for C (n) is as follows:

where H (n-1) can be estimated as ln (n-1) + M, where the constant M is euler constant with a value of 0.5772156649, and the final anomaly score (x) of the original training sample x can be calculated by combining the results of the binary trees:

e (h (x)) represents data x in a plurality of binary treesIs determined by the average of the path lengths of (a),

the number of samples representing a training sample of a single binary tree,

for indicating

The average path length of the binary tree constructed from the pieces of data is mainly used for normalization.

In the above embodiment, the training set data is randomly divided into K parts, K-1 parts are used as a model training set, the remaining 1 part is used as a cross validation set to circularly train the base classifiers, finally, the K base classifiers after training are obtained, and the K base classifiers after training are integrated by the classifier for 1 time to form a convergent case anomaly recognition model. In a specific embodiment of the application, a LightGBM model or a CatBoost model is selected by the base classifier, and a Logitics Regression model is selected by the secondary classifier. In the specific embodiment of the present application, the training set data is randomly divided into 10 parts [1,2,3,4,5,6,7,8,9,10], K-1 parts of training subsets are randomly extracted and combined to form a model training set, for example [1,2,3,4,5,6,7,8,9] is used for training a K1 classifier, and [10] is used for verifying a trained K1 classifier; [1,2,3,4,5,6,7,8,10] is used to train the K2 classifier, [9] is used to verify that the trained K2 classifier … … can train 10 classifiers K1, … …, K10 in the above manner, and an abnormality recognition model is obtained by integrating the 10 base classifiers after training through 1 classifier. Meanwhile, cross validation is performed on the 10 classifiers to obtain 10 validation results, and the mean value of the 10 validation results is taken and is the first validation result.

In the above specific embodiment, the verification results of 10 base classifiers are output through the verification set verification classifiers K1, … … and K10, then 9 verification results are randomly combined to serve as the training data set of the sub-classifiers, the remaining 1 verification result is used as the cross verification set to circularly train the sub-classifiers, finally the trained sub-base classifiers are obtained, and the case abnormality identification model is obtained through the combination of 1 sub-classifier and 10 base classifiers.

In the above embodiment, after the first verification result is obtained, the initial recognition model is iteratively updated according to the first verification result. Specifically, model parameters of the initial recognition model are adjusted, namely, a step length parameter step is added on the basis of the model parameters of the initial recognition model, model training is performed on the initial recognition model after parameters are modified through a model training set, cross authentication is performed on the initial recognition model after training through a cross verification set, a second verification result is output, the first verification result and the second verification result are compared, and if the first verification result is different from the second verification result, the model parameters of the initial recognition model are continuously adjusted until the first verification result and the second verification result are obtained through training, and the abnormal recognition model after model convergence is output.

It should be noted that, the original training sample verified by the Iforest algorithm needs to perform secondary verification by using the step size parameter, the step size parameter step is increased, for example, 0.01, and through continuous iteration, a state that the average value of the prediction result is not changed is found, so that the quality of the abnormal sample is ensured, and thus an optimal recognition result can be achieved.

It is emphasized that, in order to further ensure the privacy and safety of the anomaly scores of the original training samples, the anomaly scores of the original training samples can also be stored in nodes of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 5, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an abnormal case identification apparatus based on a tree structure, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the abnormal case identification apparatus based on tree structure according to this embodiment includes:

an obtaining module 501, configured to obtain an original training set in a predetermined time period from an initial case database, where the original training set includes a plurality of original training samples;

a calculating module 502, configured to calculate an anomaly score of each original training sample in the original training set based on a random isolated forest algorithm;

a comparison module 503, configured to compare the abnormal score of each original training sample with a preset threshold, and classify the original training samples according to a comparison result, where a positive sample and a negative sample are obtained;

a combination module 504, configured to randomly combine the positive samples and the negative samples obtained through identification to form a target training set;

a training module 505, configured to construct an initial recognition model, perform model training on the initial recognition model through a target training set, and output an abnormal recognition model;

the recognition module 506 is configured to obtain case data of a case to be recognized, import the case data of the case to be recognized to the anomaly recognition model, and output a recognition result.

Further, the calculating module 502 specifically includes:

the binary tree construction sub-module is used for constructing a binary tree through a plurality of original training samples in an original training set;

and the path calculation sub-module is used for calculating the path length of each original training sample in the binary tree and calculating the abnormal score of each original training sample based on the path length.

Further, the binary tree building submodule specifically includes:

the system comprises a sample leading-in unit, a training unit and a training unit, wherein the sample leading-in unit is used for extracting a plurality of original training samples in an original training set and leading the extracted original training samples into a preset initial binary tree model;

the characteristic combination unit is used for acquiring the sample characteristics of each original training sample and combining the acquired sample characteristics to form a characteristic set;

and the sample dividing unit is used for dividing the original training set through the feature set until the original training sample of the original training set is irrevocable and outputting a binary tree.

Further, the sample dividing unit specifically includes:

the characteristic extraction subunit is used for randomly extracting sample characteristics in the characteristic set in sequence and determining the maximum value and the minimum value of the extracted sample characteristics;

the sample dividing subunit is used for randomly selecting a numerical value between the maximum value and the minimum value as a cutting point and dividing the original training samples of the original training set;

and the binary tree output subunit is used for traversing the sample characteristics of the characteristic set until the depth of the binary tree meets the preset depth, and acquiring the binary tree with the depth meeting the requirement.

Further, the path calculation sub-module specifically includes:

the statistical unit is used for counting the number of edges of each original training sample in the binary tree and calculating the initial path length of each original training sample in the binary tree according to the number of the edges;

the correction unit is used for calculating a path correction value and correcting the initial path length of each original training sample through the path correction value to obtain the path length of each original training sample in the binary tree;

and the calculating unit is used for calculating the abnormal score of each original training sample according to the path length.

Further, the training module 505 specifically includes:

the segmentation submodule is used for randomly segmenting the target training set into K equal training subsets, wherein K is a positive integer;

the training submodule is used for randomly extracting K-1 parts of training subsets to form a model training set and carrying out model training on the initial recognition model;

the verification submodule is used for taking the rest training subsets as a cross verification set, performing cross authentication on the trained initial recognition model and outputting a first verification result;

and the iteration submodule is used for carrying out iteration updating on the initial identification model according to the first verification result until the initial identification model is converged and outputting the abnormal identification model after the model is converged.

Further, the iteration sub-module specifically includes:

the parameter adjusting unit is used for adjusting model parameters of the initial recognition model and training the initial recognition model with modified parameters through a model training set; and

the cross validation unit is used for performing cross authentication on the trained initial recognition model through a cross validation set and outputting a second validation result;

and the comparison unit is used for comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial identification model until the first verification result is identical to the second verification result after training, and outputting the abnormal identification model after model convergence.

The embodiment of the application discloses abnormal case recognition device based on tree structure, when the model training set is being constructed, calculate the abnormal score of each original training sample through IForest algorithm, then compare the abnormal score of each original training sample and the big or small relation of default threshold value in proper order, classify original training sample according to the comparison result, obtain positive sample and negative sample, then train the abnormal recognition model based on the model training set that above-mentioned mode obtained, the abnormal situation of the case of waiting to discern is discerned in the recognition model of rethread training. The method and the device for identifying the abnormal case recognition model based on the IForest algorithm calculate the abnormal score of the original training sample and effectively eliminate the interference of human factors according to the process of classifying the original training sample by the abnormal score, reduce the influence of subjective factors, and can change the preset threshold value according to the actual situation in the method and the device for identifying the abnormal case recognition model based on the IForest algorithm, so that the proportion of positive and negative samples in the training set can be improved, and the problem that the accuracy of the abnormal case recognition model obtained by training due to too few positive samples in the prior art is not high enough is avoided.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 6, fig. 6 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as computer readable instructions of the abnormal case identification method based on the tree structure. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, for example, execute computer readable instructions of the abnormal case identification method based on the tree structure.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The embodiment of the application discloses computer equipment, when constructing the model training set, calculate the abnormal score of each original training sample through IForest algorithm, then compare the abnormal score of each original training sample and the big or small relation of default threshold value in proper order, classify original training sample according to the comparison result, wherein, the type of original training sample includes positive sample and negative sample, then trains the abnormal recognition model based on the model training set that above-mentioned mode obtained, and the abnormal situation of the case of waiting to discern is discerned to the recognition model identification model that the rethread was trained. The method and the device for identifying the abnormal case recognition model based on the IForest algorithm calculate the abnormal score of the original training sample and effectively eliminate the interference of human factors according to the process of classifying the original training sample by the abnormal score, reduce the influence of subjective factors, and can change the preset threshold value according to the actual situation in the method and the device for identifying the abnormal case recognition model based on the IForest algorithm, so that the proportion of positive and negative samples in the training set can be improved, and the problem that the accuracy of the abnormal case recognition model obtained by training due to too few positive samples in the prior art is not high enough is avoided.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the tree structure based abnormal case identification method as described above.

The embodiment of the application discloses a computer-readable storage medium, when a model training set is constructed, the abnormal score of each original training sample is calculated through an IForest algorithm, then the size relation between the abnormal score of each original training sample and a preset threshold value is sequentially compared, the original training samples are classified according to the comparison result, wherein the types of the original training samples comprise positive samples and negative samples, then an abnormal recognition model is trained on the basis of the model training set obtained in the above mode, and then the abnormal condition of a case to be recognized is recognized through the trained abnormal recognition model. The method and the device for identifying the abnormal case recognition model based on the IForest algorithm calculate the abnormal score of the original training sample and effectively eliminate the interference of human factors according to the process of classifying the original training sample by the abnormal score, reduce the influence of subjective factors, and can change the preset threshold value according to the actual situation in the method and the device for identifying the abnormal case recognition model based on the IForest algorithm, so that the proportion of positive and negative samples in the training set can be improved, and the problem that the accuracy of the abnormal case recognition model obtained by training due to too few positive samples in the prior art is not high enough is avoided.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for identifying abnormal cases based on a tree structure is characterized by comprising the following steps:

constructing an initial recognition model, performing model training on the initial recognition model through the target training set, and outputting an abnormal recognition model;

and acquiring case data of a case to be identified, importing the case data of the case to be identified into the abnormal identification model, and outputting an identification result.

2. The method for identifying abnormal cases based on tree structures as claimed in claim 1, wherein the step of calculating the abnormal score of each original training sample in the original training set based on the random isolated forest algorithm specifically comprises:

constructing a binary tree through a plurality of original training samples in the original training set;

calculating the path length of each original training sample in the binary tree, and calculating the abnormal score of each original training sample based on the path length.

3. The method for identifying abnormal cases based on tree structure as claimed in claim 2, wherein said step of constructing a binary tree from a plurality of original training samples in said original training set specifically comprises:

extracting a plurality of original training samples from the original training set, and introducing the extracted original training samples into a preset initial binary tree model;

and dividing the original training set through the feature set until an original training sample of the original training set is irrevocable, and outputting the binary tree.

4. The method for identifying abnormal cases based on tree structure as claimed in claim 3, wherein said step of dividing said original training set by said feature set until the original training samples of said original training set are not resolvable and outputting said binary tree specifically comprises:

randomly selecting a numerical value between the maximum value and the minimum value as a cutting point, and dividing original training samples of the original training set;

and traversing the sample features of the feature set until the depth of the binary tree meets a preset depth, and acquiring the binary tree with the depth meeting the requirement.

5. The method for identifying abnormal cases based on tree structure as claimed in claim 2, wherein said step of calculating the path length of each of said original training samples in said binary tree and calculating the abnormal score of each of said original training samples based on said path length specifically comprises:

counting the number of edges of each original training sample in a binary tree, and calculating the initial path length of each original training sample in the binary tree according to the number of the edges;

calculating a path correction value, and correcting the initial path length of each original training sample through the path correction value to obtain the path length of each original training sample in a binary tree;

and calculating the abnormal score of each original training sample according to the path length.

6. The method for identifying abnormal cases based on tree structures according to any one of claims 1 to 5, wherein the step of constructing an initial identification model, performing model training on the initial identification model through the target training set, and outputting an abnormal identification model specifically comprises:

randomly cutting the target training set into K equal parts of training subsets, wherein K is a positive integer;

randomly extracting K-1 parts of the training subsets to form a model training set, and performing model training on the initial recognition model;

and iteratively updating the initial identification model according to the first verification result until the initial identification model is converged, and outputting an abnormal identification model after model convergence.

7. The method for identifying abnormal cases based on tree structure as claimed in claim 6, wherein said step of iteratively updating said initial identification model according to said first verification result until said initial identification model converges and outputting said abnormal identification model after model convergence comprises:

adjusting model parameters of the initial recognition model, and training the initial recognition model with modified parameters through the model training set; and

and comparing the first verification result with the second verification result, if the first verification result is different from the second verification result, continuing to adjust the model parameters of the initial identification model until the first verification result is trained to be the same as the second verification result, and outputting the abnormal identification model after model convergence.

8. An abnormal case recognition device based on a tree structure is characterized by comprising:

the comparison module is used for comparing the abnormal value of each original training sample with a preset threshold value and classifying the original training samples according to the comparison result, wherein the types of the original training samples comprise positive samples and negative samples;

the combination module is used for randomly combining the positive sample and the negative sample obtained by identification to form a target training set;

the training module is used for constructing an initial recognition model, performing model training on the initial recognition model through the target training set and outputting an abnormal recognition model;

and the recognition module is used for acquiring case data of a case to be recognized, importing the case data of the case to be recognized into the abnormal recognition model, and outputting a recognition result.

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the tree structure based abnormal case identification method of any one of claims 1 to 7.

10. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the tree structure based abnormal case identification method according to any one of claims 1 to 7.