WO2019019630A1

WO2019019630A1 - Anti-fraud identification method, storage medium, server carrying ping an brain and device

Info

Publication number: WO2019019630A1
Application number: PCT/CN2018/077230
Authority: WO
Inventors: 肖京; 王健宗; 王建明; 徐亮; 汪伟; 周宝; 李想
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-07-24
Filing date: 2018-02-26
Publication date: 2019-01-31
Also published as: CN107785058A

Abstract

Disclosed in the present application is an anti-fraud identification method, used for solving the problem of insufficient anti-fraud capabilities in the medical field. The method provided in the present application comprises: determining a target event; extracting target data related to the target event; and using at least two methods from among a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior to process the target data. Further provided in the present application are a storage medium and a server carrying a Ping An Brain.

Description

Anti-fraud identification method, storage medium, server and device carrying safe brain

This application claims the priority of the Chinese Patent Application filed on July 24, 2017, the Chinese Patent Office, the application number is CN201710605531.7, and the invention name is "anti-fraud identification method, storage medium and server hosting the safe brain". The content is incorporated herein by reference.

Technical field

The present application relates to the field of medical treatment, and in particular, to an anti-fraud identification method, a storage medium, a server and a device for carrying a safe brain.

Background technique

In the medical field, there are often many frauds, such as drug-mouse behavior, claims fraud, illegal credit card reimbursement, etc. The existence of these frauds will waste medical resources and intensify social conflicts.

However, there is currently no comprehensive method for identifying these fraudulent behaviors, resulting in insufficient anti-fraud capabilities in the medical field and fraudulent control. Therefore, finding an anti-fraud method to further improve the anti-fraud ability in the medical field has become an urgent problem to be solved by those skilled in the art.

Summary of invention

technical problem

The embodiments of the present application provide an anti-fraud identification method, a storage medium, and a server and device for carrying a safe brain, so as to solve the problem that fraudulent behavior is not recognized in the present invention, resulting in insufficient anti-fraud ability in the medical field, and fraudulent behavior. It is difficult to get control problems.

Problem solution

Technical solution

In a first aspect, an anti-fraud identification method is provided, including:

Identify target events;

Extracting target data related to the target event;

The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;

The method of constructing a decision model includes:

Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;

Perform cluster analysis on the variable object to obtain a clustering result;

Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;

Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;

Constructing a decision model by the first feature and the second feature;

The method for identifying the fraud data includes:

The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;

Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;

The method for identifying the fraud behavior includes:

Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;

Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;

The target data includes at least two of data to be tested, rule template data, and social security visit data.

In a second aspect, a computer readable storage medium is stored, the computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the anti-fraud identification method described above.

In a third aspect, a server for carrying a Ping An Big Big Data platform is provided, comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein The steps of the anti-fraud identification method described above are implemented when the processor executes the computer readable instructions.

In a fourth aspect, an apparatus for identifying fraud data is provided, including:

The modeling module is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;

And an identification module, configured to perform training on the test data based on the continuous anti-fraud model, and identify fraud data in the data to be tested.

Advantageous effects of the invention

Beneficial effect

In the embodiment of the present application, for the target data of the target event, the target data is processed by using at least two methods of constructing a decision model, a method for identifying fraud data, and a method for identifying a fraud behavior, which may be more comprehensive and complete. Anti-fraud decision-making and identification of incidents in the medical field.

Brief description of the drawing

DRAWINGS

1 is a flowchart of an anti-fraud identification method in an embodiment of the present application;

2 is a schematic flow chart of a method for constructing a decision model in an embodiment;

3 is a schematic flow chart of a method for constructing a decision model in another embodiment;

4 is a schematic flow chart of how to construct a decision model in an embodiment;

FIG. 5 is a schematic flowchart of cluster analysis of a variable object in an embodiment; FIG.

6 is a schematic structural diagram of an apparatus for constructing a decision model in an embodiment;

7 is a schematic flowchart of a first embodiment of a method for identifying fraud data according to the present application;

8 is a schematic flowchart of a second embodiment of a method for identifying fraud data according to the present application;

9 is a schematic diagram of functional modules of a first embodiment of a device for identifying fraud data according to the present application;

10 is a schematic flowchart of a first embodiment of a method for identifying social security fraud behavior according to the present application;

11 is a schematic flowchart of the refinement of step S10 in FIG. 10;

FIG. 12 is a schematic diagram showing the refinement process of step S30 in FIG. 10;

13 is a schematic flowchart of a second embodiment of a method for identifying social security fraud behavior according to the present application;

14 is a schematic diagram of functional modules of a first embodiment of an apparatus for identifying social security fraud behavior according to the present application;

15 is a schematic diagram of a relationship network of the present application;

FIG. 16 is a schematic diagram of a server carrying a Ping An Brain Big Data platform according to an embodiment of the present application.

Invention embodiment

Embodiments of the invention

The embodiment of the present application provides an anti-fraud identification method, a storage medium, a server and a device for carrying a safe brain, and is used for more comprehensive and comprehensive anti-fraud decision and identification of events in the medical field.

The Ping An Brain Big Data Platform provided by the present invention utilizes the big data resources of the group financial and non-financial fields, combines the advantages of its own technology and platform, and implements various structuralization through international advanced data mining, machine learning, deep learning and other big data technologies. Refine and classify data with unstructured data to mine data value.

As shown in FIG. 1 , an anti-fraud identification method provided by the present application includes: 101: determining a target event; 102, extracting target data related to the target event; 103, adopting a method for constructing a decision model, and fraud data as follows The target data is processed by at least two of the identification method and the fraud detection method. The target data includes at least two of data to be tested, rule template data, and social security visit data.

Embodiment 1:

Referring to FIG. 2, an embodiment of a method for constructing a decision model in an embodiment of the present application includes:

Step S110: Obtain rule template data, and extract each variable object and each template sample in the rule template data.

Specifically, the rule template refers to a set of criteria used to help determine the results of the audit. A document or project review may correspond to one or more rule templates. For example, reviewing the lender's credit rating may include “what is the payer? Rule templates such as “sub-bank lending” and “who has a bad record in the lender’s organization”. Each of the different rule templates has its corresponding rule template data, wherein the rule template data may include each variable object, each template sample, and a matching relationship between the variable object and the template sample, and the variable object is a qualitative type variable. Each variable object corresponds to a different category in the rule template. For example, the rule template is “Which branches are in which the user pays the loan”, and the corresponding rule template data may include the user 1 performing the loan at the A branch and the user 2 at the B. The branch conducts the loan, and the user 3 makes the loan at the C branch. Among them, each branch such as A branch, B branch, and C branch is a variable object, and user 1, user 2, and user 3 are template samples.

In step S120, cluster analysis is performed on the variable object to obtain a clustering result.

Specifically, the multi-dimensional data of each variable object can be extracted, and the variable object is clustered and analyzed according to the multi-dimensional data, and the multi-dimensional data refers to related data of each dimension of the variable object, for example, the variable object is each branch, and the multi-dimensional data can include each The total number of loans, the total loan amount, the average loan period, the branch size, and the geographical location of the branch. Cluster analysis refers to the analysis process of grouping physical or abstract objects into multiple classes consisting of similar objects. Clustering analysis of variable objects can cluster similar or similar variable objects. Reduce the level of the variable object. For example, variable objects include A branch, B branch, C branch, D branch..., cluster analysis of variable objects, A branch is similar to B analysis, and is divided into group A, C branch is similar to D branch, and is assigned to Group B..., the variable object is reduced from the level of each branch to the level of each group. After clustering the variable objects, clustering results composed of individual clusters can be obtained.

Step S130, matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature.

Specifically, the clustering analysis is performed on the variable object, and after the clustering result is obtained, the clustering result can be matched with each template sample according to the matching relationship between the variable object and the template sample in the rule template data. For example, the rule template is “Which relevant institution has had a bad record in the relevant lender”, the rule template data includes that user 1 had a bad record in the FK organization, user 2 had a bad record in the CE organization, and user 3 had been in KD. The organization has had a bad record..., clustering the variable object FK organization, CE organization, KD organization..., and obtaining each cluster named after group A, group B, group C... and clustering results Matches with template sample user 1, user 2, user 3.... Table 1 shows the matching relationship between the variable object and the template sample in the rule template data. Table 2 shows the matching relationship between the clustering result and each template sample. The "1" can be used to represent the variable object and the template sample or clustering result. Matching relationship, but not limited to this.

Table 1

[Table 1]

	FKFK	CECE	KDKD	……......
用户1User 1	11	00	00	……......
用户2User 2	00	11	00	……......
用户3User 3	00	11	11	……......
……......	……......	……......	……......	……......

Table 2

[Table 2]

	组AGroup A	组BGroup B	组CGroup C	……......
用户1User 1	11	00	00	……......
用户2User 2	11	00	00	……......
用户3User 3	00	11	00	……......
……......	……......	……......	……......	……......

By clustering the variable objects, it is obvious that the level of the variable object can be reduced, which is conducive to modeling.

In step S140, the black sample probabilities of the respective variable objects are respectively calculated, and the black sample probabilities of the respective variable objects are taken as the second features.

Specifically, the output of the decision model is usually a black sample or a white sample, the black sample refers to a sample that fails the audit, and the white sample refers to a sample that passes the audit, such as a decision model for bank loan qualification review, a black sample. It refers to users who do not pass the loan qualification review, while the white sample refers to users who pass the loan qualification review. Calculate the probability of the black sample of each variable object separately, that is, the probability ratio of the result type of the template sample to the black sample in the rule template data for each variable object. For example, the rule template is “Which relevant institution the lender has ever had The "bad record" can be calculated as "the probability that the user who has a bad record in the KD organization will eventually be a black sample". The calculation formula of the black sample probability of the variable object can be: black sample probability = the number of black samples of the variable object / the total number of template samples of the variable object. The calculated black sample probability of each variable object can be taken as the second feature in the form of a continuous variable. In other embodiments, the WOE (weight-of-evidence) value of each variable object may also be calculated separately, and the calculation formula is WOE=ln (the number of black samples of the variable object accounts for the total black sample) The ratio of the number / the number of white samples of the variable object to the total number of white samples), the higher the WOE value, the lower the probability that the template sample of the variable object is a black sample.

Step S150, constructing a decision model by using the first feature and the second feature.

Specifically, the current way to construct a decision model is to model all the rule template data input. The rule template data is multi-layered and complex, which is not conducive to modeling and will affect the performance of the model. By using the matched clustering result as the first feature, the black sample probability of each variable object is used as the second feature, instead of the original rule template data input to construct the decision model, which not only reduces the level involved in the data, but also retains The influence of each variable object on the decision result makes the decision result more accurate. The decision model may include a machine learning model such as a decision tree, a GBD (Gradient Boosting Decision Tree) tree model, and an LDA (Linear Discriminant Analysis) model. When constructing a document or an audit decision model of a project, it may correspond to one or more rule templates, and then obtain the first feature and the second feature corresponding to each rule template separately, and replace the original rule template data input to construct Decision model, when there are few variable objects in some rule templates, you can directly input the rule template data to build the model.

The above method for constructing a decision model extracts the variable objects by clustering each variable object and each template sample in the rule template data to obtain a clustering result, and matches the clustering result with each template sample according to the rule template data. The matched clustering result is used as the first feature to calculate the black sample probability of each variable object, and the black sample probability of each variable object is taken as the second feature, and then the decision model is constructed by using the first feature and the second feature. Cluster analysis of variable objects can reduce the dimensions and levels involved in the data, which is conducive to constructing decision models and reducing the impact on the performance of the model. In addition, the decision model constructed by the first feature and the second feature makes the model more accurate, and can effectively help to quickly process services that require complex rule review, and improve decision efficiency.

As shown in FIG. 3, in an embodiment, the foregoing method for constructing a decision model further includes:

Step S210: Mapping each variable object into a predefined tag according to a preset algorithm.

Specifically, each tag may be pre-defined, and the variable object is mapped into a predefined tag, and the preset algorithm may include a hash equation, such as MD5 (Message-Digest Algorithm 5, message digest algorithm fifth edition), SHA (Secure Hash Algorithm, etc., but not limited to this. According to the preset algorithm, each variable object is mapped to a predefined tag. For example, the variable object is A branch, B branch, C branch..., and the A branch and the C branch are mapped to the label A by the SHA algorithm, and the B branch is used. Mapped to the label K, the number of labels can be set according to the actual situation. A label does not contain too many variable objects, which can reduce the dimensions and levels involved in the data, and retain some of the original information.

Step S220: Match the label with each template sample according to the rule template data, and use the matched label as the third feature.

Specifically, the label is matched with each template sample according to the matching relationship between the variable object and the template sample in the rule template data, and the matched label is modeled as the third feature.

Step S230, constructing a decision model by using the first feature, the second feature, and the third feature.

Specifically, the matched clustering result is used as the first feature, the black sample probability of each variable object is taken as the second feature, and the matched tag is used as the third feature, and the first feature, the second feature, and the third feature are The feature replaces all the rule template data input to construct the decision model, which not only reduces the level of data involved, but also preserves the influence of each variable object on the decision result, making the decision result more accurate.

The above method for constructing a decision model constructs a decision model through the first feature, the second feature, and the third feature, and by clustering the variable object and mapping to a predefined tag, the dimension and level involved in the data can be reduced, which is beneficial to Constructing a decision-making model and reducing the impact on the performance of the model can make the model's performance more accurate, and can effectively help to quickly process the business that requires complex rule review and improve decision-making efficiency.

As shown in FIG. 4, in an embodiment, step S230 constructs a decision model by using the first feature, the second feature, and the third feature, including the following steps:

In step S302, the original node is established.

Specifically, in this embodiment, the decision model may be a decision tree model, and the original node of the decision tree may be established first.

Step S304: Acquire a result type of each template sample according to the rule template data.

Specifically, the result type of the template sample refers to the final result of the template sample, such as a black sample, a white sample, etc., and the result type of each template sample can be obtained from the rule template data.

Step S306, traversing and reading the first feature, the second feature, and the third feature, respectively, to generate a read record.

Specifically, the first feature, the second feature, and the third feature are separately traversed to generate a read record, that is, each possible decision tree branch is traversed separately, for example, the first feature is traversed separately, and the user 1 is generated in the group. A has a non-performing loan record, user 2 has a non-performing loan record in group A... reading records, traversing the second feature, and generating a black sample probability of FK is 20%, the black sample probability of the CE organization For a read record of 15%, etc., each read record may be a branch of the decision tree.

Step S308, calculating the segmentation purity of each piece of the read record according to the result type of each template sample, and determining the segmentation point according to the segmentation purity.

Specifically, the segmentation purity of each read record can be determined by calculating Gini impurity, entropy, information gain, etc., wherein Gini impurity refers to randomly applying a certain result from the set to a certain data in the set. The expected error rate of the term, entropy is used to measure the degree of confusion of the system, and the information gain is used to measure the ability of a read record to distinguish template samples. Calculating the segmentation purity of each read record can be interpreted as if the template sample is divided according to the read record, then the difference between the predicted result type and the real result type is large, and the smaller the difference, the greater the segmentation purity, indicating the strip The purer the read record. For example, the formula for calculating the purity of Gini can be:

Gini=1-[P(1) ² +P(2) ² +...+P(i) ² +...+P(m) ² ]

Then the purity of the partition = 1 - Gini is not pure, wherein i ∈ {1, 2, ..., m} refers to the m end result of the decision model, and P (i) is the template sample is judged by using the read record The result type at the time of the condition is the ratio of the final result of the species.

The optimal segmentation point can be determined according to the size of the segmentation purity of each read record, and the read condition with the higher segmentation purity is prioritized as a branch, and the original node is segmented.

Step S310, acquiring features corresponding to the segmentation points, and establishing a new node.

Specifically, the feature corresponding to the segmentation point can be acquired, and a new node is created. For example, the segmentation purity is calculated for each read record, and the read record having the maximum segmentation purity can be obtained as “user 1 has a bad condition in group A. The loan record can divide the original node into two branches, one has a bad loan record in group A, the other is that there is no bad loan record in group A, and the corresponding node is generated, and then the new node is separated. Look for the next split point and split until all read records are added to the decision tree. After the decision tree model is constructed, the decision tree can be pruned, and the nodes corresponding to the read records whose purity is less than the preset purity value are cut off, so that each branch of the decision tree has a high segmentation purity. In other embodiments, the number of nodes in the decision tree may also be set first. When the number of nodes in the decision tree reaches the set number of nodes, the decision tree is stopped.

The method for constructing a decision model traverses the first feature, the second feature, and the third feature, respectively, to generate a read record, and calculates the segmentation purity of each read record according to the result type of each template sample, according to the purity of the segmentation. Size determination of the segmentation point, construction of the decision model, can make the model's performance more accurate, can effectively help quickly deal with the business that requires complex rule review, and improve decision-making efficiency.

As shown in FIG. 5, in an embodiment, step S120 performs cluster analysis on the variable object to obtain a clustering result, including:

Step S402, randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster.

Specifically, a plurality of variable objects can be randomly selected from all the variable objects, and each of the selected variable objects is respectively used as a first cluster center of each cluster, and each cluster is named separately, each first The cluster center corresponds to one cluster, that is, the number of clusters is the same as the number of selected variable objects.

In step S404, the distances of the respective variable objects to the respective first cluster centers are respectively calculated.

In one embodiment, step S404 separately calculates the distances of the respective variable objects to the respective first cluster centers, including:

(a) Obtain multidimensional data of each variable object based on the rule template data.

Specifically, the multi-dimensional data of each variable object can be obtained from the rule template data, and the multi-dimensional data refers to related data of each dimension of the variable object, for example, the variable object is each branch, and the multi-dimensional data can include the total number of loans of each branch, and the total amount Loan volume, average loan period, branch size, geographic location, etc.

(b) Calculating the distances of the respective variable objects to the respective first cluster centers according to the multidimensional data of the respective variable objects.

Specifically, according to the obtained multi-dimensional data of each variable object, the distance between the two variable objects can be calculated by using a formula such as Euclidean distance and cosine similarity, and the distances of the respective variable objects to the respective first cluster centers are respectively calculated, for example, There are 4 clusters, and there are 4 first cluster centers respectively, and the distance from each variable object to the first first cluster center and the distance to the second first cluster center are respectively calculated.

Step S406, dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance.

Specifically, after calculating the distances of the respective variable objects to the respective first cluster centers, the variable objects may be divided into clusters corresponding to the first cluster center with the shortest distance. In other embodiments, the calculated distance may also be compared with a preset distance threshold. When the distance between the variable object and a certain cluster center is less than the distance threshold, the variable object is divided into the first cluster. The center corresponds to the cluster.

Step S408, respectively calculating the second cluster centers of the divided clusters.

Specifically, after the division is completed, each cluster may include one or more variable objects, and the second cluster center of each cluster may be recalculated by using the mean value formula, and the center of each cluster is reselected.

In step S410, it is determined whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold. If yes, step S414 is performed, and if no, step S412 is performed.

Specifically, calculating a distance between a first cluster center and a second cluster center of each cluster, and determining whether the distance is less than a preset threshold, if the distance between the first cluster center and the second cluster center of all clusters is Less than the preset threshold, indicating that each cluster tends to be stable and no longer changes, each cluster can be output as a clustering result. If the distance between the first cluster center of the cluster and the second cluster center is not less than a preset threshold, the variable objects of the respective clusters need to be re-divided.

Step S412, replacing the second cluster center with the first cluster center of the corresponding cluster, and continuing to perform step S404.

Specifically, if the distance between the first cluster center of the cluster and the second cluster center is not less than a preset threshold, the second cluster center of the cluster is replaced by the first cluster center, and each calculation is performed separately. Steps S404 to S412 are repeatedly performed in the step of changing the distance of the variable object to each of the first cluster centers until each cluster tends to be stable and no change occurs.

In step S414, each cluster is output as a clustering result.

The above method of constructing a decision model, clustering the variable objects, and merging similar variable objects into one cluster can reduce the level of data involved and facilitate the construction of the decision model.

Embodiment 2:

As shown in FIG. 6, an apparatus for constructing a decision model includes an extraction module 510, a clustering module 520, a first feature module 530, a second feature module 540, and a building module 550.

The extraction module 510 is configured to acquire rule template data, and extract each variable object and each template sample in the rule template data.

The clustering module 520 is configured to perform cluster analysis on the variable object to obtain a clustering result.

The first feature module 530 is configured to match the clustering result with each template sample according to the rule template data, and use the matched clustering result as the first feature.

The second feature module 540 calculates the black sample probability of each variable object, and takes the black sample probability of each variable object as the second feature.

The building module 550 is configured to construct a decision model by using the first feature and the second feature.

Embodiment 3:

Referring to FIG. 7, an embodiment of a method for identifying fraud data in an embodiment of the present application includes:

Step K10: training the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;

In this embodiment, the preset continuous model training method is firstly combined with data analysis theory such as decision tree and random forest, and data analysis tools such as R and SAS, and the preset training data set is trained to establish continuous anti-fraud. model. For example, the preset training data set can be divided into multiple groups, and training and intermediate tests are respectively performed to establish a continuous anti-fraud model. In the implementation of the training by using the preset continuous model training mode, in one embodiment, the preset training data set may be divided into multiple groups, and the model training and testing are performed in each group respectively. The training results of the group are relatively independent and do not affect each other. Then the models obtained after training and testing are integrated to obtain the final continuous anti-fraud model.

In another implementation manner, the preset training data set may be divided into multiple groups, and each group is trained and tested in turn, and the results of the previous set of model training and testing are used as the next set of model training and The basis of the test, that is, the training results of the upper and lower groups are related to each other. Throughout the training process, the model can be continuously optimized and improved to obtain the final continuous anti-fraud model.

Of course, it is not limited to use other model training methods to train the preset training data set to establish a continuous anti-fraud model.

Step K20: Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested.

After establishing the continuous anti-fraud model, the established continuous anti-fraud model can be used to train the test data to analyze and identify the fraud data in the data to be tested. If the test method of the preset training data set is established according to the establishment of the continuous anti-fraud model, the continuous anti-fraud model established by the data to be tested to be identified is trained and tested in the same or similar test manner, according to the training. The result of the test identifies fraud data in the data to be tested.

In some scenarios where fraud is prone to occur, such as social security malicious reimbursement, the proportion of fraudulent data in the entire social security big data is extremely small, that is, there is a large amount of unbalanced fraud data, and if a common single model is used to identify The fraud data will result in low recognition accuracy and recall rate due to the unbalanced nature of fraudulent data. Therefore, in the embodiment, for the unbalanced characteristic of the fraud data, a continuous anti-fraud model is established to identify the test data, for example, the method of using multiple models to vote together can be used to identify fraud data, which can effectively improve fraud data. The recognition accuracy and recall rate can more accurately determine fraud cases and thus narrow the scope and cost of manual review.

In this embodiment, a continuous anti-fraud model is established by using a preset continuous model training manner, and the continuous anti-fraud model is used to train the test data to identify fraud data in the data to be tested. Because the fraud data is unbalanced data in the data to be tested, the continuous anti-fraud model is used to analyze and identify the fraud data in the test data, which can improve the recognition accuracy and recall rate of the fraud data compared with the ordinary single model. Accurately determine fraud cases, thereby narrowing the scope and cost of manual review.

Further, in other embodiments, the continuous anti-fraud model for analyzing and identifying the fraud data in the test data adopts a direct continuous model, and the above step K10 can be replaced by:

Decomposing the preset training data set into a training set and a test set according to a preset ratio;

Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;

Repeating the division of the training set to a preset number of times;

Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.

In this embodiment, the N-continuous model training can be performed to establish a direct continuous model, wherein N is a positive integer greater than or equal to 2, for example, the direct continuous model can be trained as follows:

Step 1: Decompose the preset training data set into a training set Train_set and a test set Test_set according to a certain preset ratio, and retain the test set Test_set.

Step 2: The training set Train_set is further decomposed into two sub-training sets Train_setl1 and Train_setl2 according to a certain preset ratio, and the two sub-training sets Train_setl1 and Train_setl2 are respectively used as the training set and the test set of the next layer model.

Repeat the second step to divide the training set to a certain preset number.

The third step is to use the N-layer training set to train the model and perform parameter tuning using the preset common classical models, and test on the N-layer test set to adjust the parameters and retain the model. Among them, the classic model includes but is not limited to a decision tree model, a random forest model, and the like.

Step 4: Collect and calibrate the retained models to obtain a direct continuous model.

Further, the above step K20 can be replaced by:

Performing a multi-layered division of the test data in the same proportion as the training set in the training data set, and using the direct continuous model to separately train the multi-layered data to be tested, and identifying fraud data in the data to be tested. .

After establishing the direct continuous model, the established direct continuous model can be used to train the test data to analyze and identify the fraud data in the data to be tested. Specifically, the data to be tested that needs to be fraudulently identified may be randomly divided into the same number of training sets when the model is established, and then the established direct continuous model is the same as the training set in the training data set. The multi-layered data to be tested is respectively subjected to corresponding model training, and the training results of the corresponding model training are respectively summarized for the multi-layer divided data to be tested. According to the training result, the fraud data that is tested and identified in each layer after the multi-layered divided data to be tested can be obtained, and the fraud data collected and tested in each layer is summarized to obtain the final Describe fraud data in the test data.

Further, in other embodiments, the continuous anti-fraud model for analyzing and identifying the fraud data in the test data adopts an optimized continuous model, and the above step K10 can be replaced by:

Retaining the test set, further decomposing the training set into two sub-trainets according to a preset ratio, where the two sub-train sets are respectively used as a lower training set and a lower test set of the next layer model;

The lower training set is used to train the model, and the test is performed on the lower test set. The positive sample is obtained according to the test result and the training model is retained, and the obtained positive sample is used as a new training set;

Repeat the steps of dividing the training set and testing until the number of positive samples obtained is zero or the multiple training model is established;

The established multiple training models are collected and optimized to obtain an optimized continuous model.

In this embodiment, the N-continuous model training can be performed to establish an optimized continuous model, for example, the following steps can be used to optimize the training of the continuous model:

The third step is to use the lower training set Train_setl1 as the training set to train the model and tune it, and test it on the lower test set Train_setl2, and obtain the positive sample according to the test result and retain the model.

Step 4: Extract the positive samples obtained in the third step to form a training set.

Step 5: Repeat steps 2 through 4 until the Nth model has been constructed or the number of positive samples is zero, where N is a positive integer greater than or equal to 2.

The sixth step: collecting and tuning the constructed N-heavy model, that is, the multi-training model, and obtaining an optimized continuous model.

Further, the above step K20 can be replaced by:

A top-down test is performed on the data to be tested using the optimized continuous model, and a positive sample is acquired and retained according to the test result to identify fraud data in the data to be tested according to the positive sample.

After establishing the optimized continuous model, the established optimized continuous model can be used to train the test data to analyze and identify the fraud data in the data to be tested. Specifically, the top-down prediction can be directly performed on the data to be tested by using the established optimized continuous model, and the positive sample in the prediction process of the optimized continuous model is retained, and the loop is continued until the optimized continuous model is The Nth model is obtained by summarizing the positive samples predicted by each heavy model for the test data to obtain the fraud data in the final test data.

As shown in FIG. 8, on the basis of the foregoing embodiment 3, after the step K20, the method further includes:

Step K30, marking the type and/or source of the fraudulent data.

In this embodiment, after the fraudulent data in the data to be tested is identified by using the established continuous anti-fraud model, the type and/or source of the identified fraud data is further marked to indicate the characteristics of the fraud data. The type and/or source enables the relevant review department or relevant staff to focus on identifying other types of data that are the same or similar to the type and source of the fraudulent data, and narrow the scope of manual review. For example, there are some malicious or illegal credit card and reimbursement behaviors in the social security medical reimbursement system. After identifying the fraud data in the social security medical reimbursement data to be tested by using the established continuous anti-fraud model, the type and/or source of the identified fraud data may be marked, such as Chinese medicine, western medicine, medical treatment, and the like. In this way, the social security department can strictly control Chinese medicine, western medicine, and diagnosis and treatment as high-risk areas where false reimbursement may occur, thereby reducing the scope of examination and improving the accuracy and efficiency of fraud data identification.

Embodiment 4:

Referring to FIG. 9, an embodiment of an apparatus for identifying fraud data in an embodiment of the present application includes:

The modeling module 01 is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;

The identification module 02 is configured to train the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested.

Embodiment 5:

Referring to FIG. 10, an embodiment of a method for identifying fraudulent behavior in an embodiment of the present application includes:

Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship; and the group medical treatment behavior of each node in the relationship network is analyzed, The multi-dimensional group medical treatment characteristics corresponding to each node are extracted; the extracted multi-dimensional group medical treatment characteristics are input into a preset classification model to identify the fraud rate of each node according to the classification model.

The following are the specific steps to gradually realize the identification of social security fraud behavior in this embodiment:

Step Y10, establishing a relationship network between doctors and patients and a drug diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

In this embodiment, the social security medical treatment data is first obtained from the database, and after obtaining the social security medical treatment data, the relationship network between the medical doctor and the medical diagnosis can be established directly based on the social security medical treatment data. The nodes of the relationship network include, but are not limited to, hospitals, doctors, patients, regions, diseases, and medicine projects.

Further, after obtaining the social security medical treatment data, the acquired social security medical treatment data may be processed by sensitive information, and the sensitive information processing indicates that the sensitive information processing rule is used to deform the sensitive information in the data to achieve sensitivity. Protection of privacy data. Afterwards, a network of doctor-patient and drug diagnosis relationships can be established based on social security treatment data after sensitive information processing. Preferably, the social security treatment data below is the social security treatment data after the sensitive information is processed, and will not be further described below.

Specifically, referring to FIG. 11, the step Y10 includes:

Step Y11, performing data processing on the social security treatment data;

In step Y12, a network of doctor-patient and drug diagnosis relationships is established according to the social security treatment data after the data processing.

In this embodiment, after obtaining the social security medical treatment data, data processing is performed on the social security medical treatment data, and the processed data may include denoising and interference processing on the data, so as to facilitate the subsequent establishment of the relationship network, and the social security medical treatment data. After the data processing, a network of doctor-patient and drug diagnosis relationships is established based on the social security treatment data after the data processing.

In this embodiment, the relationship network established based on the social security visit data can refer to FIG. 15. As shown in FIG. 15, the relationship network includes a plurality of nodes, which are: a hospital, a doctor, a patient, a region, a disease and a medicine project, and the like. As can be seen from FIG. 15, in the relationship network, each node belongs to a different relationship. For example, the relationship between the doctor and the hospital is: the doctor belongs to (BELONG) hospital; the relationship between the doctor and the disease is: Doctor Diagnostics (DIAGNOSE) disease; the relationship between the patient and the drug program is: Patient Purchase (BUY) drug program; the relationship between the patient and the disease is: Patient with (HAS) disease and so on. Through the relationship network, the patient's medical treatment behavior can be comprehensively monitored. It should be understood that the relationship network diagram illustrated in FIG. 15 is only a preferred schematic diagram in this embodiment, and the relationship network shown in FIG. 15 is only the relationship in this embodiment. A small part of the network, as can be seen from the relationship network of Figure 15, each node is a different type of node, so each node is a node with different attributes. However, in the relational network of the present embodiment, a plurality of nodes of the same attribute may be actually included, such as a node including a plurality of doctors, or a node including a plurality of patients, and each node having the same attribute is also Membership has different relationships. Therefore, the nodes in this embodiment are not limited to the above-exemplified contents. In the case where the social security medical treatment data changes, different relational networks and nodes are also obtained, which are not exhaustive.

Step Y20, analyzing the group medical treatment behavior of each node in the relationship network, to extract the multi-dimensional group medical treatment characteristics corresponding to each node;

In this embodiment, after establishing a relationship network between the doctors and the patients and the medical diagnosis based on the social security medical treatment data, the group medical treatment behavior of each node in the relationship network is analyzed. In this embodiment, the group medical treatment for each node is performed. The behavior analysis, continue to take Figure 15 as an example, is to analyze the medical behavior presented in the relationship network, which is equivalent to the analysis of the patient's medical behavior, the analysis of the doctor's treatment behavior or the analysis of the disease treatment methods. Since each node in the relational network is subject to a different relationship, and each node is no longer affected by a single dimension but by a comprehensive influence of other nodes in the relationship network, for each node The analysis of group medical treatment behavior can finally obtain the multi-dimensional group medical treatment characteristics of each node, and the medical treatment characteristics are the characteristics extracted from the medical treatment behavior. Taking the patient node in Figure 15 as an example, the group medical treatment behavior of the patient node includes: the area where the patient is located, the hospital where the patient is visiting, the number of patients purchasing the drug items, and the specific time, and the patient suffers from Diseases, doctors who visit patients, etc. The analysis of the group's group medical treatment behavior is equivalent to comprehensive analysis of the area where the patient is located, the number of patients purchasing medicines and the specific time, and the diseases suffered by the patients. If it is found that the patient has purchased a large number of medicines in different hospitals many times, and the types of medicines are different, it can be determined that the group medical treatment characteristics are: the user's medicine purchase amount is large, the medicine type is many, and the like.

In step Y30, the extracted multi-dimensional group medical treatment features are input to a preset classification model to identify the fraud rate of each node according to the classification model.

After the multi-dimensional group medical treatment features corresponding to the respective nodes are extracted, the extracted multi-dimensional group medical treatment features are input to a preset classification model to identify the fraud rate of each node according to the classification model. Specifically, referring to FIG. 12, the step Y30 includes:

Step Y31, calculating the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute according to the multi-dimensional group medical treatment characteristics corresponding to each node;

In step Y32, the calculated similarity of each node is input into a preset classification model to calculate a fraud rate of each node according to a fraud detection formula preset in the classification model.

That is to say, after extracting the multi-dimensional group medical treatment characteristics corresponding to each node, the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute is calculated. The nodes of the same attribute are: a doctor node and a doctor node, or a patient node and a patient node.

In this embodiment, the similarity of the multi-dimensional group medical treatment features of each node of the same attribute is calculated, and the following algorithms are preferably implemented:

1) Jaccard Similarity (representing generalized similarity):

Jaccard(A,B)=|A intersect B|/|A union B|

Among them, Intersect represents intersection, Union represents union, A and B represent nodes of the same attribute, such as A and B both represent the doctor node in Figure 15, or both represent the patient node.

2) Euclidean similarity (similarity of Euclidean distance):

Euclidean(A,B)=1-euclidean_distance(A,B)

Among them, A and B represent nodes of the same attribute.

The two algorithms enumerated above for calculating the similarity of the multi-dimensional group medical treatment features of the respective nodes of the same attribute are merely exemplary, and those skilled in the art utilize the technical idea of the present application to propose other algorithms according to their specific needs. All of them are within the scope of protection of the present application, and are not exhaustive here.

Through the above similarity calculation formula, the similarity of the multi-dimensional group medical treatment features of any two nodes of the same attribute can be determined.

After determining the similarity of the multi-dimensional group medical treatment feature of each node of the same attribute, the calculated similarity of each node is input into a preset classification model, according to a fraud detection formula preset in the classification model, Calculate the fraud rate of each node. The fraud detection formula preferably includes: KNN (k-Nearest Neighbor algorithm, K nearest neighbor node algorithm, K takes 5) algorithm formula; binary Kmeans algorithm formula; Shewhart methods algorithm formula, etc., due to these algorithms The formulas are all existing formulas, and the calculation process will not be described here.

Further, in order to improve the accuracy of the classification model computing node fraud rate, in the embodiment, after the step Y32, the method for identifying the social security fraud behavior further includes:

Step A: verifying the fraud rate of each node to add the verification conclusion to the fraud rate of each node;

In step B, the fraud rate added with the verification conclusion is re-entered into the classification model to facilitate training the classification model.

That is to say, after calculating the fraud rate of each node according to the fraud detection formula preset in the classification model, the fraud rate of each node can also be verified. In this embodiment, the verification mode is preferably offline. Approval verification, after verifying the fraud rate of each node, adding the verification conclusion to the fraud rate of each node, and re-entering the fraud rate with the verification conclusion added to the classification model, so as to train the classification The model makes the identification of the node fraud rate more accurate by the subsequent classification model.

In this embodiment, the social security fraud behavior recognition based on the relational network is to establish a medical treatment network for the group's visiting behavior in the group dimension, and design an algorithm model to identify the fraud behavior from the group dimension to obtain the node fraud rate and achieve the right The social security behavior of the group dimension is identified. It can be understood that by analyzing the social security visit data of the user, if the fraud rate of multiple nodes is detected to be high, only the fraud rate of the individual node is low, and the user may be considered to have social security fraud behavior, compared to a single The rule trigger mechanism determines whether the user has social security fraud behavior through group visit behavior, and the accuracy rate of social security fraud behavior recognition is higher.

The identification method of social security fraud behavior proposed in this embodiment first establishes a relationship network of doctors and patients and drug diagnosis based on the social security medical treatment data, and then analyzes the group medical treatment behavior of each node in the relationship network to extract a multi-dimensional group. The medical treatment feature finally inputs the extracted multi-dimensional group medical treatment characteristics into a preset classification model to identify the fraud rate of each node according to the classification model. This program identifies social security fraud behaviors from multiple dimensions and perspectives. Compared with traditional single rule identification, the accuracy of social security fraud behavior recognition is higher.

Further, in order to improve the accuracy of the identification of the social security fraud behavior, another embodiment of the identification method of the fraudulent activity of the present application is proposed based on the previous embodiment.

In this embodiment, referring to FIG. 13, before the step Y20, the method for identifying the social security fraud behavior further includes:

Step Y40, determining an external factor feature to be supplemented in the relationship network, and acquiring the external factor feature from the Internet;

Step Y50: Generate a new node based on the obtained external factor feature;

In step Y60, the new node is added to the relationship network to update the relationship network.

In this embodiment, the external factor feature to be supplemented is first determined in the relationship network, and the external factor feature is obtained from the Internet, where the external factor feature refers to external information associated with the node, for example, the node is Hospital, then the external factor characteristics are hospital-related information, such as hospital address information. After acquiring the external factor feature, first generating a new node based on the acquired external factor feature, and finally adding the new node to the relationship network to update the relationship network, so that the node in the subsequent relationship network In more detail, the identification of the fraud rate of each subsequent node is also more accurate.

It is worth noting in this application that although each algorithm involved is an existing algorithm, the entire process used in the identification process of social security fraud is not the same as the identification of existing social security fraud. The application overcomes the problem that the existing social security fraud behavior recognition accuracy is low.

Example 6:

The application further provides an identification device for fraudulent activity.

Referring to FIG. 14, FIG. 14 is a schematic diagram of functional modules of a first embodiment of an apparatus 100 for identifying fraudulent behavior of the present application.

In this embodiment, the fraud detection apparatus 100 includes:

The establishing module 10 is configured to establish a relationship network between the doctor and the patient and the medical diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

The analysis extraction module 20 is configured to analyze the group medical treatment behavior of each node in the relationship network, so as to extract the multi-dimensional group medical treatment characteristics corresponding to each node;

The input identification module 30 is configured to input the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model.

In this embodiment, the relationship network established based on the social security visit data can refer to FIG. 15. As shown in FIG. 15, the relationship network includes a plurality of nodes, which are: a hospital, a doctor, a patient, a region, a disease and a medicine project, and the like. As can be seen from FIG. 15, in the relationship network, each node belongs to a different relationship. For example, the relationship between the doctor and the hospital is: the doctor belongs to (BELONG) hospital; the relationship between the doctor and the disease is: Doctor Diagnostics (DIAGNOSE) disease; the relationship between the patient and the drug program is: Patient Purchase (BUY) drug program; the relationship between the patient and the disease is: Patient with (HAS) disease and so on. Through the relationship network, the patient's medical behavior can be monitored in all aspects.

FIG. 16 is a schematic diagram of a server carrying a Ping An Brain Big Data platform according to an embodiment of the present application. As shown in FIG. 16, the server 21 of this embodiment includes a processor 210, a memory 211, and computer readable instructions 212 stored in the memory 211 and executable on the processor 210, for example, performing anti-fraud identification. Method of procedure. When the processor 210 executes the computer readable instructions 212, the steps in the embodiments of the various anti-fraud identification methods described above are implemented, such as steps 101 to 103 shown in FIG. Alternatively, when the processor 210 executes the computer readable instructions 212, the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 510 to 550 shown in FIG. 6, and the modules 01 to 02 shown in FIG. The function of the modules 10 to 30 shown in Fig. 14 is.

Illustratively, the computer readable instructions 212 may be partitioned into one or more modules/units, which are stored in a computer readable storage medium, such as the memory 211, and The processor 210 executes to complete the application.

The server 21 may be a computing device such as a cloud server that carries the Ping Brain Big Data platform. The server may include, but is not limited to, a processor 210, a memory 211.

Claims

An anti-fraud identification method, comprising:

Identify target events;

Extracting target data related to the target event;

The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;

The method of constructing a decision model includes:

Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;

Perform cluster analysis on the variable object to obtain a clustering result;

Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;

Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;

Constructing a decision model by the first feature and the second feature;

The method for identifying the fraud data includes:

The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;

Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;

The method for identifying the fraud behavior includes:

Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;

Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;

The target data includes at least two of data to be tested, rule template data, and social security visit data.
The anti-fraud identification method according to claim 1, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:

Mapping each variable object to a predefined tag according to a preset algorithm;

Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;

The constructing the decision model by using the first feature and the second feature includes:

A decision model is constructed by the first feature, the second feature, and the third feature.
The anti-fraud identification method according to claim 2, wherein the constructing the decision model by using the first feature, the second feature, and the third feature comprises:

Establish the original node;

Obtaining a result type of each template sample according to the rule template data;

Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;

Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;

Obtaining features corresponding to the segmentation points and establishing new nodes.
The anti-fraud identification method according to any one of claims 1 to 3, wherein the clustering analysis is performed on the variable object to obtain a clustering result, including:

Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;

Calculating the distance of each variable object to each first cluster center separately;

Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;

Calculating a second cluster center of each of the divided clusters separately;

Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
The anti-fraud identification method according to claim 1, wherein the continuous anti-fraud model is a direct continuous model;

The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:

Decomposing the preset training data set into a training set and a test set according to a preset ratio;

Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;

Repeating the division of the training set to a preset number of times;

Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.
The anti-fraud identification method according to claim 1, wherein the continuous anti-fraud model is an optimized continuous model;

The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:

Decomposing the preset training data set into a training set and a test set according to a preset ratio;

Retaining the test set, further decomposing the training set into two sub-trainets according to a preset ratio, where the two sub-train sets are respectively used as a lower training set and a lower test set of the next layer model;

The lower training set is used to train the model, and the test is performed on the lower test set. The positive sample is obtained according to the test result and the training model is retained, and the obtained positive sample is used as a new training set;

Repeat the steps of dividing the training set and testing until the number of positive samples obtained is zero or the multiple training model is established;

The established multiple training models are collected and optimized to obtain an optimized continuous model.
The anti-fraud identification method according to claim 1, wherein the step of establishing a relationship network between the doctors and the patients and the medical diagnosis based on the social security medical treatment data comprises:

Data processing of social security treatment data;

Establish a network of doctor-patient and drug diagnosis based on social security treatment data after data processing.
The anti-fraud identification method according to claim 1, wherein the step of inputting the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model include:

Calculating the similarity of the multi-dimensional group medical treatment characteristics of each node of the same attribute according to the multi-dimensional group medical treatment characteristics corresponding to each node;

The calculated similarity of each node is input into a preset classification model to calculate a fraud rate of each node according to a fraud detection formula preset in the classification model.
A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:

Identify target events;

Extracting target data related to the target event;

The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;

The method of constructing a decision model includes:

Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;

Perform cluster analysis on the variable object to obtain a clustering result;

Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;

Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;

Constructing a decision model by the first feature and the second feature;

The method for identifying the fraud data includes:

The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;

Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;

The method for identifying the fraud behavior includes:

Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;

Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;

The target data includes at least two of data to be tested, rule template data, and social security visit data.
The computer readable storage medium according to claim 9, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:

Mapping each variable object to a predefined tag according to a preset algorithm;

Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;

The constructing the decision model by using the first feature and the second feature includes:

A decision model is constructed by the first feature, the second feature, and the third feature.
The computer readable storage medium according to claim 10, wherein the constructing the decision model by the first feature, the second feature, and the third feature comprises:

Establish the original node;

Obtaining a result type of each template sample according to the rule template data;

Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;

Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;

Obtaining features corresponding to the segmentation points and establishing new nodes.
The computer readable storage medium according to any one of claims 9 to 11, wherein the clustering analysis of the variable object to obtain a clustering result comprises:

Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;

Calculating the distance of each variable object to each first cluster center separately;

Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;

Calculating a second cluster center of each of the divided clusters separately;

Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
A server carrying a Ping An Brain Big Data platform, comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor executes the computer The following steps are implemented when the instructions are readable:

Identify target events;

Extracting target data related to the target event;

The target data is processed by using at least two methods, such as a method for constructing a decision model, a method for identifying fraud data, and a method for identifying fraudulent behavior;

The method of constructing a decision model includes:

Obtaining rule template data, and extracting each variable object and each template sample in the rule template data;

Perform cluster analysis on the variable object to obtain a clustering result;

Matching the clustering result with each template sample according to the rule template data, and using the matched clustering result as the first feature;

Calculating the black sample probability of each variable object separately, and using the black sample probability of each variable object as the second feature;

Constructing a decision model by the first feature and the second feature;

The method for identifying the fraud data includes:

The preset training model is trained by using a preset continuous model training method to establish a continuous anti-fraud model;

Training the test data based on the continuous anti-fraud model to identify fraud data in the data to be tested;

The method for identifying the fraud behavior includes:

Establishing a network of doctor-patient and drug diagnosis based on social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

Performing group medical treatment behaviors of each node in the relationship network to extract multi-dimensional group medical treatment characteristics corresponding to each node;

Importing the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model;

The target data includes at least two of data to be tested, rule template data, and social security visit data.
The server according to claim 13, wherein before the constructing the decision model by the first feature and the second feature, the method for constructing a decision model further comprises:

Mapping each variable object to a predefined tag according to a preset algorithm;

Matching the label with each template sample according to the rule template data, and using the matched label as a third feature;

The constructing the decision model by using the first feature and the second feature includes:

A decision model is constructed by the first feature, the second feature, and the third feature.
The server according to claim 14, wherein the constructing the decision model by using the first feature, the second feature, and the third feature comprises:

Establish the original node;

Obtaining a result type of each template sample according to the rule template data;

Reading the first feature, the second feature, and the third feature by traversing, respectively, to generate a read record;

Calculating a segmentation purity of each piece of read records according to a result type of each template sample, and determining a segmentation point according to the segmentation purity;

Obtaining features corresponding to the segmentation points and establishing new nodes.
The server according to any one of claims 13 to 15, wherein the clustering analysis is performed on the variable object to obtain a clustering result, including:

Randomly selecting a plurality of variable objects from the variable object as the first cluster center of the cluster, and each first cluster center corresponds to one cluster;

Calculating the distance of each variable object to each first cluster center separately;

Dividing each variable object according to the calculation result, and dividing the variable object into clusters corresponding to the first cluster center with the shortest distance;

Calculating a second cluster center of each of the divided clusters separately;

Determining whether the distance between the first cluster center and the second cluster center in each cluster is less than a preset threshold, and if so, outputting each cluster as a clustering result; if not, replacing the second cluster center The first cluster center of the clusters, and continues to perform the steps of separately calculating the distances of the respective variable objects to the respective first cluster centers.
The server according to claim 13, wherein said continuous anti-fraud model is a direct continuous model;

The pre-set continuous model training mode is used to train the preset training data set, and the continuous anti-fraud model is established:

Decomposing the preset training data set into a training set and a test set according to a preset ratio;

Retaining the test set, further decomposing the training set into two sub-training sets according to a preset ratio, and the two sub-training sets respectively serve as a training set and a test set of the next-level model;

Repeating the division of the training set to a preset number of times;

Using the divided multi-layer training set, the model is trained using the preset classic model, and tested on the retained multi-layer test set to establish a direct continuous model.
An apparatus for constructing a decision model, comprising:

The extraction module is configured to acquire rule template data, and extract each variable object and each template sample in the rule template data.

A clustering module is used for clustering analysis of variable objects to obtain clustering results.

The first feature module is configured to match the clustering result with each template sample according to the rule template data, and use the matched clustering result as the first feature.

The second feature module separately calculates the black sample probability of each variable object, and takes the black sample probability of each variable object as the second feature.

And a building module, configured to construct a decision model by using the first feature and the second feature.
A device for identifying fraud data, comprising:

The modeling module is configured to train the preset training data set by using a preset continuous model training manner to establish a continuous anti-fraud model;

And an identification module, configured to perform training on the test data based on the continuous anti-fraud model, and identify fraud data in the data to be tested.
A device for identifying fraudulent behavior, comprising:

Establishing a module for establishing a relationship network between doctors and patients and a drug diagnosis based on the social security medical treatment data, wherein the relationship network includes each node, and each node belongs to a different relationship;

An analysis extraction module is configured to analyze the group medical treatment behavior of each node in the relationship network, so as to extract the multi-dimensional group medical treatment characteristics corresponding to each node;

The input identification module is configured to input the extracted multi-dimensional group medical treatment features into a preset classification model to identify the fraud rate of each node according to the classification model.