WO2021120775A1

WO2021120775A1 - Method and device for detecting data abnormality

Info

Publication number: WO2021120775A1
Application number: PCT/CN2020/118432
Authority: WO
Inventors: 臧大卫
Original assignee: 中国银联股份有限公司
Priority date: 2019-12-19
Filing date: 2020-09-28
Publication date: 2021-06-24
Also published as: CN111126622B; CN111126622A

Abstract

A method and device for detecting data abnormality, for use in increasing the accuracy and precision of data detection. The method comprises: acquiring detection sample data of an object to be tested (201); determining, on the basis of the detection sample data, a first detection eigenvalue of said object corresponding to a first machine learning model and a second detection eigenvalue corresponding to a rule algorithm, the rule algorithm comprising at least one determination logic (202); inputting the first detection eigenvalue corresponding to the first machine learning model into a trained machine learning model to produce a first output vector of said object, and inputting the second detection eigenvalue corresponding to the rule algorithm into the rule algorithm to produce a second output vector of said object (203); inputting the first output vector and the second output vector into a trained second machine learning model, determining an output risk index of said object (204); and determining an abnormality ascertainment result of said object on the basis of the output risk index (205).

Description

Method and device for detecting data abnormality

Cross-references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 19, 2019, the application number is 201911317683.2, and the application name is "a data anomaly detection method and device", the entire content of which is incorporated into this application by reference in.

Technical field

The present invention relates to the technical field of data processing, in particular to a method and device for detecting data abnormalities.

Background technique

The rapid development of the Internet and Internet finance has brought unprecedented challenges to the risk control system. The forms and methods of fraudulent transactions have become more diverse, highly concealed and difficult to mine, and traditional rule-engine risk control methods have become increasingly weak. The rapid development of deep learning in recent years has provided another way to solve this problem. The development of deep engines and the construction of deep learning models to mine hidden information and identify fraudulent transactions have achieved good results.

Formulating rules to detect abnormal data still has irreplaceable advantages in some scenarios. However, most of the current abnormal data detection is to use deep learning algorithms alone, and the accuracy and precision need to be further improved.

Summary of the invention

This application provides a data abnormality detection method and device to improve the accuracy and precision of data detection.

A data abnormality detection method provided by an embodiment of the present invention includes:

Obtain test sample data of the object to be tested;

According to the detection sample data, determine that the object to be tested corresponds to a first detection feature value of the first machine learning model and a second detection feature value corresponding to a rule algorithm, where the rule algorithm includes at least one judgment logic;

The first detection feature value corresponding to the first machine learning model is input to the trained machine learning model to obtain the first output vector of the object to be tested, and the second detection feature value corresponding to the rule algorithm is input to the In the rule algorithm, the second output vector of the object to be tested is obtained;

Input the first output vector and the second output vector into the trained second machine learning model to determine the output risk index of the object to be tested;

According to the output risk index, the abnormal determination result of the object to be tested is determined.

In an optional embodiment, the second output vector includes at least one output identifier; and the second detection feature value of the object to be tested is input into the rule algorithm to obtain the first value of the object to be tested. Two output vectors, including:

Determine the corresponding relationship between the judgment result and the output identification;

For each judgment logic in the rule algorithm, use the corresponding second detection characteristic value to make a judgment according to the judgment logic to obtain a corresponding judgment result, and determine the corresponding output identifier according to the judgment result;

All output identifiers are combined into the second output vector in a predetermined order.

In an optional embodiment, the first machine learning model is a neural network model, and the second machine learning model is a logistic regression model.

In an optional embodiment, the neural network model is trained in the following manner:

Obtain the training sample data in the historical time period;

According to the training sample data, selecting a training object corresponding to the first training feature of the neural network model, and determining the first training feature value corresponding to the first training feature;

The first training feature value is input into the initial neural network model, and the loss function is calculated according to the obtained machine risk index and the abnormal determination result of the training object. When the loss function is less than a preset threshold, the corresponding first The parameter is the first parameter corresponding to the neural network model, and the trained neural network model is obtained;

The logistic regression model is trained in the following manner:

Obtaining the first output vector of the training object from the trained neural network model;

According to the training sample data, selecting a training object corresponding to the second training feature of the rule algorithm, and determining a second training feature value corresponding to the second training feature;

Input the second training feature value into the rule algorithm to obtain the second output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , It is determined that the corresponding second parameter is the second parameter corresponding to the logistic regression model, and the trained logistic regression model is obtained.

In an optional embodiment, the neural network model and the logistic regression model are trained in the following manner:

Obtain the training sample data in the historical time period;

Input the first training feature value into an initial neural network model to obtain a first output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , Determine that the corresponding first parameter is the first parameter corresponding to the neural network model to obtain the trained neural network model, and determine the corresponding second parameter to be the second parameter corresponding to the logistic regression model, to obtain the trained neural network model Logistic regression model.

In an optional embodiment, the first machine learning model includes a plurality of different machine learning sub-models.

In an optional embodiment, it further includes:

Get all the judgment logic in the rule algorithm;

Obtaining a weight parameter corresponding to each judgment logic from the second machine learning model;

For each judgment logic, the rationality of the judgment logic is determined according to the relationship between the judgment logic and other judgment logics, and the weight parameters corresponding to the judgment logic.

A data abnormality detection device includes:

The acquiring unit is used to acquire the test sample data of the object to be tested;

The processing unit is configured to determine, according to the detection sample data, that the object to be tested corresponds to a first detection feature value of the first machine learning model, and a second detection feature value corresponding to a rule algorithm, the rule algorithm includes At least one judgment logic;

The calculation unit is configured to input the first detection feature value corresponding to the first machine learning model into the trained machine learning model to obtain the first output vector of the object to be tested, and to transfer the second output vector corresponding to the rule algorithm The detection feature value is input into the rule algorithm to obtain the second output vector of the object to be tested;

An output unit, configured to input the first output vector and the second output vector into the trained second machine learning model to determine the output risk index of the object to be tested;

The determining unit is configured to determine the abnormal determination result of the object to be tested according to the output risk index.

In an optional embodiment, the second output vector includes at least one output identifier; the calculation unit is specifically configured to:

In an optional embodiment, it further includes a training unit for training the neural network model in the following manner:

Obtain the training sample data in the historical time period;

The training unit is also used to train the logistic regression model in the following manner:

In an optional embodiment, the training unit is further configured to train the neural network model and the logistic regression model in the following manner:

Obtain the training sample data in the historical time period;

In an optional embodiment, an analysis unit is further included for:

Get all the judgment logic in the rule algorithm;

The embodiment of the present invention also provides an electronic device, including:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method as described above.

The embodiment of the present invention also provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium storing computer instructions, and the computer instructions are used to make the computer execute the method as described above.

In the embodiment of the present invention, for the abnormal detection of the object to be tested, the risk control system determines the first detection feature value of the object to be tested corresponding to the first machine learning model and the second detection feature value corresponding to the rule algorithm according to the detection sample data , The rule algorithm here contains at least one judgment logic. The first detection feature value corresponding to the first machine learning model is input into the trained machine learning model to obtain the first output vector of the object to be tested. On the other hand, the second detection feature value corresponding to the rule algorithm is input into the rule algorithm to obtain the second output vector of the object to be tested. The first output vector and the second output vector are input into the trained second machine learning model to determine the output risk index of the object to be tested, and based on the output risk index, determine the abnormality determination result of the object to be tested. In the embodiment of the present invention, the machine learning algorithm and the rule algorithm are closely connected, the output result of the first machine learning model and the output result of the rule algorithm are input into the second machine learning model, and the second machine learning model is used to effectively combine the first machine learning model. The accuracy and precision rate of the output of a machine learning model and the rule algorithm are higher than that of the machine learning model alone, and the recall rate index is also better than the general machine learning model system.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative labor.

FIG. 1 is an architecture diagram of a data anomaly detection system provided by an embodiment of the present invention;

2 is a schematic flowchart of a method for detecting data anomaly according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a rule tree provided by an embodiment of the present invention;

4 is a schematic diagram of a rule tree of a rule algorithm that needs to be optimized according to an embodiment of the present invention;

FIG. 5 is a schematic flowchart of a method for detecting data risk anomalies according to a specific embodiment of the present invention;

6 is a schematic structural diagram of a data anomaly detection device provided by an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. . Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Please refer to FIG. 1, which shows an architecture diagram of a data anomaly detection system provided by an embodiment of the present application. It includes five subsystems, namely a transaction collection component, a historical feature calculation component, a rule sub-engine, a depth sub-engine, and an output module. The transaction collection component collects the detection sample data of the object to be tested through the MySQL proxy or Kafka queue, filters through preliminary conditions, filters out low-risk objects and channels that do not require risk control through key field comparison, and then uses TCP socket (container) The communication is sent to the historical feature calculation component, the rule sub-engine and the depth sub-engine.

The historical feature calculation component will update the context and statistics according to the information of the object to be tested. The context information stores the information of the user's last specific behavior; the statistics information includes statistics in multiple dimensions such as card number, merchant, mobile phone number, etc. information.

The rule sub-engine obtains all the features required for the rule calculation from the historical feature calculation component, traverses all the rule trees, records the calculation results of the judgment logic in all the rule trees in the order of the middle order traversal, and sends them to the output module.

The deep sub-engine loads the trained neural network model, and sends the required features to the historical feature calculation module on demand; interactively calculates the features, One-Hot (one hot) encoding, and gets the input of the neural network model; enters the neural network model Perform the forward propagation algorithm and send the output to the output module.

The output module loads the trained logistic regression model, splices the output of the rule sub-engine and the depth sub-engine, and enters the logistic regression model for regression calculation to obtain a risk index between 0-1; if the risk index is greater than the preset Risk threshold, then the transaction is determined to be a risk transaction and stored in the risk transaction table.

It should be noted that the application scenarios mentioned above are only shown to facilitate the understanding of the spirit and principle of the present application, and the embodiments of the present application are not limited in this respect. On the contrary, the embodiments of the present application can be applied to any applicable scenarios.

The following introduces some concepts involved in the embodiments of the present application.

One-Hot Encoding is a code system in which there are as many bits as there are states, and only one bit is 1, and the others are all 0. In the embodiment of the present invention, it is used to convert the detection sample data into the current feature value and then input the machine learning model.

TCP (Transmission Control Protocol, Transmission Control Protocol), a connection-oriented, reliable, byte stream-based transport layer communication protocol.

In order to monitor system data and improve the accuracy of anomaly detection, an embodiment of the present invention provides a data anomaly detection method. As shown in FIG. 2, the data anomaly detection method provided by the embodiment of the present invention includes the following steps:

Step 201: Obtain test sample data of the object to be tested.

Among them, the detection sample data includes historical detection sample data and current detection sample data of the object to be tested. The object to be tested can be a transaction, or a user, or a bank account, etc.

The current detection sample data and historical detection sample data in the embodiment of the present invention may be a user's transaction sequence. By inputting the user's current transaction sequence into the data anomaly detection system, the risk of the current transaction can be predicted.

The historical test sample data is the test sample of the object to be tested in the historical time period. The historical time period is the time period before the current time point corresponding to the object to be tested. For example, the current time point is at 10:00 am on June 3, 2019, and the historical time period is from 10:00 am on June 3, 2018 to June 2019. 10 am on the 3rd of the month. In the specific implementation process, the time length of the historical time period can be selected according to needs and accuracy. Among them, the longer the historical time period, the higher the detection accuracy, but the greater the amount of calculation required; the historical time period’s time length The shorter the segment, the smaller the amount of calculation required for detection, but the accuracy is lower.

Step 202: According to the detection sample data, determine that the object to be tested corresponds to a first detection feature value of the first machine learning model and a second detection feature value corresponding to a rule algorithm, where the rule algorithm includes at least one Judgment logic.

In the specific implementation process, the first machine learning model can be selected according to requirements, and can be a neural network model, a PCA (principal components analysis, principal component analysis) model, and so on. Preferably, the neural network model is used as the first machine learning model in the embodiment of the present invention.

Step 203: Input the first detection feature value corresponding to the first machine learning model into the trained machine learning model to obtain the first output vector of the object to be tested, and transfer the second detection feature corresponding to the rule algorithm The value is input into the rule algorithm to obtain the second output vector of the object to be tested.

For the neural network model, it is necessary to determine the historical feature value corresponding to the historical feature and the current feature value corresponding to the instant feature according to the detected sample data. Specifically, for a specific object to be tested, its historical feature value and real-time feature value are combined as needed to perform One-Hot Encoding, and then enter the neural network model.

For the rule algorithm, for one or more judgment logics in the rule algorithm, the corresponding second detection feature value is calculated according to the detection sample data, and then the second detection feature value is judged according to the judgment logic.

Step 204: Input the first output vector and the second output vector into the trained second machine learning model, and determine the output risk index of the object to be tested.

Among them, the second machine learning model can also be selected as needed, and can be a logistic regression model, a neural network model, and the like. Preferably, in the embodiment of the present invention, a logistic regression model is used as the second machine learning model.

Step 205: Determine the abnormality determination result of the object to be tested according to the output risk index.

Among them, if the risk index is greater than the risk threshold, it indicates that the risk is greater, that is, the object to be tested is abnormal. At this time, the corresponding personnel can be notified by mail, internal process documents of the company, etc. On the other hand, if the risk index is less than or equal to the risk threshold, it indicates that the object to be tested is normal.

In the embodiment of the present invention, for the abnormal detection of the object to be tested, the risk control system determines the first detection feature value of the object to be tested corresponding to the first machine learning model and the second detection feature value corresponding to the rule algorithm according to the detection sample data , The rule algorithm here contains at least one judgment logic. The first detection feature value corresponding to the first machine learning model is input into the trained machine learning model to obtain the first output vector of the object to be tested. On the other hand, the second detection feature value corresponding to the rule algorithm is input into the rule algorithm to obtain the second output vector of the object to be tested. The first output vector and the second output vector are input to the trained second machine learning model to determine the output risk index of the object to be tested, and according to the output risk index, determine the abnormality determination result of the object to be tested. In the embodiment of the present invention, the machine learning algorithm and the rule algorithm are closely connected, the output result of the first machine learning model and the output result of the rule algorithm are input into the second machine learning model, and the second machine learning model is used to effectively combine the first machine learning model. The accuracy and precision rate of the output of a machine learning model and the rule algorithm are higher than that of the machine learning model alone, and the recall rate index is also better than the general machine learning model system.

For the traditional rule algorithm, there are only two possible output results, that is, the output result is risky or risk-free, that is, the output is only 0 and 1, and the confidence of the rule algorithm cannot be quantified. Therefore, the embodiment of the present invention introduces the machine learning algorithm while using the rule algorithm, and merges the two together and is closely connected. In order to adapt to the input and output of the machine learning algorithm, the output of the rule algorithm needs to be transformed and deformed. In the embodiment of the present invention, the second output vector is calculated from the rule algorithm, and the second output vector includes at least one output identifier. The above step 203, inputting the second detection feature value of the object under test into the rule algorithm to obtain the second output vector of the object under test, includes:

Specifically, in the embodiment of the present invention, the output identifier is used to digitize the determination result. Since the judgment result in the rule algorithm is generally risky and risk-free, the judgment result is digitized with 1 and 0. Generally speaking, if the judgment result is risky, the corresponding output flag is 1; If the result is no risk, the corresponding output identifier is 0. On the other hand, in order to increase accuracy and to facilitate subsequent optimization of the rule algorithm, in the embodiment of the present invention, the total judgment result of the rule algorithm is not used as the rule output result of the rule algorithm, but is based on each of the rule algorithms. The judgment logic determines a rule output result, and combines all the rule output results as the second output vector.

For example, the rule algorithm contains two rules: "A+B>8" and "C|(D>(E-F))". Corresponding to the traditional rule algorithm, as long as any one of the rules is satisfied, the transaction is judged to be a risky transaction. Therefore, the traditional rule algorithm will only output one result, 1 or 0.

In the embodiment of the present invention, the rule algorithm traverses all the judgment logics in the rule in a predetermined order, and the predetermined order may be middle order, preorder, postorder, etc. A determination result is generated for each determination logic, and then the corresponding output identifier is determined according to the corresponding relationship between the determination result and the output identifier.

Let us take the above rules "A+B>8" and "C|(D>(EF))" as examples. Figure 3 is a schematic diagram of the rule tree of the above rules. As shown in Figure 3, each rule corresponds to a rule tree. Among them, the first rule tree contains one judgment logic, and the second rule tree contains three judgment logics. Therefore, the second output vector d corresponding to the rule algorithm contains 4 output identifiers, denoted as [s ₁ ,s ₂ , s ₃ ,s ₄ ]. From left to right in Figure 3, the first judgment logic is to judge whether A+B>8 is true, corresponding to two judgment results, namely, yes or no, if yes, the corresponding output identifier s ₁ is 1; if not, Then the output identifier s ₁ is 0. The second judgment logic is whether C is included in the second detection characteristic value of the object to be tested. If it is, the corresponding output identifier s ₂ is 1; if not, the corresponding output identifier s ₂ is 0. Analyzing the third logic C | (D> (EF) ) is satisfied, if yes, identifying the corresponding output s ₃ to 1; if not, then the corresponding output identifier is 0 s _3. The fourth judgment logic is whether D>(EF) is established. If it is, the corresponding output identifier s ₄ is 1; if not, the corresponding output identifier s ₄ is 0. After all the decision logic is traversed, the final second output vector is obtained, and each element in the second output vector is 1 or 0.

In the embodiment of the present invention, not only the rule algorithm is adaptively improved, but the first machine learning algorithm is also adaptively improved according to the input requirements of the second machine learning algorithm. In the following, the first machine learning algorithm is a neural network model as an example for description.

The output result of the traditional neural network model is the risk index, and the risk index y _t can be calculated by the following formula:

y _t =σW _d (W _c ReLU(W _b ReLU(W _a ·x+b _a )+b _b )+b _c )+b _d )……Formula 1

Wherein, x is a first object to be measured corresponding to the characteristic value detection neural network model, b _a b _d corresponding to the neural network model to the offset vector, W _a W _d is the weight to the matrix of the neural network model; [sigma] Is the sigmoid function, which is a fixed value; ReLU is the activation function.

In the embodiment of the present invention, in order to meet the requirement that the input of the second machine learning algorithm is a vector, only the j-dimensional output vector in formula 1 is obtained, that is, the first output vector c satisfies the following formula:

c=σ(W _c ReLU(W _b ReLU(W _a ·x+b _a )+b _b )+b _c )……Formula 2

Among them, c is the second output vector corresponding to the neural network model.

Comparing formula 1 and formula 2, it can be seen that formula 1 obtains a value, that is, the risk index, while formula 2 obtains a vector, that is, the second output vector c.

In the embodiment of the present invention, the output of the first machine learning model and the output of the rule algorithm are used as the input of the second machine learning model. The first machine learning model and the rule algorithm are combined through the second machine learning model, so that the machine learning model and the rule algorithm can effectively complement each other. The second machine learning model is a logistic regression model as an example for introduction.

In the specific implementation process, the logistic regression model regresses the output of the neural network model and the output of the rule algorithm to obtain the final prediction of the risk of the object to be tested. In an optional embodiment, the logistic regression model uses the following formula to calculate the output risk index:

y=σ(W ₀ [c,d]+b ₀ )……Formula 3

Among them, y is the output risk index calculated by the logistic regression model; b ₀ is the bias vector corresponding to the logistic regression model; c is the first output vector of the neural network model; d is the second output vector of the rule algorithm; W ₀ is the weight matrix corresponding to the logistic regression model, which includes i weight values, and the number of weight values is equal to the sum of the number of elements in the first output vector and the number of elements in the second output vector.

In the above formula 3, _{each weight parameter in the weight matrix W 0} corresponds to the weight of each input of a logistic regression model. For the second output vector corresponding to the rule algorithm, each output identifier s corresponds to a weight parameter w. The higher the weight parameter w, the higher the importance of the judgment logic corresponding to the output identifier. The higher the accuracy of risk judgment. Conversely, if w is lower or negative, it means that the judgment logic is inferior and needs to be adjusted by rules.

Further, the first machine learning model in the embodiment of the present invention may include a plurality of different machine learning sub-models, thereby further increasing the accuracy of risk judgment, and the suitable scenarios are wider and the accuracy is higher.

From the above analysis, it can be seen that in the logistic regression model, the weight parameters corresponding to the rule algorithm can be used as the basis for judging the logic in the adjustment rule algorithm. Further, the embodiment of the present invention further includes:

Get all the judgment logic in the rule algorithm;

In the specific implementation process, the weight parameter corresponding to each rule algorithm that has been calculated is stored in the logistic regression model. When it is necessary to evaluate or optimize the rationality of the rule algorithm, the user sends an analysis request through a front-end user interface, such as a client or a browser, and the analysis request contains a rule set consisting of one or more rules. The system’s rule-assisted analysis master parses the rule set after receiving the request, determines all the judgment logic in the rule set, and determines the weight parameter of each judgment logic in the logistic regression model. Then according to the relationship between the judgment logic and other judgment logics, and the weight parameters corresponding to the judgment logic, the rationality of the judgment logic is determined.

Fig. 4 shows a rule tree of a rule algorithm that needs to be optimized in an embodiment of the present invention. Rule-assisted analysis The main controller parses the rule set after receiving the request, loads the weight parameters of the rule tree in the logistic regression model; uses each judgment logic as metadata, analyzes the rule tree, and analyzes the judgment logic that can be optimized. As shown in Figure 4, the rule algorithm contains two rule numbers, and one rule tree contains one or more judgment logics. Use the judgment logic node as metadata to analyze the rule tree, as shown in the rule tree on the left in Figure 4. If w ₁ ≤ w ₂ , it is recommended to _{prun the node corresponding to w 1} and keep only the right branch. For another example, you can also perform comparative analysis between rule trees to analyze the weights of similar structure nodes. In the two rule trees in Figure 4, w ₄ and w ₈ correspond to nodes that belong to similar structures. If w ₄ ≤ w ₈ , it is recommended to use w ₈ corresponds to the structure.

The rule-assisted analysis master also sends the current batch metadata to the historical rule analysis module. The historical rule analysis module will search the historical rule library for structures similar to the metadata of the current batch. For a batch of similar historical metadata, first select one or a group of exactly the same historical metadata, and use it as a basis to convert the weight of the historical metadata of the batch and the weight of the current batch of metadata to make both Comparable. Then analyze the interchangeability of the current batch metadata. If there is a similar structure with greater weight in the historical rule base for a certain metadata, it is recommended to replace that structure. Send the current batch analysis results and historical batch analysis results to the suggestion generation module, generate visual results and descriptive suggestions, and return to the front-end interface.

Further, since the embodiment of the present invention contains at least two machine learning models, for the training process of these two machine learning models, one or more first machine learning models can be separately trained, and finally all output vectors And the output of the rule algorithm are combined to train the second machine learning model. It is also possible to combine all the first machine learning model and the second machine learning model for joint training. The following takes the neural network model and the logical return model as examples to introduce.

For separate training, the neural network model is trained in the following ways:

Obtain the training sample data in the historical time period;

According to the training sample data, select the training object corresponding to the first training feature of the neural network model, and determine the first training feature value corresponding to the first training feature;

Input the first training feature value into the initial neural network model, and calculate the loss function according to the obtained machine risk index and the abnormal determination result of the training object. When the loss function is less than the preset threshold, determine the corresponding first parameter as the neural network model corresponding The first parameter of, get the trained neural network model.

The logistic regression model is trained in the following ways:

Obtain the first output vector of the training object from the trained neural network model;

According to the training sample data, select the training object corresponding to the second training feature of the rule algorithm, and determine the second training feature value corresponding to the second training feature;

The first output vector and the second output vector are input to the initial logistic regression model, and the loss function is calculated according to the obtained output risk index and the abnormal determination result of the training object. When the loss function is less than the preset threshold, the corresponding second parameter is determined to be The second parameter corresponding to the logistic regression model obtains the trained logistic regression model.

For co-training, the neural network model and logistic regression model are trained in the following ways:

Obtain the training sample data in the historical time period;

Input the first training feature value into the initial neural network model to obtain the first output vector of the training object;

The first output vector and the second output vector are input to the initial logistic regression model, and the loss function is calculated according to the obtained output risk index and the abnormal determination result of the training object. When the loss function is less than the preset threshold, the corresponding first parameter is determined to be The first parameter corresponding to the neural network model obtains the trained neural network model, and the corresponding second parameter is determined to be the second parameter corresponding to the logistic regression model to obtain the trained logistic regression model.

In order to understand the present invention more clearly, specific embodiments are used to describe the foregoing process in detail. Specific embodiments The first machine learning model is a neural network model, and the second machine learning model is a logistic regression model. Fig. 5 shows a schematic flowchart of a method for detecting data risk anomalies in a specific embodiment. As shown in Figure 5, the core of the data risk anomaly detection method is a dual-engine model, which includes four parts: a rule sub-engine, a deep sub-engine, an output module, and a rule-assisted analysis module, of which:

The rule sub-engine contains a set of rules. For the transaction to be tested, it traverses all the rules in the rule set and evaluates the risk of the transaction. As shown in Figure 5, the two rules, "A+B>8" and "C|(D>(EF))", the engine traverses the rule tree in order, and records the calculation results of all the judgment logic nodes in order. As the output of the rule sub-engine d=[s ₁ ,s ₂ ,s ₃ ,s ₄ ].

The deep sub-engine uses the trained neural network model to evaluate the risk of the transaction under test. For the transaction to be tested, the historical feature and the real-time feature are combined as needed to perform One-Hot Encoding, and then input the neural network model, and output the vector c.

The output module uses the trained logistic regression model to regress the output of the rule sub-engine and the depth sub-engine to obtain the final prediction of the transaction risk.

In addition, the rule-assisted analysis module receives front-end instructions to compare multiple rules and assist in rule formulation. Analyze the weights of multiple judgment logic nodes within a single rule, analyze the weights of judgment logic nodes among multiple rules, analyze the weights of similar rules in the historical rule library, generate visual results, and give suggestions for improving existing rules.

The embodiment of the present invention also provides a data abnormality detection device, as shown in FIG. 6, including:

The obtaining unit 601 is configured to obtain test sample data of the object to be tested;

The processing unit 602 is configured to determine, according to the detection sample data, that the object to be tested corresponds to a first detection feature value of the first machine learning model, and a second detection feature value corresponding to a rule algorithm. In the rule algorithm Contain at least one judgment logic;

The calculation unit 603 is configured to input the first detection feature value corresponding to the first machine learning model into the trained machine learning model to obtain the first output vector of the object to be tested, and to convert the first output vector corresponding to the rule algorithm 2. The detection feature value is input into the rule algorithm to obtain the second output vector of the object to be tested;

The output unit 604 is configured to input the first output vector and the second output vector into the trained second machine learning model to determine the output risk index of the object to be tested;

The determining unit 605 is configured to determine the abnormal determination result of the object to be tested according to the output risk index.

In an optional embodiment, it further includes a training unit 606, configured to train the neural network model in the following manner:

Obtain the training sample data in the historical time period;

In an optional embodiment, the training unit 606 is further configured to train the neural network model and the logistic regression model in the following manner:

Obtain the training sample data in the historical time period;

In an optional embodiment, an analysis unit 607 is further included, configured to:

Get all the judgment logic in the rule algorithm;

Based on the same principle, the present invention also provides an electronic device, as shown in FIG. 7, including:

It includes a processor 701, a memory 702, a transceiver 703, and a bus interface 704, wherein the processor 701, the memory 702 and the transceiver 703 are connected through the bus interface 704;

The processor 701 is configured to read a program in the memory 702 and execute the following method:

Obtain test sample data of the object to be tested;

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Although the preferred embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present invention.

Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims

A data anomaly detection method, which is characterized in that it comprises:

Obtain test sample data of the object to be tested;

According to the detection sample data, determine that the object to be tested corresponds to a first detection feature value of the first machine learning model and a second detection feature value corresponding to a rule algorithm, where the rule algorithm includes at least one judgment logic;

The first detection feature value corresponding to the first machine learning model is input to the trained machine learning model to obtain the first output vector of the object to be tested, and the second detection feature value corresponding to the rule algorithm is input to the In the rule algorithm, the second output vector of the object to be tested is obtained;

Input the first output vector and the second output vector into the trained second machine learning model to determine the output risk index of the object to be tested;

According to the output risk index, the abnormal determination result of the object to be tested is determined.
The method of claim 1, wherein the second output vector contains at least one output identifier; and the second detection feature value of the object to be tested is input into the rule algorithm to obtain the The second output vector of the test object includes:

Determine the corresponding relationship between the judgment result and the output identification;

For each judgment logic in the rule algorithm, use the corresponding second detection characteristic value to make a judgment according to the judgment logic to obtain a corresponding judgment result, and determine the corresponding output identifier according to the judgment result;

All output identifiers are combined into the second output vector in a predetermined order.
The method of claim 1, wherein the first machine learning model is a neural network model, and the second machine learning model is a logistic regression model.
The method of claim 3, wherein the neural network model is trained in the following manner:

Obtain the training sample data in the historical time period;

According to the training sample data, selecting a training object corresponding to the first training feature of the neural network model, and determining the first training feature value corresponding to the first training feature;

The first training feature value is input into the initial neural network model, and the loss function is calculated according to the obtained machine risk index and the abnormal determination result of the training object. When the loss function is less than a preset threshold, the corresponding first The parameter is the first parameter corresponding to the neural network model, and the trained neural network model is obtained;

The logistic regression model is trained in the following manner:

Obtaining the first output vector of the training object from the trained neural network model;

According to the training sample data, selecting a training object corresponding to the second training feature of the rule algorithm, and determining a second training feature value corresponding to the second training feature;

Input the second training feature value into the rule algorithm to obtain the second output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , It is determined that the corresponding second parameter is the second parameter corresponding to the logistic regression model, and the trained logistic regression model is obtained.
The method of claim 3, wherein the neural network model and the logistic regression model are trained in the following manner:

Obtain the training sample data in the historical time period;

According to the training sample data, selecting a training object corresponding to the first training feature of the neural network model, and determining the first training feature value corresponding to the first training feature;

Input the first training feature value into an initial neural network model to obtain a first output vector of the training object;

According to the training sample data, selecting a training object corresponding to the second training feature of the rule algorithm, and determining a second training feature value corresponding to the second training feature;

Input the second training feature value into the rule algorithm to obtain the second output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , Determine that the corresponding first parameter is the first parameter corresponding to the neural network model to obtain the trained neural network model, and determine the corresponding second parameter to be the second parameter corresponding to the logistic regression model, to obtain the trained neural network model Logistic regression model.
The method of claim 1, wherein the first machine learning model includes a plurality of different machine learning sub-models.
The method according to any one of claims 1 to 6, further comprising:

Get all the judgment logic in the rule algorithm;

Obtaining a weight parameter corresponding to each judgment logic from the second machine learning model;

For each judgment logic, the rationality of the judgment logic is determined according to the relationship between the judgment logic and other judgment logics, and the weight parameters corresponding to the judgment logic.
A data abnormality detection device, which is characterized in that it comprises:

The acquiring unit is used to acquire the test sample data of the object to be tested;

The processing unit is configured to determine, according to the detection sample data, that the object to be tested corresponds to a first detection feature value of the first machine learning model, and a second detection feature value corresponding to a rule algorithm, the rule algorithm includes At least one judgment logic;

The calculation unit is configured to input the first detection feature value corresponding to the first machine learning model into the trained machine learning model to obtain the first output vector of the object to be tested, and to transfer the second output vector corresponding to the rule algorithm The detection feature value is input into the rule algorithm to obtain the second output vector of the object to be tested;

An output unit, configured to input the first output vector and the second output vector into the trained second machine learning model to determine the output risk index of the object to be tested;

The determining unit is configured to determine the abnormal determination result of the object to be tested according to the output risk index.
The device according to claim 8, wherein the second output vector includes at least one output identifier; and the calculation unit is specifically configured to:

Determine the corresponding relationship between the judgment result and the output identification;

For each judgment logic in the rule algorithm, use the corresponding second detection characteristic value to make a judgment according to the judgment logic to obtain a corresponding judgment result, and determine the corresponding output identifier according to the judgment result;

All output identifiers are combined into the second output vector in a predetermined order.
8. The device of claim 8, wherein the first machine learning model is a neural network model, and the second machine learning model is a logistic regression model.
10. The device of claim 10, further comprising a training unit for training the neural network model in the following manner:

Obtain the training sample data in the historical time period;

According to the training sample data, selecting a training object corresponding to the first training feature of the neural network model, and determining the first training feature value corresponding to the first training feature;

The first training feature value is input into the initial neural network model, and the loss function is calculated according to the obtained machine risk index and the abnormal determination result of the training object. When the loss function is less than a preset threshold, the corresponding first The parameter is the first parameter corresponding to the neural network model, and the trained neural network model is obtained;

The training unit is also used to train the logistic regression model in the following manner:

Obtaining the first output vector of the training object from the trained neural network model;

According to the training sample data, selecting a training object corresponding to the second training feature of the rule algorithm, and determining a second training feature value corresponding to the second training feature;

Input the second training feature value into the rule algorithm to obtain the second output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , It is determined that the corresponding second parameter is the second parameter corresponding to the logistic regression model, and the trained logistic regression model is obtained.
The device according to claim 10, wherein the training unit is further configured to train the neural network model and the logistic regression model in the following manner:

Obtain the training sample data in the historical time period;

According to the training sample data, selecting a training object corresponding to the first training feature of the neural network model, and determining the first training feature value corresponding to the first training feature;

Input the first training feature value into an initial neural network model to obtain a first output vector of the training object;

According to the training sample data, selecting a training object corresponding to the second training feature of the rule algorithm, and determining a second training feature value corresponding to the second training feature;

Input the second training feature value into the rule algorithm to obtain the second output vector of the training object;

Input the first output vector and the second output vector to the initial logistic regression model, and calculate a loss function according to the obtained output risk index and the abnormality determination result of the training object, when the loss function is less than a preset threshold , Determine that the corresponding first parameter is the first parameter corresponding to the neural network model to obtain the trained neural network model, and determine the corresponding second parameter to be the second parameter corresponding to the logistic regression model, to obtain the trained neural network model Logistic regression model.
The apparatus of claim 8, wherein the first machine learning model includes a plurality of different machine learning sub-models.
The device according to any one of claims 8 to 13, characterized in that it further comprises an analysis unit for:

Get all the judgment logic in the rule algorithm;

Obtaining a weight parameter corresponding to each judgment logic from the second machine learning model;

For each judgment logic, the rationality of the judgment logic is determined according to the relationship between the judgment logic and other judgment logics, and the weight parameters corresponding to the judgment logic.
An electronic device, characterized in that it comprises:

At least one processor; and,

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method according to any one of claims 1-7 .
A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute any one of claims 1-7 method.