CN112085588B

CN112085588B - Method and device for determining safety of rule model and data processing method

Info

Publication number: CN112085588B
Application number: CN202010908613.0A
Authority: CN
Inventors: 张文彬; 殷山; 李翰林; 李漓春
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-11-29
Anticipated expiration: 2040-09-02
Also published as: CN112085588A

Abstract

The specification provides a method and a device for determining the safety of a rule model and a data processing method. The method comprises the steps of determining first distribution of target attributes according to a sample set; meanwhile, processing the sample set by using a rule model to determine a second distribution of the target attributes under various hit conditions; then, according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, calculating a safety indication parameter capable of reflecting the difference degree between the second distribution of the target attributes under multiple hit conditions and the original first distribution; and determining whether the rule model has safety risks or not according to the safety indication parameters. Therefore, the difference degree between the second distribution and the first distribution of the target attributes under various hit conditions can be quantified by determining and utilizing the safety indication parameters, and whether the rule model has safety risks or not can be determined more accurately according to the safety indication parameters.

Description

Method and device for determining safety of rule model and data processing method

Technical Field

The present specification belongs to the field of internet technologies, and in particular, relates to a method and an apparatus for determining security of a rule model, and a data processing method.

Background

In some data processing scenarios, the model generator is often separate from the data provider.

Usually, the data provider can respond to the request of the model generator, utilize the data resource owned by the own party, run the rule model provided by the model generator, get the corresponding processing result; and feeding back the processing result to the model generator. Therefore, the model generator can obtain a corresponding processing result on the premise of not contacting the data resource owned by the data provider; and can perform specific data processing according to the processing result.

However, if the rule model itself is not secure, the data provider may leak the data resources owned by the data provider in the course of running the rule model.

Disclosure of Invention

The specification provides a method and a device for determining the safety of a rule model and a data processing method, so that whether the rule model has safety risks can be determined more accurately.

The method, the device and the data processing method for determining the safety of the rule model provided by the specification are realized as follows:

a method of determining security of a rule model, comprising: obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data; determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

A method of data processing, comprising: obtaining a first distribution of the target attribute and a second distribution of the target attribute; calculating a security indication parameter according to the first distribution of the target attribute and the second distribution of the target attribute; determining a degree of difference between the first distribution of the target attribute and the second distribution of the target attribute according to the safety indication parameter.

A device for determining the security of a rule model, comprising: the acquisition module is used for acquiring the rule model and the sample set; wherein the rule model comprises a rule set comprising a plurality of sample data;

the first determining module is used for determining first distribution of the target attribute according to the sample set; determining a second distribution of the target attributes under various hit conditions according to the sample set and the rule model; the second determining module is used for determining the safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and the third determining module is used for determining whether the rule model has safety risks or not according to the safety indication parameters under various hit conditions.

A server comprising a processor and a memory for storing processor-executable instructions that, when executed by the processor, implement obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data; determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

A computer readable storage medium having stored thereon computer instructions that, when executed, implement obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data; determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

According to the method, the device and the data processing method for determining the safety of the rule model, a first distribution of target attributes is determined according to a sample set; meanwhile, processing the sample set by using a rule model to determine a second distribution of the target attributes under various hit conditions; then according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, calculating a safety indication parameter capable of reflecting the difference degree between the second distribution of the target attributes under multiple hit conditions and the original first distribution; and further, whether the rule model has the safety risk or not can be determined according to the safety indication parameters. Therefore, the difference degree between the second distribution and the first distribution of the target attributes under various hit conditions can be quantified by utilizing the safety indication parameters, whether the rule model has a safety risk or not can be determined more accurately according to the safety indication parameters, and the risk of data leakage when the rule model is operated by a data provider is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present specification, the drawings required for the embodiments will be briefly described below, the drawings in the following description are only some of the embodiments described in the present specification, and other drawings may be obtained by those skilled in the art without inventive labor.

FIG. 1 is a diagram illustrating an embodiment of the structural components of a system to which a method for determining the security of a model provided in an embodiment of the present specification is applied;

FIG. 2 is a flow diagram of a method for determining security of a model provided by one embodiment of the present description;

FIG. 3 is a diagram illustrating an embodiment of a method for determining security of a model using an embodiment of the present disclosure, in an example scenario;

FIG. 4 is a diagram illustrating an embodiment of a method for determining security of a model using an embodiment of the present disclosure, in an example scenario;

FIG. 5 is a flow diagram illustrating a data processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural component diagram of a server provided in an embodiment of the present description;

fig. 7 is a schematic structural composition diagram of a device for determining security of a model according to an embodiment of the present specification.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without making any creative effort shall fall within the protection scope of the present specification.

The embodiment of the specification provides a method for determining the safety of a rule model, which can be particularly applied to a system comprising a first server and a second server.

In particular, reference may be made to fig. 1. The first server may specifically include a server disposed on the model generator side. The second server may specifically include a server disposed on the data provider side.

In particular, in order to perform corresponding data processing (for example, determining credit risk of a user) by using data resources owned by a data provider, a first server may configure and construct a rule model including one or more rule sets, and send the rule model to a second server.

The second server may check the security of the rule model before running the rule model using the own owned data resource.

Specifically, the second server may obtain a sample set including a plurality of sample data and the rule model. The second server can determine a first distribution of the target attributes according to the sample set; and processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions. And determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions. And then, whether the rule model has a security risk or not can be determined according to the security indication parameters under various hit conditions.

Under the condition that the rule model is determined to have no security risk, the second server can normally use data resources owned by the second server to operate the rule model, and a corresponding processing result is obtained; and feeding back the processing result to the first server. The first server can complete corresponding data processing according to the processing result.

Under the condition that the rule model is determined to have security risks, the second server can refuse to use the data resources owned by the second server to run the rule model, and therefore the data resources owned by the data provider can be effectively prevented from being leaked.

In this embodiment, the first server and the second server may specifically include a server that is applied to a data processing system side and is capable of implementing functions such as data transmission and data processing. Specifically, the first server and the second server may be, for example, an electronic device having data operation, storage functions and network interaction functions. Alternatively, the first server and the second server may also be software programs running in the electronic device and providing support for data processing, storage and network interaction. In this embodiment, the number of the servers included in the first server and the second server is not specifically limited. The first server and the second server may be specifically one server, or several servers, or a server cluster formed by several servers.

Referring to fig. 2, an embodiment of the present disclosure provides a method for determining security of a rule model. In particular implementations, the method may include the following.

S201: obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data.

In some embodiments, the method may be applied in particular to a second server arranged on the data provider side.

In some embodiments, the method may also be applied to a third server disposed on the third party side. The third party may be a service provider which is independent of the data provider and the model generator and is trusted by the data provider and the model generator together to detect the security of the rule model. Specifically, for example, the first server sends the generated rule model to the second server and also sends the same rule model to the third server, the third server generates and sends the security prompt message to the second server when determining that the rule model has no security risk by applying the method, and the second server runs the rule model by using the own data resource after receiving the security prompt message.

In some embodiments, in the case that the model generator needs to perform self-checking on the generated rule model, the method may also be applied to a first server arranged on the model generator side. That is, the model generator may detect the security of the generated rule model via the first server. Specifically, the first server sends the rule model to the second server only when the first server determines that the generated rule model has no security risk through detection.

The embodiments of the present specification will be described specifically mainly by taking an example of applying the method to the second server. For the case of application to the third server, the first server, reference may be made to the following embodiments applied to the second server.

In some embodiments, the rule model may be specifically understood as a data model generated or provided by a model generator for detecting whether the attributes of a data object (e.g., a user object, etc.) satisfy certain decision rules.

In particular, the rule model may specifically include one or more rule sets. Each rule set may further include one or more rules.

In some embodiments, the rules are used to detect whether a certain attribute characteristic of a data object satisfies a certain predetermined data value range. The rule may specifically include: attributes, operators, and data elements such as data thresholds.

The attribute may be specifically understood as a kind of parameter data for characterizing a certain attribute characteristic of the data object. For example, the above-mentioned attribute may be monthly income, default rate, height, occupation, and the like. The data threshold may be specifically understood as an upper limit value and/or a lower limit value of a data value set for an attribute in a rule. E.g., 1000 yuan, 15 times, 5%, etc. The above operator may particularly be understood as a symbol in a rule defining a decision relation between an attribute and a data threshold. For example, > (greater than signs), < (less than signs), ≧ or (greater than or equal to signs), and the like. Of course, the above listed attributes, operators, data thresholds are only illustrative.

Specifically, for example, in rule 1 "monthly income of user >1000 yuan", the attribute is "monthly income", the operator is ">, and the data threshold is" 1000 yuan ". If a user has monthly revenue data of 2000 dollars, greater than 1000 dollars, it is understood that the user hits rule 1. If a user's monthly revenue data is 500 dollars, less than 1000 dollars, it may be understood that the user did not hit rule 1.

In some embodiments, the rule set may include only one rule. For example, rule set 1 may contain only one rule, rule 1. If a user hits rule 1, it can be understood that the user hits rule set 1. If a user does not hit rule 1, then the user is understood to not hit rule set 1.

In some embodiments, the rule set may also include a plurality of different rules. The plurality of different rules can be connected together through preset logic connection words to form a rule set. The preset logical connection words may specifically include a connection word such as "and" (e.g., and), "or" (e.g., or).

For example, in rule set 2 "number of default times of user >5 (can be denoted as rule 2), or default rate of user >0.5 (can be denoted as rule 3)," rule 2 and rule 3 are linked together by logical conjunction "or" to form a rule set, i.e., rule set 2. If a user hits at least one of rule 2 and rule 3 above, it is understood that the user hits rule set 2. If a user does not hit either rule 1 or rule 2, then it is understood that the user does not hit rule set 2.

In some embodiments, the model generator may configure corresponding rules according to specific application scenarios and data processing requirements; combining the rules to obtain a corresponding rule set; then, a corresponding rule model (which may also be referred to as a rule-based model) is constructed according to the rule set. And the data provider runs the rule model by using the owned data resources to obtain a corresponding processing result so as to perform corresponding data processing.

In some embodiments, the model generator and the data provider tend to be separate. In this case, the model generator may transmit the rule model described above to the data provider. The data provider can utilize the data resources owned by the data provider, such as a database containing information data of a large number of data objects, to run the rule model to obtain the corresponding processing result; and feeding the processing result back to the model generator so that the model generator can obtain and utilize the processing result to complete corresponding data processing. This also reduces the risk of data resources owned by the data provider being compromised.

In some embodiments, since some rule models have security risks themselves, when a data provider runs such rule models by using data resources owned by the data provider to obtain corresponding processing results, the data provider still has the risk of data leakage, and the data provider threatens data security.

For example, the model generator, when generating the rule model, intentionally configures the rule set n in the rule model as "monthly income =5000 dollars for the user". At this time, if the data provider directly utilizes the owned data resources, the data provider inquires the information data of the user L to be detected (for example, the monthly income of the user L is 5000 yuan); the information data of the user L is input to the rule model, a processing result of the user L hitting the rule set n (for example, hitting the rule set n) is obtained, and the processing result is fed back to the model generator. In this case, although the data provider does not directly leak the information data that the monthly income data of the user L is 5000 yuan to the model generator, the model generator can accurately guess that the monthly income data of the user L is 5000 yuan based on the processing result. That is, the data resources of the data provider have been compromised.

Therefore, in order to avoid leakage of owned data resources when the data provider runs the rule model and protect data security of the data provider, the data provider can detect the security of the rule model before running the rule model by using own data resources; and under the condition that the rule model is determined to have no security risk, reusing the data resource of the own party to run the rule model.

In some embodiments, the second server may be specifically understood as a server disposed on the data provider side. The method for determining the security of the rule model provided in the embodiment of the present specification may be specifically applied to the second server side. On the other hand, a first server may be disposed on the model generator side.

In some embodiments, the sample set may be specifically understood as a set of sample data for detecting whether the rule model has a security risk. The sample set may include a plurality of sample data.

Specifically, each sample data may include information data related to a sample object (also referred to as a sample data object). For example, in a credit risk detection scenario, the sample data may specifically be information data related to the credit of the sample user, such as number of default times data, default rate data, monthly income data, and the like of the sample user.

In some embodiments, the second server may collect and construct the sample set from the information data of the real data objects before the implementation. The second server may also use the generated information data of the virtual data object to construct a sample set by means of simulation.

In some embodiments, when embodied, the second server may receive a data processing request from the first server. The data processing request may carry a rule model provided by the first server. Accordingly, the second server may obtain the rule model by receiving a data processing request.

Further, the data processing request may also carry an identity of the data object to be processed (e.g., an identity ID of the user to be detected).

The data processing request may be specifically configured to request the second server to query and run a rule model according to the information data of the data object by using a data resource owned by the second server, so as to determine a rule set hit by the data object to be processed.

In some embodiments, the second server may further obtain the rule model to be detected directly sent by the first server.

In some embodiments, the second server may obtain the rule model from the data processing request after receiving the data processing request. Further, before the second server runs the rule model according to the information data of the queried data object by using the own data resource, a sample set (for example, a sample set of information data including an attribute appearing in the rule model) matching the rule model may be obtained to detect whether the rule model has a security risk.

Specifically, the obtaining of the sample set by the second server may be receiving sample data provided by a third party as the sample set; the sample set may be constructed by extracting information data of a plurality of real data objects from data resources owned by the own party; it is also possible to construct a sample set or the like by simulation using the information data of the generated virtual data object.

S202: determining a first distribution of target attributes according to the sample set; processing the sample set using the rule model to determine a second distribution of target attributes for a plurality of hit cases.

In some embodiments, while the first server sends the rule model to the second server, some information related to the rule model, which is allowed to be disclosed to the second server, for example, identification information of a rule set included in the rule model, identification information of attributes in the rule set, occurrence times of each attribute in the rule model, and the like, may also be sent to the second server as basic information of the rule model to assist the second server in performing security detection on the rule model. Accordingly, the second server can obtain the basic information of the rule model through the first server.

The identification information of the rule set may specifically be a name of the rule set, or may be a number of the rule set. The identification information of one rule set corresponds to one rule set. The identification information of the attribute may specifically be a name of the attribute, a number of the attribute, or the like. The identification information of one attribute corresponds to one attribute.

In some embodiments, after obtaining the rule model, the second server may also input the test sample set into the rule model for testing; and determining the identification information of the rule set contained in the rule model according to the processing result output by the rule model.

In some embodiments, in a case where the second server is allowed to split the rule model to obtain the rule set, the second server may first split a specific rule set from the rule model; further splitting the rule set into rules; and finally, acquiring the attributes appearing in the rule.

In some embodiments, the second server may determine the target attribute according to the identification information of the attributes in the rule set.

In some embodiments, the target attribute may specifically include one attribute, and may also include a plurality of attributes. In the case where the target property includes multiple properties, a target property set of the object may be constructed, which may be denoted as Xin.

In some embodiments, when implemented, the second server may determine the attribute appearing in the rule model as the target attribute according to the identification information of the attribute in the rule set.

In some embodiments, in specific implementation, the second server may further screen out, as target attributes, one or more attributes with higher importance or with higher attention of the user from the attributes appearing in the rule model according to specific situations and processing needs.

In some embodiments, the hit condition may be specifically understood as a condition of a rule set in a rule model on which sample data hits.

In some embodiments, in specific implementation, the second server may determine, according to identification information of a rule set in the rule model, a rule set included in the rule model; and determining multiple possible hit conditions according to the rule set contained in the rule model.

Specifically, for example, if it is determined that the rule model only contains one rule set 1 (which may be denoted as RuleSet _ 1) according to the identification information of the rule set in the rule model, it may be determined that 2^1=2 hit conditions exist, that is, they are respectively: hit rule set 1, and no hit rule set 2.

If the rule model is determined to contain two rule sets according to the identification information of the rule sets in the rule model: rule set 1 (RuleSet _ 1) and rule set 2 (which can be denoted as RuleSet _ 2), it can be determined that there are 2^2=4 hit cases, namely: hit ruleset 1 and ruleset 2 simultaneously, hit ruleset 1 but not hit ruleset 2, hit ruleset 2 but not hit ruleset 1, and not hit rulesets 1 and 2.

By analogy, if the rule model contains n rule sets: rule set 1 (RuleSet _ 1), rule set 2 (RuleSet _ 2) \8230, rule set n (which can be noted as RuleSet _ n), then 2^ n hit conditions can be determined to exist.

In some embodiments, in specific implementation, the number of sample data corresponding to each data value of the target attribute in the sample set may be determined by performing data statistics on the data values of the target attribute appearing in the sample set; and then, according to the number of sample data corresponding to each data value of the target attribute in the sample set, determining the data value distribution of the target attribute in the sample set as the first distribution of the target attribute.

In some embodiments, in the case where the data value of the target attribute is discrete data, the sample set may be retrieved, and the data value of the target attribute of the sample object is obtained from the sample set; according to the sample set, counting the number of sample objects corresponding to each data value of the target attribute; and determining the data value distribution of the target attribute according to the number of the sample objects corresponding to each data value of the target attribute.

Specifically, for example, for the target attribute monthly income, the data values of the acquired monthly income include, by retrieving the sample set: three data values of 500 yuan, 1000 yuan and 2000 yuan. Through data statistics of the sample set, the number of users with 500 yuan of monthly income is 20, the number of users with 1000 yuan of monthly income is 50, and the number of users with 2000 yuan of monthly income is 30. Further, the data values (500, 1000 and 2000) for this target attribute with respect to monthly income can be determined to be distributed as 2.

In some embodiments, in the case that the data value of the target attribute is continuous data, a sample set may be retrieved first, and a maximum value and a minimum value of the data value of the target attribute may be determined from the sample set; dividing a plurality of data value intervals between the maximum value and the minimum value of the target attribute according to a preset numerical interval; according to the sample set, respectively dividing sample objects (or sample data) into corresponding data value intervals; and counting and determining the data value distribution of the target attribute according to the number of the sample objects in each data value interval.

In some embodiments, referring to fig. 3, a distribution of data values of the target attribute may be determined as a first distribution (which may also be referred to as an original distribution or an original first distribution) of the target attribute based on the sample set in the manner described above. The first distribution of the target attribute may be specifically understood as a distribution ratio of different data values of the target attribute without knowing the hit rule set.

The first distribution of the target attribute may be used to reflect the probability of guessing the data value of the target attribute of a sample object in the sample set without rule model processing.

In some embodiments, in specific implementation, sample data included in the sample set may be input into the rule model, the rule model is run, and the identification information of the rule set hit by each sample data is output as a processing result.

Further, the second server may divide the sample data into the corresponding sub-sample sets in the hit condition according to the rule set hit by each sample data (or the data object corresponding to each sample data) in the processing result, so as to establish and obtain the sub-sample sets in multiple hit conditions. And further, according to the subsample set under various hit conditions, determining the data value distribution of the target attribute under the hit condition through data statistics, and using the data value distribution as a second distribution of the target attribute under the hit condition.

Specifically, taking processing the sub sample set in the current hit condition of the sub sample sets in multiple hit conditions as an example, data statistics may be performed on the sub sample set in the current hit condition to determine each data value of the target attribute included in the sub sample set in the current hit condition, and count the number of sample data corresponding to each data value of the target attribute in the sub sample set in the current hit condition. And determining the data value distribution of the target attribute in the subsample set under the current hit condition as the second distribution of the target attribute under the current hit condition according to the number of sample data corresponding to each data value of the target attribute in the subsample set under the current hit condition.

For example, the rule model includes only one rule set 1, and the rule set 1 includes only one rule, which is "monthly income >500 yuan for the user". The current hit case is hit ruleset 1. The target attribute is the user's monthly revenue. At this time, the user data included in the sub-sample set in the current hit situation is user data having a data value of the monthly income of the sample set larger than 500 units, that is, user data having a data value of the monthly income of the user of 2000 units (30 units in total) and user data having a data value of the monthly income of 1000 units (50 units in total). And then, according to the subsample set under the current hit condition, the data values of the target attributes are determined to be 2000 yuan and 1000 yuan. Further, by performing data statistics on the user data included in the sub-sample set under the current hit condition, it is determined that the data value (1000 yuan and 2000 yuan) distribution of the target attribute under the current hit condition is 5.

In this manner, a second distribution of target attributes for a plurality of hit conditions may be determined.

In some embodiments, the second distribution of the target attribute may be specifically understood as a distribution ratio of different data values of the target attribute given the hit rule set. The second distribution of the target attribute can reflect the probability of guessing the data value of the target attribute of a sample object under the condition of determining the rule set hit by the sample object after the rule model processing.

S203: and determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions.

In some embodiments, the above-mentioned safety indication parameter may be understood as an indication parameter for characterizing a degree of difference between the first distribution of the target property and the second distribution of the target property in the case of multiple hits, respectively.

In some embodiments, the safety indication parameter may specifically include at least one of: information entropy difference, kini index difference, purity difference, KL divergence, and the like. Of course, it should be noted that the above listed safety indication parameters are only an illustrative example. In specific implementation, other suitable parameters besides the above-listed parameters can be introduced as the safety indication parameters according to specific situations.

Generally, if the value of the security indication parameter for a certain hit is larger, it indicates that the second distribution of the target attribute in that hit differs to a greater extent from the first distribution of the target attribute. Accordingly, after the sample data is processed by the rule model, under the condition that the sample data is determined to be in the hit condition, more data information may be leaked out to the model generator or other third parties. When the value of the safety indication parameter is too large, even greater than a certain threshold (e.g., a preset safety threshold), it indicates that the data information leaked by operating the rule model may exceed a tolerance range, and cause severe damage to the data resource of the data provider. At this time, it can be judged that the rule model has a security risk.

Conversely, if the value of the security indication parameter corresponding to a certain hit is smaller, even 0, it indicates that the second distribution of the target attribute is smaller in difference from the first distribution of the target attribute in the hit. Correspondingly, the sample data is processed through the rule model, and less data information is revealed to the model generator or other third parties on the premise that the sample data is determined to belong to the hit condition. When the value of the security indication parameter is equal to or approaches 0, it may be considered that the sample data is processed through the rule model, and no additional data information is revealed to the model generator or other third party even under the condition that it is determined that the sample data belongs to the hit condition. When the value of the safety indication parameter is smaller than or equal to a certain threshold (for example, a preset safety threshold), it indicates that the data information leaked by operating the rule model is relatively less and does not exceed the tolerance range. At this time, it can be judged that the rule model does not have a security risk.

In some embodiments, the following describes how to determine the safety indication parameters in the multiple hit situations according to the first distribution of the target attributes and the second distribution of the target attributes in the multiple hit situations, by taking the example that the safety indication parameters include the information entropy difference. The information entropy may be specifically understood as a parameter used for measuring the amount of information. Generally, a larger value of the information entropy indicates a larger amount of information carried.

In some embodiments, in specific implementation, a probability value of each data value of the target attribute in the sample set may be determined according to the first distribution of the target attribute; and calculating the information entropy of the target attribute based on the sample set according to the probability value of each data value of the target attribute, wherein the information entropy is used as the first information entropy.

Specifically, for example, for target property X, m different data values are included, which may be denoted as v _1, v _2 \8230; 8230; v _ m, respectively. The probability values of the data values determined based on the first distribution of X are respectively p ₁ 、p ₂ ......p _m . From the probability values of the respective data values of X, the first information entropy can be calculated according to the following equation: e = ∑ Σ _i p _i logp _i . Wherein E is expressed as a first information entropy, p _i And the probability value of the data value marked as i in X is represented, and the value of i is more than or equal to 1 and less than or equal to m.

Taking the information entropy of the target attribute under the current hit condition in the multiple hit conditions as an example, the probability value of each data value of the target attribute in the subsample set under the current hit condition can be determined; and calculating the information entropy of the target attribute based on the sub-sample set under the current hit condition according to the probability value of each data value of the target attribute, and taking the information entropy as a second information entropy.

Specifically, for example, the target attribute X in the current hit case includes m 'different data values, which may be respectively denoted as v _1', v _2 '\ 8230 \8230;, v _ m'. The probability values of the data values determined based on the second distribution of X under the current hit condition are respectively p _1' 、p _2’ ……p _m’ . According to the probability value of each data value of X, the second information entropy corresponding to the current hit condition can be calculated according to the following formula: e' = Sigma _i’ p _i’ logp _i’ . Where E' is expressed as the second information entropy in the case of the current hit, p _i’ And the probability value is represented as the probability value of the data value marked as i ' in the X under the current hit condition, and the value of i ' is more than or equal to 1 and less than or equal to m '.

In a similar manner, the second information entropies under other hit conditions can be respectively calculated according to the subsamples under other hit conditions, so that the second information entropies under multiple hit conditions are obtained.

Further, the difference between the first information entropy and the second information entropy under multiple kinds of hit conditions is respectively obtained, and the information entropy difference under multiple kinds of hit conditions is obtained and used as a safety indication parameter under multiple kinds of hit conditions.

Specifically, for example, taking the calculation of the information entropy difference in the current hit case as an example, the information entropy difference may be calculated according to the following equation: E-E'. Wherein E is expressed as a first information entropy, and E' is expressed as a second information entropy under the current hit condition. And then the information entropy difference can be determined as a safety indication parameter under the current hit condition.

In a similar manner, information entropy differences between the first information entropy and the second information entropy in other hit situations can be respectively calculated, so that the safety indication parameters in multiple hit situations can be obtained.

It should be understood, of course, that the above-listed manner for determining the safety indication parameter is only an exemplary illustration. In specific implementation, according to specific situations and processing requirements, other suitable manners may also be adopted to calculate the safety indication parameters under various hit conditions. The present specification is not limited to these.

S204: and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

In some embodiments, in specific implementation, the safety indication parameters in the multiple hit situations may be compared with a preset safety threshold, respectively, to determine whether the safety indication parameters in the various hit situations are greater than the preset safety threshold. And under the condition that at least one of the safety indication parameters under the multiple hit conditions is determined to be greater than a preset safety threshold, determining that the rule model has safety risks.

In some embodiments, the specific value of the preset safety threshold may be determined according to factors such as sensitivity of change of the data value of the target attribute, and tolerance of error.

Specifically, for example, for some application scenarios with higher accuracy requirements, the tolerance for errors is usually small, and the tolerance range is also relatively small; meanwhile, if the target attribute data value has a small variation range and high sensitivity, the preset safety threshold value can be set to be relatively small, so that the variation of the data value distribution of the target attribute can be sensitively and accurately found.

In contrast, for some application scenarios with low precision requirements, the tolerance for errors is usually large, and the tolerance range is also relatively large; meanwhile, if the target attribute data value has a large variation range and low sensitivity, the preset safety threshold value can be set to be relatively large, so that the false alarm rate is reduced.

In some embodiments, in the case that it is determined that the safety indication parameters in the multiple hit cases are all less than or equal to the preset safety threshold in the above manner, it may be determined that the rule model does not have a safety risk. At this point, the second server may run the rule model using the own owned data resources normally.

In some embodiments, in a case where it is determined that the rule model does not have a security risk, in a specific implementation, the second server may retrieve, from an owned data resource (e.g., a database), information data matching an identity of the data object according to the identity of the data object (e.g., an identity ID of a user) carried in the data processing request. And inputting the information data into a rule model, operating the rule model, and outputting the identification information of the rule set hit by the data object as a processing result.

And the second server may feed back the processing result to the first server. The first server may perform corresponding data processing according to the processing result. For example, the first server may determine the specific credit risk of the user according to a preset credit risk rating rule according to a rule set hit by the user object in the processing result.

In some embodiments, it may be determined that the rule model is at a security risk in the case that the security indication parameter in at least one of the hit cases is greater than the preset security threshold in the above manner. In this case, the second server may refuse to use the own data resource to run the rule model in order to avoid that the own data resource is leaked or information data exceeding a tolerance range is leaked when the rule model is run. Therefore, the data security of the data provider can be effectively protected, and the risk that the data resource owned by the data provider is leaked is reduced.

In the embodiment, a first distribution of the target attribute is determined according to the sample set; meanwhile, processing the sample set by using a rule model to determine a second distribution of the target attributes under various hit conditions; then according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, calculating a safety indication parameter capable of reflecting the difference degree between the second distribution of the target attributes under multiple hit conditions and the original first distribution; and determining whether the rule model has a safety risk according to the safety indication parameters. Therefore, the difference degree between the second distribution and the first distribution of the target attributes under various hit conditions can be quantified by utilizing the safety indication parameters, whether the rule model has a safety risk or not can be determined more accurately according to the safety indication parameters, and the risk of data leakage when the rule model is operated by a data provider is reduced.

In some embodiments, the safety indication parameter may specifically include at least one of: information entropy difference, kini index difference, purity difference, KL divergence, and the like. Of course, it should be noted that the above listed safety indication parameters are only an illustrative example. In specific implementation, other parameters can be introduced as safety indication parameters according to specific situations and processing requirements. For example, parameters such as EMD distance, kiniboni, information entropy gain, purity ratio, etc. may also be introduced as the safety indication parameters. The present specification is not limited thereto.

In some embodiments, in a case that the safety indication parameter includes an information entropy difference, the determining, according to the first distribution of the target attribute and the second distribution of the target attribute in the multiple hit situations, the safety indication parameter in the multiple hit situations may include, in specific implementation, the following: calculating the information entropy of the target attribute as a first information entropy according to the first distribution of the target attribute; respectively calculating the information entropies of the target attributes under various hit conditions according to the second distribution of the target attributes under various hit conditions to obtain second information entropies under various hit conditions; and respectively subtracting the first information entropy from the second information entropy under the multiple hit conditions to obtain multiple difference values, and using the multiple difference values as the information entropy differences under the multiple hit conditions, namely the safety indication parameters under the multiple hit conditions.

In some embodiments, the security indication parameter may further include an information entropy ratio. Under the condition that the safety indication parameters comprise information entropy ratios, after a first information entropy and a second information entropy under multiple kinds of hit conditions are respectively calculated, the first information entropy can be divided by the second information entropy under multiple kinds of hit conditions respectively to obtain multiple quotient values which serve as the information entropy ratios under multiple kinds of hit conditions, namely the safety indication parameters under multiple kinds of hit conditions. For example, the entropy ratio of the information in the current hit case of the multiple hit cases may be denoted as E/E'. Wherein E is expressed as a first information entropy, and E' is expressed as a second information entropy under the current hit condition.

In some embodiments, the safety indication parameter may further include an information entropy gain. Under the condition that the safety indication parameters comprise information entropy gains, besides calculating a first information entropy and a second information entropy under various hit conditions, determining a third information entropy under various hit conditions according to a complementary set of a sub sample set under various hit conditions; and then, the first information entropy, and the second information entropy and the third information entropy under various hit conditions are integrated to calculate the information entropy gain under various hit conditions, and the information entropy gain is used as a safety indication parameter under the corresponding hit conditions.

Specifically, for example, the current hit situation in a plurality of hit situations is taken as an example. The sample set is SampleSet, and the subsample set in the current hit case is SampleSet _1. Correspondingly, the complement of the subsample set in the current hit case is SampleSet _2= SampleSet-SampleSet _1. And calculating a first information entropy E according to the sample set, and calculating a second information entropy E' under the current hit condition according to the sub-sample set under the current hit condition. Further, a third information entropy (which may be denoted as E ") in the current hit situation may be determined through data statistics according to a complementary set of the sub-sample set (e.g., sampleSet — 2) in the current hit situation.

In some embodiments, the third entropy (E ") of information may also be determined using a set of subsamples in the opposite case of the current hit instead of the complement of subsamples in the current hit described above. For example, the rule model includes 3 different rule sets, rule set 1, rule set 2, and rule set 3. The current hit (denoted as hit 1) is hit only rule set 1 but not hit rule sets 2 and 3. Another hit case (denoted as hit case 2) is a hit of only rule sets 2 and 3 but no hit of rule set 1. At this time, the hit condition 2 may be used as an opposite condition of the current hit condition, and a third information entropy in the current hit condition is determined through data statistics according to the subsample set of the hit condition 2.

Specifically, when the third information entropy in the current hit situation is determined, similar to the determination of the second information entropy in the current hit situation, the target attribute X may be determined based on SampleSet _2, and includes m "different data values, which may be respectively denoted as v _1", v _2 "\8230; v _ m'. Determining the data value distribution of X based on SampleSet _2, and further determining that the probability values of the data values are p respectively according to the data value distribution of X _1″ 、p _2″ ……p _m″ . According to the probability value of each data value of X, the third information entropy corresponding to the current hit condition can be calculated according to the following formula: e' = ∑ Sigma _i″ p _i″ logp _i″ . Where E' is expressed as the third information entropy in the current hit case, p _i″ And a probability value of a data value denoted by the index i ' of X in the current hit case, wherein the value of i ' is more than or equal to 1 and less than or equal to m '.

Further, the information entropy of the second information in the current hit situation can be obtained according to the first information entropy, the second information entropy in the current hit situation and the second information entropy in the current hit situationDetermining information entropy gain according to the following formula:

and the information entropy gain is used as a safety indication parameter under the current hit condition.

Wherein N is the number of sample data in the sample set, N ₁ Expressed as the number of sample data in the set of subsamples in the current hit case, N ₂ Expressed as the number of sample data in the complement of the set of subsamples in the current hit case.

In some embodiments, the safety indication parameter may further include an information entropy gain ratio. In a case that the security indication parameter includes an information entropy gain ratio, the information entropy gain ratio may be determined according to the following equation based on a first information entropy, a second information entropy in a current hit situation, and a third information entropy in the current hit situation:

in some embodiments, the safety indication parameter may further comprise a kini index difference. When the safety indication parameter includes the kini index difference, the determining the safety indication parameter under multiple hit conditions according to the first distribution of the target attribute and the second distribution of the target attribute under multiple hit conditions may include, in specific implementation: determining a probability value of each data value of the target attribute based on the sample set according to the first distribution of the target attribute; and calculating the Gini index of the target attribute as a first Gini index according to the probability value of each data value of the target attribute. Respectively calculating probability values of all data values of the target attributes under various hit conditions according to the second distribution of the target attributes under various hit conditions; and calculating the Gini indexes of the target attributes under various hit conditions according to the probability values of the data values of the target attributes under various hit conditions, and taking the Gini indexes as second Gini indexes under corresponding hit conditions. And respectively subtracting the first kini indexes from the second kini indexes under multiple hit conditions to obtain multiple difference values, wherein the multiple difference values are used as the kini index differences under the multiple hit conditions, namely the safety indication parameters under the multiple hit conditions.

In some embodiments, the aforementioned kini index may be specifically understood as a parameter for describing the size of the probability that sample data is divided into errors. Generally, the smaller the value of the kini index is, the smaller the probability that the sample data in the set is divided into errors is, and accordingly, the purer the sample data in the set is.

In some embodiments, the first kini index may be calculated from the first distribution of the target property according to the following equation: gini = ∑ Σ _i p _i (1-p _i )＝1-∑ _i p _i ² . Wherein p is _i Which may be represented as a probability value for a data value of the target property X, numbered i, determined based on the sample set.

In some embodiments, taking the current hit as an example, a second kiney index may be calculated from a second distribution of target attributes in the current hit according to the following equation: gini' = ∑ Σ _i’ p _i’ (1-p _i’ )＝1-∑ _i’ p _i’ ² . Wherein p is _i’ May be expressed as a probability value for the data value of the target property X, labeled i', determined based on the set of subsamples for the current hit.

In some embodiments, when implemented, the difference in the kini indices for the current hit may be calculated according to the following equation: gini-Gini' and the difference in the Gini indices is used as a security indicator parameter in the current hit situation.

In some embodiments, the safety indication parameter may further comprise a ratio of kini indices. When the safety indication parameter includes a ratio of the kini indexes, in specific implementation, the first kini index may be divided by the second kini indexes under multiple hit conditions to obtain multiple quotient values, which are used as the ratio of the kini indexes under multiple hit conditions, that is, the safety indication parameter under multiple hit conditions.

Specifically, taking the current hit situation as an example, the ratio of the kini indexes in the current hit situation can be calculated according to the following formula: gini/Gini' as a security indication parameter in the current hit case.

In some embodiments, the safety indication parameter may further comprise a purity difference. In the case that the safety indication parameter includes a poor purity, in a specific implementation, after the first kini index and the second kini indexes under multiple hit conditions are determined, a third kini index (which may be recorded as Gini) under multiple hit conditions may be calculated according to a complement of the sub-sample set under multiple hit conditions.

Specifically, taking the current hit condition as an example, the third kiney index in the current hit condition can be calculated according to the following formula: gini' = ∑ Sigma _i″ p _i″ (1-p _i″ )＝1-∑ _i″ p _i″ ² . Wherein p is _i″ May be expressed as a probability value for the data value of the target attribute X, labeled i ", determined based on the complement of the subsample set in the current hit case.

And calculating the purity difference under various hit conditions according to the first kini index, the second kini index under various hit conditions and the third kini index under various hit conditions.

Specifically, the purity difference in the current hit case can be calculated according to the following equation:

wherein N is the number of sample data in the sample set, N ₁ Expressed as the number of sample data in the set of subsamples in the current hit case, N ₂ Expressed as the number of sample data in the complement of the subsample set in the current hit case.

The purity difference under various hit conditions can be calculated as a safety indication parameter under various hit conditions.

In some embodiments, the safety indication parameter may also include a purity ratio. In the case that the safety indication parameter includes a purity ratio, the specific implementation may be as follows, taking the current hit condition as an exampleThe purity ratio under the current hit condition is calculated by the equation:

in the above manner, the purity ratios in the case of multiple hits can be determined as safety indication parameters in the case of multiple hits.

In some embodiments, the safety indication parameter may also include a KL divergence (which may also be referred to as a Kullback-Leible divergence). The KL divergence may be understood as an asymmetry metric parameter reflecting a difference (or relative entropy) between two distributions (e.g., a first distribution of the target property and a second distribution in case of a hit).

In some embodiments, in the case that the safety indication parameter includes a KL divergence, taking the current hit case as an example, the KL divergence in the current hit case may be calculated according to the following equation:

wherein, the above p _i Represented as the probability value, p, for the data value of number i in the target attribute X determined based on the subsample set in the current hit case _i’ Represented as the probability value for the data value of number i' in the target attribute X determined based on the complement of the subsample set in the current hit.

In the above manner, KL divergences in a plurality of hit situations can be calculated respectively as the safety indication parameters in a plurality of hit situations.

In some embodiments, the safety indication parameter may further include an EMD distance. The EMD (Earth Mover's Distance) Distance may specifically represent the minimum cost for changing from one distribution to another distribution, and may be used to describe the Distance between two distributions (e.g., a first distribution of the target attribute and a second distribution in a case of a hit).

In some embodiments, in the case that the safety indication parameter includes an EMD distance, taking the current hit case as an example, it may be determined whether the target attribute is a non-numeric attribute or a numeric attribute according to the data value of the target attribute.

The numerical attribute may be specifically understood as an attribute in which a data value is a numeric character. Such as body temperature, blood pressure, number of defaults, etc. The above-mentioned non-numeric attribute may be specifically understood as an attribute in which a data value is a non-numeric character. Such as native place, sex, occupation, etc.

In the case where the target attribute is determined to be a non-numeric attribute, the EMD distance in the current hit case may be calculated according to the following equation:

wherein p is _i Representing the probability value, p, for a data value of index i in the target property X determined on the basis of the subsample set in the current hit situation _i’ Represented as the probability value for the data value of index i' in the target attribute X determined based on the complement of the subsample set in the current hit case.

In the case where the target attribute is determined to be a numeric attribute, the EMD distance in the current hit case may be calculated according to the following equation:

wherein r is _i ＝p _i -p _i’ 。

In specific implementation, the EMD distances under various hit conditions can be calculated according to the above-mentioned method, and used as the safety indication parameters under various hit conditions.

In some embodiments, the determining whether the rule model has a security risk according to the security indication parameters in multiple hit situations may include the following steps: comparing the safety indication parameters under various hit conditions with preset safety threshold values respectively; and under the condition that at least one of the safety indication parameters under the multiple hit conditions is determined to be greater than a preset safety threshold, determining that the rule model has safety risks. The specific value of the preset safety threshold can be determined according to the sensitivity of the data value change of the target attribute, the error tolerance and other factors.

In some embodiments, the target attribute may specifically include a plurality of attributes. In particular, a first distribution of the plurality of attributes and a second distribution of the plurality of attributes in a plurality of hit cases may be determined with reference to an embodiment based on one target attribute. When the current hit condition in multiple hit conditions is processed, the first distribution of the multiple attributes may be compared with the second distribution of the multiple attributes in the current hit condition, and when it is found that the difference degree between the first distribution of at least one attribute and the second distribution of the same attribute in the current hit condition is greater than a preset safety threshold, it may be determined that the rule model has a safety risk. The preset safety threshold values corresponding to different attributes may be different.

In some embodiments, it is also considered that in a specific application scenario, although the data value of the target attribute of a data object in some hit cases is relatively easier to guess, the number of data objects belonging to such hit cases is relatively small. For example, only a very small portion of the sample data in the sample set may be partitioned into the hit case. Therefore, the amount of information data leaked by the sample data based on the part is relatively small and still belongs to a tolerable range, and the security of the rule model is acceptable.

In some embodiments, in order to determine the security of the rule model more accurately, when it is determined that the security indication parameter in at least one of the multiple hit situations is greater than the preset security threshold, the method may further include: determining the hit condition of the safety indication parameter larger than a preset safety threshold as a risk hit condition; counting the proportion of the sample data belonging to the risk hit condition in the sample set as the proportion of the risk sample; comparing the risk sample proportion with a preset proportion threshold; and determining that the rule model has a security risk under the condition that the risk sample proportion is determined to be larger than the preset proportion threshold.

In some embodiments, the preset duty ratio threshold may be specifically determined according to a tolerance. Specifically, for example, the preset duty ratio threshold may be 30%. The specification is not limited to the specific value of the preset duty threshold.

In some embodiments, when it is determined that the ratio of the risk samples is greater than the preset ratio threshold, it may be determined that, when the rule model is operated to perform specific data processing, a relatively large number of data objects are divided into risk hit situations, and further, a relatively high probability is given out more information data. The data security of the data resources of the data provider is relatively greatly influenced beyond a tolerable range. At this time, it can be determined that the rule model described above is at a security risk.

In some embodiments, in a case where it is determined that the risk sample proportion is less than or equal to the preset proportion threshold, it may be determined that, when the rule model is run for specific data processing, the amount of leaked information data is relatively small in a case where only a relatively small number of data objects are classified as risk hits. The impact on data security of data resources of the data provider is relatively small, within a tolerable range. At this time, it can be determined that the rule model does not present a security risk.

In some embodiments, in the case that the rule model is determined to be at risk, it may be further determined more finely which rule in the rule model is at risk.

In specific implementation, the method may further include: and retrieving the hit rule set in the risk hit condition, and determining the rule set containing the target attribute from the hit rule set in the risk hit condition as a risk rule set.

The risk rule set may be specifically understood as a rule set which has a security risk and may reveal data values about target attributes owned by a data provider.

In some embodiments, in a case that the rule model is allowed to be disassembled to obtain a specific rule in the rule model, in order to reduce data processing amount and more efficiently discover the rule model with a security risk, the method may further include the following steps when implemented: detecting whether a rule set in the rule model contains a preset operator or not; and determining that the rule model has a security risk in the case that at least one rule set in the rule model contains a preset operator.

The preset operator may be specifically understood as an operator with a higher probability of revealing data information. For example, equal to a number ("="), a reduced number ("≈ or"), or the like. Of course, the above listed preset operators are only illustrative. In specific implementation, according to specific situations and processing requirements, other suitable operators may also be introduced as the preset operators. The present specification is not limited to these.

In some embodiments, when it is detected that at least one rule set in the rule model includes a preset operator, it may be directly determined that the rule model has a security risk, and it may be determined whether the rule model has the security risk by calculating and according to the security indication parameters under multiple hit conditions without consuming processing time and processing resources, so that data throughput may be effectively reduced.

Under the condition that the rule set in the rule model does not contain the preset operator, whether the rule model has the safety risk or not can be judged through calculation according to the safety indication parameters under various hit conditions, and therefore data processing amount can be effectively reduced.

In some embodiments, after determining whether the rule model has a security risk according to the first distribution of the target attribute and the second distribution of the target attribute in the case of multiple hits, when the method is implemented, the following may be further included: in the event that the rule model is determined to be at a security risk, risk hint information may be generated. And the risk prompt information is used for prompting a data provider to refuse to run the rule model. Correspondingly, the second server can refuse to operate the rule model according to the risk prompt information, so that the risk that the data provider is exposed to leakage of data resources due to the operation of the rule model can be reduced.

In some embodiments, a security prompt may be generated upon determining that the rule model is not at a security risk based on the first distribution of target attributes and the second distribution of target attributes for a plurality of hit cases. Correspondingly, the second server can normally use the data resource owned by the own party to run the rule model according to the safety prompt information, and obtain the corresponding processing result. And sending the processing result to the first server to feed back to the model generator.

In some embodiments, the determining a first distribution of the target attribute according to the sample set may include, in specific implementation, the following: counting the number of sample data of each data value of the target attribute in the sample set; and determining the distribution of each data value of the target attribute in the sample set according to the sample data quantity of each data value of the target attribute in the sample set, wherein the distribution is used as the first distribution of the target attribute.

In some embodiments, the processing the sample set by using the rule model to determine the second distribution of the target attribute in the multiple hit situations may include the following: processing a plurality of sample data in the sample set by using the rule model to obtain processing results of the plurality of sample data; the processing result comprises identification information of a rule set hit by the sample data; determining a rule set hit by the sample data according to the processing results of the plurality of sample data; dividing the plurality of sample data into a plurality of sub-sample sets under the hit condition according to the rule set hit by the sample data; and determining the data value distribution of the target attributes under various hit conditions according to the subsample set under various hit conditions, and taking the data value distribution as a second distribution of the target attributes under various hit conditions.

In some embodiments, the above method for determining the safety of the rule model can be specifically applied to a disease detection scenario. Specifically, in a disease detection scenario, the model generator may be a disease detection organization providing a disease detection service for users, and the data provider may be an organization storing health data of a large number of users, such as a hospital, a physical examination center, and the like. Accordingly, the rule model may specifically comprise a rule model generated by the model generator for detecting a disease. The rule set in the rule model may specifically include a decision rule related to the health data of the user.

In particular, reference may be made to FIG. 4. A first server disposed on a side of the disease detection mechanism may generate a rule model for detecting a disease. And sends the rule model to a second server arranged at one side of the XX hospital. Wherein XX hospital owns and manages the health database of users. The health database of the user stores a large amount of health data of the user. The XX hospital has a cooperative agreement with the disease detection institution in advance.

After receiving the rule model, the second server may detect the security of the rule model by using the method for determining the security of the rule model provided in the embodiments of the present specification.

Under the condition that the rule model is determined to have the security risk, risk prompt information can be generated and fed back to the first server to prompt that the rule model has the security risk, and the rule model is refused to run.

And under the condition that the rule model is determined to have no security risk, the rule model is operated by utilizing the health database owned by the own party according to the cooperation protocol, and a corresponding processing result is generated and fed back to the first server.

In some embodiments, when it is determined that the rule model does not have a security risk according to the security indication parameters in the multiple hit situations, the following may be further included in implementation: acquiring an identity of a target user; inquiring a health database of the user according to the identity of the target user to acquire health data of the target user; processing the health data of the target user by utilizing the rule model to obtain a corresponding processing result; wherein the processing result is used for determining the risk of the target user suffering from the preset disease.

Specifically, the second server may generate and send the security prompt message to the first server when it is determined that the rule model does not have the security risk, so as to prompt that the rule model does not have the security risk, and the rule model may be normally operated. The first server can respond to the received safety prompt information and send the detection request carrying the identity of the target user to be detected to the second server.

The second server may obtain the identity of the target user from the detection request sent by the first server; and then, inquiring the owned health database of the user according to the identity to acquire the health data of the target user. And then the health data of the target user can be input into the rule model, the rule model is operated, and the identification information of the rule set hit by the target user is output as a processing result corresponding to the target user. And feeding back the processing result to the first server.

The first server can determine a rule set hit by the target user according to the processing result, and then determine the risk that the target user suffers from the preset disease according to the rule set hit by the user. For example, a probability value that the target user is predicted to suffer from cancer, etc.

Therefore, under the condition of protecting the data security of the health database of the user in the XX hospital, the health database can cooperate with a disease detection mechanism to assist the disease detection mechanism to determine the risk that the target user suffers from the preset disease.

In some embodiments, the above method for determining the security of the rule model may be applied in a scenario of detecting a credit risk of a user. Specifically, in a scenario of detecting a credit risk of a user, the model generator may be a shopping website, a financial platform, or the like that needs to determine the credit risk of the user, and the data provider may be a financial institution such as a bank that has a large amount of information data related to the credit of the user. Accordingly, the rule model may specifically include a rule model generated by the model generator for determining the credit risk of the user. The rule set in the rule model may specifically include a decision rule related to the credit information of the user.

Of course, it should be noted that the above listed application scenarios and the rule models and rule sets used are only schematic illustrations. In specific implementation, the method for determining the safety of the rule model can be applied to other application scenarios according to specific situations. The present specification is not limited to these.

In some embodiments, in order to further reduce errors and more accurately determine whether the rule model has a security risk, in addition to determining and detecting the security of the rule model according to a single security indicating parameter, it may be determined and determined whether the rule model has a security risk according to a combination of security indicating parameters.

In some embodiments, the method, when implemented, may further include: screening a plurality of safety indication parameters to construct a safety indication parameter combination; determining a safety indication parameter combination under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the combination of the security indication parameters under various hit conditions.

Specifically, according to the application scenario, a plurality of safety indication parameters which are matched with the application scenario and have a good effect are screened from the plurality of safety indication parameters to be combined to serve as the safety indication parameter combination for the application scenario. And determining a safety indication parameter combination under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, so as to more finely and comprehensively quantify the difference degree between the first distribution and the second distribution of the target attributes under multiple hit conditions.

And then, the safety indication parameter combination under various conditions can be used for replacing a single safety indication parameter, so that whether the rule model has safety risks or not can be determined more accurately, and errors are reduced.

As can be seen from the above, in the method for determining the security of the rule model provided in the embodiment of the present specification, a first distribution of target attributes is determined according to a sample set; meanwhile, processing the sample set by using a rule model to determine a second distribution of the target attributes under various hit conditions; then according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, calculating a safety indication parameter capable of reflecting the difference degree between the second distribution of the target attributes under multiple hit conditions and the original first distribution; and determining whether the rule model has a security risk according to the security indication parameters. Therefore, the difference degree between the second distribution and the first distribution of the target attributes under various hit conditions can be quantified by utilizing the safety indication parameters, whether the rule model has a safety risk or not can be accurately determined according to the safety indication parameters, and the risk of data leakage when the rule model is operated by a data provider is reduced. A plurality of safety indication parameters are screened out from the information entropy difference, the Gini index difference, the purity difference and the KL divergence to construct a safety indication parameter combination; and according to the safety indication parameter combination, the difference degree between the first distribution and the second distribution of the target attribute under various hit conditions is more finely quantized, so that whether the rule model has safety risks or not can be more accurately determined.

Referring to fig. 5, an embodiment of the present disclosure further provides a data processing method. When the method is implemented, the following contents may be included.

S501: a first distribution of the target attribute is obtained, as well as a second distribution of the target attribute.

S502: and calculating a security indication parameter according to the first distribution of the target attribute and the second distribution of the target attribute.

S503: determining a degree of difference between the first distribution of the target attribute and the second distribution of the target attribute according to the safety indication parameter.

In some embodiments, the safety indication parameter may specifically include at least one of: information entropy difference, kini index difference, purity difference, KL divergence, and the like.

By the method, the difference degree between the two kinds of distribution of the same target attribute can be quantified by using the safety indication parameter, so that more accurate data processing can be performed according to the difference degree between the two kinds of distribution of the same target attribute.

Embodiments of the present specification further provide a server, including a processor and a memory for storing processor-executable instructions, where the processor, when implemented, may perform the following steps according to the instructions: obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data; determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

In order to more accurately complete the above instructions, referring to fig. 6, another specific server is provided in the embodiments of the present specification, where the server includes a network communication port 601, a processor 602, and a memory 603, and the above structures are connected by an internal cable, so that the structures may perform specific data interaction.

The network communication port 601 may be specifically configured to obtain a rule model and a sample set; wherein the rule model comprises a set of rules, the set of samples comprising a plurality of sample data.

The processor 602 may be specifically configured to determine a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

The memory 603 may be specifically configured to store a corresponding instruction program.

In this embodiment, the network communication port 601 may be a virtual port bound with different communication protocols, so that different data can be sent or received. For example, the network communication port may be a port responsible for web data communication, a port responsible for FTP data communication, or a port responsible for mail data communication. In addition, the network communication port can also be a communication interface or a communication chip of an entity. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it may also be a bluetooth chip.

In this embodiment, the processor 602 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The description is not intended to be limiting.

In this embodiment, the memory 603 may include multiple layers, and in a digital system, the memory may be any memory as long as binary data can be stored; in an integrated circuit, a circuit without a real form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.

The present specification further provides a computer storage medium based on the above rule model security determination method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium implements: obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data; determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions; determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.

Referring to fig. 7, on a software level, the embodiment of the present specification further provides a device for determining security of a rule model, which may specifically include the following structural modules.

The obtaining module 701 may be specifically configured to obtain a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data.

A first determining module 702, specifically configured to determine a first distribution of the target attribute according to the sample set; and determining second distribution of the target attributes under various hit conditions according to the sample set and the rule model.

The second determining module 703 may be specifically configured to determine the safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions.

The third determining module 704 may be specifically configured to determine whether the rule model has a security risk according to the security indication parameters in multiple hit situations.

It should be noted that, the units, devices, modules, and the like described in the foregoing embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. It is to be understood that, in implementing the present specification, functions of each module may be implemented in one or more pieces of software and/or hardware, or a module that implements the same function may be implemented by a combination of a plurality of sub-modules or sub-units, or the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Therefore, the device for determining the safety of the rule model provided by the embodiment of the specification can determine whether the rule model has a safety risk or not more accurately, and reduces the risk of data leakage when a data provider operates the rule model.

An embodiment of the present specification further provides a data processing apparatus, including: the acquisition module is used for acquiring a first distribution of the target attribute and a second distribution of the target attribute; the calculation module is used for calculating a safety indication parameter according to the first distribution of the target attribute and the second distribution of the target attribute; a determining module for determining a degree of difference between the first distribution of the target attribute and the second distribution of the target attribute according to the security indication parameter.

The processing device can quantify the difference degree between two kinds of distributions of the same target attribute by utilizing the safety indication parameter, so that more accurate data processing can be carried out according to the difference degree between the two kinds of distributions of the same target attribute.

Although the present specification provides method steps as described in the examples or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When implemented in practice, an apparatus or client product may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) in accordance with the embodiments or methods depicted in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. The terms first, second, etc. are used to denote names, but not any particular order.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus necessary general hardware platform. With this understanding, the technical solutions in the present specification may be essentially embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments in the present specification.

The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications without departing from the spirit of the specification.

Claims

1. A method of determining security of a rule model, comprising:

obtaining a rule model and a sample set; wherein the rule model comprises a rule set comprising a plurality of sample data;

determining a first distribution of target attributes according to the sample set; processing the sample set by using the rule model to determine a second distribution of the target attributes under multiple hit conditions;

determining safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; wherein the security indication parameter comprises at least one of: poor information entropy, poor kini index, poor purity and KL divergence; determining a safety indication parameter under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions, including: determining a parameter type of a safety indication parameter; according to the parameter type of the safety indication parameter, determining the safety indication parameter under multiple hit conditions by using the first distribution of the target attribute and the second distribution of the target attribute under multiple hit conditions in a matching mode;

and determining whether the rule model has a security risk or not according to the security indication parameters under various hit conditions.

2. The method of claim 1, wherein in the case that the security indication parameter comprises an entropy difference, the determining the security indication parameter for multiple hit cases from the first distribution of the target attribute and the second distribution of the target attribute for the multiple hit cases comprises:

calculating the information entropy of the target attribute as a first information entropy according to the first distribution of the target attribute;

respectively calculating the information entropies of the target attributes under various hit conditions according to the second distribution of the target attributes under various hit conditions to obtain second information entropies under various hit conditions;

and respectively subtracting the first information entropy from the second information entropy under the multiple hit conditions to obtain multiple difference values, and using the multiple difference values as safety indication parameters under the multiple hit conditions.

3. The method of claim 1, wherein determining whether the rule model is at a security risk according to the security indication parameters in the case of multiple hits comprises:

comparing the safety indication parameters under the multiple hit conditions with preset safety thresholds respectively;

and under the condition that at least one of the safety indication parameters under the multiple hit conditions is determined to be greater than a preset safety threshold, determining that the rule model has safety risks.

4. The method of claim 3, wherein in case it is determined that at least one of the plurality of hit safety indication parameters is greater than a preset safety threshold, the method further comprises:

determining the hit condition of the safety indication parameter larger than a preset safety threshold as a risk hit condition;

counting the proportion of the sample data belonging to the risk hit condition in the sample set as the proportion of the risk sample;

comparing the risk sample proportion with a preset proportion threshold;

determining that the rule model has a security risk if it is determined that the risk sample proportion is greater than the preset proportion threshold.

5. The method of claim 1, said determining a first distribution of target attributes from said sample set, comprising:

counting the number of sample data of each data value of the target attribute in the sample set;

and determining the distribution of each data value of the target attribute in the sample set according to the sample data quantity of each data value of the target attribute in the sample set, wherein the distribution is used as the first distribution of the target attribute.

6. The method of claim 1, the processing the sample set with the rule model to determine a second distribution of target attributes for a plurality of hits, comprising:

processing a plurality of sample data in the sample set by using the rule model to obtain processing results of the plurality of sample data; the processing result comprises identification information of a rule set hit by the sample data;

determining a rule set hit by the sample data according to the processing results of the plurality of sample data;

dividing the plurality of sample data into a plurality of sub-sample sets under the hit condition according to the rule set hit by the sample data;

and determining the data value distribution of the target attributes under various hit conditions according to the subsample set under various hit conditions, and taking the data value distribution as a second distribution of the target attributes under various hit conditions.

7. The method of claim 1, the rule model comprising a rule model for detecting a disease, the rule set comprising, in response, decision rules related to health data of a user.

8. The method of claim 7, in the event that it is determined from the safety indication parameters in the case of multiple hits that the rule model does not present a safety risk, the method further comprising:

acquiring an identity of a target user;

inquiring a health database of the user according to the identity of the target user to acquire health data of the target user;

processing the health data of the target user by utilizing the rule model to obtain a corresponding processing result; wherein the processing result is used for determining the risk of the target user suffering from the preset disease.

9. The method of claim 1, further comprising:

screening a plurality of safety indication parameters to construct a safety indication parameter combination;

determining a safety indication parameter combination under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions;

and determining whether the rule model has a security risk or not according to the combination of the security indication parameters under various hit conditions.

10. A device for determining security of a rule model, comprising:

the acquisition module is used for acquiring the rule model and the sample set; wherein the rule model comprises a rule set comprising a plurality of sample data;

the first determining module is used for determining first distribution of the target attribute according to the sample set; determining a second distribution of the target attributes under various hit conditions according to the sample set and the rule model;

the second determining module is used for determining the safety indication parameters under multiple hit conditions according to the first distribution of the target attributes and the second distribution of the target attributes under multiple hit conditions; wherein the security indication parameter comprises at least one of: poor information entropy, poor kini index, poor purity and KL divergence; the second determining module is specifically configured to determine a parameter type of the safety indication parameter; according to the parameter type of the safety indication parameter, determining the safety indication parameter under multiple hit conditions by using the first distribution of the target attribute and the second distribution of the target attribute under multiple hit conditions in a matching mode;

and the third determining module is used for determining whether the rule model has a security risk according to the security indicating parameters under various hit conditions.

11. A server comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 9.

12. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 9.