CN112580106A

CN112580106A - Multi-source data processing system and multi-source data processing method

Info

Publication number: CN112580106A
Application number: CN202110103428.9A
Authority: CN
Inventors: 任静涵
Original assignee: E Capital Transfer Co ltd
Current assignee: E Capital Transfer Co ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-03-30

Abstract

The invention relates to a multi-source data processing method and a system thereof. The method comprises the following steps: mixing a redundant attribute set in the key attribute set as an initial attribute set; carrying out format conversion on the initial attribute set to obtain a to-be-processed attribute set with a preset format; selecting two or more clients as simulation execution clients, simulating the simulation execution clients to execute the attribute set to be processed and obtaining a simulation completion attribute set; sending the simulation completion attribute set serving as noise and the attribute set to be processed to the rest of the plurality of clients except the simulation execution client, executing the attribute set to be processed by the rest of the clients and obtaining an execution completion attribute set; and inputting the execution completion attribute set into a decision analysis model for calculation and analysis to obtain a decision analysis result. According to the invention, on the basis of protecting the privacy information of a plurality of clients and the privacy of the decision analysis model of the server, the information from the plurality of clients can be integrated to obtain the decision analysis result.

Description

Multi-source data processing system and multi-source data processing method

Technical Field

The invention relates to computer technology, in particular to a multi-source data processing system and a multi-source data processing method for data processing of multi-source data (namely data from a plurality of clients).

Background

In the security industry cloud service environment, sometimes private data information of multiple dealer organizations needs to be utilized for comprehensive data processing, so that a method for analyzing and processing data on the premise of protecting the private data of each dealer organization needs to be provided.

On the other hand, cloud service providers generally use their own decision analysis models to perform relevant data analysis and processing, and in such a case, it is also necessary to provide a data analysis and processing method that does not reveal the decision analysis models of the cloud service providers.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a multi-source data processing system and a multi-source data processing method capable of processing multi-source data without revealing private privacy information from multiple sources (i.e., clients) and at the same time without revealing privacy information of a server.

The invention also provides a multi-source data processing method, which is characterized by being realized by a server and a plurality of clients, and the method comprises the following steps:

a redundancy adding step, in which the server side mixes a redundancy attribute set in the key attribute set as an initial attribute set;

format conversion, namely, the server performs format conversion on the initial attribute set to obtain a to-be-processed attribute set with a preset format;

a simulation execution step, in which the server selects two or more clients as simulation execution clients based on a first random algorithm, and simulates the simulation execution clients to execute the attribute set to be processed and obtain a simulation completion attribute set;

a real execution step, in which the server side sends the simulation completion attribute set as noise together with the attribute set to be processed to the rest clients except the simulation execution client side in the plurality of clients, and the rest clients execute the attribute set to be processed and obtain an execution completion attribute set; and

and analyzing and deciding, namely inputting the execution completion attribute set into a decision analysis model by the server side for calculation and analysis to obtain a decision analysis result.

Optionally, the decision analysis model performs gradual judgment according to the attribute of the execution completion attribute set, and finally obtains a decision result.

Optionally, a unique client identifier is preset for each client,

in the simulation execution step, two or more clients are selected as simulation execution clients from the client identification numbers of the plurality of clients by a first random algorithm.

Optionally, the first random algorithm comprises any one of:

numerical probability algorithms, the Las Vegas algorithm, the Monte Carlo algorithm, and the Skiwood algorithm.

Optionally, in the format conversion step, the following format conversion is performed on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

for the fields with linear attributes, a problem set is generated after discrete processing is carried out by adopting a discretization technology.

Optionally, the step of actually performing comprises the sub-steps of:

substep 1: the server side selects one client side from the rest client sides except the simulation execution client side in the plurality of client sides by adopting a second random algorithm;

substep 2: sending the attribute set to be processed and the simulation attribute set to the selected client;

substep 3: the selected client executes the attribute set to be processed, adds the execution result to the attribute set to be processed and returns the result to the server;

substep 4: and the server repeatedly executes the substeps 1-3 until the rest clients all execute the attribute set to be processed and obtain an execution completion attribute set.

The multi-source data processing system of the present invention is characterized by comprising: a service end and a plurality of client ends,

wherein, the server side includes:

the redundancy adding module is used for mixing a redundancy attribute set in the key attribute set as an initial attribute set;

the format conversion module is used for carrying out format conversion on the initial attribute set to obtain a to-be-processed attribute set with a preset format;

the simulation execution module is used for selecting two or more clients as simulation execution clients based on a first random algorithm, simulating the simulation execution clients to execute the attribute set to be processed and obtain a simulation completion attribute set;

a first communication module, communicatively connected to the client, configured to send the simulation completion attribute set as noise together with the to-be-processed attribute set to remaining clients, except the simulation execution client, among the plurality of clients, and configured to accept a returned execution completion attribute set; and

an analysis decision module for inputting the execution completion attribute set into a decision analysis model for calculation analysis and obtaining a decision analysis result,

Wherein the client comprises:

the second communication module is in communication connection with the server and is used for receiving the simulation completion attribute set and the attribute set to be processed from the server and returning an execution completion attribute set obtained by the execution module to the server; and

and the execution module is used for executing the attribute set to be processed and obtaining an execution completion attribute set.

Optionally, a unique client identifier is preset for each client,

the simulation execution module selects two or more clients as simulation execution clients from the client identification numbers of the plurality of clients through a first random algorithm.

Optionally, the first random algorithm comprises any one of:

Optionally, the format conversion module performs the following format conversion on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

The server of the present invention is a server for communicating with a plurality of clients, and includes:

the simulation execution module is used for selecting two or more clients from the plurality of clients as simulation execution clients based on a first random algorithm, simulating the simulation execution clients to execute the attribute set to be processed and obtain a simulation completion attribute set;

a first communication module, configured to send the simulation completion attribute set as noise together with the to-be-processed attribute set to remaining clients other than the simulation execution client among the plurality of clients and to accept a returned execution completion attribute set; and

and the analysis decision module is used for inputting the execution completion attribute set into a decision analysis model for calculation analysis and obtaining a decision analysis result.

The computer-readable medium of the present invention, on which a computer program is stored, is characterized in that,

the computer program, when executed by a processor, implements the multi-source data processing method described above.

The computer device of the present invention includes a storage module, a processor, and a computer program stored on the storage module and executable on the processor, and is characterized in that the processor implements the above-mentioned multi-source data processing method when executing the computer program.

As described above, according to the multi-source data processing system and the multi-source data processing method of the present invention, a protection policy for protecting privacy of private data of a client can be provided, and a part of redundant attribute sets are added to a key attribute set input by a decision analysis model, and converted into a question set with a yes or no answer, and transmitted to the client in service in a random order, the client updates the answer of the question set according to its own privacy data, and the last completed client transmits the answer to the server to complete attribute collection for the same object. Therefore, the key attribute information of all the clients about the analysis object is obtained, the privacy of the information can be ensured, and the key attribute is prevented from being leaked. Moreover, the strategy for protecting the decision analysis model can be provided by the server, each client can not acquire the decision analysis model of the server, and privacy protection is provided for the server.

Moreover, according to the multi-source data processing system and the multi-source data processing method of the present invention, the client of the selected client identifier is used as the simulation execution client, so that other real data can be concealed.

Therefore, according to the invention, private privacy information from multiple sources (namely the client) can not be disclosed, and privacy information (namely the decision analysis model) of the server can not be disclosed at the same time, and on the basis of realizing the protection of privacy data, the information from the multiple sources can be synthesized to obtain a final decision analysis result.

Drawings

FIG. 1 is a flow diagram illustrating a multi-source data processing method of the present invention.

FIG. 2 is a block diagram showing the architecture of a multi-source data processing system of the present invention.

FIG. 3 is a flow diagram illustrating a multi-source data processing method according to one embodiment of the invention.

Detailed Description

The following description is of some of the several embodiments of the invention and is intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention.

For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of multi-source data processing systems and multi-source data processing methods, and that these same principles, as well as any such variations, may be implemented therein without departing from the true spirit and scope of the present patent application.

As shown in fig. 1, the multi-source data processing method of the present invention includes:

redundancy addition step S100: the server side mixes a redundant attribute set in the key attribute set as an initial attribute set;

format conversion step S200: the server side carries out format conversion on the initial attribute set to obtain a to-be-processed attribute set with a preset format;

the simulation executes step S300: the server side selects two or more than two clients as simulation execution clients based on a first random algorithm, simulates the simulation execution clients to execute the attribute set to be processed and obtains a simulation completion attribute set;

actually executing step S400: the server side sends the simulation completion attribute set as noise to the rest clients except the simulation execution client side together with the attribute set to be processed, and the rest clients execute the attribute set to be processed and obtain an execution completion attribute set; and

an analysis decision step S500: and the server inputs the execution completion attribute set to a decision analysis model for calculation and analysis to obtain a decision analysis result.

In the redundancy adding step S100, the redundancy attribute set is mixed into the key attribute set as the initial attribute set, so that it is practically impossible for a plurality of clients to know what the "key attribute set" is, and therefore the "redundancy attribute set" is mixed therein as noise, thereby obtaining the technical effect that it can be ensured that the information to be collected by the server is really information and cannot be obtained by a plurality of clients.

The decision analysis model is established in the server and is only visible to the server, so that the decision analysis model can be prevented from being leaked to a plurality of clients serving as multiple sources.

Furthermore, a unique client identifier is preset and distributed to each client, and the client identifiers of the plurality of clients are recorded and stored in the server. In the simulation execution step S300, the server selects two or more client identifiers from the client identifiers of the plurality of clients through a first random algorithm, and uses the client with the selected client identifier as a simulation execution client.

Here, any one of the following may be employed as the first random algorithm: numerical probability algorithms, las vegas algorithms, monte carlo algorithms, and schouard algorithms, etc.

In the format conversion step S200, the following format conversion is performed on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

For example, as an example, for a field a which is a discrete attribute, such as Value1 and Value2 … Value en, a question set is generated { If field a = Value1 and If field a = Value2 … If field a = Value en }, where N is a natural number; the field B which is the linear attribute is subjected to discrete processing by using a discretization technology, and the subsequent processing mode is the same as that of the discrete attribute field A.

By performing the above format processing in the format conversion step S200, the original information can be converted into a form that can be easily read and counted by a computer, and moreover, the private data of the client itself can not be leaked, whereby the technical effect that the data processing speed is increased and the privacy of the data is improved at the same time can be obtained.

The meaning of the simulation executing step S300 is that if the simulation executing client is not set, if the first client executes the set of attributes to be processed and then sends the updated data to the second client, the second client can estimate the privacy information of the set of attributes to be processed of the first client, so that by adding the "simulation executing client" as noise, the privacy information of the first client that actually executes the set of attributes to be processed can be protected.

The step S400 of actually executing includes the following sub-steps:

substep 3: the selected client executes the attribute set to be processed, adds the execution result to the attribute set to be processed and returns the result to the server; and

Here, the second random algorithm may be the same algorithm as the first random algorithm or may be a different algorithm.

In addition, for data transmitted between the client and the server, an encrypted form may be used, for example, an asymmetric algorithm is used for encryption, and here, RSA, DSA, ECC, and the like may be used as the asymmetric algorithm.

In the analysis and decision step S500, the server inputs the execution completion attribute set to a decision analysis model for computational analysis and obtaining a decision analysis result, wherein the decision analysis model performs gradual judgment according to the attributes of the execution completion attribute set to finally obtain the decision result. Specifically, in decision making, a certain attribute value is used for judgment at an internal node of the tree, and a decision is made as to which branch node to enter according to a judgment result until a leaf node is reached to obtain a decision result. As an algorithm for decision making, ID3, C4.5, CART, etc. may be employed.

In addition, optionally, after the analyzing and deciding step S500, the method can further include: and the server returns the decision analysis result to each client. This has the technical effect that the correlation results can be shared.

As shown in FIG. 2, the multi-source data processing system of the present invention comprises: a server 100 and a plurality of clients 200.

The server 100 includes:

a redundancy adding module 110, configured to mix a redundancy attribute set in the key attribute set as an initial attribute set;

a format conversion module 120, configured to perform format conversion on the initial attribute set to obtain a to-be-processed attribute set in a predetermined format;

a simulation execution module 130, configured to select two or more clients as simulation execution clients based on a first random algorithm, and simulate the simulation execution clients to execute the to-be-processed attribute set and obtain a simulation completion attribute set;

a first communication module 140, communicatively connected to the client, for sending the simulation completion attribute set as noise to the remaining clients except the simulation execution client among the plurality of clients together with the to-be-processed attribute set and for accepting a returned execution completion attribute set; and

and the analysis decision module 150 is configured to input the execution completion attribute set to a decision analysis model for computational analysis and obtaining a decision analysis result.

Each of the plurality of clients 200 includes:

a second communication module 210, communicatively connected to the server 100, configured to receive the simulation completion attribute set and the pending attribute set from the server 100 and return an execution completion attribute set obtained by an execution module 220 described below to the server 100; and

and the execution module 220 is configured to execute the to-be-processed attribute set and obtain an execution completion attribute set.

In the server 100, the redundancy adding module 110 mixes the redundancy attribute set in the key attribute set as the initial attribute set, so that it is practically impossible for a plurality of clients 200 to know what the "key attribute set" is, and therefore the "redundancy attribute set" is mixed therein as noise, thereby obtaining the technical effect that it can be ensured that the information to be collected by the server 100 is really information that cannot be obtained for a plurality of clients.

Furthermore, the decision analysis model is established by the analysis decision module 150 in the server 100 and is only visible to the server, which can ensure that the decision analysis model is not revealed to multiple clients as multiple sources.

In the analysis and decision module 500, the server inputs the execution completion attribute set to a decision analysis model for computational analysis and obtaining a decision analysis result, wherein the decision analysis model performs gradual judgment according to the attributes of the execution completion attribute set to finally obtain the decision result. Specifically, in decision making, a certain attribute value is used for judgment at an internal node of the tree, and a decision is made as to which branch node to enter according to a judgment result until a leaf node is reached to obtain a decision result. As an algorithm for decision making, ID3, C4.5, CART, etc. may be employed.

Furthermore, a unique client identifier is preset and distributed to each client, and the client identifiers of the plurality of clients are recorded and stored in the server. The simulation execution module 130 of the server 100 selects two or more client ids from the client id numbers of the plurality of clients by a first random algorithm, uses the client of the selected client id as a simulation execution client, simulates the simulation execution client to execute the set of attributes to be processed and obtain a set of simulation completion attributes, and sends the set of simulation completion attributes as noise together with the set of attributes to be processed to the remaining clients except the simulation execution client among the plurality of clients, thereby obtaining a technical effect that the randomly selected client and the simulation selected client execute the set of attributes to be processed, and then for the remaining clients, even if the set of simulation completion attributes is obtained, real information cannot be obtained because the set of simulation completion attributes is formed by simulation of the server rather than real, the method is a technical means for concealing real data by providing the noise as noise to the rest clients.

Here, as the first random algorithm, a numerical probability algorithm, a las vegas algorithm, a monte carlo algorithm, and a schouard algorithm can be adopted. The client is selected through a random algorithm to perform simulation execution on the attribute set to be processed and obtain a simulation completion attribute set, and the technical effect that manual selection can be avoided and the finally obtained decision result is relatively accurate can be obtained.

In the format conversion module 120, the following format conversion is performed on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

By performing the above format processing by the format conversion module 120, the original information can be converted into a form that can be easily read and counted by a computer, and moreover, the private data of the client itself can not be leaked, whereby the technical effect that the data processing speed is increased and the privacy of the data is improved at the same time can be obtained.

Next, a multi-source data processing method according to an embodiment of the present invention will be described. The embodiment applies the multi-source data processing method to the security industry cloud service environment, and protects the privacy of the decision analysis model of the server while protecting the data privacy of each client (dealer client).

As shown in fig. 3, step S1: the server side mixes the key attribute set input by the model into a partial redundant attribute set;

step S2: the server side converts the format of the attribute set mixed with the partial redundant attribute set into a question set with a yes answer or a no answer, wherein the answer is 1, and the answer is 0 if the answer is no;

step S3: the client of each dealer has a unique identification number, such as

serial number

1,2,3,4 …, the server randomly selects the identification numbers of the clients of two or more dealers by using a random algorithm and simulates the inquired dealer, for example, 3,24, that is, the random dealer clients are the dealer clients of 3 and 24 and simulate the executed attribute sets;

step S4: randomly selecting another dealer identification number such as 8 dealer client, and sending a message { "queried dealer": {3,24}, "object": obj1, "problem set": { problem 1:0, problem 2:0, problem 3:0, … } };

step S5: the dealer client side with the identification number of 8 adds the self identification number to the inquired dealer after receiving the message, and simultaneously answers the questions in the question set in sequence by using the data of the self about Obj1, when the answer is yes, the answer identification of the corresponding question is updated, and after the message is updated, the information is { "inquired dealer": {3,8,24}, "object": obj1, "problem set": { problem 1:0, problem 2:1, problem 3:0, … } };

step S6: the dealer client with the identification number of 8 randomly selects a dealer client to send a message to the dealer client after removing the 'inquired dealer';

step S7: the last client side feeds back information to the server side, namely, the information is sent to the server side by the last client side after all the dealers except the dealer marked as 3 and 24 are answered, and the server side inputs the finally fed-back information to the decision analysis model to obtain a decision analysis result.

Here, the server side completes the analysis decision about the object Obj1 by using the obtained key attribute set and the decision analysis model established by itself, and optionally, after the analysis is completed, an analysis conclusion set may be sent to a plurality of clients, where each conclusion includes conclusion information and a probability of speculative correctness, for example, { "object": obj1, "conclusion set": { (conclusion 1, 89%), (conclusion 2, 60%), (conclusion 3, 76%), … } }.

As described above, according to the multi-source data processing system and the multi-source data processing method of the embodiment, a protection policy for protecting privacy of private data of a dealer of a client can be provided, a part of redundant attribute sets are added to key attribute sets input by a decision analysis model, the key attribute sets are converted into question sets with yes or no answers, the question sets are transmitted to the client in service in a random order, the client updates the answers of the question sets according to privacy data of the client, and the last completed client is transmitted to the server to complete attribute collection for the same object.

Therefore, after the key attribute which is input as a decision analysis model is added into the redundant attribute and converted into the question set with the answer of yes or no, the key attribute information of all the clients about the analysis object can be obtained, the privacy of the information can be ensured, and the key attribute can be prevented from being disclosed. Moreover, the strategy for protecting the decision analysis model can be provided by the server, each client can not acquire the decision analysis model of the server, and privacy protection is provided for the server.

Further, according to the multi-source data processing system and the multi-source data processing method of the embodiment, the client identified by the selected client is taken as a simulation execution client (simulation completion query) because the simulation completion attribute set is formed by simulation of the server and is not real, and is provided to the remaining clients together as noise, so that other real data can be concealed.

The present invention also provides a computer-readable medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the above-described multi-source data processing method.

The invention also provides computer equipment which comprises a storage module, a processor and a computer program which is stored on the storage module and can run on the processor, and is characterized in that the processor realizes the multi-source data processing method when executing the computer program.

The above examples have mainly explained the multi-source data processing system and the multi-source data processing method of the present invention. Although only a few embodiments of the present invention have been described in detail, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A multi-source data processing method, comprising:

a simulation execution step, in which the server selects two or more clients in the plurality of clients as simulation execution clients based on a first random algorithm, and simulates the simulation execution clients to execute the attribute set to be processed and obtain a simulation completion attribute set;

a real execution step, in which the server sends the simulation completion attribute set as noise to the remaining clients except the simulation execution client among the plurality of clients together with the to-be-processed attribute set, and receives the execution completion attribute set from the remaining clients, wherein the execution completion attribute set is obtained by the remaining clients executing the to-be-processed attribute set; and analyzing and deciding, namely inputting the execution completion attribute set into a decision analysis model by the server side for calculation and analysis to obtain a decision analysis result.

2. The multi-source data processing method of claim 1,

and the decision analysis model performs gradual judgment according to the attributes of the execution completion attribute set to finally obtain a decision result.

3. The multi-source data processing method of claim 1,

a unique client identifier is preset for each client,

4. The multi-source data processing method of claim 3,

the first stochastic algorithm comprises any one of:

5. The multi-source data processing method of claim 1,

in the format conversion step, the following format conversion is performed on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

6. The multi-source data processing method of claim 1,

the step of actually executing comprises the following substeps:

7. A server configured to communicate with a plurality of clients, comprising:

8. The server according to claim 7,

9. The server according to claim 7,

a unique client identifier is preset for each client,

10. The server according to claim 9,

the first stochastic algorithm comprises any one of:

11. The server according to claim 7,

the format conversion module performs the following format conversion on the attribute fields in the initial attribute set:

for the discrete field, generating a problem set;

12. A multi-source data processing system, comprising: a service end and a plurality of client ends,

wherein, the server side includes:

wherein the client comprises:

13. A computer-readable medium, having stored thereon a computer program,

the computer program, when executed by a processor, implements the multi-source data processing method of any of claims 1-6.

14. A computer device comprising a storage module, a processor and a computer program stored on the storage module and executable on the processor, wherein the processor implements the multi-source data processing method of any one of claims 1 to 6 when executing the computer program.