CN110674373A

CN110674373A - Big data processing method, device, equipment and storage medium based on sensitive data

Info

Publication number: CN110674373A
Application number: CN201910876650.5A
Authority: CN
Inventors: 张少典; 马汉东
Original assignee: Shanghai Sen Sen Medical Technology Co Ltd
Current assignee: Shanghai Sen Sen Medical Technology Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-01-10
Anticipated expiration: 2039-09-17
Also published as: CN110674373B

Abstract

The application provides a big data processing method, a big data processing device, big data processing equipment and a storage medium based on sensitive data, wherein the number of samples is determined according to a preset condition, and a state function is determined according to the number of the samples; screening the number of seeds according to the state function, and adding the number of seeds meeting the screening condition into a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step. According to the method and the device, the required sample data set can be quickly screened out from the original data set without sensitive data by establishing the number of the samples and the state function, and the state function can be optimized through unsatisfied samples, so that the data characteristics represented by the sample data set are highly consistent with the authenticity data characteristics of the original data set, and the method and the device have the advantages of being efficient in screening and capable of keeping the authenticity of the original data set.

Description

Big data processing method, device, equipment and storage medium based on sensitive data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a big data processing method, apparatus, device, and storage medium based on sensitive data.

Background

Currently, the big data field generally involves data holders, data providers, and data consumers. The data holder has the use right and ownership of the data, but does not know how to utilize the data number to generate value; the data provider has the capability of data analysis, and can analyze the original data to obtain some conclusions; the data user has no data ownership and no data analysis capability, but needs to perform practical application according to the analysis result of the original data.

The data holder can find the cooperation of the data provider, and the data user can purchase the data, and in the field of sensitive data such as medical data or government identity data, the data contain sensitive information and cannot be directly leaked to the data user, so the data user needs to purchase a data analysis conclusion obtained by the analysis of the data provider.

Data providers at present generally adopt a random sampling mode to improve the value density of big data, analysis results obtained through the mode often have certain error with authenticity features expressed by a big data total set, the error is reduced by enlarging the number of sampling samples, however, computational analysis cost is sacrificed, a data user cannot effectively know comprehensive information of the big data, the data user cannot be applied in a targeted manner, the data cannot give out the maximum utilization value, the data user cannot know effective analysis data, and asymmetry in information circulation is caused. The information asymmetry causes unsmooth information exchange, so that the analysis process of a data provider is very long and difficult, the requirements of a data user cannot be met, and the expected effect cannot be achieved.

Therefore, how to keep the authenticity characteristics of the sample data set consistent with those of the original data set under the condition of accelerated screening is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, it is an object of the present application to provide a method, an apparatus, a device and a storage medium for big data processing based on sensitive data, so as to solve at least one problem existing in the prior art.

To achieve the above and other related objects, the present application provides a big data processing method based on sensitive data, the method comprising: establishing the number of samples according to a preset condition, and establishing a state function according to the number of the samples; screening the number of seeds according to the state function, and adding the number of seeds meeting the screening condition into a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

In an embodiment of the present application, the screening the seed number according to the state function, and adding the seed number meeting the screening condition to a parameter set includes: calling an original data set; randomly extracting a sample as the seed number, and substituting the seed number into the state function for calculation; judging whether the evaluation indexes corresponding to various parameter requirements in the screening conditions are met or not; if yes, carrying out the next step, otherwise, skipping to the previous step; calculating whether the state function meets the requirement, if so, carrying out the next step, otherwise, skipping to the last step; adding the seed number meeting the requirement into the parameter set; and disassembling the state function to analyze that the condition is not met, and adding the optimal sample into the parameter set.

In an embodiment of the present application, the raw data set is a big data set without sensitive data; the parameter set is a sample data set.

In an embodiment of the present application, the screening condition is established according to a specific parameter type in the original data set.

In an embodiment of the present application, the state function is disassembled through a dynamic specification algorithm.

In an embodiment of the present application, the disassembling the state function to analyze that the requirement is not satisfied, and adding the optimal sample to the parameter set includes: randomly calling a sample which does not meet the screening condition; splitting a big problem which does not meet the screening condition into a plurality of small problems; backward pushing from the last step of the minor problems according to the steps, finding out reasons which do not meet the conditions according to the state function, analyzing imperfect conditions in the screening conditions corresponding to the state function according to the reasons, and repeating the steps to obtain a plurality of unsatisfied samples; selecting the optimal solution which can optimize the state function in the unsatisfied samples as the optimal sample according to the screening conditions; outputting the optimal sample to add to the set of parameters.

In an embodiment of the present application, the state function is a screening process established according to the number of the samples, and can be adjusted in real time according to the unsatisfied samples.

To achieve the above and other related objects, the present application provides a big data processing apparatus, comprising: the establishing module is used for establishing the number of samples according to a preset condition and establishing a state function according to the number of the samples; the processing module is used for screening the seed number according to the state function and adding the seed number meeting the screening condition into a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

To achieve the above and other related objects, the present application provides a computer apparatus, comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method as described above.

To achieve the above and other related objects, the present application provides a computer readable storage medium storing computer instructions which, when executed, perform the method as described above.

In summary, according to the big data processing method, the big data processing device, the big data processing equipment and the storage medium based on the sensitive data, the number of samples is determined according to the preset conditions, and the state function is determined according to the number of the samples; screening the number of seeds according to the state function, and adding the number of seeds meeting the screening condition into a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

Has the following beneficial effects:

1. according to the big data processing method based on the sensitive data, the required sample data set can be quickly screened out from the original data set without the sensitive data by establishing the number of the samples and the state function, and the state function can be optimized by not meeting the samples, so that the data characteristics represented by the sample data set are highly consistent with the authenticity data characteristics of the original data set, the big data user can more comprehensively know the big data information, the asymmetry of information circulation in statistics is avoided, and the big data processing method based on the sensitive data has the advantages of being high in screening efficiency and keeping the authenticity of the original data set.

2. According to the big data processing method based on the sensitive data, the unsatisfied sample in the data can be analyzed for unsatisfied reasons in a dynamic specification algorithm mode, the sample data which does not directly satisfy the conditions but has reference value in the data is added into the parameter set, and the method has the advantage of further improving the value of the processed parameter set.

3. According to the big data processing method based on the sensitive data, the big data with the sensitive data removed is used as the original data set to be processed, the low-value density attribute in the big data can be efficiently removed, the sample data set with high-value referential property is left, and the method has the advantage of further improving the authenticity of the sample data set.

Drawings

Fig. 1 is a flowchart illustrating a big data processing method based on sensitive data according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating step S2 of the sensitive data-based big data processing method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating step S26 of the sensitive data-based big data processing method according to an embodiment of the present application.

FIG. 4 is a block diagram of a big data processing apparatus according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings so that those skilled in the art to which the present application pertains can easily carry out the present application. The present application may be embodied in many different forms and is not limited to the embodiments described herein.

In order to clearly explain the present application, components that are not related to the description are omitted, and the same reference numerals are given to the same or similar components throughout the specification.

Throughout the specification, when a component is referred to as being "connected" to another component, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a component is referred to as "including" a certain constituent element, unless otherwise stated, it means that the component may include other constituent elements, without excluding other constituent elements.

When an element is referred to as being "on" another element, it can be directly on the other element, or intervening elements may also be present. When a component is referred to as being "directly on" another component, there are no intervening components present.

Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first interface and the second interface, etc. are described. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" include plural forms as long as the words do not expressly indicate a contrary meaning. The term "comprises/comprising" when used in this specification is taken to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not exclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.

Terms indicating "lower", "upper", and the like relative to space may be used to more easily describe a relationship of one component with respect to another component illustrated in the drawings. Such terms are intended to include not only the meanings indicated in the drawings, but also other meanings or operations of the device in use. For example, if the device in the figures is turned over, elements described as "below" other elements would then be oriented "above" the other elements. Thus, the exemplary terms "under" and "beneath" all include above and below. The device may be rotated 90 or other angles and the terminology representing relative space is also to be interpreted accordingly.

Fig. 1 is a schematic flow chart of a big data processing method based on sensitive data according to an embodiment of the present application. As shown in the figure, the method includes steps S1 to S3, which are specifically as follows:

step S1: establishing the number of samples according to a preset condition, and establishing a state function according to the number of the samples;

in this embodiment, the predetermined condition is a condition for screening target data to determine a required sample. Such as data type, attributes, categories, etc.

The state function is a function which is constructed based on a plurality of state attributes and is used for characterizing the change of the data system, when the state of the system is changed, a series of properties of the system are changed, and the change amount is only dependent on the initial state and the final state and is not related to the path which is undergone during the change.

In this embodiment, by establishing the number of samples and the state function, a required sample data set can be quickly screened from an original data set from which sensitive data is removed, and the state function can be optimized by unsatisfied samples, so that data characteristics represented by the sample data set are highly consistent with authenticity data characteristics of the original data set, thereby facilitating a big data user to more comprehensively know big data information, avoiding asymmetry of information circulation in statistics, and having the advantages of high screening efficiency and keeping authenticity of the original data set.

Step S2: screening the number of seeds according to the state function, and adding the number of seeds meeting the screening condition into a parameter set;

As shown in fig. 2, in the present embodiment, the step S2 includes steps S21 to S26, which are as follows:

step S21: calling an original data set;

in an embodiment of the present application, the raw data set is a big data set without sensitive data. Generally, in the field of sensitive data such as medical data or government identity data, the data cannot be directly revealed to a data user because the data contains sensitive information.

In this embodiment, the big data from which the sensitive data is removed is processed as the original data set, so that the low-value density attribute in the big data can be efficiently removed, and a sample data set with high-value referential property is left, which has the advantage of further improving the authenticity of the sample data set.

Step S22: and randomly drawing a sample as the seed number, and substituting the seed number into the state function for calculation.

In this embodiment, the random numbers generated by the computer are simulated by a long string of serial numbers, so called pseudo-random numbers, and when the random numbers are practically applied, the random numbers generally have all the probabilistic properties and statistical properties of the real random numbers, so that a great number of serial pseudo-random numbers can be generated, wherein the first random number of a sequence corresponds to a number, and the number is called a seed number.

Step S23: judging whether the evaluation indexes corresponding to various parameter requirements in the screening conditions are met or not; if so, go to the next step 24, otherwise, go to the previous step 22.

Step S24: calculating whether the state function meets the requirement, if so, performing the next step 25, otherwise, skipping to the last step 26;

step S25: and adding the seed number meeting the requirement into the parameter set.

In an embodiment of the present application, the parameter set is a sample data set.

Step S26: and disassembling the state function to analyze that the condition is not met, and adding the optimal sample into the parameter set.

Dynamic programming is a method used in mathematics, computer science and economics to solve complex problems by decomposing the original problem into relatively simple sub-problems. The dynamic programming algorithm is to divide the problem, define the problem state and the relation between the states, and make the problem solve in a recursion (or divide and conquer) mode. Dynamic Programming is particularly effective for sub-problem overlap situations because it saves the solutions of the sub-problems in a table, and when a solution of a sub-problem is needed, it takes value directly, thus avoiding repeated calculations.

In this embodiment, the unsatisfied sample in the data can be analyzed for the unsatisfied reason in a dynamic canonical algorithm manner, and the sample data which does not directly satisfy the condition but has the reference value in the data is added into the parameter set, so that the method has the advantage of further improving the value of the processed parameter set

In an embodiment of the present application, as shown in fig. 3, the step S26, that is, the dynamic canonical algorithm specifically includes steps S261 to S265, which are specifically as follows:

step S261: randomly calling a sample which does not meet the screening condition;

step S262: splitting a big problem which does not meet the screening condition into a plurality of small problems;

step S263: backward pushing from the last step of the minor problems according to the steps, finding out reasons which do not meet the conditions according to the state function, analyzing imperfect conditions in the screening conditions corresponding to the state function according to the reasons, and repeating the steps to obtain a plurality of unsatisfied samples;

step S264: selecting the optimal solution which can optimize the state function in the unsatisfied samples as the optimal sample according to the screening conditions;

step S265: outputting the optimal sample to add to the set of parameters.

In this embodiment, the state function is disassembled through a dynamic normative algorithm, and the dynamic programming algorithm defines the relationship between the states of the problem by splitting the problem, so that the problem can be solved in a recursion manner. When any sub-problem is solved, various possible local solutions are listed, the local solutions which are possible to reach the optimal are reserved through decision, other local solutions are discarded, the sub-problems are solved in sequence, and the last sub-problem is the solution of the initial problem.

Step S3: and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

In this embodiment, the parameter set is a sample data set.

In summary, the method of the application can rapidly screen out the required sample data set from the original data set without sensitive data by establishing the number of the samples and the state function, and the state function can be optimized by not meeting the samples, so that the data characteristics expressed by the sample data set are highly consistent with the authenticity data characteristics of the original data set, and the method has the advantages of high screening efficiency and original data set authenticity preservation.

Fig. 4 is a block diagram of a big data processing apparatus according to an embodiment of the present application. As shown, the apparatus 400 includes:

an establishing module 401, configured to establish a number of samples according to a preset condition, and establish a state function according to the number of samples;

a processing module 402, configured to filter seed numbers according to the state function, and add the seed numbers meeting the filtering condition to a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiment described in the present application, the technical effect brought by the contents is the same as the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.

It should be further noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these units can be implemented entirely in software, invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module 402 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the processing module 402. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown, the computer device 500 includes: a memory 501 and a processor 502; the memory 501 is used for storing computer instructions; the processor 502 executes computer instructions to implement the method described in fig. 1. .

In some embodiments, the number of the memory 501 in the computer device 500 may be one or more, the number of the processor 502 may be one or more, the number of the communicator 503 may be one or more, and fig. 5 is taken as an example.

In an embodiment of the present application, the processor 502 in the computer device 500 loads one or more instructions corresponding to processes of an application program into the memory 501 according to the steps described in fig. 1, and the processor 502 executes the application program stored in the memory 501, thereby implementing the method described in fig. 1.

The Memory 501 may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 501 stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for implementing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.

The Processor 502 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In some specific applications, the various components of the computer device 500 are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of explanation the various busses are shown in fig. 5 as a bus system.

In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method described in fig. 1.

The computer-readable storage medium, as will be appreciated by one of ordinary skill in the art: the embodiment for realizing the functions of the system and each unit can be realized by hardware related to computer programs. The aforementioned computer program may be stored in a computer readable storage medium. When the program is executed, the embodiment including the functions of the system and the units is executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The application effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the invention. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present application.

Claims

1. A big data processing method based on sensitive data is characterized by comprising the following steps:

establishing the number of samples according to a preset condition, and establishing a state function according to the number of the samples;

screening the number of seeds according to the state function, and adding the number of seeds meeting the screening condition into a parameter set;

and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

2. The method of claim 1, wherein the screening the number of seeds according to the state function, and adding the number of seeds satisfying the screening condition to a parameter set comprises:

calling an original data set;

randomly extracting a sample as the seed number, and substituting the seed number into the state function for calculation;

judging whether the evaluation indexes corresponding to various parameter requirements in the screening conditions are met or not; if yes, carrying out the next step, otherwise, skipping to the previous step;

calculating whether the state function meets the requirement, if so, carrying out the next step, otherwise, skipping to the last step;

adding the seed number meeting the requirement into the parameter set;

and disassembling the state function to analyze that the condition is not met, and adding the optimal sample into the parameter set.

3. The method of claim 2, wherein the raw data set is a big data set with sensitive data removed; the parameter set is a sample data set.

4. The method of claim 2, wherein the screening criteria are established based on specific parameter classes in the raw data set.

5. The method according to claim 2, characterized in that the state functions are disassembled by a dynamic specification algorithm.

6. The method of claim 2, wherein said deconstructing said state function to analyze that a condition is not satisfied, adding optimal samples to said parameter set comprises:

randomly calling a sample which does not meet the screening condition;

splitting a big problem which does not meet the screening condition into a plurality of small problems;

backward pushing from the last step of the minor problems according to the steps, finding out reasons which do not meet the conditions according to the state function, analyzing imperfect conditions in the screening conditions corresponding to the state function according to the reasons, and repeating the steps to obtain a plurality of unsatisfied samples;

selecting the optimal solution which can optimize the state function in the unsatisfied samples as the optimal sample according to the screening conditions;

outputting the optimal sample to add to the set of parameters.

7. The method of claim 6, wherein the state function is a screening process established according to the number of samples, and is adjustable in real time according to the unsatisfied samples.

8. A big data processing apparatus, the apparatus comprising:

the establishing module is used for establishing the number of samples according to a preset condition and establishing a state function according to the number of the samples;

the processing module is used for screening the seed number according to the state function and adding the seed number meeting the screening condition into a parameter set; and judging whether the parameter set meets the sample number, if so, outputting the parameter set, and otherwise, jumping to the previous step.

9. A computer device, the device comprising: a memory, and a processor; the memory is to store computer instructions; the processor executes computer instructions to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions which, when executed, perform the method of any one of claims 1 to 7.