CN110399209B - Data processing method, system, electronic device and storage medium - Google Patents

Data processing method, system, electronic device and storage medium Download PDF

Info

Publication number
CN110399209B
CN110399209B CN201910688165.5A CN201910688165A CN110399209B CN 110399209 B CN110399209 B CN 110399209B CN 201910688165 A CN201910688165 A CN 201910688165A CN 110399209 B CN110399209 B CN 110399209B
Authority
CN
China
Prior art keywords
data
source cluster
area
desensitization
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910688165.5A
Other languages
Chinese (zh)
Other versions
CN110399209A (en
Inventor
张世瑛
曹伟
梁杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN201910688165.5A priority Critical patent/CN110399209B/en
Publication of CN110399209A publication Critical patent/CN110399209A/en
Application granted granted Critical
Publication of CN110399209B publication Critical patent/CN110399209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45579I/O management, e.g. providing access to device drivers or storage

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a data processing method, applied to a scheduling device, the method including: acquiring configuration information, wherein the configuration information comprises object information and sampling rules of a target object to be sampled, the target object is stored in a source cluster device, the source cluster device comprises a sandbox area and a non-sandbox area which are independent of each other, and the target object is stored in the non-sandbox area; generating a control instruction based on the sampling rule and the object information; and sending a control instruction to a source cluster device to enable the source cluster device to sample source data in the target object, storing sampled data obtained by sampling into a sandbox area of the source cluster device, and copying the sampled data from the sandbox area to the target cluster device. The present disclosure also provides a data processing system, an electronic device, and a computer-readable storage medium.

Description

Data processing method, system, electronic device and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, system, electronic device, and storage medium.
Background
Data preparation for large data platforms is typically copying data in a source cluster to a target cluster. In the prior art, the operation process of copying the data in the source cluster to the target cluster is complex, the processing flow is long, and a large amount of hardware resources, network resources and the like are needed.
In the prior art, the problems of long processing flow, long required time, large amount of hardware resources, network resources and the like in the data preparation process exist.
Disclosure of Invention
In view of the above, the present disclosure provides a data processing method, system, electronic device, and storage medium.
One aspect of the present disclosure provides a data processing method applied to a scheduling device, the method including: acquiring configuration information, wherein the configuration information comprises object information and sampling rules of a target object to be sampled, the target object is stored in source cluster equipment, the source cluster equipment comprises a sandbox area and a non-sandbox area which are mutually independent, and the target object is stored in the non-sandbox area; generating a control instruction based on the sampling rule and the object information; and sending a control instruction to the source cluster equipment to enable the source cluster equipment to sample the source data in the target object, storing the sampled data obtained by sampling into a sandbox area of the source cluster equipment, and copying the sampled data from the sandbox area to the target cluster equipment.
According to an embodiment of the present disclosure, the configuration information further comprises a desensitization configuration; generating the control instruction based on the sampling rule and the object information includes: determining metadata of the target object based on the object information; establishing a data table based on the metadata; determining a desensitization function according to the desensitization configuration, the desensitization function being used for data desensitization of the sampled data; and generating a control instruction according to the sampling rule, the data table and the desensitization function.
According to an embodiment of the present disclosure, a control instruction is generated to perform the following operations: obtaining sampling data from a target object according to a sampling rule and object information; carrying out data desensitization on the sampled data to obtain desensitization data; storing desensitization data in a sandbox area; and copying desensitization data from the sandboxed area to the target cluster device.
According to an embodiment of the present disclosure, the method further includes acquiring a concurrent configuration parameter in a case where there are a plurality of control instructions for respectively executing different tasks; determining the task number of tasks executed by the source cluster equipment at the same time based on the concurrent configuration parameters; and controlling the source cluster equipment to execute a plurality of control instructions based on the number of tasks.
According to an embodiment of the present disclosure, controlling the source cluster device to execute the plurality of control instructions based on the number of tasks includes: acquiring currently available resources in source cluster equipment; and determining the current available resources allocated to each task based on the current available resources and the number of tasks, so as to run the control instruction of the task by using the allocated current available resources.
According to the embodiment of the disclosure, the method further includes generating an acquisition record for the scheduling device to acquire the currently available resources in the source cluster device, so as to query whether an abnormal acquisition record exists in the acquisition record.
According to the embodiment of the disclosure, the method further comprises verifying whether the amount of data copied to the target cluster is consistent with the amount of original data stored in the sandbox area; and sending out alarm information under the condition that the data volume is inconsistent with the original data volume.
Another aspect of the present disclosure provides a data processing system comprising: the source cluster equipment comprises a sandbox area and a non-sandbox area, wherein the sandbox area and the non-sandbox area are mutually independent, and a target object is stored in the non-sandbox area; a target cluster device; and a scheduling device for executing the method, wherein the source cluster device is used for responding to the control instruction, sampling the target object stored in the non-sandbox area to obtain sampled data, storing the sampled data in the sandbox area, and copying the sampled data from the sandbox area to the target cluster device.
Another aspect of the present disclosure provides an electronic device including: one or more processors; a storage device to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described method.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
According to the embodiment of the disclosure, the problems that the processing flow of copying the data in the source cluster to the target cluster is long, a large amount of time is consumed, and a large amount of hardware resources are required can be at least partially solved, and therefore the technical effects of reducing the processing steps required for copying the data in the source cluster to the target cluster and reducing resource consumption can be achieved.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates a schematic diagram of a data processing method;
FIG. 2 schematically illustrates an exemplary system architecture of a data processing method according to an embodiment of the present disclosure;
FIG. 3 schematically shows a flow chart of a data processing method according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram of a method of generating control instructions according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a data processing method according to another embodiment of the present disclosure;
fig. 6 schematically illustrates an operating principle diagram of a scheduling device 230 according to an embodiment of the present disclosure;
FIG. 7 schematically shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure; and
FIG. 8 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Fig. 1 schematically shows a schematic diagram of a data processing method. As shown in fig. 1, the data preparation process may include the steps of: 1) sampling source data stored in a source cluster, and storing sampled data obtained by sampling into a temporary data storage unit; 2) desensitizing the sampled data; 3) exporting the desensitized data from the source cluster device 110 to a general Storage unit a corresponding to the source cluster, such as SAN (Storage Area Network) or DAS (Direct-attached Storage); 4) importing the data of the general storage unit A into a magnetic tape; 5) transmitting the data to a target cluster by means of a tape; 6) restoring the tape data to a general storage unit B corresponding to the target cluster; 7) and importing the data from the general storage unit B corresponding to the target cluster into the target cluster.
Therefore, the processing flow of the data preparation process is long, and a large amount of hardware resources, network resources and the like are required.
The embodiment of the disclosure provides a data processing method applied to a scheduling device. The method comprises the processes of obtaining configuration information, generating a control instruction and sending the control instruction to a source cluster device. The source cluster device samples source data in the target object in response to the control instruction, stores the sampled data obtained by sampling into a sandbox area of the source cluster device, and copies the sampled data from the sandbox area to the target cluster device. The configuration information comprises object information and a sampling rule of a target object to be sampled, so that a control instruction is generated according to the sampling rule and the object information. The target object is stored in a source cluster device, the source cluster device comprises a sandbox area and a non-sandbox area which are mutually independent, and the target object is stored in the non-sandbox area.
Fig. 2 schematically illustrates an exemplary system architecture 200 of a data processing method according to an embodiment of the present disclosure. It should be noted that fig. 2 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 2, a system architecture 200 according to this embodiment may include a source cluster 210, a target cluster 220, and a scheduling appliance 230.
According to embodiments of the present disclosure, the source cluster 210 may include, for example, a plurality of node devices that collectively maintain one or more databases.
According to an embodiment of the present disclosure, a sandbox area 211 is created in the storage area of the source cluster 210, and the operating environment of the sandbox area 211 is isolated from the non-sandbox area 212 of the source cluster 210.
According to an embodiment of the present disclosure, the scheduling device 230 is configured to generate a control instruction and send the control instruction to the source cluster 210, so that the source cluster 210 executes the control instruction. The operations performed by the source cluster 210 according to the control instruction include sampling data in the non-sandboxed area 212 of the source cluster 210, storing the sampled data in the sandboxed area 211, and copying the data in the sandboxed area 211 to the target cluster 220, so that software testing, model training, analysis mining, and the like can be performed by using the data in the target cluster 220.
Fig. 3 schematically shows a flow chart of a data processing method according to an embodiment of the present disclosure. The data processing method may be performed by the scheduling device 230 shown in fig. 2, for example.
As shown in fig. 3, the method may include operations S310 to S330.
In operation S310, configuration information is acquired. The configuration information includes object information and sampling rules of a target object to be sampled. The target object is stored at the source cluster device. The storage area of the source cluster device may include a sandbox area and a non-sandbox area independent of each other, where the target object is stored.
According to an embodiment of the present disclosure, the configuration information may be, for example, input by a user on the terminal device. The configuration information may include object information and sampling rules for the target object.
According to an embodiment of the present disclosure, the target object may be, for example, a data table stored in the non-sandboxed area 212. The object information of the target object may be, for example, a table name of a data table. The data table may have various forms, and embodiments of the present disclosure do not limit this.
The sampling rule may be, for example, data with the character "beijing" extracted from the data table, or may be, for example, data with a date of 5 months and 30 numbers extracted from the data table, or the like.
For example, in the system architecture shown in FIG. 2, the storage area of the source cluster 210 may include a sandbox area 211 and a non-sandbox area 212. The target object may be stored in the non-sandboxed area 212.
In operation S320, a control instruction is generated based on the sampling rule and the object information.
According to an embodiment of the present disclosure, the control instruction may be, for example, a structured query language (sql) script.
According to the embodiment of the present disclosure, for example, a target object may be determined according to object information, and metadata included in the target object may be determined, so as to establish a data table identical to the metadata in the target object, and a script of an sql statement may be generated according to the established data table and a sampling rule.
In operation S330, a control instruction is sent to the source cluster device to cause the source cluster device to sample source data in the target object, store the sampled data obtained by sampling into a sandbox area of the source cluster device, and copy the sampled data from the sandbox area to the target cluster device.
According to the embodiment of the disclosure, the data processing method establishes the sandbox area in the source cluster, controls the source cluster to sample data through the scheduling device, and stores the sampled data in the sandbox area, so that the data in the sandbox area can be directly copied to the target cluster. Therefore, on one hand, the data processing method does not need to carry out multiple data accesses in the source cluster, so that not only are a large amount of CPU and I/O resources saved, but also the time for extracting data from the source cluster to the target cluster is saved. On the other hand, the sandbox area is established in the source cluster, and the sandbox area and the non-sandbox area are independent from each other, so that the data in the sandbox area can be directly copied to the target cluster without being transmitted through a disk under the condition that the safety of the data in the source cluster is guaranteed.
Fig. 4 schematically illustrates a method flowchart of operation S320 in which the configuration information further includes a desensitization configuration, according to an embodiment of the present disclosure.
As shown in fig. 4, operation S320 may further include operations S321 to S324.
In operation S321, metadata of the target object is determined based on the object information. For example, the metadata of the source data table may be determined by determining the source data table stored in the source cluster according to the table name of the data table.
In operation S322, a data table is built based on the metadata. For example, it may be that the scheduling device creates a data table that is at least partially identical to the metadata of the source data table.
In operation S323, a desensitization function for data desensitization of the sampled data is determined according to the desensitization configuration.
According to embodiments of the present disclosure, a unique desensitization function may be determined, for example, based on a desensitization configuration entered by a user. The desensitization configuration may be, for example, user-entered identification information of the desensitization function. As will be understood by those skilled in the art, a "desensitization function" refers to a function used to desensitize data.
In operation S324, a control instruction is generated according to the sampling rule, the data table, and the desensitization function. For example, a desensitization function is added to a data table composed of generated sql statements, and a script of the sql statements is generated according to a sampling rule.
According to the embodiment of the disclosure, the data processing method includes the desensitization function in the generated control instruction, so that data extraction and desensitization of the sampled data can be further completed, and hardware resources such as a CPU (central processing unit), an I/O (input/output) and the like are further saved.
According to an embodiment of the present disclosure, the scheduling device sends the generated control instruction to the source cluster, so that the source cluster executes the control instruction.
The operation executed by the source cluster according to the control instruction comprises the following steps: obtaining sampling data from a target object according to a sampling rule and object information; carrying out data desensitization on the sampled data to obtain desensitization data; storing desensitization data in a sandbox area; and copying desensitization data from the sandboxed area to the target cluster.
Fig. 5 schematically illustrates a data processing method according to another embodiment of the present disclosure.
As shown in fig. 5, the data processing method may further include operations S510 to S530 on the basis of operations S310 to S330 shown in fig. 3. For example, the method may be performed after operation 520.
In operation S510, in the case where there are a plurality of control instructions for respectively executing different tasks, concurrent configuration parameters are acquired.
In operation S520, the number of tasks performed by the source cluster device at the same time is determined based on the concurrency configuration parameter.
According to the embodiment of the present disclosure, for example, the configuration information may include object information of a plurality of target objects, and one control instruction is generated according to the sampling rule, desensitization configuration, and metadata for each target object, so that each control instruction is used to perform different tasks, respectively.
According to an embodiment of the present disclosure, in operations S510 and S520, a user may set a concurrency configuration parameter to control the number of tasks simultaneously executed by a source cluster. For example, the concurrency configuration parameter may be 3, and the source cluster executes the control instructions corresponding to 3 tasks at the same time.
In operation S530, the control source cluster device executes a plurality of control instructions based on the number of tasks.
According to the embodiment of the disclosure, the method may enable the user to manage the concurrency amount of the control instructions executed by the source cluster 210 by setting the concurrency configuration parameter.
According to an embodiment of the present disclosure, operation S530 may further include: acquiring currently available resources in source cluster equipment; and determining the current available resource allocated to each task based on the current available resource and the quantity so as to use the allocated current available resource to run the control instruction of the task.
The currently available resources may include, for example, currently available CPU resources, memory resources, and the like.
According to the embodiments of the present disclosure, the CPU resource allocated to each task may be determined, for example, according to the currently available CPU resource and the number of concurrently executed tasks. Specifically, for example, there may be 100 currently available CPUs in the source cluster, and if the number of concurrently executed tasks is 100, one CPU resource may be allocated to each task.
According to the embodiment of the disclosure, the method can reasonably allocate the current available resources in the process of simultaneously executing a plurality of tasks.
According to an embodiment of the present disclosure, the data processing method may further include: and generating an acquisition record of the scheduling device for acquiring the current available resources in the source cluster device so as to inquire whether an abnormal acquisition record exists in the acquisition record.
For example, it may be an access source, an access time, an access object that records the currently available resources of the access source cluster.
For example, the IP address of the access source should be the IP address of the scheduling device, and when the query acquisition record finds that there is another IP address that is the access source, it is determined as an abnormal acquisition record. The access object may be, for example, a table name of the accessed source data table.
According to an embodiment of the present disclosure, the data processing method may further include: checking whether the data volume copied to the target cluster is consistent with the original data volume stored in the sandbox area; and sending out alarm information under the condition that the data volume is inconsistent with the original data volume.
For example, it may be that the scheduling device accesses the amount of data in the target cluster to compare if the amount of data is consistent with the original amount of data stored in the sandboxed area.
According to the embodiment of the disclosure, the data processing method may further include generating a profile log for later viewing in the process of data preparation of this time. The log may record, for example, the time of the data preparation, the source data, the target cluster, and the like.
Fig. 6 schematically illustrates an operation principle diagram of the scheduling device 230 according to an embodiment of the present disclosure.
As shown in fig. 6, the scheduling device 230 may obtain input information, for example, may perform operation S310 described above with reference to fig. 3. The input information may include, for example, a list of target objects, sampling rules, and desensitization configurations. The target object manifest may be, for example, a table name of a source data table of the extracted data. The target object manifest may include a plurality of table names to extract data from a plurality of data tables.
According to an embodiment of the present disclosure, the scheduling device 230 builds a data table from metadata of the source data table to be identical to the metadata in the source data table, and determines a desensitization function according to the desensitization configuration. Therefore, the scheduling master control program can generate control instructions according to the sampling rule, the desensitization function and the metadata. For example, operation S320 described above with reference to fig. 3 may be performed.
As shown in fig. 6, the scheduling device 230 may further obtain the concurrency configuration parameter, and determine the number of tasks executed by the source cluster device 210 in batch based on the concurrency configuration parameter. For example, operations S510-S530 described above with reference to FIG. 5 may be performed.
According to the embodiment of the present disclosure, for example, the master control program may be scheduled to generate the control instruction according to the sampling rule, the desensitization function, the metadata and the concurrency configuration, so as to control the source cluster device 210 to execute the number of tasks in batch through the control instruction.
As shown in fig. 6, the scheduling device 230 may further include a control instruction for acquiring currently available resources in the source cluster device 210, and determining currently available resources allocated to each task based on the currently available resources and the number of tasks, so as to run the task using the allocated currently available resources. According to the embodiment of the present disclosure, the scheduling device 230 may, for example, send the obtained currently available resources to the scheduling overall control program, and the scheduling overall control program allocates the currently available resources to each task according to the currently available resources and the number of tasks, so as to control the source cluster 210 to run the multiple tasks through the control instruction.
According to the embodiment of the present disclosure, after the scheduling device 230 generates a perfect scheduling overall control program, the scheduling overall control program may be sent to the source cluster device 210. The source cluster device 210 performs sampling and desensitization operations on the source data of the target object stored in the non-sandbox area 212 according to the scheduling master control program, and stores the desensitized data in the sandbox area 211, thereby copying the desensitized data stored in the sandbox area 211 to the target cluster device 220.
As shown in fig. 6, the scheduling device 230 may also check whether the target data copied to the target cluster 220 is consistent with the original data in the source data; and sending alarm information under the condition that the target data is inconsistent with the original data.
According to an embodiment of the present disclosure, the scheduling device 230 may, for example, access the target cluster 220 to obtain the amount of data copied into the target cluster 220, so as to compare whether the amount of data copied into the target cluster is consistent with the original amount of data in the source data.
As shown in fig. 6, the scheduling device 230 may further generate an acquisition record of the scheduling device 230 acquiring the currently available resources in the source cluster device 210, so as to query whether there is an abnormal acquisition record in the acquisition record. For example, it may be an access source, an access time, an access object that records the currently available resources of the access source cluster. For example, the IP address of the access source should be the IP address of the scheduling device, and when the query acquisition record finds that there is another IP address that is the access source, it is determined as an abnormal acquisition record.
As shown in FIG. 6, the scheduler 230 may also generate an archive log of this data preparation for later review. The log may record, for example, the time of the data preparation, the source data, the target cluster, and the like.
Another aspect of the present disclosure discloses a data processing system.
The data processing system may include a source cluster device, a scheduling device, and a target cluster device.
The scheduling device is used for acquiring configuration information, wherein the configuration information comprises object information and a sampling rule of a target object to be sampled, the target object is stored in the source cluster device, and the scheduling device is used for generating a control instruction based on the sampling rule and the object information and sending the control instruction to the source cluster device. The scheduling device may be, for example, the scheduling device 230 shown in fig. 6.
The source cluster device comprises a sandbox area and a non-sandbox area, wherein the sandbox area and the non-sandbox area are mutually independent, and a target object is stored in the non-sandbox area. The scheduling device may be, for example, the scheduling device 210 shown in fig. 6.
The source cluster device is used for responding to the control instruction, sampling the target object stored in the non-sandbox area to obtain sampling data, storing the sampling data in the sandbox area, and copying the sampling data from the sandbox area to the target cluster device. The target cluster device may be, for example, the scheduling device 220 shown in fig. 6.
According to an embodiment of the present disclosure, the scheduling apparatus may perform the data processing method described in any one of the above.
Another aspect of the present disclosure discloses a data processing apparatus.
Fig. 7 schematically shows a schematic diagram of a data processing apparatus 700 according to an embodiment of the present disclosure.
As shown in fig. 7, the data processing apparatus 700 includes an obtaining module 710, a generating module 720, and a transmitting module 730.
The obtaining module 710 may, for example, perform operation S310 described above with reference to fig. 3, for obtaining configuration information, where the configuration information includes object information and sampling rules of a target object to be sampled, the target object being stored in a source cluster device, the source cluster device including a sandbox area and a non-sandbox area which are independent of each other, and the target object being stored in the non-sandbox area.
The generating module 720, for example, may perform operation S320 described above with reference to fig. 3 for generating a control instruction based on the sampling rule and the object information.
The sending module 730, for example, may perform operation S330 described above with reference to fig. 3, and is configured to send the control instruction to the source cluster device, so that the source cluster device samples the source data in the target object, stores the sampled data obtained by sampling in a sandbox area of the source cluster device, and copies the sampled data from the sandbox area to the target cluster device.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any of the obtaining module 710, the generating module 720 and the sending module 730 may be combined and implemented in one module, or any one of the modules may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 710, the generating module 720, and the sending module 730 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of the three implementations. Alternatively, at least one of the obtaining module 710, the generating module 720 and the sending module 730 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.
FIG. 8 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In the RAM 803, various programs and data necessary for the operation of the system 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM 802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
System 800 may also include an input/output (I/O) interface 805, also connected to bus 804, according to an embodiment of the disclosure. The system 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 802 and/or RAM 803 described above and/or one or more memories other than the ROM 802 and RAM 803.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure. The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (9)

1. A data processing method is applied to scheduling equipment, and comprises the following steps:
acquiring configuration information, wherein the configuration information comprises object information and sampling rules of a target object to be sampled, the target object is stored in a source cluster device, the source cluster device comprises a sandbox area and a non-sandbox area which are independent of each other, and the target object is stored in the non-sandbox area;
generating a control instruction based on the sampling rule and the object information; and
sending the control instruction to the source cluster device to enable the source cluster device to sample source data in the target object, storing sampled data obtained by sampling into a sandbox area of the source cluster device, and copying the sampled data from the sandbox area to the target cluster device;
wherein the configuration information further comprises a desensitization configuration; the generating a control instruction based on the sampling rule and the object information includes: determining metadata of the target object based on the object information; establishing a data table based on the metadata; determining a desensitization function according to the desensitization configuration, the desensitization function being for data desensitization of the sampled data; and
and generating a control instruction according to the sampling rule, the data table and the desensitization function.
2. The method of claim 1, wherein the control instructions are generated to:
obtaining sampling data from the target object according to the sampling rule and the object information;
performing data desensitization on the sampled data to obtain desensitization data;
saving the desensitization data to the sandboxed area; and
copying the desensitization data from the sandboxed area to the target cluster device.
3. The method of claim 1, further comprising:
acquiring concurrent configuration parameters under the condition that a plurality of control instructions for respectively executing different tasks exist;
determining the task number of tasks executed by the source cluster equipment at the same time based on the concurrent configuration parameters; and
and controlling the source cluster equipment to execute a plurality of control instructions based on the task number.
4. The method of claim 3, wherein said controlling the source cluster device to execute the plurality of control instructions based on the number of tasks comprises:
acquiring the current available resources in the source cluster equipment;
and determining the current available resource allocated to each task based on the current available resource and the number of tasks so as to use the allocated current available resource to run the control instruction of the task.
5. The method of claim 4, further comprising:
and generating an acquisition record for acquiring the current available resources in the source cluster equipment by the scheduling equipment so as to inquire whether an abnormal acquisition record exists in the acquisition record.
6. The method of claim 1, further comprising:
checking whether the data volume copied to the target cluster is consistent with the original data volume stored in the sandbox area; and
and sending out alarm information under the condition that the data volume is inconsistent with the original data volume.
7. A data processing system comprising:
the source cluster equipment comprises a sandbox area and a non-sandbox area, wherein the sandbox area and the non-sandbox area are mutually independent, and a target object is stored in the non-sandbox area;
a target cluster device; and
scheduling apparatus for performing the method of any one of claims 1 to 6,
wherein the source cluster device is configured to sample a target object stored in the non-sandboxed area in response to the control instruction to obtain sampled data, store the sampled data to the sandboxed area, and copy the sampled data from the sandboxed area to the target cluster device.
8. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.
CN201910688165.5A 2019-07-26 2019-07-26 Data processing method, system, electronic device and storage medium Active CN110399209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910688165.5A CN110399209B (en) 2019-07-26 2019-07-26 Data processing method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910688165.5A CN110399209B (en) 2019-07-26 2019-07-26 Data processing method, system, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN110399209A CN110399209A (en) 2019-11-01
CN110399209B true CN110399209B (en) 2022-02-25

Family

ID=68326347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910688165.5A Active CN110399209B (en) 2019-07-26 2019-07-26 Data processing method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN110399209B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800473B (en) * 2021-03-17 2022-01-04 好人生(上海)健康科技有限公司 Data processing method based on big data safety house
CN112988604B (en) * 2021-04-30 2024-04-02 中国工商银行股份有限公司 Object testing method, testing system, electronic device and readable storage medium
CN114817390A (en) * 2022-04-27 2022-07-29 中国农业银行股份有限公司 Data processing method and device based on Sqoop program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868389A (en) * 2016-04-15 2016-08-17 北京思特奇信息技术股份有限公司 Method and system for implementing data sandbox based on mongoDB
CN106650424A (en) * 2016-11-28 2017-05-10 北京奇虎科技有限公司 Method and device for detecting target sample file
CN106776143A (en) * 2016-12-27 2017-05-31 北京奇虎科技有限公司 The method and terminal device of a kind of mirror back-up for end application
CN107247741A (en) * 2017-05-14 2017-10-13 四川盛世天成信息技术有限公司 A kind of concentrating type textual magnanimity sensitive data processing method and system
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109635024A (en) * 2018-11-23 2019-04-16 华迪计算机集团有限公司 A kind of data migration method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868389A (en) * 2016-04-15 2016-08-17 北京思特奇信息技术股份有限公司 Method and system for implementing data sandbox based on mongoDB
CN106650424A (en) * 2016-11-28 2017-05-10 北京奇虎科技有限公司 Method and device for detecting target sample file
CN106776143A (en) * 2016-12-27 2017-05-31 北京奇虎科技有限公司 The method and terminal device of a kind of mirror back-up for end application
CN107247741A (en) * 2017-05-14 2017-10-13 四川盛世天成信息技术有限公司 A kind of concentrating type textual magnanimity sensitive data processing method and system
CN109491989A (en) * 2018-11-12 2019-03-19 北京懿医云科技有限公司 Data processing method and device, electronic equipment, storage medium
CN109635024A (en) * 2018-11-23 2019-04-16 华迪计算机集团有限公司 A kind of data migration method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Mining Sandboxes;Konrad Jamrozik;《2016 IEEE/ACM 38th International Conference on Software Engineering》;20170403;全文 *
基于SequoiaDB的金融业历史数据存储与查询解决方案;谢欣;《金融电子化》;20170430(第4期);第72-73页 *

Also Published As

Publication number Publication date
CN110399209A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN109302522B (en) Test method, test device, computer system, and computer medium
CN110399209B (en) Data processing method, system, electronic device and storage medium
US9946628B2 (en) Embedding and executing trace functions in code to gather trace data
CN110532185B (en) Test method, test device, electronic equipment and computer readable storage medium
US20120317555A1 (en) Application development enviroment for portable electronic devices
WO2021208512A1 (en) Method and apparatus for acquiring control information of user interface, terminal and storage medium
US8291388B2 (en) System, method and program for executing a debugger
CN111831325A (en) Method, device, system and medium for updating configuration file in application
US20190155539A1 (en) Method and apparatus for processing data based on physical host
CN113515448A (en) Method and device for acquiring starting time information of application program
CN110417597B (en) Method and device for monitoring certificate, electronic equipment and readable storage medium
CN112416762A (en) API test method and device, equipment and computer readable storage medium
CN113448867B (en) Software pressure testing method and device
US9892010B2 (en) Persistent command parameter table for pre-silicon device testing
CN112988604B (en) Object testing method, testing system, electronic device and readable storage medium
CN114791885A (en) Interface test method, device, equipment and medium
CN115391204A (en) Test method and device for automatic driving service, electronic equipment and storage medium
CN110968519A (en) Game testing method, device, server and storage medium
CN111382057A (en) Test case generation method, test method and device, server and storage medium
RU2775354C1 (en) Method and apparatus for launching arbitrary (untrusted) code on a cluster in an isolated environment
US20240103998A1 (en) Systems and methods for variant testing at scale
CN109144876B (en) Automatic testing method, device, server and storage medium
CN112817573B (en) Method, apparatus, computer system, and medium for building a streaming computing application
CN112416695B (en) Global variable monitoring method, device, equipment and storage medium
US20170063597A1 (en) Api provider insights collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant