CN114417408A

CN114417408A - Data processing method, device, equipment and storage medium

Info

Publication number: CN114417408A
Application number: CN202210053524.1A
Authority: CN
Inventors: 徐立国; 陈治宇
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-29
Anticipated expiration: 2042-01-18
Also published as: CN114417408B

Abstract

The disclosure provides a data processing method, apparatus, device and storage medium. The technical field of computers is related to, and especially the fields such as big data, cloud computing, information security are related to. The specific implementation scheme is as follows: acquiring target data from each data source, wherein the target data comprises at least one type of data; respectively carrying out format conversion on different types of data in each target data to obtain a first data stream of each target data, wherein the format of the first data stream is a preset target format; privacy calculation processing is carried out on each first data stream respectively to obtain a second data stream of each target data; and generating data to be stored of each target data based on each second data stream. According to the technical scheme disclosed by the invention, the processing efficiency of the diversified data source data can be improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and more particularly to the field of big data, cloud computing, information security, and the like.

Background

With the advent of the big data era, diversified processing of data sources becomes a difficult problem to be solved, and particularly, privacy calculation processing needs to be performed on data from diversified data sources, and the data sources are characterized by large data volume and many files. Therefore, the processing efficiency is low.

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, device, storage medium, and computer program product.

According to a first aspect of the present disclosure, there is provided a data processing method, including:

acquiring target data from each data source, wherein the target data comprises at least one type of data;

respectively carrying out format conversion on different types of data in each target data to obtain a first data stream of each target data, wherein the format of the first data stream is a preset target format;

privacy calculation processing is carried out on each first data stream respectively to obtain a second data stream of each target data;

and generating data to be stored of each target data based on each second data stream.

According to a second aspect of the present disclosure, there is provided a data processing apparatus comprising:

the acquisition module is used for acquiring target data from each data source, and the target data comprises at least one type of data;

the first format conversion module is used for respectively carrying out format conversion on different types of data in each target data to obtain a first data stream of each target data, and the format of the first data stream is a preset target format;

the privacy calculation module is used for respectively carrying out privacy calculation processing on each first data stream to obtain a second data stream of each target data;

and the generating module is used for generating the data to be stored of each target data based on each second data stream.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method provided by the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method provided by the first aspect described above.

According to the technical scheme disclosed by the invention, the processing efficiency of the diversified data source data can be at least improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow diagram of a data processing method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of formatting before and after privacy computing processing according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of normal processing and exception processing flow according to an embodiment of the present disclosure;

FIG. 4 is an overall logic block diagram of a diverse data source streaming process according to an embodiment of the disclosure;

FIG. 5 is an interface schematic of data source processing according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data write-out module interface according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an ORC file format according to an embodiment of the present disclosure;

FIG. 8 is a first block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 9 is a second block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a data processing scenario according to an embodiment of the present disclosure;

fig. 11 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms "first," "second," and "third," etc. in the description and claims of the present disclosure and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

The embodiment of the present disclosure provides a data processing method, which may be applied to an electronic device including, but not limited to, a fixed device and/or a mobile device, for example, the fixed device includes, but not limited to, a server, and the server may be a cloud server or a general server. For example, mobile devices include, but are not limited to: one or more of a cell phone or a tablet computer. As shown in fig. 1, the data processing method includes:

s101, acquiring target data from each data source, wherein the target data comprises at least one type of data;

s102, performing format conversion on different types of data in each target data respectively to obtain a first data stream of each target data, wherein the format of the first data stream is a preset target format;

s103, performing privacy calculation processing on each first data stream to obtain a second data stream of each target data;

and S104, generating data to be stored of each target data based on each second data stream.

Wherein the present disclosure does not limit the type of data source. Such as JDBC data sources, S3 data sources, Oracle data sources, PostgreSQL data sources, MySQL data sources, etc.

Here, the target data is data to be processed acquired from a data source, and the target data may be represented in the form of a data stream. It will be appreciated that the target data obtained by different data sources may include the same data type, but the target data obtained by different data sources is different.

Wherein the present disclosure does not limit the type of data. Such as picture data, text data, voice data, etc.

Here, the different types of data correspond to different format converters. The format converter may be subdivided into an input format converter and an output format converter. The first data stream in S102 is converted by the input format converter.

For example, the input format converter is divided into at least: a Sample input Format (Sample input Format) converter, an Annotation input Format (Annotation input Format) converter, a Full Data input Format (Full Data input Format) converter, a Csv input Format (Csv input Format) converter, a Jdbc input Format (Jdbc input Format) converter.

Illustratively, target data of a data type of sampling is converted by a sampling input format, target data of a data type of annotation is converted by an annotation input format, target data of a data type of integrity data is converted by an integrity data input format, and target data of a data type of Csv data is converted by a Csv input format. That is, for example, data of which data type is Csv is converted by the Csv input format converter regardless of the data source; as another example, data of data type Txt, regardless of the data source, is converted by the Txt format converter.

Here, the preset target format is a virtual table format. The content represented by each column is the same for virtual tables of different target data. Of course, the content represented by each column in the virtual table can be set or adjusted according to actual needs and user requirements.

Illustratively, in a virtual table, the first column represents the file name; a second column representing data of the picture itself; and the third column, the binary information of the markup file.

Here, the privacy calculation process includes, but is not limited to, at least one of the following types: sampling treatment and desensitization treatment.

The embodiments of the present disclosure do not limit the type of privacy computation processing. Through the privacy calculation processing, partial data can be replaced so as to protect the safety of the data.

Here, the second data stream is a data stream obtained by performing privacy computation processing on at least part of data in the first data stream. That is, at least a portion of the data in the second data stream is different from the first data stream. The second data stream is in the same format as the first data stream.

Here, the data to be stored is data conforming to the target storage format. The target storage format is a column storage format, such as an (Optimized Row column, ORC) file format.

According to the technical scheme of the embodiment of the disclosure, target data from each data source is acquired; respectively carrying out format conversion on different types of data in each target data to obtain a first data stream of each target data; privacy calculation processing is carried out on each first data stream respectively to obtain a second data stream of each target data; and generating data to be stored of each target data based on each second data stream, so that the target data of different data formats of different data sources are converted into a unified preset target format, and then a series of operations such as privacy calculation processing, storage and the like are performed, so that the processing efficiency of the data of the diversified data sources can be improved, including the efficiency of the privacy calculation processing.

In some embodiments, obtaining target data from data sources comprises: determining each analyzer respectively adaptive to each data source; and streaming target data from each data source through each parser.

Wherein the parser can identify the type of data source to which it is connected, and thereby can determine whether data from that data source can be parsed. In practical application, different data sources are connected with the parser adapted to the different data sources, so that the parser can parse data acquired from the data sources connected with the different data sources.

Here, the streaming acquisition may be understood as acquiring one piece of target data at the same time for each parser, and acquiring the next piece of target data of each data source after the currently acquired target data is processed.

Illustratively, the parsers are determined according to data source types, such as a Jdbc data source with a Jdbc parser, an S3 data source with a S3 parser, an Oracle data source with an Oracle parser, a PostgreSQL data source with a PostgreSQL parser, and a MySQL data source with a MySQL parser.

As yet another example, the parser is determined based on the data source type and the data type. For example, the parser corresponding to the csv format data of the S3 data source is the csv sub parser of S3. The parser corresponding to the SQL format data of the Oracle data source is an SQL sub parser of Oracle.

Although the analytic modes of different analyzers are different, the data are read in a streaming mode, and the use of a cache can be reduced by adopting the streaming reading, so that the resource utilization rate and the reading efficiency are improved.

Therefore, target data from different data sources are obtained through different resolvers, and for each resolver, the target data is obtained from the data source connected with the resolver in a streaming processing mode, so that basic support is provided for the streaming processing of data of diversified data sources, and therefore the privacy calculation of supporting multiple files and big data is achieved.

In some embodiments, the preset target format is a virtual table format, and format conversion is performed on different types of data in each target data to obtain a first data stream of each target data, including: respectively determining input format converters matched with different types of data in the target data; and respectively carrying out format conversion on different types of data in the target data through each input format converter to obtain a first data stream of each target data, wherein the first data stream is represented by a virtual table format.

Here, the first data streams of the respective target data are represented by the same kind of virtual table, that is, the content represented by each column is the same for the first data streams of the respective target data. It is understood that the content represented by each column in the virtual table can be set or adjusted according to actual needs, such as user requirements.

Illustratively, in a virtual table, the first column represents the file name; a second column, representing data of the picture itself; and the third column, the binary information of the markup file.

Illustratively, the sampling type data is input to the format converter through sampling to obtain a first data stream, the annotation type data is input to the format converter through annotation to obtain the first data stream, the csv type data is input to the format converter through csv to obtain the first data stream, and the Jdbc type data is input to the format converter through Jdbc to obtain the first data stream.

Therefore, format conversion is carried out on different types of data in each target data through different input format converters to obtain the first data stream of each target data, which is represented by the virtual table format, so that the subsequent privacy calculation processing is carried out only by facing the first data stream of the virtual table format, and the data streams of different formats are not required to be respectively subjected to privacy calculation processing of different forms, therefore, not only can the data source coverage of the privacy calculation processing be improved, but also the calculation process of the privacy processing can be simplified, and the processing efficiency of the privacy calculation processing is improved.

In some embodiments, performing privacy computation processing on each first data stream to obtain a second data stream of each target data, includes: privacy calculation processing is carried out on each first data stream respectively to obtain data which are subjected to privacy calculation processing and correspond to each first data stream; and obtaining a second data stream of each target data based on the data which is subjected to privacy calculation processing and corresponds to each first data stream, wherein the format of the second data stream is a preset target format.

Here, the second data stream has the same format as the first data stream, and is a preset target format, such as a virtual table format.

Here, the privacy calculation process includes, but is not limited to, a sampling process and a desensitization process.

Therefore, the data processed by the privacy calculation is well formatted data, diversification of the data is shielded for the privacy calculation, and difficulty of the privacy calculation is reduced.

In some embodiments, the data processing method may further include: respectively determining output format converters matched with different types of data in the target data; and passing the second data streams corresponding to the target data through the output format converters to obtain the updated second data streams. Generating data to be stored of each target data based on each second data stream, including: and generating the data to be stored of each target data, which meets the target storage format, based on each updated second data stream.

Here, the output Format converter includes, but is not limited to, a Sample output Format (Sample output Format) converter, an Annotation output Format (Annotation output Format) converter, and a Full Data output Format (Full Data output Format) converter.

Here, the number of kinds of output format converters is equal to or less than the number of kinds of input format converters.

Here, the target storage format is a columnar storage format, for example, the target storage format is an ORC file format. In some embodiments, the data tables are grouped (line groups) on a line basis and then the line groups are stored in a columnar manner. When the data is inquired, all the data does not need to be scanned, and only the appointed column needs to be inquired.

Fig. 2 shows a schematic diagram of formatting before and after privacy calculation processing, and as shown in fig. 2, target data is obtained from a data source in a streaming manner, format conversion is performed on the target data, then privacy calculation processing is performed, format conversion is further performed on the data after the privacy calculation processing, and finally data to be stored, which meet a target storage format, of each target data is generated. Here, some identification processing (e.g., prefixing) is performed on the partial data after the privacy calculation processing, but the data after the privacy calculation processing is not changed, such as prefixing identification for the picture data.

Thus, format conversion is carried out on the second data streams respectively corresponding to the target data through the output format converters to obtain the updated second data streams; generating data to be stored of each target data, which meets the target storage format, based on each updated second data stream; therefore, different storage processing modes are not needed to be adopted for different data sources during storage, and only a unified storage mode is needed, so that diversified data can be stored in a unified format, and the storage processing efficiency is improved.

In some embodiments, the data processing method may further include: and storing the data to be stored of each target data according to a preset storage mode.

Here, the preset storage manner includes: and the data source is distributed with independent memories, and each memory is divided into independent data slices used for storing the data to be stored. In each data slice, the initial position information and each column of attributes of each data slice and the header information of each column are stored through index metadata; data information; and storing each line of Data in a column form (Row Data is also called Row Group, and one Stripe corresponds to a plurality of Row groups.

In this way, after the diversified data source data is subjected to unified privacy calculation processing, unified format storage can be performed, and the overall streaming processing of the diversified data source data can be realized.

In some embodiments, storing the data to be stored of each target data according to a preset storage manner includes: respectively writing the data to be stored of each target data into the data slice of the memory corresponding to each data source; and uploading the data to be stored in the data pieces reaching the preset threshold value to a storage server under the condition that the data size of the data pieces corresponding to each data source reaches the preset threshold value.

Here, the preset threshold may be set or adjusted according to actual conditions. For example, the size of the memory allocated by the system for each data source is set. The preset threshold is less than or equal to the memory allocated by the system for each data source.

For example, if the preset threshold is 10 megabits, each line of data is written into a data slice stored in the memory first, and when the number of data reaches 10 megabits, the data in the data slice is uploaded to the storage server.

Here, the storage server is used for merging the data uploaded by all the data pieces of each data source.

Therefore, by uploading data on the data sub-chips, each line of data written in is written in the memory corresponding to each data source, so that the cache can be reasonably used, the input/output interface can be reasonably applied, and resources can be fully utilized.

In some embodiments, the storing the data to be stored of each target data according to a preset storage mode further includes: and uploading the data to be stored in the data slice of the memory corresponding to each data source to a storage server under the condition that the reading of each target data corresponding to each data source is finished.

Taking the size of the data slice as 10 million as an example, if the data in the data slice 1 corresponding to the data source 1 does not reach 10 million, but all the target data corresponding to the data source 1 are completely read or the connection with the data source 1 is disconnected, uploading the data in the data slice 1 to the storage server.

Therefore, the integrity of the target data of each stored data source can be ensured.

In some embodiments, the storing the data to be stored of each target data according to a preset storage mode further includes: and respectively merging the data of all the data slices corresponding to each data source through the storage server to obtain target storage data of each data source.

For example, the data source 1 includes target data 1-2 n, the storage server receives data 1-n to be stored uploaded by a data slice 1 of a memory corresponding to the data source 1 and data n + 1-2 n to be stored uploaded by a data slice 2 of the memory corresponding to the data source 1, and merges the data slice 1 corresponding to the data source 1 and the data slice 2 to be stored, so as to obtain the target storage data of the data source 1.

Therefore, as the data are uploaded to the storage server in the fragments and finally the fragments are combined, the cache can be reasonably used, the input/output interface can be reasonably applied, and the utilization rate of resources is improved.

In some embodiments, the data processing method may further include: determining a target database of each data source; and storing the target storage data of each data source to a target database of each data source.

And the target database corresponding to each data source can be set according to the configuration information.

Here, the target database is a final storage location of the target data of the data source after the privacy computation process. The present disclosure does not limit the type of target database. Such as a bos database, a minio database, an oss database, a Distributed File System (HDFS) database, etc.

The target databases corresponding to the data sources may be the same or different. For example, data source 1 corresponds to target database 1, and data source 2 and data source 3 correspond to target database 2.

In this way, the target data of the data source can be stored to the designated database according to the configuration information.

In some embodiments, the data processing method may further include: generating a fault event and reporting the fault event layer by layer under the condition that an abnormality occurs in the data processing process; and determining the fault occurrence reason according to the fault event.

Here, layer by layer means that a fault event is reported to a node at the previous stage where a fault occurs at the node at the current stage where the fault occurs, and so on until the fault event is reported to a node at the first stage. According to the data processing flow, each node comprises a task distribution node, a target data acquisition node, an input format conversion node, a privacy calculation processing node, an output format conversion node, a data writing-out processing node and a data storage node.

Fig. 3 shows a schematic diagram of normal processing and exception processing flows, and the normal flow is represented by a solid line in fig. 3, specifically, a task distributor distributes a task to a data source parser, a data stream is generated by the data source parser, and the data stream is written into a data storage through privacy calculation processing. In fig. 3, a dotted line is used to indicate an abnormal flow, specifically, in which module an abnormality occurs, an error is thrown layer by layer, error information is recorded, and finally the error information is thrown into a task distributor, and the task distributor updates a task to be a failure and updates a failure reason.

Therefore, the data exception handling process can be reported in time, the fault event can be acquired in time, and the efficiency of handling the fault event is improved conveniently.

In order to facilitate understanding of the above data processing method, a specific implementation is given below.

The overall logic block diagram of the diversified data source streaming processing is shown in fig. 4, and a task distributor is responsible for processing tasks configured by a user and processing different tasks in a multi-thread concurrent manner. The task distributor acquires tasks and distributes thread processing tasks, wherein each task is realized by acquiring different data source information of a user through a data source analyzer, reading and converting the data source information into an iterator and packaging the iterator into a data structure (such as a virtualization table), and the data structure further comprises metadata of data, such as the line number of the data, the column number of the data, the type of the data and the like. In the process of packaging into the iterator, different types of file formats can be converted according to different format converters, converted into a unified format through an input formatting tool (namely, the input format converter), and then distributed to different processing implementations to perform sampling, desensitization and other processing on data through privacy calculation processing. The processed data is packaged into a new data structure, an iterator is included in the data structure, format conversion is carried out through an output formatting tool (namely an output format converter), the data structure is converted into a format which needs to be stored, and the data structure is written into a file storage system in a split mode through a data writing-out processing module. When the task distributor processes data, task scheduling is carried out based on multiple data types included by different data sources, and each task corresponds to the processing of data of one data type of one data source. The same data source can be provided with a plurality of data types, each data type can only be configured with one task, and the plurality of data types can be configured with a plurality of tasks and can schedule the plurality of tasks. The data read in a streaming mode is constructed into an iterator, the design of a nested iterator is carried out on multiple files, and the streaming data processing of subsequent privacy calculation can be achieved through the iterator. For example, for three files a, b, and c, 3 file name lists are first configured into iterators (3 iterations), each iteration reads a corresponding file internally, a corresponding file stream is configured into a file stream iterator, the file stream iterator is iterated, and then the next file is processed by the outer iterator.

In the framework, the data are read in a streaming mode, format conversion, privacy calculation and storage are adopted, data are read line by line, desensitization and sampling processing is carried out on the data line by line, and one line is written in after one line is processed.

Fig. 5 shows a schematic diagram of an interface of data source processing, wherein rectangles represent core interfaces, parallelograms represent abstract classes, and ovals represent different implementation classes of the interface in fig. 5. In fig. 5, a Data Source parser (Data Source) is used as a core interface, which has different implementation classes of Data sources, such as S3, Mysql, Oracle, and other different implementation classes of Data Source parsing. The Data Source interface is managed through a Data Source parser Factory interface (Data Source Factory), and corresponding parsers are found out according to different Data sources configured by users. The Data Source interface realizes format conversion of input Data stream through an input formatting tool, and the input formatting tool has different implementation classes according to the type of a file, such as: example file format converters, Json file format converters, csv file format converters, and the like. The Data Source interface is also connected with a Data stream wrapper (Data Frame), and can acquire an iterator of each row of Data through the Data stream processing structure and acquire the meta information of the Data, such as the number of rows, the number of columns, the Data type and the like. The data stream encapsulator is further divided into a data read stream encapsulator, a data write stream encapsulator and a data storage stream encapsulator. The data read stream is an encapsulation of the stream generated by the user data source data. The data write stream is an encapsulation of the data generated after the privacy computation. A data storage stream is an encapsulation of data to be stored. The data reading stream is divided into different implementation classes such as an S3 reading stream, a Mysql reading stream and the like. Similarly, the data write stream is further divided into different implementation classes such as an S3 write stream, a Mysql write stream, and the like. The data storage stream is divided into different implementation classes such as an S3 storage stream, a Mysql storage stream and the like.

Fig. 6 shows a schematic diagram of a data writing-out module interface, where in fig. 6, a rectangle represents a core interface, a parallelogram represents an abstract class, and an ellipse represents different implementation classes of the interface. The Data writing-out module (Data Sink) is used as a core interface, and the Data writing-out module can have an implementation class (minio \ bos \ oss, etc.) written in by file storage of the S3 protocol, and can also be an implementation class written in by storage systems of other protocols. The Data writing-out module is managed through a Data Sink Factory interface (Data library), and different Data writing-out implementation classes can be selected according to different user configurations. The data writing-out module is connected with the output formatting tool and converts the processed data into a uniform format for storage. For example, a general output formatting tool may be implemented, and most of the formatted data may be supported. And (3) implementation of an unstructured output format converter, such as a picture data set, a large text, and the like.

Fig. 7 shows a schematic diagram of the format of the ORC file, which is not a pure columnar storage format, but is stored in columns within each row group by first partitioning the entire table according to the row group. The ORC file is self-describing, its metadata is serialized using Protocol Buffers, and the data in the ORC file is compressed as much as possible to reduce the consumption of storage space, and is also supported by search SQL, Presto, and other query engines. In the hive table in the ORC format, records are firstly transversely split into multiple strips, then in each strip, data is stored in units of columns, and the contents of all columns are stored in the same file. When reading an ORC file, two pieces of position information are required to accurately perform the data reading operation. Since there are multiple groups in each stripe, the ORC reader (reader) needs to know the metadata stream (metadata streams) for each group and the starting position of the group recorded in the data stream (data streams). Since an ORC File can contain multiple strips, and an HDFS block can also contain multiple strips, in order to quickly locate the position of a given strip, it is necessary to know the starting position of each strip, and this information can be saved in the File script (File font) of the ORC, as shown by the dashed line in the middle of fig. 7. Storage in the ORC file format has the following advantages:

1) ORC files are columnar stores, have multiple file compression modes, and have high compression ratios. For example, the formatted data is stored in the OCR document in the form of fragments, each fragment is a certain number of rows of data, and each row stores the corresponding column data. Since compression is supported, the storage footprint is reduced.

2) The ORC file can be fragmented, large batches of data can be processed usually in privacy calculation processing, complete data source data can be fragmented and stored by utilizing the characteristics of the fragments, the data can be searched by directly positioning corresponding fragments according to index and algorithm searching during query, all data are not required to be traversed, and query efficiency is improved.

3) The ORC file provides various complex data formats, and can store different data formats according to the supported data source and service scene, so as to better realize the uniform processing of diversified data.

4) Unified storage, different data sources and data types, such as structured data types Csv, Json and the like, unstructured data pictures and large texts or machine learning data sets VOC, COCO and the like, can be uniformly stored in columns by using the characteristics of the ORC file, so that the formats of the data sources can be shielded when the data sources are read and used, and subsequent processing can be performed according to the uniform formats.

In the following, an overall closed-loop flow of private computation streaming processing and unified storage in an isolated domain scene is provided, specifically as follows:

1) the user selects a corresponding data source and data type, such as: s3 data source, Csv data type;

2) acquiring data source data in a streaming mode through different data sources according to the configuration of a user, constructing an iterator, and nesting and iterating multiple files;

3) formatting the data through different formatting tools according to different data types;

4) privacy calculation processing, namely acquiring formatted virtual table data through an iterator and a formatting tool, and performing stream-type calculation on the data, such as desensitization calculation, specified field statistics and the like;

5) the processed data is written into a data output layer in a streaming mode;

6) the data output layer accesses streaming data, constructs an ORC file and uploads the ORC file to the storage server in a fragmentation mode;

7) and analyzing the data by the user in a uniform data reading mode to perform subsequent algorithm operation.

Therefore, the method can support the privacy calculation of multiple files and big data through stream processing, has less temporary memory and good calculation efficiency; in the storage, the privacy calculation processing results of diversified data sources can be stored in a unified format.

An embodiment of the present disclosure discloses a data processing apparatus, as shown in fig. 8, the data processing apparatus may include:

an obtaining module 810, configured to obtain target data from each data source, where the target data includes at least one type of data;

a first format conversion module 820, configured to perform format conversion on different types of data in each target data, respectively, to obtain a first data stream of each target data, where a format of the first data stream is a preset target format;

the privacy calculation module 830 is configured to perform privacy calculation processing on each first data stream, so as to obtain a second data stream of each target data;

a generating module 840, configured to generate data to be stored of each target data based on each second data stream.

In some embodiments, the obtaining module 810 is configured to:

determining each analyzer respectively adaptive to each data source;

and streaming target data from each data source through each parser.

In some embodiments, the predetermined target format is a virtual table format, and the first format conversion module 820 is configured to:

respectively determining input format converters matched with different types of data in the target data;

and respectively carrying out format conversion on different types of data in the target data through each input format converter to obtain a first data stream of each target data, wherein the first data stream is represented by a virtual table format.

In some embodiments, the privacy calculation module 830 is configured to:

privacy calculation processing is carried out on each first data stream respectively to obtain data which are subjected to privacy calculation processing and correspond to each first data stream;

and obtaining a second data stream of each target data based on the data which is subjected to privacy calculation processing and corresponds to each first data stream, wherein the format of the second data stream is a preset target format.

In some embodiments, as shown in fig. 9, the data processing apparatus may further include:

a second format conversion module 850 for:

respectively determining output format converters matched with different types of data in the target data;

passing the second data streams corresponding to the target data through output format converters to obtain updated second data streams;

the generating module is further configured to generate data to be stored of each target data, which satisfies the target storage format, based on each updated second data stream.

and the storage module 860 is configured to store the data to be stored of each target data according to a preset storage manner.

In some embodiments, the storage module 860 includes:

the writing submodule is used for respectively writing the data to be stored of each target data into the data sheet of the memory corresponding to each data source;

and the uploading sub-module is used for uploading the data to be stored in the data pieces reaching the preset threshold value to the storage server under the condition that the data size of the data pieces corresponding to each data source reaches the preset threshold value.

In some embodiments, the upload sub-module is further configured to:

and uploading the data to be stored in the data slice of the memory corresponding to each data source to a storage server under the condition that the reading of each target data corresponding to each data source is finished.

In some embodiments, the storage module 860 further comprises:

and the merging submodule is used for merging the data of all the data slices corresponding to each data source through the storage server to obtain target storage data of each data source.

In some embodiments, the storage module 860 further comprises:

the determining submodule is used for determining a target database of each data source;

and the storage submodule is used for storing the target storage data of each data source to the target database of each data source.

an exception handling module 870 to:

generating a fault event and reporting the fault event layer by layer under the condition that an abnormality occurs in the data processing process;

and determining the fault occurrence reason according to the fault event.

It should be understood by those skilled in the art that the functions of the processing modules in the data processing apparatus according to the embodiments of the present disclosure may be understood by referring to the description of the foregoing data processing method, and the processing modules in the data processing apparatus according to the embodiments of the present disclosure may be implemented by analog circuits that implement the functions described in the embodiments of the present disclosure, or by running software that performs the functions described in the embodiments of the present disclosure on electronic devices.

The data processing device of the embodiment of the disclosure can process diversified data sources in a unified format, thereby improving the processing efficiency of the diversified data sources; in addition, in the storage process, the occupied memory is less, and the diversified data sources can be stored in a unified format, so that when the stored data is read or used, the format of the data sources can be shielded, and the data can be processed according to the unified format.

Fig. 10 is a schematic view of a data processing scenario, and as can be seen from fig. 10, an electronic device such as a cloud server receives target data from a diversified data source, and performs format conversion on different types of data in each target data to obtain a first data stream of each target data, where the format of the first data stream is a preset target format; privacy calculation processing such as sampling processing or desensitization processing is carried out on each first data stream respectively to obtain a second data stream of each target data; generating data to be stored of each target data based on each second data stream; respectively storing the data to be stored of each target data according to a preset storage mode, for example, writing the data into the data slice of the memory corresponding to each data source, uploading the data stored in the data slice reaching the preset threshold to a storage server under the condition that the size of the data in the data slice corresponding to each data source reaches the preset threshold, and merging the data of all the data slices corresponding to each data source through the storage server to obtain the target storage data in the ORC file format of each data source; and, the target storage data of each data source can be stored in the target database of each data source.

It should be understood that the scene diagram shown in fig. 10 is only illustrative and not restrictive, and those skilled in the art may make various obvious changes and/or substitutions based on the example of fig. 10, and the obtained technical solution still belongs to the disclosure scope of the embodiments of the present disclosure.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read-Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-Chip (SOC), load Programmable Logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard Disk, a random access Memory, a Read-Only Memory, an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a Compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client and server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of data processing, comprising:

respectively carrying out format conversion on different types of data in the target data to obtain a first data stream of the target data, wherein the format of the first data stream is a preset target format;

2. The method of claim 1, wherein the obtaining target data from each data source comprises:

determining each analyzer respectively adaptive to each data source;

and streaming the target data from each data source through each parser.

3. The method according to claim 1, wherein the preset target format is a virtual table format, and the performing format conversion on different types of data in each target data to obtain a first data stream of each target data includes:

determining input format converters matched with different types of data in the target data respectively;

and respectively carrying out format conversion on different types of data in the target data through the input format converters to obtain a first data stream of the target data, wherein the first data stream is represented by a virtual table format.

4. The method according to claim 1, wherein the performing privacy computation processing on each of the first data streams to obtain a second data stream of each of the target data comprises:

performing privacy calculation processing on each first data stream respectively to obtain data which is subjected to privacy calculation processing and corresponds to each first data stream;

and obtaining the second data stream of each target data based on the data which is subjected to the privacy computation processing and corresponds to each first data stream, wherein the format of the second data stream is the preset target format.

5. The method of claim 1, further comprising:

determining output format converters matched with different types of data in the target data respectively;

passing the second data streams corresponding to the target data through the output format converters to obtain updated second data streams;

generating data to be stored of each target data based on each second data stream includes:

and generating the data to be stored of each target data, which meet the target storage format, based on each updated second data stream.

6. The method of claim 1, further comprising:

and storing the data to be stored of each target data according to a preset storage mode.

7. The method according to claim 6, wherein the storing the data to be stored of each target data according to a preset storage mode respectively comprises:

respectively writing the data to be stored of each target data into the data slice of the memory corresponding to each data source;

and uploading the data to be stored in the data slices reaching the preset threshold value to a storage server under the condition that the data size of the data slices corresponding to each data source reaches the preset threshold value.

8. The method according to claim 7, wherein the data to be stored of each target data is stored according to a preset storage mode, and the method further comprises:

and uploading the data to be stored in the data slice of the memory corresponding to each data source to the storage server under the condition that the reading of the target data corresponding to each data source is finished.

9. The method according to claim 7 or 8, wherein the storing the data to be stored of each target data according to a preset storage mode further comprises:

and merging the data of all the data slices corresponding to the data sources through the storage server to obtain target storage data of the data sources.

10. The method of claim 9, further comprising:

determining a target database of each data source;

and storing the target storage data of each data source to the target database of each data source.

11. The method of claim 1, further comprising:

and determining the fault occurrence reason according to the fault event.

12. A data processing apparatus comprising:

the acquisition module is used for acquiring target data from each data source, wherein the target data comprises at least one type of data;

the first format conversion module is used for respectively carrying out format conversion on different types of data in the target data to obtain a first data stream of the target data, wherein the format of the first data stream is a preset target format;

and the generating module is used for generating data to be stored of each target data based on each second data stream.

13. The apparatus of claim 12, wherein the means for obtaining is configured to:

determining each analyzer respectively adaptive to each data source;

and streaming the target data from each data source through each parser.

14. The apparatus of claim 12, wherein the preset target format is a virtual table format, and the first format conversion module is configured to:

15. The apparatus of claim 12, wherein the privacy computation module is to:

16. The apparatus of claim 12, further comprising:

a second format conversion module to:

the generating module is further configured to generate the to-be-stored data, which satisfies the target storage format, of each target data based on each updated second data stream.

17. The apparatus of claim 16, further comprising:

and the storage module is used for storing the data to be stored of each target data according to a preset storage mode.

18. The apparatus of claim 17, the storage module comprising:

and the uploading sub-module is used for uploading the data to be stored in the data pieces reaching the preset threshold to a storage server under the condition that the size of the data in the data pieces corresponding to the data sources reaches the preset threshold.

19. The apparatus of claim 18, wherein the upload sub-module is further configured to:

20. The apparatus of claim 18 or 19, wherein the storage module further comprises:

and the merging submodule is used for merging the data of all the data slices corresponding to the data sources through the storage server to obtain target storage data of the data sources.

21. The apparatus of claim 20, the storage module further comprising:

22. The apparatus of claim 12, further comprising:

an exception handling module to:

and determining the fault occurrence reason according to the fault event.

23. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.