CN111813846A

CN111813846A - Data analysis processing system and data processing method

Info

Publication number: CN111813846A
Application number: CN202010611247.2A
Authority: CN
Inventors: 焦悦光; 胡宗星; 邱剑生; 郭璐; 崔静
Original assignee: Beijing Zetyun Tech Co ltd
Current assignee: Beijing Zetyun Tech Co ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-23
Anticipated expiration: 2040-06-29
Also published as: CN111813846B

Abstract

The invention provides a data analysis processing system and a data processing method, wherein the method comprises the following steps: obtaining input data of a first data structure of a streaming task; converting input data of the first data structure into intermediate data of a second data structure; calculating the intermediate data by using an operator of the flow task, and outputting a calculation result; wherein the second data structure includes a static data region and a dynamic data region. The data analysis processing system in the embodiment of the invention can process dynamic data or complex data, and improves the data processing efficiency.

Description

Data analysis processing system and data processing method

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data analysis processing system and a data processing method.

Background

In recent years, big data processing and analysis have become global problems, and with the increasing level of informatization and automation of the economy and society, big data problems are faced in many fields such as government management, public services, scientific research, commercial application and the like, and various solutions which are targeted and economically effective are needed. The big data platform provides processing capacity for industry big data, and integrates functions of data access, data processing, data storage, query and retrieval, analysis and mining, application interfaces and the like.

The existing data analysis processing system can only process single-layer data or static data, but cannot process dynamic data or complex data (nested data), and has low data processing efficiency and single data processing type.

Disclosure of Invention

The embodiment of the invention provides a data analysis processing system and a data processing method, which solve the problems of low data processing efficiency and single type of processed data of the conventional data analysis processing system.

In order to solve the above technical problem, the present invention provides a data processing method applied to a data analysis processing system, the method comprising:

obtaining input data of a first data structure of a streaming task;

converting input data of the first data structure into intermediate data of a second data structure;

processing the intermediate data by using an operator of the stream task, and outputting a processing result;

wherein the second data structure includes a static data region and a dynamic data region.

Preferably, in the above method, the converting the input data of the first data structure into the intermediate data of the second data structure includes:

acquiring the data type of the input data;

and converting the input data of the first data structure into intermediate data of a second data structure according to the data type.

Preferably, in the above method, the converting the input data of the first data structure into the intermediate data of the second data structure according to the data type includes:

determining a target data type corresponding to each field in the second data structure according to the original data type of each field of the input data, wherein the target data type comprises a static data type and a dynamic data type;

uniformly and sequentially numbering corresponding static data and dynamic data in the second data structure to obtain a static area index, and individually and sequentially numbering the dynamic data to obtain a dynamic area index;

and converting the input data of the first data structure into intermediate data of a second data structure according to the static area index, the dynamic area index and the corresponding target data type of each field in the second data structure.

Preferably, in the above method, the step of determining, according to the original data type of each field of the input data, a corresponding target data type of each field in the second data structure includes:

a substep: if the original data type of the field of the input data is static and the data type is a scalar, marking the field as static data;

and a substep b: if the original data type of the field of the input data is static and the data type is a non-scalar, recursively repeating the substeps a and b for each subfield of the field;

and a substep c: if the original data type of the field of the input data is dynamic and the number and the name of the sub-fields contained in the field are determined, recursively repeating the sub-steps a, b and c for each sub-field of the field;

and a substep d: and if the original data type of the field of the input data is dynamic and the number or the name of the sub-fields of the field is uncertain, marking the field as dynamic data.

Preferably, in the above method, before the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure, the method further includes:

establishing a static data area with a corresponding length according to the number of the static area indexes;

and establishing a dynamic data area with a corresponding length according to the number of the dynamic area indexes.

Preferably, in the above method, the static data area is a variable-length array, and the dynamic data area is a variable-length array.

Preferably, in the above method, the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure includes:

mapping the value of the field marked as the static data into the array element of which the static area index corresponding to the field of the static data is a subscript in the static data area;

mapping the value of the field marked as the dynamic data into an array element in the dynamic data area, wherein the dynamic area index corresponding to the field of the dynamic data is a subscript;

and setting the value of an array element with the static area index corresponding to the field of the dynamic data as a subscript in the static data area as the dynamic area index.

Preferably, in the above method, the acquiring the data type of the input data includes:

obtaining a data type of the input data based on a user configuration input; or

Determining a data type of the input data based on a pre-established data type prediction model.

Preferably, in the above method, the input data includes nested data and/or dynamic data.

Preferably, in the above method, after the step of converting the input data of the first data structure into the intermediate data of the second data structure, the method further includes:

determining a computing mode of the stream task based on the target data type;

the step of processing the intermediate data by using the operator of the stream task and outputting a processing result comprises the following steps:

and processing the intermediate data by using the operator of the stream task based on the calculation mode, and outputting a processing result.

Preferably, in the above method, before the step of obtaining the input data of the first data structure of the streaming task, the method further includes: and acquiring input data of the streaming task, and performing deserialization processing on the input data.

Preferably, in the method, the step of processing the intermediate data by using an operator of the stream task and outputting a processing result includes:

accessing a value corresponding to the intermediate data through the subscript of the array element by using an operator of the stream task;

calculating by using the value to obtain a calculation result;

and converting the calculation result into data of the first data structure to obtain output data.

Preferably, in the above method, after the step of converting the calculation result into the data of the first data structure and obtaining the output data, the method further includes:

carrying out serialization processing on the output data;

and outputting the output data after the serialization processing.

Preferably, in the above method, the stream task runs in a distributed manner, the step of processing the intermediate data by using an operator of the stream task and outputting a processing result includes:

calculating the intermediate data by using a first operator of the stream task, and performing serialization processing on the calculated data to obtain a byte stream;

inputting the byte stream into a second operator, and performing anti-sequence on the byte stream to obtain calculation data; and processing the calculation data by using a second operator, and outputting a calculation result.

Preferably, in the above method, the second data structure further includes intrinsic attributes.

and if the field of the input data is a field common to at least two data structures in the first data structure, mapping the field to the intrinsic attribute.

The embodiment of the present invention further provides a data analysis processing system, where the data analysis processing system includes:

the acquisition module is used for acquiring input data of a first data structure of the stream task;

a conversion module for converting the input data of the first data structure into intermediate data of a second data structure;

the processing module is used for processing the intermediate data by using the operator of the flow task and outputting a processing result;

Preferably, in the data analysis processing system, the conversion module includes:

the acquisition subunit is used for acquiring the data type of the input data;

and the conversion subunit is used for converting the input data of the first data structure into the intermediate data of the second data structure according to the data type.

Preferably, in the data analysis processing system, the conversion subunit is specifically configured to:

Preferably, in the data analysis processing system, the step of obtaining, according to the original data type of each field of the input data, a target data type corresponding to each field in the second data structure includes:

Preferably, in the data analysis processing system, before the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure, the method further includes:

Preferably, in the data analysis processing system, the static data area is a variable-length array, and the dynamic data area is a variable-length array.

Preferably, in the data analysis processing system, the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure includes:

Preferably, in the data analysis processing system, the obtaining subunit is specifically configured to:

obtaining a data type of the input data based on a user configuration input; or

Preferably, in the data analysis processing system, the input data includes nested data and/or dynamic data.

Preferably, in the data analysis processing system, after the step of converting the input data of the first data structure into the intermediate data of the second data structure, the data analysis processing system further includes:

determining a computing mode of the stream task based on the target data type;

the processing module is specifically configured to:

Preferably, the data analysis processing system further includes:

and the deserializing module is used for acquiring the input data of the stream task and deserializing the input data.

Preferably, in the data analysis processing system, the processing module is further specifically configured to:

calculating by using the value to obtain a calculation result;

carrying out serialization processing on the output data;

and outputting the output data after the serialization processing.

Preferably, in the data analysis processing system, the stream task runs in a distributed manner, and the processing module is further specifically configured to:

Preferably, in the data analysis processing system, the second data structure further includes intrinsic attributes.

Preferably, in the data analysis processing system, the step of determining, according to the original data type of each field of the input data, a target data type corresponding to each field in the second data structure includes:

The embodiment of the present invention further provides a data analysis processing system, where the data analysis processing system includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when the computer program is executed by the processor, the steps of the data processing method are implemented.

An embodiment of the present invention further provides a readable storage medium, where a computer program is stored, and when the computer program is executed, the steps of the data processing method are implemented.

The invention provides a data analysis processing system and a data processing method, wherein the method comprises the following steps: obtaining input data of a first data structure of a streaming task; converting the input data of the first data structure into a second data structure to obtain intermediate data; calculating the intermediate data by using an operator of the flow task, and outputting a calculation result; wherein the second data structure includes a static data region and a dynamic data region. According to the embodiment of the invention, the first data structure of the input data is converted into the second data structure, so that the data analysis processing system can process dynamic data or complex data, and the data processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a data processing method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a graphical user interface for defining data structures provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a graphical user interface for defining a data structure according to an embodiment of the present invention;

FIG. 4 is a flow chart of step 102 of a data processing method provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a streaming task provided by an embodiment of the present invention;

FIG. 6 is a graphical configuration interface of a stream task operator provided by embodiments of the present invention;

FIG. 7 is a graphical configuration interface of yet another stream task operator provided by an embodiment of the present invention;

fig. 8 is a block diagram of a data analysis processing system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a data processing method provided by an embodiment of the present invention, where the data processing method is applied to a data analysis processing system, and as shown in fig. 1, the data processing method includes the following steps:

step 101, input data of a first data structure of a stream task is obtained.

Optionally, the input data includes real-time data, and the input data may be nested data, or may be dynamic data. The nested data refers to data comprising at least two layers of data structures, and the field of the nested data is a non-scalar.

Wherein the first data structure includes a field name and a type of value. In an embodiment of the present invention, the first data structure includes at least one of: dynamic data structures, static data structures, nested data structures. If the field information in the first data structure is pre-known, i.e. the name, type and number of the fields can be determined, and the type of each field is either scalar or static, the data structure is called a static data structure, whereas if the field information in the first data structure is unpredictable, the data structure is called a dynamic data structure.

Illustratively, the single-layer data structure and the nested data structure are specifically described below in the "student achievement" context. A data structure is defined below to represent the student's record of performance, called "student performance", where the first line is the name of the data structure, followed by the fields' names, and after the colon is the type of the field:

student achievement

Study number: character string

Name: character string

Achievement: integer number of

The type of value of each field in the data structure is scalar (i.e., a value that does not require decomposition and can be directly processed, such as an integer, a floating point, a string, etc.), which is referred to as a single-layer (flat) data structure.

Modifying the 'achievement' field type in the data structure into another data structure 'each achievement':

score of each department

The language: integer number of

Mathematics is as follows: integer number of

English: integer number of

The overall data structure definition of "student achievement" becomes:

student achievement

Study number: character string

Name: character string

Achievement: score of each department

The language: integer number of

Mathematics is as follows: integer number of

English: integer number of

At this point the type of the "achievements" field is no longer a scalar, and this data structure (the overall data structure of the "student achievements") is referred to as a nested (i.e., multi-level) data structure.

The static data structures and dynamic data structures are further described below in conjunction with the above examples.

The "student achievement" in the above example includes that the field name, the type, and the number of fields of the field are fixed, that is, the field information of the data structure is predictable, and thus the "student achievement" in the above example is a static data structure. If the above-mentioned "records of each department" data structure also contains a field "other subjects", its type is a dynamic data structure (for example, the records of other subjects can be stored in the way of key-value pair, the key is the name of the subject, the value is its corresponding record, here the number and name of the key are unpredictable), then the "records of each department" data structure is dynamic, thus result in the "student record" data structure containing "records of each department" also being dynamic.

Optionally, the obtaining of the input data of the first data structure of the stream task specifically includes: defining a data structure according to input data to be processed, and processing the input data based on the defined data structure so as to obtain the input data of the first data structure. Wherein, defining the data structure may be that a user defines the data structure used in the stream task by some data structure description language (for example, by means of Json code definition) or by means of a Graphical User Interface (GUI). FIG. 2 is a diagrammatic view of a graphical user interface defining a "department achievements" data structure. The "scores of subjects" data structure shown in fig. 2 has three fields "language", "math", and "english" of which types are integers, and one field "other subjects" of which types is field-value type dynamic data.

FIG. 3 is a diagrammatic illustration of a graphical interface defining a "student achievement" data structure. The defined 'each achievement' type is quoted, and the nested definition is formed. The final overall data structure of the obtained student achievement is defined as:

student achievement

Study number: character string

Name: character string

Achievement: score of each department

The language: integer number of

Mathematics is as follows: integer number of

English: integer number of

Other subjects: key-value type dynamic data.

Optionally, in step 101, based on obtaining a first data structure of input data of the streaming task, the data processing method further includes: and acquiring input data of the streaming task, and performing deserialization processing on the input data.

Specifically, the input data of the streaming task is usually in the form of a byte stream, and the data analysis processing system cannot directly process the input data and needs to deserialize the input data into data of the first data structure.

Step 102, converting the input data of the first data structure into intermediate data of a second data structure.

Here, the static data area: is an array of indefinite length, the value of the array is a scalar, and the type of the value of the data includes, but is not limited to, at least one of: integers, strings, boolean values, and the like. Dynamic data area: the array is an array with an indefinite length, and the value of the array is various dynamic data structures, for example, the value of the array can be field-value type dynamic data and the like. The dynamic data area can open up a storage space by using a pointer (for example, an array which is dynamically increased by a linked list mode, and field-value type data which is dynamically increased by a hash table mode) and can dynamically increase the storage space.

And 103, processing the intermediate data by using the operator of the stream task, and outputting a processing result.

The method and the device adopt the second data structure to adapt to the possible dynamic/nested data structure, and convert the data of the data structures such as the dynamic data structure, the nested data structure and the like into the data of the second data structure which is supported by the data analysis processing system to operate, thereby realizing the real-time processing of the data of the dynamic data structure and the data of the nested data structure.

The implementation of the steps of the method is specifically described in detail below for the steps of the method:

optionally, as shown in fig. 4, step 102 includes:

step 1021, acquiring the data type of the input data.

Wherein, the step 1021 of obtaining the data type of the input data of the first data structure specifically includes: obtaining a data type of the input data based on a user configuration input; or processing the input data based on a data type prediction model established in advance in the data analysis processing system so as to determine the data type of the input data.

Specifically, the obtaining of the data type of the input data based on the user configuration input includes: and displaying a user interface for configuring the data type, and acquiring the configuration operation of the user on the user interface so as to acquire the data type of the input data of the first data structure.

Specifically, processing the input data based on a data type prediction model in the data analysis processing system, so as to determine the data type of the input data includes: the user inputs sample data, and the data analysis processing system utilizes a pre-trained data type prediction model to automatically infer the data type according to the sample data input by the user. Furthermore, the user can perform custom adjustment and modification on the data type automatically inferred by the data analysis processing system to obtain the final data type.

Step 1022, converting the input data of the first data structure into the intermediate data of the second data structure according to the data type.

Specifically, the conversion of the input data of the first data structure into the intermediate data of the second data structure includes two processes, marking the data type and establishing an index. The process of establishing the index is as follows:

said converting said input data into intermediate data of a second data structure according to said data type of said step 1022 comprises:

The process in which data types are marked is as follows:

the step of determining a target data type corresponding to each field in the second data structure according to the original data type of each field of the input data includes:

The marking process is for determining whether to place the fields in the static data area or the dynamic data area of the second data structure.

After completing the marking of the data type and the indexing, converting the input data of the first data structure into the intermediate data of the second data structure further comprises two processes of establishing a data area and mapping data. The process of establishing the data area is as follows:

before the step of converting the input data into intermediate data of the second data structure according to the static area index, the dynamic area index, and the corresponding data type of each field in the second data structure, the method further includes:

The static data area is a variable-length array, and the dynamic data area is a variable-length array.

The process of mapping data is as follows:

the step of converting the input data into intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure includes:

Further, the second data structure further includes intrinsic attributes. Wherein intrinsic properties refer to some common fields that the adapted at least two data structures have, said data structures comprising the first data structure of the streaming task. Specifically, for example, events in a stream task of a data analysis processing system have a timestamp, and the timestamp can be regarded as intrinsic. The fields inheriting the attributes are static, and the storage mode of the fields is determined in the source code design stage of the data analysis processing system.

Further, the second data structure includes intrinsic attributes, the steps of: the step of obtaining a target data type corresponding to each field in the second data structure according to the original data type of each field of the input data further includes:

Since the fields common to at least two data structures in the first data structure are mapped to the intrinsic property without marking the steps b, c, d, e in the data type, the speed of data structure conversion can be increased by the data type of the intrinsic property.

Example one: taking the student's score data structure as an example, a field ' other subjects ' is added in the ' each subject score ' data structure, the type of the field ' other subjects ' is a dynamic data structure stored in a key-value mode, the key name is a subject name, and the value is the subject score. The overall definition of the "student achievement" data structure becomes:

student achievement

Study number: character string

Name: character string

Achievement: score of each department

The language: integer number of

Mathematics is as follows: integer number of

English: integer number of

Other subjects: key-value type dynamic data.

Table 1 is an illustration of the tagging and indexing of fields in the "student achievements" data structure.

TABLE 1

The fields of "school number", "name" and "achievement" are marked as "static data" because they are scalar quantities;

the "achievement" field is marked as "dynamic data" because it is a compound type which is not marked, but the "language", "mathematics" and "English" in the field are marked as "static data", and the type of the "other subjects" is a field-value type dynamic data. The "static data" and "dynamic data" are then numbered.

The second data structure thus obtained is schematically shown in table 2 below:

TABLE 2

The Chinese language represents the Chinese language field in the score field, and other fields are similar.

Illustratively, a piece of "student achievement" data in JSON format (first data structure) is as follows:

the data after parsing into the second data structure is as follows:

static data area: "2020001", "Zhang III", 88,92,59,0

Dynamic data area: { "physical": 70, "chemical": 78 }.

Example two: the two data structures used in the redefined stream task are as follows:

student's failure subject item counting

Study number: character string

Name: character string

Failing to count subjects:

the language: integer number of

Mathematics is as follows: integer number of

English: integer number of

Other subjects: integer number of

Total number of failed subjects for students

Study number: character string

Name: character string

Number of failed meshes: integer number of

The labeling and indexing process for the "student failed subject item count" data structure is schematically shown in table 3 below.

TABLE 3

The second data structure corresponding to the "student failing to reach subject item count" data structure is shown in table 4 below:

TABLE 4

The labeling and indexing process for the "total number of failed students" data structure is schematically shown in table 5 below:

TABLE 5

The second data structure corresponding to the student's total number of failed subjects "data structure is shown in table 6 below:

TABLE 6

Optionally, the embodiment of the present invention provides the following feasible implementation process for step 103, where in step 103, the implementation of the step of processing the intermediate data by using the operator of the stream task and outputting the processing result specifically includes:

calculating by using the value to obtain a calculation result;

In the process of converting the data of the first data structure into the intermediate data of the second data structure, the index is constructed and is used as the subscript of the array element corresponding to the second data structure, so that the data can be directly obtained through the subscript of the array element without searching step by step through field names, the data can be quickly obtained, the waiting time is shortened, and the calculation speed of the stream task operator is increased.

Furthermore, it should be noted that, in the process of processing the intermediate data of the second data structure by the operator in the streaming task to obtain the calculation result, multiple serialization and deserialization processes are performed on the intermediate data of the second data structure based on the requirements of the operator on input and output data, so that the data output by the upstream operator can be transmitted to the downstream operator as input through the network.

Specifically, the serialization and deserialization of the intrinsic properties of the second data structure is done at the source code level.

The static data area of the second data structure is a variable-length array, and the serialization mode of the static data area of the second data structure can be as follows: firstly, outputting a serialized integer value to represent the number of elements in the array, and then sequentially outputting the values of the serialized elements. Each element is scalar type, and the original serialization mode of the corresponding data type under the programming language is adopted (for example, under the Java language, an integer directly outputs four bytes, and each character code of a character string is output). When deserializing, a whole value is deserialized firstly to know the length of the array, and then the value of each element is deserialized according to the predicted data type sequence.

The dynamic data area of the second data structure is a variable-length array, and the serialization mode of the dynamic data area of the second data structure can be as follows: firstly, outputting a serialized integer value to represent the number of elements in the array, and then sequentially outputting the values of the serialized elements. The serialization way for each element may be: first a serialized integer value is output, representing the length of the byte string (as its length is not fixed) after the value of the element itself has been serialized, and then the serialized value of the element is output. When deserializing, an integer value is deserialized to know the length of the array, and then each element is deserialized in sequence. When each element is deserialized, an integer value is deserialized to know the length of the byte string to be read, and then the byte string with the length is read and deserialized into the value of the element.

Further, after the step of converting the calculation result into the data of the first data structure to obtain the output data, the method further includes:

carrying out serialization processing on the output data;

and outputting the output data after the serialization processing.

Specifically, the static region index and the dynamic region index in the second data structure corresponding to each input/output field in the first data structure of the input data of the stream task, which is obtained based on the user configuration operation, can be obtained, so that when the stream task runs, the operator of the stream task directly accesses the corresponding value through the subscript of the array element. And calculating the value by an operator of the flow task to obtain a calculation result. And converting the calculation result into calculation data of the first data structure based on the second data structure, the static region index and the dynamic region index, and serializing the calculation data into a byte stream, namely obtaining data of a field-value type as output data.

Optionally, in this embodiment of the present invention, after converting the input data of the first data structure into the intermediate data of the second data structure in step 102, the method further includes:

determining a computing mode of the stream task based on the target data type;

the step of processing the intermediate data by using the operator of the stream task and outputting a processing result comprises the following steps: and processing the intermediate data by using the operator of the stream task based on the calculation mode, and outputting a processing result.

The target data type based on the second data structure can optimize the calculation method, the calculation method can be determined in advance based on the type of the median in the data structure, and therefore the data analysis processing system can directly process the data based on the determined calculation method, and the operation speed is improved. For example, summing two data, if it is predicted that both data are integers, the runtime can directly use integer addition to obtain a result; if the types of both data are unpredictable, the runtime needs to judge the type combinations of the two data that may occur one by one to perform appropriate type conversion operations on the original data and to use the corresponding type addition operations, which reduces the running speed.

Optionally, the stream task runs in a distributed manner, in step 103, the processing the intermediate data by using an operator of the stream task, and outputting the processing result may further include:

Specifically, when the stream task runs on the distributed platform, the instances of the operators may run on different hosts, the output data of the upstream operator needs to be serialized into a byte stream, the byte stream is transmitted to the host where the downstream operator is located through the network, and then deserialization operation is performed to restore the byte stream to the original data. Because the intermediate data of the second data structure is data of an array structure, and the subscript of the array structure is the index corresponding to the intermediate data field, the field name of the data does not need to be saved when the calculation data obtained by calculation based on the intermediate data is subjected to serialization processing, the size of the generated byte stream can be reduced, the network bandwidth during data transmission is saved, and the data processing efficiency can be further improved.

Illustratively, a stream task is defined to process the above-mentioned "student achievement" data structure and "total number of failed subjects" data structure.

The streaming task is illustrated in fig. 5, where the input data format and the output data format are specified as JSON format. The input data is converted into RT Event (the RT Event is intermediate data of a second data structure) after being analyzed into the RT Event operator, and the corresponding data structure is student achievement; then, the corresponding data structure is changed into 'student failing to reach the department item count' through 'field value mapping' operator; then, the corresponding data structure is changed into the total number of the student failing to reach the subjects through a summation operator; and finally, converting the 'construction output' operator into data in a JSON format for output.

The specific operations performed by the operators of the stream task may be configured in a manner defined by a user input code (e.g., in a manner defined by a programming language such as Java, Python, or R) or in a manner defined by a graphical user interface. FIG. 6 below shows a graphical configuration interface for the operation of the "field value mapping" operator.

In the graphical configuration interface shown in fig. 6, the "academic number" and "name" fields of the configuration output data directly take the values of the homonymous fields of the input data, and the subfields "language" and "mathematics" and "english" under the "failing subject count" field of the output data are generated by performing conditional value calculation on the homonymous fields corresponding to the "score of each subject" field of the input data in a manner that when the original field value is less than 60, the result value is 1, otherwise, the result value is 0. The sub-field of other subjects under the output field of the failed subject count is generated by performing condition counting calculation on corresponding fields with the same name under the subject score field of the input data, and the counting mode is that the number of the dynamic sub-fields of the original field (which is field-value type dynamic data) is counted to be less than 60.

FIG. 7 is an interface schematic diagram of a graphical configuration of a "sum" operator

According to the embodiment of the invention, the first data structure of the input data is converted into the second data structure, so that the data analysis processing system can process dynamic data or complex data, and the data processing efficiency is improved.

In the graphical configuration interface shown in fig. 7, the "school number" and "name" fields of the configuration output data directly take the value of the same name field of the input data, and the "number of failed subjects" field of the output data is the sum of all subfields under the "failed subject count" of the input.

Taking a specific input data as an example, the change of the data after passing through each operator is explained.

Input data in JSON format:

the data after the "resolve to RT Event" operator is as follows:

static data area: "2020001", "Zhang III", 88,92,59,0

Dynamic data area: { "physical": 70 "," chemical ":78}

The data after the "field value mapping" operator is as follows:

static data area: "2020001", "Zhang III", 0,0,1,0

Dynamic data area: air conditioner

The data after the "sum" operator is as follows:

static data area: "2020001", "Zhang III", 1

Dynamic data area: air conditioner

The JSON format data output after the 'construction output' operator is as follows:

according to the invention, the first data structure of the input data is converted into the second data structure, so that the data analysis processing system can process dynamic data or complex data, and the data processing efficiency is improved. Meanwhile, the user operation is simple and convenient, and the user operation threshold is reduced.

Based on the model operation method provided in the above embodiment, an embodiment of the present invention further provides a data analysis processing system for implementing the above method, and referring to fig. 8, a data analysis processing system 800 provided in an embodiment of the present invention includes:

the obtaining module 801 obtains input data of a first data structure of a streaming task.

A conversion module 802, configured to convert the input data of the first data structure into intermediate data of a second data structure.

And the processing module 803 is configured to process the intermediate data by using an operator of the stream task, and output a processing result. Wherein the second data structure includes a static data region and a dynamic data region.

Optionally, in the data analysis processing system, the conversion module includes:

the acquisition subunit is used for acquiring the data type of the input data;

and the conversion subunit is used for converting the input data of the first data structure into the intermediate data of the second data structure according to the data type of the first data structure.

Optionally, in the data analysis processing system, the conversion subunit is specifically configured to: determining a target data type corresponding to each field in the second data structure according to the original data type of each field of the input data, wherein the target data type comprises a static data type and a dynamic data type;

Optionally, in the data analysis processing system, the step of determining, according to the original data type of each field of the input data, a target data type corresponding to each field in the second data structure includes:

a substep: if the original data type of the field of the input data is static and the type is a scalar, marking the field as static data;

Optionally, in the data analysis processing system, the first data structure includes a field name and a type of a value.

Optionally, in the data analysis processing system, before the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure, the method further includes:

Optionally, the static data area is a variable length array, and the dynamic data area is a variable length array.

Optionally, in the data analysis processing system, the step of converting the input data of the first data structure into the intermediate data of the second data structure according to the static area index, the dynamic area index, and the data type corresponding to each field in the second data structure includes:

mapping the value of the field marked as the static data to an array element in the static data area, wherein the static area index corresponding to the field of the static data is a subscript;

Optionally, in the data analysis processing system, the obtaining subunit is specifically configured to:

obtaining a data type of the input data based on a user configuration input; or

Optionally, in the data analysis processing system, the input data includes nested data and/or dynamic data.

Optionally, in the data analysis processing system, after the step of converting the input data of the first data structure into the intermediate data of the second data structure, the method further includes:

determining a computing mode of the stream task based on the target data type;

the processing module is specifically configured to:

Optionally, the data analysis processing system further includes: and the deserializing module is used for acquiring the input data of the stream task and deserializing the input data.

Optionally, in the data analysis processing system, the processing module 803 is further specifically configured to:

calculating by using the value to obtain a calculation result;

carrying out serialization processing on the output data;

and outputting the output data after the serialization processing.

Optionally, in the data analysis processing system, the stream task runs in a distributed manner, and the processing module 803 is further specifically configured to:

Optionally, in the data analysis processing system, the second data structure further includes intrinsic attributes.

Optionally, in the data analysis processing system, the determining, according to the original data type of each field of the input data, a target type corresponding to each field in the second data structure includes:

The data analysis processing system provided by the invention has the advantages that the first data structure of the input data is converted into the second data structure, so that the data analysis processing system can process dynamic data or complex data, and the data processing efficiency is improved. Meanwhile, the user operation is simple and convenient, and the user operation threshold is reduced.

An embodiment of the present invention provides a data analysis processing system, which includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor, and when executed by the processor, the computer program implements the steps of the data processing method according to the above embodiment.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the data processing method according to the above embodiment.

The embodiment of the present invention further provides a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the data processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data processing method is applied to a data analysis processing system and is characterized by comprising the following steps:

obtaining input data of a first data structure of a streaming task;

2. The method of claim 1, wherein converting the input data of the first data structure into the intermediate data of the second data structure comprises:

acquiring the data type of the input data;

3. The method of claim 2, wherein converting the input data of the first data structure into the intermediate data of the second data structure according to the data type comprises:

4. The method of claim 1, wherein the input data comprises nested data and/or dynamic data.

5. The method of claim 3, wherein after the step of converting the input data of the first data structure to the intermediate data of the second data structure, the method further comprises:

determining a computing mode of the stream task based on the target data type;

6. A data analysis processing system, characterized in that the data analysis processing system comprises:

7. The data analysis processing system of claim 6, wherein the conversion module comprises:

the acquisition subunit is used for acquiring the data type of the input data;

a conversion subunit, configured to convert input data of a first data structure into intermediate data of a second data structure according to the data type of the first data structure.

8. The data analysis processing system according to claim 7, wherein the conversion subunit is specifically configured to: determining a target data type corresponding to each field in the second data structure according to the original data type of each field of the input data, wherein the target data type comprises a static data type and a dynamic data type;

9. The data analysis processing system of claim 6, wherein the input data comprises nested data and/or dynamic data.

10. The data analysis processing system of claim 8, wherein the step of converting the input data of the first data structure into the intermediate data of the second data structure is followed by further comprising:

determining a computing mode of the stream task based on the target data type;

the processing module is specifically configured to: