CN109343833B

CN109343833B - Data processing platform and data processing method

Info

Publication number: CN109343833B
Application number: CN201811117521.XA
Authority: CN
Inventors: 罗志国; 王文军
Original assignee: Dingfu Intelligent Technology Co ltd
Current assignee: Dinfo Beijing Science Development Co ltd
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2022-12-16
Anticipated expiration: 2038-09-20
Also published as: CN109343833A

Abstract

The embodiment of the invention discloses a data processing platform, which comprises: a first component comprising a first algorithm encapsulated with a standard interface; a second component comprising a second algorithm encapsulated with a standard interface; the standard interface is used for enabling the first algorithm and the second algorithm to respectively calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset, so that the first component and the second component can be connected in series; the first algorithm and the second algorithm are machine learning algorithms written in different programming languages. On the data processing platform in the technical scheme, a user can directly connect machine learning algorithms written by two different programming languages in series without considering the obstacles of the programming languages or mastering professional programming knowledge, so that the cross calling and execution of the multi-language algorithms are realized, and the use cost is reduced.

Description

Data processing platform and data processing method

Technical Field

The invention relates to the technical field of big data processing, in particular to a data processing platform and a data processing method.

Background

With the increasing growth of structured, semi-structured and unstructured data, it is a challenge to excavate valuable information from massive data.

In different application scenarios, the data to be processed by the computer is different correspondingly, and the storage formats of the data are different, and the information that the user needs to extract is different. This makes it often necessary to use a variety of different machine learning algorithms to accomplish different tasks when processing data, such as using neural networks, support vector machines, logistic regression, decision trees, random forest algorithms, naive bayes, hidden markov, conditional random fields, and the like. Combining a plurality of different algorithms to obtain a data processing method aiming at a specific application scene, and further mining information meeting the user requirements from massive data.

However, different algorithms may be written by different programming languages, such as JAVA, python, R language, etc., and these different algorithms cannot be directly connected in series to form a complete data processing method for users of the data processing platform to process data, which is a problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the technical problem, the application provides a data processing platform, and components on the platform comprise algorithms written by different programming languages, and the components can be flexibly combined, so that a user can directly connect machine learning algorithms written by different programming languages in series for use.

In a first aspect, a data processing platform is provided, comprising:

a first component comprising a first algorithm encapsulated with a standard interface;

a second component comprising a second algorithm encapsulated with a standard interface;

the standard interface is used for enabling the first algorithm and the second algorithm to respectively calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset, so that the first component and the second component can be connected in series; the first and second algorithms are machine learning algorithms written in different programming languages.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the data processing platform further includes:

an interaction unit to: according to a first node in the processing flow, sending the first assembly and the first data set to an actuator; receiving a first calculation result, wherein the first calculation result is a calculation result of a data structure Dataset obtained by processing a first data set by an executor by using a first component; sending the second component and the first calculation result to an executor according to a second node in the processing flow; the second node in the processing flow is connected behind the first node, the first node corresponds to the first component, and the second node corresponds to the second component.

With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the data processing platform further includes:

an actuator for: processing the first data set by using the first component to obtain a first calculation result with a data structure as Dataset; sending the first calculation result to the interaction unit; receiving a second assembly and a first calculation result sent by the interaction unit; and processing the first calculation result by using a second component to obtain a second calculation result with a data structure of Dataset.

With reference to the first aspect and the first and second implementation manners, in a third possible implementation manner of the first aspect, the data processing platform further includes:

a third component comprising a third algorithm encapsulated with a standard interface; the third algorithm is a machine learning algorithm written by the same programming language as the second algorithm; and the standard interface is also used for enabling the third algorithm to calculate by using the data with the data structure as Dataset and outputting the calculation result with the data structure as Dataset.

With reference to the third implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the data processing platform further includes:

an interaction unit to: sending the first assembly and the first data set to an executor according to a first node in a processing flow; receiving a first calculation result, wherein the first calculation result is a calculation result of a data structure Dataset obtained by processing a first data set by an executor by using a first component; according to a second node and a third node in the processing flow, sending the second component, the third component and the first calculation result to an executor; wherein, after the second node is connected to the first node, the third node is connected to the second node, and the first node corresponds to the first component, the second node corresponds to the second component, and the third node corresponds to the third component in the processing flow.

With reference to the fourth implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the data processing platform further includes:

an actuator for: processing the first data set by using the first component to obtain a first calculation result with data as Dataset; sending the first calculation result stored in a first memory to the interaction unit; receiving a second component, a third component and a first calculation result sent by the interaction unit; processing the first calculation result by using a second component to obtain a second calculation result with a data structure as Dataset; storing the second calculation result in a second memory; and processing the second calculation result in the second memory by using a third component to obtain a third calculation result with a data structure of Dataset.

With reference to the first aspect and the foregoing possible implementation manners, in a sixth possible implementation manner of the first aspect, the interaction unit is further configured to: acquiring a first dependent file corresponding to the first component according to a first node in a processing flow; and sending the first dependency file, the first component, and the first data set to an executor;

and/or the presence of a gas in the gas,

the interaction unit is further configured to: acquiring a second dependent file corresponding to the second component according to a second node in the processing flow; and sending the second dependency file, the second component, and the second calculation result to an executor.

With reference to the first aspect and the foregoing possible implementation manners, in a seventh possible implementation manner of the first aspect, the data processing platform further includes:

a component publishing unit to: acquiring a fourth dependent file corresponding to the fourth component; performing conflict detection on the fourth dependent file and the dependent file of the published component; outputting conflict information in case of a conflict; acquiring a merging instruction input by a user according to the conflict information; updating the dependent file which conflicts with the fourth dependent file into the fourth dependent file according to the merging instruction; the fourth component comprises a fourth algorithm packaged by adopting a standard interface, the fourth algorithm is a machine learning algorithm, and the standard interface is further used for enabling the fourth algorithm to calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset.

In a second aspect, a data processing method is provided, including:

acquiring input data of a first algorithm from a first data set with a data structure as Dataset, and calculating by using the first algorithm to obtain a first calculation result with the data structure as Dataset;

acquiring input data of a second algorithm from the first calculation result, and calculating by using the second algorithm to obtain a second calculation result with a data structure as Dataset;

wherein the first algorithm and the second algorithm are machine learning algorithms written in different programming languages.

With reference to the second aspect, in a first possible implementation manner of the second aspect, after the step of calculating a first calculation result with a data structure as Dataset by using the first algorithm, the method further includes:

writing the first calculation result stored in the first memory into the HFDS/disk;

reading the first calculation result stored in the HFDS/disk into a second memory;

after the step of calculating a second calculation result with a data structure as Dataset by using the second algorithm, the method further includes:

and acquiring input data of a third algorithm from the second calculation result in the second memory, and calculating by using the third algorithm to obtain a third calculation result with a data structure of Dataset.

The data processing platform in the technical scheme comprises a first component and a second component, wherein the first component and the second component respectively comprise a first algorithm and a second algorithm which are packaged by adopting a standard interface. Since the first component and the second component are packaged by adopting standard interfaces, although the first algorithm and the second algorithm are written by adopting different programming languages, the first algorithm and the second algorithm can call data stored in a data structure such as a Dataset through the standard interfaces. While the calculation results are still output in Dataset form, so that the calculation results of one component can be directly used as input data of another component. By the mode, a user can directly connect the machine learning algorithms written by two different programming languages in series on the data processing platform, the obstacles of the programming languages are not considered, professional programming knowledge is not required to be mastered, the cross calling and execution of the multi-language algorithms are realized, and the use cost is reduced.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic block diagram of one embodiment of a data processing platform according to the present application;

FIG. 2 is a schematic diagram of data ingress and egress during execution of a first component and a second component connected in series in a data processing platform according to the present application;

FIG. 3 is a schematic diagram of a second embodiment of a data processing platform according to the present application;

FIG. 4 is a diagram illustrating an example of a data processing method implemented in a process customization unit in the data processing platform of the present application;

FIG. 5 is a schematic diagram of a third embodiment of a data processing platform according to the present application;

fig. 6 is a flowchart of one embodiment of a data processing method according to the present application.

Detailed Description

The following provides a detailed description of the embodiments of the present application.

Machine learning algorithms written in different programming languages cannot be concatenated directly in the prior art, mainly because: many existing algorithms only achieve basic operational capability, need to perform secondary packaging on the algorithms for services when the algorithms are used really, and have the problem of different data structures of input and output parameters in a cross-language process because of no uniform specification during packaging. For example, the first algorithm is written in JAVA and the second algorithm is written in Python. The first algorithm needs an input parameter of a List structure in JAVA to obtain an output parameter of the List structure in JAVA; the second algorithm requires the input parameter of the Tuple structure in Python to obtain the output parameter of Tuple in Python. In this case, since the data structure of the output parameters of the first algorithm and the data structure of the input parameters of the second algorithm do not match, the user cannot directly concatenate the two. The concatenation here means that the output parameters of the previous algorithm cannot be directly used as the input parameters of the second algorithm without data structure conversion.

Spark is a Hadoop-like distributed parallel computing framework and is a general computing engine for rapidly processing large-scale data. Data structures are the way data is stored and organized in a computer. The commonly used data structures are mainly: arrays, stacks, queues, linked lists, trees, graphs, heaps, hash tables, and the like. An elastic Distributed data set (RDD) is a basic data structure of Spark, and is a read-only and partition-capable Distributed data set, and all data in Spark can be organized in the form of RDD. Dataset is a data structure in 1.6 and subsequent versions of Spark, supporting languages including java, scala, python, or R.

In order to solve the problem that machine learning algorithms written by different programming languages cannot be directly used in series, in the application, a data structure Dataset in Spark is applied to a standard interface for the first time, the machine learning algorithms written by different programming languages are packaged by the standard interface, and each machine learning algorithm is independently packaged into a component. And then the components are released to a data processing platform, so that a user can directly combine different components on the data processing platform according to different requirements, and machine learning algorithms written in different languages are connected in series to form a complete data processing method for processing mass data.

Specifically, referring to fig. 1 to 3, a first embodiment of the present application provides a data processing platform, including:

a first assembly 1 comprising a first algorithm 11 encapsulated with a standard interface 8;

a second component 2 comprising a second algorithm 21 encapsulated with a standard interface 8;

the standard interface 8 is configured to enable the first algorithm 11 and the second algorithm 21 to perform calculation by using data with a data structure of Dataset, and output a calculation result with a data structure of Dataset, so that the first component 1 and the second component 2 can be connected in series; the first algorithm 11 and the second algorithm 21 are machine learning algorithms written in different programming languages.

The above-mentioned Dataset is a data structure in 1.6 and subsequent versions of Spark, which imposes a type constraint on each row of data of the RDD. For example: the data stored in the RDD is shown in table 1, and the data stored in the Dataset is shown in table 2. In the Dataset of table 2, each line of data is actually an object entity, for example, "user information" is an object entity, and the "user information" includes the type of the entity (e.g., "peoples"), the elements in the entity (e.g., "id", "name", "age", etc.), and the type of the element (e.g., "margin", "string", etc.). According to the type of the entity, different elements contained in the entity and the type of the elements, corresponding data (such as ' 1 ', ' zhang ' three ', ' 23 ', etc.) are stored.

Data examples in Table 1 RDD

1, zhang San, 23
	2, lie four, 35

Data example in Table 2 Dataset

value：People[id：bigint，name:string，age：bigint]
	Peoples (id =1, name = "zhang san", age = 23)
Peoples (id =2, name = "prune four", age = 35)

Note: in table 2, big is a number type, string is a character type, id is an account, name is a user name, and age is an age.

The standard interface 8 in this embodiment may be understood as a unified entry function "spark execution," and the machine learning algorithm is actually accessed through the same entry function by using the standard interface 8 to package the machine learning algorithm. The standard interface is used to define a specific data structure, i.e. data with a data structure set can be called or data with a data structure set can be output through the standard interface. No matter which language of JAVA, python, SCALA and R is adopted to write the machine learning algorithm, as long as the standard interface is adopted for packaging, the corresponding calculation can be carried out by utilizing the data with the data structure as the Dataset, because the data structures of the Dataset bottom layers of different programming languages are consistent, and the Dataset outer layer provides APIs of multiple languages for calling aiming at different languages, so that the data structure of the Dataset can realize the cross-language effect. In this embodiment, the parameters and attributes, such as the type of the object entity specifically contained in the Dataset, the element specifically contained in the entity, and the type of the element, may be adjusted by the user according to different application scenarios.

The first and second algorithms may each be selected from existing machine learning algorithms, such as neural networks, support vector machines, logistic regression, decision trees, random forest algorithms, naive bayes, hidden markov, conditional random fields, and the like. The second algorithm and the first algorithm are machine learning algorithms written in different programming languages, the programming languages in the application include JAVA, python, SCALA and R languages, and the programming languages used by the first algorithm and the second algorithm may be a combination of any two of the four languages. For example, the first algorithm may be written in the JAVA language and the second algorithm may be written in the Python language.

And respectively packaging the first algorithm and the second algorithm by adopting the standard interface to obtain the first component and the second component. And issuing the first component and the second component to the data processing platform, wherein the first component and the second component are packaged by adopting standard interfaces, so that the first algorithm and the second algorithm can call the data stored in the data structure of Dataset through the standard interfaces although the first algorithm and the second algorithm are written by adopting different programming languages. While the calculation results are still output in Dataset form, so that the calculation results of one component can be directly used as input data of another component. By the mode, a user can directly connect the machine learning algorithms written by two different programming languages in series on the data processing platform, the obstacles of the programming languages are not considered, professional programming knowledge is not required to be mastered, the cross calling and execution of the multi-language algorithms are realized, and the use cost is reduced.

Besides, other components, such as the third component 3, the fourth component (not shown in fig. 1), the fifth component (not shown in fig. 1), and so on, may also be included on the data processing platform, and these components all include machine learning algorithms packaged by standard interfaces, and the programming language used by these machine learning algorithms may be the same as or different from the first algorithm and the second algorithm. According to the requirement of data processing, the user selects all or part of the components, and combines the components in series to form a complete data processing method to process mass data.

Optionally, referring to fig. 1 and fig. 3, the data processing platform may further include a flow customizing unit 6 for formulating a flow of the data processing. The user can represent different components by different nodes in the flow customization unit 6, and connect the different components as required to represent that different machine learning algorithms are connected in series to form a complete data processing method. For example, referring to fig. 4, a first component is represented by a first node, a second component is represented by a second node, and the second node is connected to the back of the first node, so that the first component is used to process data and then the second component is used. Similarly, a third node may also be used to represent a third component, and a third node may also be connected behind the second node, which indicates that data processed by the second component is processed by the third component; and the fourth node represents a fifth component, and the fifth component is connected behind the third node to represent that the data processed by the third component is processed by the fifth component. The user-defined process flow may include all or some of the components already on the platform. In this way, a user can visually formulate the dependency relationship between different components, and the components do not need to be used in the whole process of formulation.

In addition, the problems of different specific parameters, different numbers of parameters, and the like of different machine learning algorithms may also affect the serial use of the different machine learning algorithms. For this purpose, in the process customization unit 6, constraints on input parameters and output parameters of the components, such as defining types of the output parameters, may also be set, so as to make the connection between the components smoother by defining the input parameters and the output parameters when the components are operated to process data. For example, the number of output parameters of the second module shown in fig. 4 is 2, and the number of input parameters of the third module is 3, that is, the number of output parameters is different from the number of input parameters; and the names of the input parameters in the third component and the output parameters in the second component are also different; this results in the output parameters of the previous component not being successfully passed to the next component. At this time, the corresponding relationship between the first output parameter of the second component and the first input parameter of the third component may be constructed in the flow customization unit 6, so as to designate the first output parameter of the second component as the first input parameter of the third component, thereby enabling the two components to smoothly transfer parameters when used in series.

In the application, the method formed by combining different components in series can be applied to the local data processing platform and can also be completed on a remote Spark cluster. That is, the data processing task may be executed by at least one Executor (Executor) on the local execution engine, or may be submitted to the remote Spark cluster and executed by at least one remote Executor (Executor). When processing is performed locally, each component needs to be packaged and submitted together with other Spark running necessary files according to the dependency relationship in the designated processing flow, and then the computing task is completed by a local executor. When the method is executed on the remote Spark cluster, each component needs to be packaged and submitted together with other Spark running necessary files according to the dependency relationship of each component in the specified processing flow, and is distributed to an executor on the remote Spark cluster to complete the process, and finally, the calculation result is returned to the data processing platform. An interaction unit may be provided for interacting with the actuators, whether locally or remotely, for data transfer with the actuators.

Therefore, optionally, referring to fig. 1 and fig. 3, the data processing platform further includes an interaction unit 4, configured to send a first data set with data sets of various components and data structures in the processing flow to the executor 5; and the receiving executor 5 calculates a calculation result with a data structure as Dataset. More specifically, the interactive unit 4 implements step transmission according to the dependency relationship between the components in the process flow when transmitting the components and data. Please refer to the example of the processing flow of fig. 4, where the processing flow includes a first node and a second node connected behind the first node, and the first node corresponds to the first component, and the second node corresponds to the second component, and the interaction unit 4 is configured to: according to a first node in the processing flow, sending the first assembly 1 and a first data set to an executor 5; receiving a first calculation result, wherein the first calculation result is a calculation result of a data structure, which is obtained by processing a first data set by using a first component 1, of an actuator 5; and sending the second component 2 and the first calculation result to an executor 5 according to a second node in the processing flow.

When Spark executes a task, usually one task generates a plurality of jobs, each Job is divided into a plurality of stages, and then the plurality of stages are respectively allocated to a plurality of executors which have free computing resources and can participate in the work on the Spark cluster to complete the computation. Therefore, the actuators in this embodiment may be understood as a general term for actuators that can participate in work remotely or locally, and do not refer to a single or a few specific actuators.

In one implementation, the interactive unit divides the tasks included in the components, and then distributes the divided stages to the multiple executors. In another implementation, the interactive unit may package and submit the components and data to another driver unit (not shown in the figures), which performs task partitioning and distributes Stage to multiple executors. In this application, the transmission and reception between the interaction unit and the actuator include both a direct transmission/reception implementation and an indirect transmission/reception implementation, which is not limited in this application.

For different conversion functions, the dependency relationship between the datasets can be divided into narrow dependency (narrow dependency) and wide dependency (wide dependency), wherein the narrow dependency means that each partition of a parent Dataset is only used by one partition of the child Dataset, and the child Dataset partitions usually correspond to a plurality of common parent Dataset partitions; accordingly, wide dependency means that each partition of a parent Dataset may be used by multiple child Dataset partitions, which typically correspond to all parent Dataset partitions. When stages are divided, groups can be searched from back to front according to the data processing flow formulated in the flow customizing unit 6, all stages with narrow dependence are divided into one group, and all data of the group of stages can be stored in a memory when the stages are executed, so that the data do not fall to the ground, and the IO (input/output) overhead of data reading is reduced.

Referring to fig. 1, when the data processing process is performed locally, any of the foregoing data processing platforms may further include at least one executor 5, configured to process the first data set by using the first component 1, to obtain a first calculation result with a data structure as Dataset; sending the first calculation result to the interaction unit 4; receiving a second component and a first calculation result sent by the interaction unit 4; and processing the first calculation result by using the second component 2 to obtain a second calculation result with a data structure of Dataset.

In the data processing platform, the components and the scheduling operation of the components are not necessarily connected, and the components and the scheduling operation of the components are decoupled and separated, so that excessive coupling is avoided.

In addition to the first component 1 and the second component 2, the data processing platform may further include a component written in any one of java, scala, python and R languages as an extension component for a user to flexibly select according to an actual usage scenario.

Optionally, in an implementation manner, please refer to fig. 1, the data processing platform includes a third component 3, and the third component 3 includes a third algorithm packaged by using a standard interface; the third algorithm is a machine learning algorithm written by the same programming language as the second algorithm; the standard interface is also used for enabling the third algorithm to calculate by using the data with the data structure as the Dataset, and outputting the calculation result with the data structure as the Dataset.

In this case, referring to fig. 4 and fig. 1, if the second node is connected to the first node, the third node is connected to the second node, the first node corresponds to the first component, the second node corresponds to the second component, and the third node corresponds to the third component in the processing flow, then the interaction unit 4 may specifically be configured to: according to a first node in the processing flow, sending the first assembly 1 and a first data set to an executor 5; receiving a first calculation result, wherein the first calculation result is a calculation result of a data structure, which is obtained by processing a first data set by using a first component 1, of an executor 5; and sending the second component 2, the third component 3 and the first calculation result to an executor 5 according to a second node and a third node in the processing flow.

When the data processing is performed locally, the data processing platform may further include an executor 5 for: processing the first data set by using the first component 1 to obtain a first calculation result with data set as data; sending the first calculation result stored in the first memory to the interaction unit 4; receiving the second component 2, the third component 3 and the first calculation result sent by the interaction unit 4; processing the first calculation result by using the second component 2 to obtain a second calculation result with a data structure as Dataset; storing the second calculation result in a second memory; and processing the second calculation result stored in the second memory by using the third component 3 to obtain a third calculation result with a data structure of Dataset. Here, the second memory is a memory space corresponding to the programming languages of the second algorithm and the third algorithm, and is completely isolated from the memory space corresponding to the programming language of the first algorithm (which may be referred to as the first memory).

Compared with Hadoop, spark itself does not provide a Distributed File System (HDFS) for storing data, so Spark execution tasks mostly depend on HDFS of Hadoop. On a general Spark actuator, intermediate data is stored in a memory in the process of executing an algorithm, so that reading and writing are facilitated, the calculation speed is improved, and the data reading overhead is reduced. In the data processing platform, algorithms written in two different languages are connected in series, and the algorithms written in the different languages are stored in different and completely isolated memory spaces during operation, so that parameters cannot be transmitted through the memory. For example, the first algorithm is written in JAVA language, the second algorithm is written in Python language, when the first component performs calculation by using the first data set as input data, the intermediate data of the whole execution process is stored in the jvm heap memory, and when the second component performs calculation by using the first calculation result, the intermediate data of the whole execution process is stored in another memory dedicated to Python. Therefore, after the executor 5 processes the first data set by using the first component 1 to obtain the first calculation result, the first calculation result must be read from the jvm heap memory and sent to the interactive unit 4, and the interactive unit 4 writes its data into the distributed file system (HDFS) or the disk first. When the second component 2 needs to be used, the interaction unit 4 reads the first calculation result from the HDFS/disk, and writes the first calculation result into a memory dedicated to Python for the use of the executor 5 when executing the second component 2.

In this embodiment, because the programming language of the second algorithm is the same as that of the third algorithm, the intermediate data when the second component 2 and the third component 3 are executed are stored in the same memory space, and it can be known from the processing flow that the third component 3 is executed after the second component 2 is executed, the parameters between the two components can be directly transferred in the same memory without being written to the disk or the HDFS. By the mode, the data do not fall to the ground in the whole process of executing the second component 2 and the third component 3 by the executor 5, the purpose of high-efficiency execution is achieved, the read-write process of a disk/HDFS is further reduced, extra performance consumption brought by accessing a file system is reduced, such as IO (input output) overhead of data reading and the like, and the calculation speed is increased.

It should be noted that, in the embodiment of the present application, the data structure DataFrame may be regarded as one of the cases of Dataset, because Dataset and DataFrame have identical member functions, and the difference is that the type of each row of entities is different. DataFrame may also be called Dataset [ Row ], where each Row is Row in type. While in Dataset, what type each row is indeterminate and may be defined by the developer from the row, e.g., the type of entity in Table 2 may be defined as "People".

It should be noted that, in China, the skilled person does not perform unified chinese translation on "Job", "Stage", "Dataset" and "DataFrame" in Spark, but generally refers to the original english language to describe, so to avoid ambiguity of translation, the present embodiment also uses the original english language to describe, and the skilled person can understand these english terms.

When calculating using algorithms written in different programming languages, different dependent files are required. Therefore, optionally, referring to fig. 5, any one of the interaction units 4 described above is further configured to: acquiring a first dependent file 71 corresponding to the first component 1 according to a first node in a processing flow; and sending said first dependency file 71 to the executor 5 together with said first component 1 and said first data set;

and/or the presence of a gas in the gas,

optionally, any of the aforementioned interaction units 4 is further configured to: acquiring a second dependent file 72 corresponding to the second component 2 according to a second node in the processing flow; and sending said second dependency file 72 to the actuator 5 together with said second component 2 and said second algorithm calculation.

The dependency files here mainly include two types: the entry file for the component and the third party dependency package. The entry file is used for assembling parameters of the components, dynamically importing the dependency packages, and submitting the dependency packages to a remote or local spark cluster for distributed computing. For example, in this embodiment, each programming language has a corresponding entry file, and the entry file corresponding to the algorithm written in the R language is "dinfonain.r"; the entry file corresponding to the algorithm written in Python language is "dinfonmain. The third-party dependency package refers to a module provided by a third party, which needs to be depended on by executing a certain component, for example, a langdetect language detection package of Python version, and the like. The dependency files are generally stored in preset positions in the HDFS, and when the dependency files and the third-party dependency packages are required to be submitted, the entry files and the third-party dependency packages are packaged into a dependency file corresponding to the components and are submitted together with the components.

For example, assume that the second algorithm is a corpus analysis algorithm written in Pyhton language, and the corresponding second dependent file includes the entry file "dinfonmain. After the executor receives the second component and the second dependent file, an entry file 'dinfo main.py' in the second dependent file is run firstly, the second component to be executed is found through the entry file, and meanwhile, a langdetect language detection packet is dynamically introduced during running, so that language detection is carried out before corpus analysis is carried out, and whether the corpus to be processed is Chinese corpus or English corpus is determined. And then calling a second algorithm suitable for the language of the corpus to be processed from the second component according to the detection result.

The foregoing only describes the process of acquiring and submitting the dependency files corresponding to the first component and the second component by the interactive unit, and if other components are included in the processing flow, the acquiring and submitting processes of the corresponding dependency files are similar to this.

Optionally, referring to fig. 5, the data processing platform further includes a component issuing unit 9, configured to: acquiring a fourth dependent file corresponding to a fourth component, wherein the fourth component comprises a fourth algorithm packaged by a standard interface, the fourth algorithm is a machine learning algorithm, and the standard interface is further used for enabling the fourth algorithm to calculate by using data with a data structure as Dataset and outputting a calculation result with the data structure as Dataset; performing conflict detection on the fourth dependent file and the dependent file of the published component; if conflict occurs, conflict information is output; acquiring a merging instruction input by a user according to the conflict information; and updating the dependent file which conflicts with the fourth dependent file into the fourth dependent file according to the merging instruction.

Through the component publishing unit 9, a component developer can publish a new component to the data processing platform for a user to flexibly call. When issuing the components, the component issuing unit 9 may further automatically detect whether the dependency file corresponding to the fourth component to be issued conflicts with the dependency file corresponding to each component already issued on the data processing platform. For example, the dependency files of the published components comprise java jar2.0, SCALA jar1.0 and R script 1.0; the fourth dependent file is java jar3.0. The conflict detection finds that the fourth dependent file conflicts with the dependent file of the published component, then outputs conflict information, such as "conflict with java jar 2.0! ". After seeing the conflict information, the user can input a merge instruction, wherein the merge instruction can include information of a dependent file to be updated in the dependent file corresponding to the published component, such as java jar 2.0. According to the merging instruction, the dependent file java jar2.0 on the data processing platform can be updated to java jar3.0. For another example, the dependency files of the published components include java jar2.0, SCALA jar1.0, and R script 1.0; the fourth dependent file is python2.0. And if the conflict detection finds that the fourth dependent file does not conflict with the dependent file of the published component, the fourth dependent file is published to the data processing platform.

Here, after merging the dependent files, the updated dependent file may be stored in an independent area on the data processing platform and retained, and a fourth dependent file corresponding to the updated dependent file is recorded. For example, in the foregoing example, java jar2.0 is updated to java jar3.0, java jar2.0 is stored in another area, and the corresponding relationship between update and update between java jar3.0 and java jar2.0 is recorded, and other information such as the time of update may also be recorded. If the java jar2.0 is merged with the previously published java jar1.0 due to conflict when being published to the data processing platform, the corresponding relation between the java jar1.0-java jar 2.0-java jar3.0 can be established, and therefore the evolution process between the dependent files of different versions is recorded. On one hand, the method is convenient for restoring to the appointed historical version in time when the new dependent file has a problem, and improves the stability of the data processing platform; on the other hand, the method can ensure the logical consistency and relative independence of different dependent file versions, and when the dependent file of one version is generated or disappears, the contents of the other versions are not influenced.

In a second embodiment of the present application, a data processing method is provided to describe a process of an executor processing data. Specifically, please refer to fig. 6, which includes the following steps S100 to S200.

S100: acquiring input data of a first algorithm from a first data set with a data structure as Dataset, and calculating by using the first algorithm to obtain a first calculation result with the data structure as Dataset;

s200: acquiring input data of a second algorithm from the first calculation result, and calculating by using the second algorithm to obtain a second calculation result with a data structure as Dataset;

In step S100, the data in the first data set may be data in the form of text, voice, picture, or the like, for example, the data may be text obtained by voice conversion of a bank customer service call, text extracted from financial news on the internet, or the like, or the data may be obtained by processing the text. The source of the data in the first data set is not limited, and the data structure meeting the requirement of storage is set.

First input data is obtained from a first data set having a data structure set as input data for a first algorithm. For example, the example of table 2 is a part of the first data set, from which some or all information can be obtained as the first input data, and which contents are specifically obtained may be different according to the actual first algorithm.

The first algorithm may be selected from existing machine learning algorithms, and the first calculation result is calculated using the first input data as the input data of the first algorithm. Although the first calculation result is still output in the form of Dataset, the parameters and attributes of the entity specifically contained in the output data and the input data, the type of the entity, the element specifically contained in the entity, the type of the element, and the like may be different due to the calculation performed by the first algorithm.

As briefly explained in the example of table 2, the first input data is obtained from the first data set, and includes three elements "id", "name", and "age" and corresponding values. If the first algorithm is to calculate the average of the age of all ids and classify these ids as being both greater than and less than or equal to the average, then the output data after calculation (i.e., the first calculation) may contain two entities, the first entity containing all usernames greater than the average and the second entity containing all usernames less than or equal to the average. That is, the final calculation result is still stored in a data structure in the form of Dataset, in which each entity contains the element name and the corresponding value. It should be noted that the first algorithm is only an example for illustrating the first input data and the calculation result of the first algorithm, and in fact, the first algorithm in this application is a machine learning algorithm, and the complexity of the specific calculation process is far more than that.

In the step of S200, part or all of the information is acquired from the first calculation result as input data of the second algorithm, i.e., second input data. Which contents are specifically acquired may be different according to the actual second algorithm. And performing calculation of a second algorithm by taking second input data as input data, and finally outputting a calculation result in a Dataset form, namely a second calculation result.

In the method, although the first algorithm and the second algorithm are machine learning algorithms written in different programming languages, after the first algorithm and the second algorithm are packaged into the first component and the second component, the input data can be obtained from the data with a data structure of Dataset, and the calculation result is output in the form of Dataset, so that the input data can be directly connected in series into a complete method for processing mass data, and multi-language algorithm cross calling execution is realized.

As described in the first embodiment, because the programming languages of the first algorithm and the second algorithm are different, the intermediate data and the result during the operation of the algorithms need to be stored in two completely isolated memory spaces, respectively. Therefore, the intermediate data and the calculation results generated in the calculation steps of S100 and S200 described above are stored in different memories. That is, the first calculation result obtained in S100 is stored in the first memory, and the second calculation result of S200 is stored in the second memory. When the executor needs to use the first calculation result as the input data of the second algorithm to calculate, the HFDS/disk needs to be read and written in the middle. Specifically, the steps S300 and S400 are also included between the steps S100 and S200:

s300: writing the first calculation result stored in the first memory into the HFDS/disk;

s400: the first calculation result stored in the HFDS/disk is read into the second memory.

The first memory is a memory space corresponding to a programming language of the first algorithm; the second memory is a memory space corresponding to the programming language of the second algorithm, which are completely isolated. In one implementation, the executor may directly store the first calculation result in the first memory to the disk or the HFDS, and when the second component needs to use the first calculation result, the executor may directly read the first calculation result in the disk or the HFDS to the second memory. In another implementation, the executor sends the first calculation result in the first memory to the interaction unit, so that the interaction unit stores the first calculation result in the disk or the HFDS, and when the second component needs to use the first calculation result, the interaction unit reads the first calculation result from the disk or the HFDS, and sends the first calculation result to the executor to be stored in the second memory.

Optionally, after the step of S200, the method may further include:

s500: and acquiring input data of a third algorithm from the second calculation result in the second memory, and calculating by using the third algorithm to obtain a third calculation result with a data structure of Dataset. Wherein the third algorithm is a machine learning algorithm written in the same programming language as the second algorithm.

Here, the programming languages of the second algorithm and the third algorithm are the same, and they correspond to the same memory space — the second memory, and the intermediate data and the second calculation result when the second component and the third component are executed are both stored in the second memory. Meanwhile, the third algorithm is executed after the second algorithm is known from the processing flow, so that the second calculation result does not need to be stored in a disk or an HFDS and read out, and can be directly transferred in the second memory.

By the mode, the data do not fall to the ground in the whole process of operating the second algorithm and the third algorithm by the actuator, the purpose of high-efficiency execution is achieved, the read-write process of the disk/HDFS is further reduced, the IO (input/output) overhead of data reading is reduced, and the calculation speed is increased.

It should be noted that the data processing method may further include other machine learning algorithms, and these machine learning algorithms may be connected in series before the first algorithm, or may be connected in series after the second algorithm or the third algorithm, which is not limited in this application. The step S500 may be referred to in the calculation result reading and writing process between the machine learning algorithms written in the two same languages, and the steps S300 and S400 may be referred to in the calculation result reading and writing process between the machine learning algorithms written in the two different languages, which are not described herein again.

The following describes the data processing method with a specific application scenario.

The amount of data after a customer service call voice recording of a certain bank for the past two years is converted into text exceeds one hundred GB, and increases at a rate of 6-7GB per month. The data contain valuable potential information, and the method has important help or guidance for work such as intelligent service optimization, customer service interaction, intelligent delivery and treatment, potential customer acquisition, risk management and the like. The above-described method can be used to process these texts to be processed.

The data processing platform comprises a first component, a second component and a third component. The first algorithm in the first component is a machine learning algorithm for segmenting the text, and the first algorithm is packaged by adopting a standard interface BaseCom; the second algorithm in the second component is a machine learning algorithm for classifying texts according to word segmentation results, such as a Single-Pass algorithm (also called a Single-channel method or a Single-Pass method), and the second algorithm is packaged by adopting a standard interface BaseCom; the third algorithm in the third component is a machine learning algorithm, such as Earley analysis algorithm, that parses classified text by category (Parsing), which has been encapsulated using the standard interface BaseCom. The first algorithm is written in JAVA, the second algorithm is written in Python, and the third algorithm is written in Python. The text to be processed is stored in a data structure of Dataset, i.e. the first data set.

And respectively segmenting the texts in the first data set through a first algorithm, and storing corresponding segmentation results in a data structure of the Dataset to obtain a first calculation result. And then classifying the texts in the first data set by using the word segmentation result through a second algorithm, wherein the texts are classified into three categories of complaints, consultation and complaints. Each category includes a collection. That is, the second calculation result includes three categories, one for each category, namely a complaint set, a consultation set, and a complaint set, each set including text classified into the category. Then, a third algorithm is used to mine product-targeted topics from the categories, such as debit card transactions, debit card loss reports, credit card payments, credit card consumption offers, and the like.

The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A data processing system, comprising:

the standard interface is used for enabling the first algorithm and the second algorithm to respectively calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset, so that the first component and the second component can be connected in series; the first algorithm and the second algorithm are machine learning algorithms written in different programming languages;

the data processing system further comprises a flow customization unit for:

connecting different nodes according to requirements to formulate a data processing flow, wherein the different nodes represent different components;

setting constraint conditions of input parameters and output parameters of the components;

the data processing system further comprises:

a third component comprising a third algorithm encapsulated with a standard interface; the third algorithm is a machine learning algorithm written by the same programming language as the second algorithm; the standard interface is also used for enabling the third algorithm to calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset;

an interaction unit to: sending the first assembly and the first data set to an executor according to a first node in a processing flow; receiving a first calculation result, wherein the first calculation result is a calculation result of a data structure Dataset obtained by processing a first data set by an executor by using a first component; sending the second component, the third component and the first calculation result to an actuator according to a second node and a third node in a processing flow; wherein, after the second node is connected to the first node, the third node is connected to the second node, and the first node corresponds to the first component, the second node corresponds to the second component, and the third node corresponds to the third component in the processing flow;

2. The data processing system of claim 1, further comprising:

3. The data processing system of claim 2, further comprising:

4. The data processing system of claim 2, wherein the interaction unit is further configured to: acquiring a first dependent file corresponding to the first component according to a first node in a processing flow; and sending the first dependency file, the first component, and the first data set to an executor;

and/or the presence of a gas in the gas,

the interaction unit is further configured to: acquiring a second dependent file corresponding to the second assembly according to a second node in the processing flow; and sending the second dependent file, the second component and the first calculation result to an executor.

5. The data processing system of claim 4, further comprising:

a component issuing unit configured to: acquiring a fourth dependent file corresponding to the fourth component; performing conflict detection on the fourth dependent file and the dependent file of the published component; outputting conflict information in case of a conflict; acquiring a merging instruction input by a user according to the conflict information; updating the dependent file which conflicts with the fourth dependent file into the fourth dependent file according to the merging instruction; the fourth component comprises a fourth algorithm packaged by a standard interface, the fourth algorithm is a machine learning algorithm, and the standard interface is further used for enabling the fourth algorithm to calculate by using data with a data structure of Dataset and outputting a calculation result with the data structure of Dataset.

6. A data processing method implemented using the data processing system of claim 1, the data processing method comprising:

acquiring input data of a second algorithm from the first calculation result according to a data processing flow formulated by the flow customizing unit and set constraint conditions, and calculating by using the second algorithm to obtain a second calculation result with a data structure as a Dataset;

wherein the first algorithm and the second algorithm are machine learning algorithms written in different programming languages;

after the step of calculating the first calculation result with the data structure as Dataset by using the first algorithm, the method further includes:

after the step of calculating a second calculation result with a data structure as Dataset by using a second algorithm, the method further comprises the following steps:

and acquiring input data of a third algorithm from the second calculation result in the second memory, and calculating by using the third algorithm to obtain a third calculation result with a data structure of Dataset, wherein the third algorithm is a machine learning algorithm written by using the same programming language as the second algorithm.