CN112130813A

CN112130813A - Multi-user collaborative modeling method, system and equipment for big data analysis

Info

Publication number: CN112130813A
Application number: CN202010772721.XA
Authority: CN
Inventors: 陈欣洁; 李建广; 余智华; 袁宝东; 冯凯
Original assignee: Golaxy Data Technology Co ltd
Current assignee: Golaxy Data Technology Co ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-12-25

Abstract

The invention provides a multi-user collaborative modeling method, a multi-user collaborative modeling system and a multi-user collaborative modeling device for big data analysis, wherein the method comprises the following steps: acquiring data information, and constructing a current task analysis flow according to the data information, wherein the current task analysis flow comprises a common node and a super node, and the super node can carry out parameter input and output; receiving a first operation instruction, determining cooperative users, and sharing and distributing the cooperative users by the current task analysis process; receiving a second operation instruction of the cooperative user to the task analysis flow to generate a final task analysis flow; and receiving a third operation instruction, and running a final task analysis flow. The super node solves the technical problems that all links in data analysis application construction are interdependent and cannot be processed in parallel, and the technical effects that various roles give full play to respective maximum advantages and the construction efficiency is high are achieved.

Description

Multi-user collaborative modeling method, system and equipment for big data analysis

The technical field is as follows:

the invention relates to the technical field of data analysis, in particular to a big data analysis-oriented multi-user collaborative modeling method, system and equipment.

Background art:

with the rapid development and the promotion of big data technology, the expansion speed of business data scale is increased exponentially, the characteristics of large business data volume, complex structure and diversified sources are more and more obvious, and enterprise analysts can face a plurality of problems and challenges in the process of mining the big data asset value by using analysis means such as big data technology, artificial intelligence algorithm and the like. For example, the service scenes are many and complex, the latitude of service data is high, the volume of service data is huge, complex service data often relates to hundred million-level feature dimensions, how to perform feature screening and integration from high-dimensional service data needs to consume a large amount of time and labor cost, the analysis process is complicated, and the analysis link is complex; therefore, for the construction of complex analysis tasks, the complex analysis tasks cannot be efficiently completed by only one person, and the complex analysis tasks completed by cooperation of multiple persons are an inevitable trend in the field of big data analysis in the future.

In the traditional big data analysis collaborative modeling process, a plurality of roles are required to be jointly completed, in the whole analysis application construction process, business personnel are required to be taken as the leading part all the time, a data analyst and a coding personnel are in cooperation, all the links are mutually dependent and cannot be processed in parallel; the maximum advantages of various roles cannot be fully exerted, so that the problems of long time consumption, high cost and the like in analysis application construction are caused, the problems directly cause low application development efficiency of enterprises, and the production cost and the operation efficiency of the enterprises are influenced.

The invention is provided in view of the above.

The invention content is as follows:

the invention provides a multi-user collaborative modeling method, a multi-user collaborative modeling system and a multi-user collaborative modeling device for big data analysis, and solves at least one technical problem.

The invention provides a multi-user collaborative modeling method for big data analysis, which comprises the following steps:

acquiring data information, and constructing a current task analysis flow according to the data information, wherein the current task analysis flow comprises a common node and a super node, and the super node can carry out parameter input and output;

receiving a first operation instruction, determining cooperative users, and sharing and distributing the cooperative users by the current task analysis process;

receiving a second operation instruction of the cooperative user to the task analysis flow to generate a final task analysis flow;

and receiving a third operation instruction, and running a final task analysis flow.

By adopting the scheme, the data information can be constructed into a data flow diagram form, the super node is an independent draggable component in the data flow diagram, and the component does not have any independent execution module. When the component is created, the component type is only required to be designated as the super node, the execution command and the execution parameter of the component are not required to be defined, and the execution program package of the component is not required to be uploaded.

The second operation instruction is that the cooperative users edit the super nodes in the current task analysis process, the cooperative users with different roles can respectively complete the corresponding tasks, and the task analysis process formed after all the cooperative users respectively execute the second operation instruction is the final task analysis process. The sender of the third operation instruction is an administrator, the system receives the third operation instruction and then submits and runs the final task analysis flow, and the running results and states of all nodes in the task and all nodes inside the super node can be checked in real time in the running process.

Further, determining the cooperative users, and sharing and allocating the cooperative users by the current task analysis process includes the steps of:

generating a user set containing all users;

loading the super node, and determining cooperative users matched with the super node from the user set;

and sharing and distributing the current task analysis flow to the cooperative users.

By adopting the scheme, the user set is a user set containing all roles, each module is loaded in the process of loading the super node, and for each module, a proper single or multiple cooperative users can be selected from the initially defined sharing range to complete corresponding module tasks.

Preferably, before the step of sharing and allocating the collaborative user by the current task analysis process, the method further includes performing permission configuration on the collaborative user.

By adopting the scheme, when the cooperative user authority is configured, the operation authority of the cooperative users to the main graph and the sub graph can be configured, the authority configuration can clarify the operation range of each cooperative user, and unauthorized operation caused by intention or error is prevented.

Preferably, the permission configuration includes at least one of a viewing permission, an editing permission, and an execution permission.

By adopting the scheme, the viewing permission can only view the main graph, the sub-graph flow, the log and the operation result, and the flow cannot be operated or modified; the editing authority is to modify the task flow and edit the task flow, and the flow cannot be run; the execution authority is the highest operation authority which can modify, execute and debug the task.

Further, receiving a second operation instruction of the collaborative user on the current task analysis flow, and generating a final task analysis flow includes the steps of:

when a second operation instruction of a first cooperative user is received, setting a shackle mechanism for the current task analysis process;

when the first cooperative user finishes the second operation instruction, the shackle mechanism is released;

repeating the steps until the Nth cooperative user completes a second operation instruction, wherein N belongs to {2,3, 4,5, … };

and generating a final task analysis flow.

By adopting the scheme, the shackle mechanism can prevent a plurality of users from editing or operating a certain super node or main graph at the same time, ensure that only one user edits or operates and executes the flow chart at the same time, and further prevent data inconsistency caused by multi-user operation.

Preferably, when the (N-1) th cooperative user completes the second operation instruction, the message reminding is performed on the Nth cooperative user.

By adopting the scheme, the system can automatically judge at the moment, and can perform message reminding on the Nth cooperative user when the (N-1) th cooperative user finishes the second operation instruction, so that the Nth cooperative user can finish editing the current task analysis flow in the shortest time.

The super node can carry out parameter input and output, and comprises: the super node internally comprises a temporary input node and a temporary output node.

By adopting the scheme, the input and the output of each super node can be well defined according to the data flow conversion of the processing link in the configuration process.

Further, receiving a third operation instruction, and executing a final task analysis flow includes the steps of:

receiving a third operation instruction, unfolding the super node, and constructing a final task analysis flow chart;

uploading the final task analysis flow chart;

and running a final task analysis flow according to the final task analysis flow big diagram.

By adopting the scheme, the final task analysis flow chart is generated before the third operation instruction is received, but the final task analysis flow chart comprises the super node and the data input and output node, so that the later-stage viewing and consulting are not facilitated, and therefore the following steps are required to be carried out to construct the final task analysis flow chart, and the steps comprise:

s41: traversing all edges in the final task analysis flow chart, and executing steps S42-S46 for each edge cur _ conn;

s42: if the start node conn _ src _ node of the edge is a normal node and the end node conn _ desc _ node is a super node, step S43 is performed; if the initial node conn _ src _ node of the edge is a super node, go to step S44; otherwise, returning to the step of S41, and continuously traversing the edges;

s43: recursively searching a common node sub _ dest _ node connected with a data input temporary node in the super nodes according to the serial number of the super node input node associated with the edge, adding an edge pointing to the sub _ dest _ node from the conn _ src _ node, returning to the step S41, and continuously traversing the edge;

s44: recursively searching a common node sub _ src _ node connected with a data output temporary node in the super nodes according to the sequence number of the super node output node associated with the edge, if the termination node conn _ desc _ node of the edge is the common node, executing the step S45, otherwise, executing the step S46;

s45: newly adding an edge pointing to conn _ desc _ node from the sub _ src _ node, returning to S41, and continuously traversing the edge;

s46: recursively searching a common node sub _ dest _ node connected with a data input temporary node in the super nodes according to the serial number of the super node input node associated with the edge, adding an edge pointing to the sub _ dest _ node from the sub _ src _ node, returning to the step S41, and continuously traversing the edge;

after the completion, the following steps are executed

S47: traversing all nodes in the final task analysis flow chart, and deleting the super nodes, the data input temporary nodes and the data output temporary nodes;

s48: traversing all edges in the final task analysis flow chart, and deleting edges of which the starting node or the ending node is a super node and a temporary node for data input and data output;

the invention has the beneficial effects that:

1. the super node solves the technical problems that all links in data analysis application construction are interdependent and cannot be processed in parallel, and the technical effects that various roles give full play to respective maximum advantages and the analysis application construction efficiency is high are achieved.

2. The permission configuration solves the technical problem that the cooperative user prevents unauthorized operation caused by intention or error, and the technical effect of improving the accuracy of the analysis and construction result is achieved.

3. The technical problem of inconsistent data caused by multi-person operation is solved by the aid of the shackle mechanism, and technical effects of improving construction efficiency and improving construction result accuracy are achieved.

Description of the drawings:

in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of one embodiment of the present invention;

FIG. 2 is an exemplary diagram of an embodiment of the present invention;

FIG. 3 is a sub-flow diagram of the data preprocessing of FIG. 2;

FIG. 4 is a feature engineering sub-flow diagram of FIG. 2;

FIG. 5 is a sub-flow diagram of the model training sub-flowchart of FIG. 2;

FIG. 6 is a sub-flow diagram of the model evaluation in FIG. 2;

FIG. 7 is a schematic diagram of a process for generating a final task analysis;

fig. 8 is a flow chart of the final task analysis of fig. 2-7.

The specific implementation mode is as follows:

reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

The invention protects a multi-user collaborative modeling method for big data analysis.

Referring to fig. 1 and 2, in some embodiments of the invention, the following steps are included:

s1: acquiring data information, and constructing a current task analysis flow according to the data information, wherein the current task analysis flow comprises a common node and a super node, and the super node can carry out parameter input and output;

s2: receiving a first operation instruction, determining cooperative users, and sharing and distributing the cooperative users by the current task analysis process;

s3: receiving a second operation instruction of the cooperative user to the task analysis flow to generate a final task analysis flow;

s4: and receiving a third operation instruction, and running a final task analysis flow.

By adopting the scheme, the data information can be constructed into a data flow graph, fig. 2 is a main graph of a current task analysis flow constructed by the junk short message data information and used for classifying the junk short messages, at the moment, data preprocessing, feature engineering, model training and model evaluation in the graph are super nodes, the super nodes are independent and draggable components in the data flow graph, and the components do not have any independent execution module. When the component is created, the component type is only required to be designated as the super node, the execution command and the execution parameter of the component are not required to be defined, and the execution program package of the component is not required to be uploaded. For convenience of explanation, data preprocessing, feature engineering, model training and model evaluation are taken as main task flows, and fig. 3-6 are subtask flows after data preprocessing, feature engineering, model training and model evaluation are performed on each super node in sequence, and are called subgraphs; for example, in fig. 6, the nodes other than the data input node are common nodes, but the common nodes may also be set in the main task flow. In practical application, users may include professional-level data scientists, data analysts, encoding personnel and the like, but not all users are necessarily involved in tasks, so that the users need to select needed cooperative users from the tasks, and after sharing and allocating the cooperative users to the current task analysis flow, the cooperative users can see the current task analysis flow and operate according to corresponding allocation.

The scheme is characterized in that processing links of complex analysis tasks are disassembled from a longitudinal dimension based on a data flow graph in a refinement mode of analyzing the granularity of a process, and input, output and dependency relationships of all the processing links are defined; in the horizontal dimension, each super node can be designed with a sub-process, and each super node task can be distributed to a single person or a plurality of persons to complete cooperatively. By means of the longitudinal dimension and the transverse dimension, the complex analysis tasks are disassembled and distributed, and the multi-user collaborative co-construction analysis model is achieved. The complex business analysis process is built and task allocation is completed in a manner of nesting sub-processes in a large process, so that the analysis dimension of the complex business process can be reduced, meanwhile, workers with different roles can clearly divide and position the tasks in the analysis task, knowledge and experience are shared among teams through a cooperative cooperation and knowledge sharing mechanism, and the application building efficiency is improved.

Step S2: determining cooperative users, and sharing and distributing the cooperative users by the current task analysis process comprises the following steps:

s201: generating a user set containing all users;

s202: loading the super node, and determining cooperative users matched with the super node from the user set;

s203: and sharing and distributing the current task analysis flow to the cooperative users.

By adopting the scheme, the user set is a user set comprising all roles, the users comprise professional-level data scientists, data analysts, encoding personnel and the like, the collaborative users need to be screened out, and a mode of adding and removing a single user can be selected, or a mode of setting a user group can be adopted. In the process of loading the super node, each module is loaded, and for each module, appropriate single or multiple cooperative users can be selected from the initially defined sharing range to complete corresponding module tasks.

Step S203: and before the step of sharing and distributing the collaborative users by the current task analysis process, permission configuration is carried out on the collaborative users.

The permission configuration comprises at least one of viewing permission, editing permission and executing permission.

Step S3, receiving a second operation instruction of the cooperative user for the current task analysis flow, and generating a final task analysis flow includes the steps of:

and generating a final task analysis flow.

By adopting the scheme, the shackle mechanism can prevent a plurality of users from editing or operating a certain super node or main graph at the same time, ensure that only one user edits or operates and executes the flow chart at the same time, and further prevent data inconsistency caused by multi-user operation. The first cooperative user is a role which is operated firstly at a certain node according to an operation sequence, can be a professional data scientist or a data analyst and the like, and turns to other roles according to the operation sequence after the operation is finished until the Nth user finishes, so that a final task analysis flow is generated.

And when the (N-1) th cooperative user finishes the second operation instruction, performing message reminding on the Nth cooperative user.

By adopting the scheme, the input and the output of each super node can be well defined according to the data flow conversion of the processing link in the configuration process. The super node can dynamically configure input and output parameters, and it can be assumed that its input parameter is SupInput ═ { sup _ in%₁,sup_in₂...sup_in_m}{sup_out₁,sup_out₂...sup_out_nAnd outputting parameters SupOutput, wherein m and n are numbers of input and output parameters of the super node respectively. Inside the super node, two temporary node components of data input DINode and data output DONode are automatically generated, and for DINode, it is assumed that its output parameter is DIOutput ═ { di _ out₁,di_out₂...di_out_mSimilarly, for DONode, let its input parameter be DOInput ═ do _ in₁,do_in₂...do_in_nAnd the input and output parameters of the super node are respectively compared with the output parameters of the internal temporary data input node and the temporary data output nodeThe input parameters are in one-to-one correspondence, and the pointed data are consistent, namely, the following relationship is satisfied

Wherein m and n are the number of input and output parameters of the super node component respectively. After the input and the output of the super nodes in each processing link are defined, the data flow dependency relationship among the super nodes can be constructed in the main task flow chart based on the data flow, and for each super node sub-model, the corresponding processing flow can be designed according to the matched input and output.

Referring to fig. 7 and 8, receiving a third operation instruction, the process of executing the final task analysis includes the steps of:

uploading the final task analysis flow chart;

after the completion, the following steps are executed

in the above steps, S52 to S56 are processes of adding a new edge to the final task analysis flowchart, and the start node and the end node associated with the added new edge are operator nodes or data nodes that can actually run.

Step S53 is a case where the start node of the edge is a normal node and the end node is a super node, and for this case, it is necessary to find the first normal node connected to data input inside the super node, construct a new edge, which is equivalent to unfolding the super node from the data input layer;

in the case that the starting node of the edge in the steps S54 and S55 is a super node and the terminating node is a normal node, for the case, the first normal node connected with data output inside the super node needs to be found, a new edge is constructed, and the method is equivalent to unfolding the super node from a data output layer;

in the case that the starting node of the edge in steps S54 and S56 is a super node, and the terminating node is also a super node, for this case, it is necessary to find the first common node for connecting data output from the super node of the starting node, and at the same time, find the first common node for connecting data input from the super node of the terminating node, and construct the edge of the two nodes, which belongs to the special case of the above two cases.

And after the super nodes are unfolded to form a final task analysis flow big diagram in the final task analysis flow diagram, the final task analysis flow big diagram is converted into a task flow diagram capable of being analyzed by a scheduling engine based on data flow, and distribution scheduling execution is carried out.

In the task execution process, the running results and states of all nodes in the task and all nodes in the super node can be checked in real time. The state of the super node depends on the state of the sub-process in the super node, and the super node can be executed successfully only if all the nodes in the sub-process are executed successfully.

All users in the sharing range can check the whole execution condition of the task, can monitor the running log of the task in real time, check the main graph and the result in the super node subgraph which can be accessed by the main graph, and all the results in the graph can be multiplexed, namely the executed nodes do not need to be repeatedly executed, and meanwhile, the analysis result can be shared to other users.

The invention also protects a system applying the big data analysis-oriented multi-user collaborative modeling method, which comprises the following steps:

the construction unit is used for constructing the current task analysis process;

a screening unit, configured to determine the cooperative user;

the sharing unit shares and distributes the current task analysis process to the collaborative users;

a generating unit for generating a final task analysis flow;

and the operation unit is used for operating the final task analysis flow.

The invention also discloses equipment of the big data analysis-oriented multi-user collaborative modeling method, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the method when executing the program.

It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the protection scope of the claims of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein, the foregoing description of the disclosed embodiments being directed to enabling one skilled in the art to make and use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-user collaborative modeling method for big data analysis is characterized by comprising the following steps:

2. The big data analysis-oriented multi-user collaborative modeling method according to claim 1, wherein determining collaborative users, and sharing and allocating the current task analysis process to the collaborative users comprises the steps of:

generating a user set containing all users;

3. The big-data-analysis-oriented multi-user collaborative modeling method according to claim 2, wherein the step of sharing and allocating collaborative users by the current task analysis process further comprises: and configuring the authority of the cooperative user.

4. The big-data-analysis-oriented multi-user collaborative modeling method according to claim 3, wherein: the permission configuration comprises at least one of viewing permission, editing permission and executing permission.

5. The big data analysis-oriented multi-user collaborative modeling method according to claim 4, wherein the step of receiving a second operation instruction of the collaborative user on the current task analysis flow to generate a final task analysis flow comprises the steps of:

and generating a final task analysis flow.

6. The big-data-analysis-oriented multi-user collaborative modeling method according to claim 5, wherein when an (N-1) th collaborative user completes a second operation instruction, a message prompt is performed on the Nth collaborative user.

7. The big-data-analysis-oriented multi-user collaborative modeling method according to claim 6, wherein the super node capable of parameter input and output comprises: the super node internally comprises a temporary input node and a temporary output node.

8. The big-data-analysis-oriented multi-user collaborative modeling method according to any one of claims 1 to 7, wherein receiving a third operation instruction and executing a final task analysis flow comprises the steps of:

uploading the final task analysis flow chart;

9. A system applying the big data analysis oriented multi-user collaborative modeling method according to any one of claims 1 to 8, the system comprising:

a screening unit, configured to determine the cooperative user;

a generating unit for generating a final task analysis flow;

and the operation unit is used for operating the final task analysis flow.

10. A device of a multi-user collaborative modeling method for big data analysis is characterized by comprising the following steps: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of any of the preceding claims 1 to 8.