CN105956087B - Data version management system and method - Google Patents

Data version management system and method Download PDF

Info

Publication number
CN105956087B
CN105956087B CN201610282533.2A CN201610282533A CN105956087B CN 105956087 B CN105956087 B CN 105956087B CN 201610282533 A CN201610282533 A CN 201610282533A CN 105956087 B CN105956087 B CN 105956087B
Authority
CN
China
Prior art keywords
data
code
data set
execution
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610282533.2A
Other languages
Chinese (zh)
Other versions
CN105956087A (en
Inventor
徐葳
徐方舟
张炀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cross Information Core Technology Research Institute (Xi'an) Co., Ltd.
Original Assignee
Cross Information Core Technology Research Institute (xi'an) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cross Information Core Technology Research Institute (xi'an) Co Ltd filed Critical Cross Information Core Technology Research Institute (xi'an) Co Ltd
Priority to CN201610282533.2A priority Critical patent/CN105956087B/en
Priority to CN201910359068.1A priority patent/CN110119393B/en
Publication of CN105956087A publication Critical patent/CN105956087A/en
Application granted granted Critical
Publication of CN105956087B publication Critical patent/CN105956087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning

Abstract

The present invention provides a kind of data and code release management system and method.Wherein, the system comprises data management module, code administration module, enforcement engine module and system core modules, wherein data management module is stored at least one data set;Code administration module is stored at least one and executes code, and receives the code transmission one code process request that the code of user's push is stored or pushed according to user;Enforcement engine module executes back-end engine configured at least one, executes back-end engine according to received commands calls one, operation one executes code to execute operation at least data set in the data management module;When system core module receives the data processing request of user's submission, the data set in data management module is handled, creates the datamation stream of data set, and record the data version information and code release information of formation.The present invention efficiently solves the problems such as version management low efficiency or confusion of data and code.

Description

Data version management system and method
Technical field
The present invention relates to data analysis fields, more particularly to a kind of data and code release management system and method.
Background technique
In recent years, people had collected a large amount of data.Meanwhile data science man also becomes the work of each major company's the supreme arrogance of a person with great power Make.But lacks enough tools now and data science man is helped to analyze data flow.As the task of data science is more and more multiple Miscellaneous, many Data Analysts start that code release tool, such as Git is transformed.But the task of data science is not Git It can handle completely.
Firstly, data science is data-centered.One data set can be several through over cleaning, label and pretreatment etc. A operation.Data set just produces multiple versions in this way.Data science man needs to record these versions and modifies data at any time. Method that is a kind of common but not recommending is exactly to save multiple copies, and a little copies are respectively designated as data.csv, data- Version1.csv, data-final-version.csv, data-last-version.csv.This naming method often makes People feel very chaotic.And because version or data set are mistaken and often will lead to some mistakes.
Second, a machine learning model generally comprises many parameters, and it is very normal in data science for training these parameters Seeing for task.These parameters such as learning rate, initial value, regularization etc. often allow people to feel vast and hazy.So successors Also just forgotten the significance and importances of these parameters.
Third, when data set is increasing, data science man is frequently necessary to build a distributed platform, and its it On be iteratively repeated these experiments.They are also possible to using some third-party software packages.But unfortunately, different soft It is usually very dull for installing and configure these software packages in part/hardware environment.
Finally, shared data set and experience are highly difficult things between data science man.Certainly they can share them Code and as a result, still this is unfavorable for their their data sets of profound understanding and makes full use of others code and knot Fruit.
The DataHub project support data set version of MIT controls, and can not but manipulate entire data set analysis and develop Journey.So the project is more a database management tools rather than Software Development Tools.On the other hand, Harvard Dataverse is then a data publication and sharing platform, but lacks the Version Control and analytic function of data.
Summary of the invention
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of data and code release management System and method, for solve data in the prior art and code version management low efficiency or it is chaotic the problems such as.
In order to achieve the above objects and other related objects, the first aspect of the application is to provide a kind of data version management System, comprising: data management module, code administration module, enforcement engine module and system core module, wherein the number At least one data set is stored with according to management module;The code administration module is stored at least one and executes code, described to hold Line code is for operating at least one described data set;The enforcement engine module executes rear end configured at least one Engine executes back-end engine according to received commands calls one, and operation one executes code to the data management module In an at least data set execute operation;When the system core module receives the data processing request of user's submission, processing Data set in the data management module, creates the datamation stream of the data set, and records the versions of data letter of formation Breath.
In the embodiment of the application, the data of the data set are stored in one first storage unit, the data set Metadata be stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.
In the embodiment of the application, the data processing request that the system core module receives user's submission is to mention When handing over a new data set, the system core module extracts the data of the new data set and is stored in the first storage list Member extracts the metadata of the new data set and is stored in second storage unit, and is formed described in new data ID association The data and metadata of new data set, create the datamation stream of the data set, and record the data version information of formation.
In the embodiment of the application, the data processing request that the system core module receives user's submission is to repair When changing the data set stored in the data management module, the system core module will according to the data processing request One executes Code copying into the enforcement engine module, and sends an execution order and enable its operation to the enforcement engine module The execution code to form a new data set, extract the data of the new data set and be stored in institute by the system core module The first storage unit is stated, the metadata of the new data set is extracted and is stored in second storage unit, and forms one and newly counts It is associated with the data and metadata of the new data set according to ID, and forms a code ID for the execution code and the new data Collection is associated, creates the datamation stream of the data set, and record the data version information of formation.
The another aspect of the application provides a kind of data version management method, the described method comprises the following steps: prestore to A data set and at least one execution code for being operated at least one described data set less, and configure at least It is a kind of for run it is described execute code execution back-end engine;And;When receiving the data processing request of user's submission, adjust Back-end engine operation described at least one is executed with one and executes code to execute processing to an at least data set, creates the number According to the datamation stream of collection, and record the data version information of formation.
In the embodiment of the application, the data of the data set are stored in one first storage unit, the data set Metadata be stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.
In the embodiment of the application, the data processing request for receiving user's submission newly counts to submit one When according to collection, extracts the data of the new data set and be stored in first storage unit, extract first number of the new data set According to and be stored in second storage unit, and form data and metadata that a new data ID is associated with the new data set, and Record the data version information formed.
In the embodiment of the application, the data processing request for receiving user's submission is the modification data When the data set stored in management module, one execution Code copying to the execution is drawn according to the data processing request It in holding up, and sends an execution order and enables it run described execution code to form a new data set to the enforcement engine, extract The data of the new data set are simultaneously stored in first storage unit, extract the metadata of the new data set and are stored in institute The second storage unit is stated, and forms data and metadata that a new data ID is associated with the new data set, and form a code ID is associated with the new data set by the execution code, and records the data version information of formation.
The another further aspect of the application is to provide a kind of code release management system, comprising: data management module, dematron Module, enforcement engine module and system core module are managed, wherein the data management module is stored at least one data Collection;The code administration module is stored at least one and executes code, and the execution code is used for the data management module At least one data set of storage is operated;The code that the code administration module is also used to receive user's push is stored Or the code according to user's push sends code process request;The enforcement engine module executes rear end configured at least one Engine, when executing order for receiving, according to executing back-end engine described in a commands calls, operation one execute code with Operation is executed to the data set in the data management module;The system core module is used to record the code of user's push And code release information is formed, and when receiving the code process request of the code administration module, send and one execute order To the enforcement engine module, it is enabled to run the execution code in the code administration module, and in the execution code with right The code release information of formation is recorded after data set execution operation in the data management module.
In the embodiment of the application, the system core module is also used to the data processing request submitted according to user Code copying is executed into the enforcement engine module by one, and is sent an execution order and enabled its fortune to the enforcement engine module The row execution code is to form a new data set, and it is related to the new data set by the execution code to form a code ID Connection, and record the code release information of formation.
In the embodiment of the application, the system core module is copied to described hold according to the data processing request The execution code of row engine modules is to store in the new execution code of user's submission or the code administration module of calling Execute code.
The another aspect of the application is to provide a kind of code release management method, comprising the following steps: prestores at least one A data set and at least one for being operated at least one described data set execute code, and configure at least one For running the execution back-end engine for executing code;And;The code for receiving user's push is stored, and is recorded and to be formed Code release information;Or the code according to user's push sends code process request, sends one and executes order to described Executing back-end engine enables it run the execution code prestored, and in the execution code to execute to the data set prestored The code release information of formation is recorded after operation.
In the embodiment of the application, the data of the data set are stored in one first storage unit, the data set Metadata be stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.
In the embodiment of the application, code release management method further includes step, at the data submitted according to user It manages request and executes Code copying into the execution back-end engine for one, and send one and execute command commands its operation execution generation Code is to form a new data set, and it is associated with the new data set by the execution code to form a code ID, and records shape At code release information.
It is described to be copied to the execution back-end engine according to the data processing request in the embodiment of the application In execution code be the execution code that stores in the code administration module of the new execution code that user submits or calling.
In the embodiment of the application, code release management method further includes the steps that configuration, and there are many user UI, use To receive the request of different user submission respectively or to different user feedback request information.
As described above, data and code release management system of the invention and method, have the advantages that the present invention By providing respective version management for data set and code, and for each data set and code provide directed acyclic workflow and The incidence relation both constructed, efficiently solve data and code version management low efficiency or it is chaotic the problems such as;In addition, using UI design, can for user's comparison, analyze each history data set and provide convenient and fast approach;In addition, each unit is distributed in difference It, can be convenient for the operating pressure on each server of mitigation on server.
Detailed description of the invention
Fig. 1 is shown as the structural schematic diagram of data version management system of the invention.
Fig. 2 is shown as the flow chart of data version management method of the invention.
Fig. 3 is shown as the structural schematic diagram of code release management system of the invention.
Fig. 4 is shown as the flow chart of code release management method of the invention.
Fig. 5 is shown as the composition schematic diagram of datamation stream in a specific embodiment of the invention.
Component label instructions
1 data version management system
11 data management modules
12 code administration modules
13 enforcement engine modules
14 system core modules
2 code release management systems
21 data management modules
22 code administration modules
23 enforcement engine modules
24 system core modules
S11~S12, S21~S22 step
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation Feature in example can be combined with each other.
It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment Think, only shown in schema then with related component in the present invention rather than component count and arbitrary shape when according to actual implementation, The kenel, quantity and ratio of each component can arbitrarily change when its actual implementation for one kind, and its assembly layout kenel may also It is increasingly complex.
Embodiment one
Referring to Fig. 1, the structural schematic diagram of data version management system of the invention is shown as, as shown, the application First aspect be to provide a kind of data version management system, the data version management system can be only fitted to separate unit service Device, server cluster, in server or distributed server based on cloud computing framework.Wherein, the server cluster refer to by Many server centereds get up to carry out data version management together, and the server cluster can use multiple computers and carry out simultaneously Row calculates, to improve arithmetic speed.The server based on cloud computing framework is stored each server by virtualization technology Chi Hua, so that server shares computing resource where each module in data version management system.The distributed server be by Data and program in the data version management system, which are dispersed on multiple servers, carries out coordinated operation.
Each module in the data version management system can be configured and be serviced at any of the above-described kind according to actual design needs In device.Specifically, the data version management system 1 includes: data management module 11, code administration module 12, enforcement engine Module 13 and system core module 14.
The data management module 11 is stored at least one data set.Wherein, the data set is by version management The set of data.The data include but is not limited to: text data, and/or multi-medium data etc..In a specific embodiment In, the text data is exemplified as code data, system log etc..The multi-medium data is exemplified as image data, video counts According to etc..If the data management module 11 saves multiple data sets, can be between each data set onrelevant or related Connection.For example, data set A3 is derived by data set A1 and A2 in data set A1, A2 and A3.Data set A3 passes through rope Draw or associate field is associated with data set A1 and A2.
The metadata for indexing or describing data can also be included in the data set.Wherein, in the data set Each data and corresponding metadata can be associated by a data ID.Specifically, the metadata (Metadata) is also known as intermediary Data, relaying data mainly describe data attribute (property) for the data (data about data) for describing data Information, for support such as indicate storage location, historical data, resource lookup, file record function.The metadata is calculated It is a kind of electronic type catalogue, in order to achieve the purpose that scheduling, it is necessary to describe and collect in data perhaps characteristic, in turn Reach the purpose for assisting data retrieval.
A kind of optinal plan is that the data of the data set are stored in one first storage unit, first number of the data set According to being stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.Here, described First storage unit and the second storage unit are configurable in same database server.It can also configure according to actual needs In different servers.Such as in an optional embodiment, first memory cell arrangements are in Hadoop distributed document In system (big data distributed file system);Second memory cell arrangements are in NoSQL database (database of non-relational) In.
Wherein, distributed file system (the Distributed File System) design is based on client/server Mode, in particular to the physical memory resources of file system management are not necessarily directly connected on the local node, but pass through meter Calculation machine network is connected with node.The NoSQL database is, for example, key assignments (Key-Value) storing data library, column storage Library, Document image analysis, figure (Graph) database or MongoDB database etc..
In this application, using distributed file system can efficient scan data set, but random access is then inefficient 's.In order to solve this problem, scheme provided by the present application is to store mark and the note of every picture, such as filename, big Small, content description, these contents are stored in NoSQL database to accelerate inquiry velocity, that is, are connected according to data ID Connect initial data and metadata.
The data version management system of the application has recorded each domain of data, for example, a new data is a newname The set of word and version number.It is MongoDB database used in the embodiment that the application specifically uses to store metadata, But not limitation and this, in other implementations, it is also possible to column storage database, key assignments storing data will be moved to To improve efficiency in library, Document image analysis or figure (Graph) database.
The code administration module 12 is stored at least one and executes code, and the execution code is used for described at least one A data set is operated.Wherein, when the execution code is called, execute to the data in data set and data set into Row additions and deletions such as change at the operation.For example, the code that executes includes but is not limited to: increasing the execution code of a new data set, delete number Increase the execution code of label/character etc. according to the execution code of collection, in the data that preset data is concentrated, in preset data concentration Data in delete the execution for executing code, replacing in the data that preset data is concentrated label/character etc. of label/character etc. Code.
Can be selected at one is in embodiment, and the execution code is stored in such as Gitlab, and is used The API of GitLab is interacted.The Gitlab is using the edition management system of Ruby on Rails mono- open source, in fact An existing Git project warehouse from trustship, can be accessed disclosed or private items by web interface.Described Gitlab possesses the function similar with Github, can browse through source code, manages defect and annotation.It can be with Executive Team to storehouse The access in library, the Gitlab are highly susceptible to the version that browsing was submitted and provide a file history library.Team Member can To be exchanged using built-in simple chat program (Wall).The Gitlab also provides a code snippet collecting function Can easily realize code reuse, convenient for it is in the future in need when searched.
The enforcement engine module 13 executes back-end engine configured at least one, according to received commands calls one Back-end engine is executed, operation one executes code to execute operation at least data set in the data management module 11.? This, the back-end engine that executes is set for each programming language for executing code.The execution back-end engine includes single machine engine And Distributed engine.
The single machine engine is, for example, the Python on single machine, and the Python is pure free software, source code GPL (GNU General Public License) agreement is followed with interpreter CPython, it when being executed first can be by .py Compilation of source code in file at Python byte code (bytecode), then again by Python Virtual Machine (Python virtual machine) executes these compiled byte code.
The Distributed engine is, for example, the Spark on cluster, and the Spark is a quick and general computing cluster Frame, its kernel are write using Scala language, it provides Scala, Java and Python programming language high-level API can easily develop the application program of parallel processing using these API.
In the particular embodiment, it is required for configuring the execution back-end engine, because not only facilitating as user in this way Put up the environment of distributed type assemblies;It can be automatically code and result data collection can be connected.That is, The available arbitrary intermediate result of code is executed in the data version management system of the application, as long as user remains original number According to and code.
When the system core module 14 receives the data processing request of user's submission, the data management module is handled Data set in 11, creates the datamation stream of the data set, and records the data version information of formation.
Wherein, the datamation stream (Data Work Flow, DWF) is for marking the data set during version management And/or the oriented acyclic v ion of the data in data set.For new data set, the datamation stream is corresponding Label v ion is root node.For data set/data comprising version updating, the datamation stream is to indicate number According to the secondary relationship between two data of concentration, and/or between two data sets.The secondary relationship includes the execution of a data set Historical record and version.Wherein, the historical record that executes includes but is not limited to: the data intensive data before variation and after variation Points relationship (i.e. father and son's node relationships), variation front and back called execution code, execute time etc..In brief, described The concept of datamation stream is as the data set logical relation in the data version management system of the application.The data set foundation This forms dependence.The datamation stream is the core function that data are reappeared in the data version management system of the application Energy.
In the datamation stream, a node illustrates a particular version of data set.One company The directed edge for connecing two nodes indicates that a data set is that have another data set to be derived.On side label indicate be then The code release once tested.Referring to Fig. 5, the example of a data workflow shown in fig. 5, that is, a data workflow It is a directed acyclic graph.Picture 5 illustrates common two kinds of datamation flow structures of one-to-one and many-one structure.
In above-mentioned one-to-one structure, a data set is derived by another data set.For example, Yong Huke To create a new data set based on an existing data set, and some new labels are stamped on new data set and are incited somebody to action It shares to other users.And one data set of above-mentioned many-one representation can be derived by two or more data sets ?.As the operation of two tables of data of some merging etc is not always the case.
Father and son's section is introduced when stating datamation flow structure in realization in the data version management system of the application Point, which is related to that this attribute is used to record this data set, to be derived from which data set.Meanwhile the data version of the application This management system can also realize the function of comparing the difference between father and son's data set.These functions help user easier It was found that the modification of oneself code results in any result.Therefore, the structure chart of the datamation stream can not only make number in order According to the relationship between collection, may also help in management user executes record, including generates result according to version number.
The system core module 14 is according to the data set before variation and the mark information (such as ID value) and data work of data It flows, obtain and records the data version information after forming corresponding variation.Wherein, the data version information includes but is not limited to Dataset name, code ID, forms at least one of time and running log at data ID.
Here, the system core module 14 can be asked by the data processing that network/submission interface captures user is submitted It asks, and corresponding execution order to the enforcement engine module 13 is sent according to acquired data processing request and from dematron The execution code selected in module 12 is managed, so that the enforcement engine module 13 calls corresponding execution back-end engine to run Selected execution code, to carry out the version management of data set.
Such as in a specific implementation process, code (Push) is executed whenever user submits to Gitlab server, Gitlab server will notify the system core module 14 by a Web hook.14 meeting of system core module User is requested to the queue of push-in oneself, while choosing request from head of the queue to be handled.The system core module 14 can incite somebody to action The execution code copies of the request are to the enforcement engine module 13, and then the enforcement engine module 13 will use user's offer Parameter and input run execution code.After the task terminates, the system core module 14 will record lower current request Information, the commit ID including current Push on Gitlab server, parameter that user specifies and it is any specifically with reality Test relevant information.In some cases, experiment can generate new data set.The system core module 14 described so also will record this Relationship between a little data sets, i.e., datamation stream above-mentioned.
It is a variety of situations when user submits the request of data of processing, will be illustrated respectively below:
In one case, when the data processing request is to submit a new data set, the system core module 14 can directly be stored in new data set in the first storage unit, create the datamation stream of the new data set, and record The data version information of formation.Alternatively, the system core module 14 is according to the data processing request from code administration module 12 It is middle to choose corresponding execution code, and the corresponding execution order for submitting new data set is sent to enforcement engine module 13.Institute State enforcement engine module 13 according to it is received execute order and execute selected execution code, and by acquired new data set Data be stored in the first storage unit.At the same time, the system core module 14 also creates the data of the data set Workflow, and record the data version information of formation.
In a kind of optinal plan, the new data set includes data and metadata.The system core module 14 is being selected When taking execution code, the execution code for saving data and metadata can be corresponded to by choosing, and execute selected execution code, from Data and metadata are extracted in new data set respectively, then extracted data are stored in the first storage unit, will be extracted Metadata be stored in the second storage unit, and form data and metadata that a new data ID is associated with the new data set, wound The datamation stream of the data set is built, and records the data version information of formation.Here, comprising making in the datamation stream For the data and corresponding metadata of root node.Recorded data version information includes: dataset name, data ID, first number According to ID, the corresponding relationship of data ID and metadata ID, the code ID for executing the addition data set, formation time and operation day Will.
In another case, when the data processing request that user submits is to modify in the data management module 11 to store A data set when, the system core module 14 holds to described an execution Code copying according to the data processing request In row engine modules 13, and sends an execution order and enable it run described execution code to be formed to the enforcement engine module 13 One new data set.It will be formed by new data set again to be stored in the first storage unit, while creating the new data set and repairing relatively The datamation stream of data set before changing, and record the data version information of formation.
A kind of optional way is that the system core module 14 extracts the data of the new data set and is stored in described One storage unit extracts the metadata of the new data set and is stored in second storage unit, and forms a new data ID The data and metadata of the new data set are associated with, and form a code ID for the execution code and the new data set phase Association, creates the datamation stream of the data set, and record the data version information of formation.
More preferably mode is that the system core module 14 is copied to the execution according to the data processing request The execution code of engine modules 13 is to store in the new execution code of user's submission or the code administration module 12 of calling Execution code.
Specifically, user can also submit new execution code previously according to self-demand, and manually or pass through the system System nucleus module 14 adjusts the corresponding relationship between the new execution back-end engine executed in code and enforcement engine module 13.By This, it is described when the data processing request that user submits is the data set stored in the modification data management module 11 System core module 14 is determined according to the data processing request by the new execution Code copying to the enforcement engine module In 13, and corresponding execution code is executed by corresponding execution back-end engine, to save new data set, and creates new data Collect the datamation stream relative to the data set before modification, and records the data version information of formation.
When user needs that the data set between different editions is analyzed and calculated, a kind of preferred side in the present embodiment Formula is the data version management system 1 further include: Subscriber Interface Module SIM (is not illustrated).The Subscriber Interface Module SIM is matched A variety of user UI are equipped with, to receive the request of different user submission respectively or to different user feedback request information.
Specifically, user is frequently necessary to analyze some data sets and calculates some parameters, such as in natural language processing Accuracy rate or stock market return survey in daily investment repayment.In the datamation stream and versions of data for creating data set After information, the Subscriber Interface Module SIM can be provided for user based on datamation stream and the data in associated each data set. Shown data can help user to compare a pair of of historical analysis result, show the difference in code and/or parameter.For more A variety of UI design of kind user can help each user to obtain best algorithm and parameter.
In conclusion data version management system provided by the present application may be implemented to manage number in the system integrated at one According to version, and run personal code work in systems;Can retain the code and data of user simultaneously, and can to version twice into Row compares, and discovery is different;In addition, data and metadata are stored separately by data version management system of the invention, allow to More efficiently cross filter data;Furthermore the present invention is each data set by providing respective version management for data set and code The workflow of directed acyclic and the incidence relation of both buildings are provided with code, efficiently solve the version management of data and code The problems such as low efficiency or confusion;In addition, designed using a variety of UI, can for user's comparison, analyze each history data set and provide Convenient and fast approach;In addition, each unit is distributed on different server, it can be convenient for the operating pressure on each server of mitigation.
Embodiment two
Referring to Fig. 2, the flow chart of data version management method of the invention is shown as, as shown, the of the application Two aspects are to provide a kind of data version management method.The data version management method mainly has data version management system To execute.Wherein, the data version management system can be only fitted to single server, server cluster, based on cloud computing frame In the server or distributed server of structure.Wherein, the server cluster refers to get up many server centereds and carry out together Data version management, the server cluster can use multiple computers and carry out parallel computation, to improve arithmetic speed.It is described Server based on cloud computing framework passes through virtualization technology by each server storage pool, so that in data version management system Server shares computing resource where each module.The distributed server is by the data in the data version management system It is dispersed on multiple servers with program and carries out coordinated operation.
Each module in the data version management system can be configured and be serviced at any of the above-described kind according to actual design needs In device.The data version management system executes the method according to following steps suddenly.
In step s 11, at least one data set is prestored and for being operated at least one described data set At least one executes code, and configures at least one for running the execution back-end engine for executing code.
Here, the data set is by the set of the data of version management.The data include but is not limited to: textual data According to, and/or multi-medium data etc..In a specific embodiment, the text data is exemplified as code data, system log Deng.The multi-medium data is exemplified as image data, video data etc..If the data management module saves multiple data sets, Then can be between each data set onrelevant or related.For example, in data set A1, A2 and A3, data set A3 is It is derived by data set A1 and A2.Data set A3 is associated by index or associate field with data set A1 and A2.
The metadata for indexing or describing data can also be included in the data set.Wherein, in the data set Each data and corresponding metadata can be associated by a data ID.Specifically, the metadata (Metadata) is also known as intermediary Data, relaying data mainly describe data attribute (property) for the data (data about data) for describing data Information, for support such as indicate storage location, historical data, resource lookup, file record function.The metadata is calculated It is a kind of electronic type catalogue, in order to achieve the purpose that scheduling, it is necessary to describe and collect in data perhaps characteristic, in turn Reach the purpose for assisting data retrieval.
A kind of optinal plan is that the data of the data set are stored in one first storage unit, first number of the data set According to being stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.Here, described First storage unit and the second storage unit are configurable in same database server.It can also configure according to actual needs In different servers.Such as in an optional embodiment, first memory cell arrangements are in Hadoop distributed document In system (big data distributed file system);Second memory cell arrangements are in NoSQL database (database of non-relational) In.
Wherein, distributed file system (the Distributed File System) design is based on client/server Mode, in particular to the physical memory resources of file system management are not necessarily directly connected on the local node, but pass through meter Calculation machine network is connected with node.The NoSQL database is, for example, key assignments (Key-Value) storing data library, column storage Library, Document image analysis, figure (Graph) database or MongoDB database etc..
In this application, using distributed file system can efficient scan data set, but random access is then inefficient 's.In order to solve this problem, scheme provided by the present application is to store mark and the note of every picture, such as filename, big Small, content description, these contents are stored in NoSQL database to accelerate inquiry velocity, that is, are connected according to data ID Connect initial data and metadata.
The data version management system of the application has recorded each domain of data, for example, a new data is a newname The set of word and version number.It is MongoDB database used in the embodiment that the application specifically uses to store metadata, But not limitation and this, in other implementations, it is also possible to column storage database, key assignments storing data will be moved to To improve efficiency in library, Document image analysis or figure (Graph) database.
Here, the execution code is for operating at least one described data set.Wherein, when the execution code When called, execute and additions and deletions are carried out to the data in data set and data set the operation such as change.For example, the execution code includes But it is not limited to: increases the execution code for executing code, deleting data set of a new data set, in the data that preset data is concentrated Increase label/character etc. the execution code for executing code, deleting in the data that preset data is concentrated label/character etc., The execution code of label/character etc. is replaced in the data that preset data is concentrated.
Can be selected at one is in embodiment, and the execution code is stored in such as Gitlab, and is used The API of GitLab is interacted.The Gitlab is using the edition management system of Ruby on Rails mono- open source, in fact An existing Git project warehouse from trustship, can be accessed disclosed or private items by web interface.Described Gitlab possesses the function similar with Github, can browse through source code, manages defect and annotation.It can be with Executive Team to storehouse The access in library, the Gitlab are highly susceptible to the version that browsing was submitted and provide a file history library.Team Member can To be exchanged using built-in simple chat program (Wall).The Gitlab also provides a code snippet collecting function Can easily realize code reuse, convenient for it is in the future in need when searched.Here, the execution back-end engine is for each It executes the programming language of code and sets.The execution back-end engine includes single machine engine and Distributed engine.
The single machine engine is, for example, the Python on single machine, and the Python is pure free software, source code GPL (GNU General Public License) agreement is followed with interpreter CPython, it when being executed first can be by .py Compilation of source code in file at Python byte code (bytecode), then again by Python Virtual Machine (Python virtual machine) executes these compiled byte code.
The Distributed engine is, for example, the Spark on cluster, and the Spark is a quick and general computing cluster Frame, its kernel are write using Scala language, it provides Scala, Java and Python programming language high-level API can easily develop the application program of parallel processing using these API.
In the particular embodiment, it is required for configuring the execution back-end engine, because not only facilitating as user in this way Put up the environment of distributed type assemblies;It can be automatically code and result data collection can be connected.That is, The available arbitrary intermediate result of code is executed in the data version management system of the application, as long as user remains original number According to and code.
In step s 12, when receiving the data processing request of user's submission, one is called to execute described in back-end engine operation At least one executes code to execute processing to an at least data set, creates the datamation stream of the data set, and record The data version information of formation.
Specifically, when the data version management system receives the data processing request of user's submission, the number is handled According to the data set in management module, the datamation stream of the data set is created, and records the data version information of formation.
Wherein, the datamation stream (Data Work Flow, DWF) is for marking the data set during version management And/or the oriented acyclic v ion of the data in data set.For new data set, the datamation stream is corresponding Label v ion is root node.For data set/data comprising version updating, the datamation stream is to indicate number According to the secondary relationship between two data of concentration, and/or between two data sets.
The secondary relationship includes the execution historical record and version of a data set.Wherein, the execution historical record packet It includes but is not limited to: the points relationship (i.e. father and son's node relationships) of the data intensive data before variation and after variation, variation front and back institute The execution code of calling executes the time etc..In brief, versions of data pipe of the concept of the datamation stream as the application Data set logical relation in reason system.The data set forms dependence according to this.The datamation stream is this Shen The core function of data is reappeared in data version management system please.
In the datamation stream, a node illustrates a particular version of data set.One company The directed edge for connecing two nodes indicates that a data set is that have another data set to be derived.On side label indicate be then The code release once tested.Referring to Fig. 5, the example of a data workflow shown in fig. 5, that is, a data workflow It is a directed acyclic graph.Picture 5 illustrates common two kinds of datamation flow structures of one-to-one and many-one structure.
In above-mentioned one-to-one structure, a data set is derived by another data set.For example, Yong Huke To create a new data set based on an existing data set, and some new labels are stamped on new data set and are incited somebody to action It shares to other users.And one data set of above-mentioned many-one representation can be derived by two or more data sets ?.As the operation of two tables of data of some merging etc is not always the case.
Father and son's section is introduced when stating datamation flow structure in realization in the data version management system of the application Point, which is related to that this attribute is used to record this data set, to be derived from which data set.Meanwhile the data version of the application This management system can also realize the function of comparing the difference between father and son's data set.These functions help user easier It was found that the modification of oneself code results in any result.Therefore, the structure chart of the datamation stream can not only make number in order According to the relationship between collection, may also help in management user executes record, including generates result according to version number.
The data version management system is according to the data set before variation and the mark information (such as ID value) and data of data Workflow obtains and records the data version information after forming corresponding variation.Wherein, the data version information includes but unlimited In dataset name, data ID, code ID, form at least one of time and running log.
Here, the data processing that the data version management system can be submitted by network/submission interface captures user Request, and corresponding executions is sent to the enforcement engine module according to acquired data processing request and is ordered and from dematron The execution code selected in module is managed, so that the corresponding execution back-end engine of enforcement engine module calling is selected to run Execution code, to carry out the version management of data set.
Such as in a specific implementation process, code (Push) is executed whenever user submits to Gitlab server, Gitlab server will notify the system core module 14 by a Web hook.14 meeting of system core module User is requested to the queue of push-in oneself, while choosing request from head of the queue to be handled.The system core module 14 can incite somebody to action The execution code copies of the request are to the enforcement engine module 13, and then the enforcement engine module 13 will use user's offer Parameter and input run execution code.After the task terminates, the system core module 14 will record lower current request Information, the commit ID including current Push on Gitlab server, parameter that user specifies and it is any specifically with reality Test relevant information.In some cases, experiment can generate new data set.The system core module 14 described so also will record this Relationship between a little data sets, i.e., datamation stream above-mentioned.
It is a variety of situations when user submits the request of data of processing, will be illustrated respectively below:
In one case, when the data processing request is to submit a new data set, data version management system, institute New data set can be directly stored in the first storage unit by system, create the datamation stream of the new data set, and record The data version information of formation.Alternatively, the data version management system chooses corresponding execution according to the data processing request The data of acquired new data set are stored in the first storage unit by code.At the same time, the data version management System also creates the datamation stream of the data set, and records the data version information of formation.
In a kind of optinal plan, the new data set includes data and metadata.The data version management system exists When choosing execution code, the execution code for saving data and metadata can be corresponded to by choosing, and execute selected execution code, It extracts data and metadata respectively from new data set, then extracted data is stored in the first storage unit, will be mentioned The metadata taken is stored in the second storage unit, and forms data and metadata that a new data ID is associated with the new data set, The datamation stream of the data set is created, and records the data version information of formation.Here, including in the datamation stream Data and corresponding metadata as root node.Recorded data version information includes: dataset name, data ID, member Data ID, the corresponding relationship of data ID and metadata ID, the code ID for executing the addition data set, formation time and operation day Will.
In another case, when the data processing request that user submits is to modify a stored data set, institute It states data version management system and executes Code copying into corresponding execution back-end engine for one according to the data processing request, And it sends an execution order and enables it run described execution code to form a new data set to the execution back-end engine.Again by institute The new data set of formation is stored in the first storage unit, while creating the data of the data set before the opposite modification of the new data set Workflow, and record the data version information of formation.
A kind of optional way is that the data version management system is extracted the data of the new data set and is stored in described First storage unit extracts the metadata of the new data set and is stored in second storage unit, and forms a new data ID is associated with the data and metadata of the new data set, and forms a code ID for the execution code and the new data set It is associated, the datamation stream of the data set is created, and record the data version information of formation.
More preferably mode is that the data version management system is copied to described hold according to the data processing request The execution code of row back-end engine is to store in the new execution code of user's submission or the code administration module of calling Execute code.
Specifically, user can also submit new execution code previously according to self-demand, and pass through the versions of data The new corresponding relationship executed between code and each execution back-end engine of management system adjustment.As a result, when the data that user submits When processing request is modifies a stored data set, the data version management system according to the data processing request, It determines and executes the new execution Code copying in back-end engine to corresponding, and executed accordingly by corresponding execution back-end engine Code is executed, to save new data set, and creates datamation stream of the new data set relative to the data set before modification, and Record the data version information formed.
When user needs that the data set between different editions is analyzed and calculated, a kind of preferred side in the present embodiment Formula is that the data version management method further includes the steps that configuration, and there are many user UI, and in this step, there are many use for configuration Family UI, to receive the request of different user submission respectively or to different user feedback request information.
Specifically, user is frequently necessary to analyze some data sets and calculates some parameters, such as in natural language processing Accuracy rate or stock market return survey in daily investment repayment.In the datamation stream and versions of data for creating data set After information, the data version management system can be provided for user based on datamation stream and the number in associated each data set According to.Shown data can help user to compare a pair of of historical analysis result, show the difference in code and/or parameter.Needle Each user can be helped to obtain best algorithm and parameter a variety of UI design of a variety of users.
In conclusion data version management system provided by the present application may be implemented to manage number in the system integrated at one According to version, and run personal code work in systems;Can retain the code and data of user simultaneously, and can to version twice into Row compares, and discovery is different;In addition, data and metadata are stored separately by data version management system of the invention, allow to More efficiently cross filter data;Furthermore the present invention is each data set by providing respective version management for data set and code The workflow of directed acyclic and the incidence relation of both buildings are provided with code, efficiently solve the version management of data and code The problems such as low efficiency or confusion;In addition, designed using a variety of UI, can for user's comparison, analyze each history data set and provide Convenient and fast approach;In addition, each unit is distributed on different server, it can be convenient for the operating pressure on each server of mitigation.
Embodiment three
Referring to Fig. 3, the structural schematic diagram of code release management system of the invention is shown as, as shown, the application The third aspect be to provide a kind of code release management system, the code release management system can be only fitted to separate unit service Device, server cluster, in server or distributed server based on cloud computing framework.Wherein, the server cluster refer to by Many server centereds get up to carry out data version management together, and the server cluster can use multiple computers and carry out simultaneously Row calculates, to improve arithmetic speed.The server based on cloud computing framework is stored each server by virtualization technology Chi Hua, so that server shares computing resource where each module in code release management system.The distributed server be by Data and program in the code release management system, which are dispersed on multiple servers, carries out coordinated operation.
Each module in the code release management system can be configured and be serviced at any of the above-described kind according to actual design needs In device.Specifically, the code release management system 2 includes: data management module 21, code administration module 22, enforcement engine Module 23 and system core module 24.
The data management module 21 is stored at least one data set.
Wherein, the data set is by the set of the data of version management.The data include but is not limited to: textual data According to, and/or multi-medium data etc..In a specific embodiment, the text data is exemplified as code data, system log Deng.The multi-medium data is exemplified as image data, video data etc..If the data management module 21 saves multiple data Collection, then can be between each data set onrelevant or related.For example, in data set A1, A2 and A3, data set A3 It is to be derived by data set A1 and A2.Data set A3 is associated by index or associate field with data set A1 and A2.
The metadata for indexing or describing data can also be included in the data set.Wherein, in the data set Each data and corresponding metadata can be associated by a data ID.Specifically, the metadata (Metadata) is also known as intermediary Data, relaying data mainly describe data attribute (property) for the data (data about data) for describing data Information, for support such as indicate storage location, historical data, resource lookup, file record function.The metadata is calculated It is a kind of electronic type catalogue, in order to achieve the purpose that scheduling, it is necessary to describe and collect in data perhaps characteristic, in turn Reach the purpose for assisting data retrieval.
A kind of optinal plan is that the data of the data set are stored in one first storage unit, first number of the data set According to being stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.Here, described First storage unit and the second storage unit are configurable in same database server.It can also configure according to actual needs In different servers.Such as in an optional embodiment, first memory cell arrangements are in Hadoop distributed document In system (big data distributed file system);Second memory cell arrangements are in NoSQL database (database of non-relational) In.
Wherein, distributed file system (the Distributed File System) design is based on client/server Mode, in particular to the physical memory resources of file system management are not necessarily directly connected on the local node, but pass through meter Calculation machine network is connected with node.The NoSQL database is, for example, key assignments (Key-Value) storing data library, column storage Library, Document image analysis, figure (Graph) database or MongoDB database etc..
In this application, using distributed file system can efficient scan data set, but random access is then inefficient 's.In order to solve this problem, scheme provided by the present application is to store mark and the note of every picture, such as filename, big Small, content description, these contents are stored in NoSQL database to accelerate inquiry velocity, that is, are connected according to data ID Connect initial data and metadata.
The code release management system of the application has recorded each domain of data, for example, a new data is a newname The set of word and version number.It is MongoDB database used in the embodiment that the application specifically uses to store metadata, But not limitation and this, in other implementations, it is also possible to column storage database, key assignments storing data will be moved to To improve efficiency in library, Document image analysis or figure (Graph) database.
The code administration module 22 is stored at least one and executes code, and the execution code is used for the data pipe At least one data set that reason module 21 stores is operated.
Wherein, it when the execution code is called, executes and the data progress additions and deletions in data set and data set is changed Operation.For example, the code that executes includes but is not limited to: increasing the execution for executing code, deleting data set of a new data set Code, the execution code for increasing label/character etc. in the data that preset data is concentrated are deleted in the data that preset data is concentrated The execution code of label/character etc. is replaced except the execution code of label/character etc., in the data that preset data is concentrated.
Can be selected at one is in embodiment, and the execution code is stored in such as Gitlab, and is used The API of GitLab is interacted.The Gitlab is using the edition management system of Ruby on Rails mono- open source, in fact An existing Git project warehouse from trustship, can be accessed disclosed or private items by web interface.Described Gitlab possesses the function similar with Github, can browse through source code, manages defect and annotation.It can be with Executive Team to storehouse The access in library, the Gitlab are highly susceptible to the version that browsing was submitted and provide a file history library.Team Member can To be exchanged using built-in simple chat program (Wall).The Gitlab also provides a code snippet collecting function Can easily realize code reuse, convenient for it is in the future in need when searched.
In addition to this, the code that the code administration module 22 is also used to receive user's push is stored or according to user The code of push sends code process request.
Wherein under a kind of situation, the code that the code administration module 22 is used to receive user's push is stored.
Specifically, user using user terminal to 22 upload code of code administration module, then the code administration mould Block 22 is saved the received code.Wherein, the code is that user is mentioned according to the code release management system 2 The execution code that the API of confession is created/adapted.For example, user improves the execution code in the code administration module 22, And pass through the system core module 24 of subsequent detailed description, improved execution code is uploaded to the code administration module 22, then the received code is updated and is saved by the code administration module 22.
In another scenario, the code administration module 22 is used to send a code process according to the code of user's push Request.
Here, the code of user's push is matched without carrying out in system core module 24 in advance with enforcement engine module 23 It sets, then when the execution code that pushes in advance of user's selection comes when managing data set, the code administration module 22 is to system core Core module 24 sends code process request, to inform execution back-end engine corresponding to system core module 24.
The enforcement engine module 23 executes back-end engine configured at least one, when executing order for receiving, according to According to back-end engine is executed described in a commands calls, operation one executes code to the number in the data management module 21 Operation is executed according to collection.
Here, the back-end engine that executes is set for each programming language for executing code.The execution back-end engine packet Include single machine engine and Distributed engine.
The single machine engine is, for example, the Python on single machine, and the Python is pure free software, source code GPL (GNU General Public License) agreement is followed with interpreter CPython, it when being executed first can be by .py Compilation of source code in file at Python byte code (bytecode), then again by Python Virtual Machine (Python virtual machine) executes these compiled byte code.
The Distributed engine is, for example, the Spark on cluster, and the Spark is a quick and general computing cluster Frame, its kernel are write using Scala language, it provides Scala, Java and Python programming language high-level API can easily develop the application program of parallel processing using these API.
In the particular embodiment, it is required for configuring the execution back-end engine, because not only facilitating as user in this way Put up the environment of distributed type assemblies;It can be automatically code and result data collection can be connected.That is, The available arbitrary intermediate result of code is executed in the code release management system of the application, as long as user remains original number According to and code.
The system core module 24 is used to record the code of user's push and forms code release information;And it receives When the code process request of the code administration module 22, sends one and execute order to the enforcement engine module 23, enable its fortune Execution code in the row code administration module 22, and in the execution code to one in the data management module 21 The code release information of formation is recorded after data set execution operation.
Here, being described respectively for the function of the system core module 24.In in one way in which, when with When family pushes code, the received code is passed to code administration module 22 by the system core module 24, to save, together When form the code release information of the corresponding code.Wherein, similar with the data version information in the various embodiments described above, it is described Code release information includes but is not limited to following at least one: name of code, code ID, forming time, specified parameter and operation Log.The system core module 24 can refer to by data set workflow come manage data set in the way of, to the code into Row version management.For example, user, which improves stored one, executes code, then the code release management system 2 will be received Code updated and saved.At the same time, before and after the code management system also record modification between two execution codes Corresponding relationship, to form code operation stream;And on the basis of execution code release information before a modification, modification is determined and recorded Code release information afterwards.In a further mode of operation, when user selects the code of push to manage data set, the system core Core module 24 is requested based on the code process of the code administration module 22, is sent one and is executed order to the enforcement engine module 23, enable it run the execution code in the code administration module 22, and in the execution code to the data management mould The code release information of formation is recorded after data set execution operation in block 21.For example, user passes through input parameter, code Version information etc. selects the data in the code administration data set pushed in advance to change, then the system core module 24 is by institute Received parameter, code release information etc. are supplied to code administration module 22.The code administration module 22 is according to above- mentioned information It determines the execution code to be executed, and sends corresponding code process request to the system core module 24.The system Nucleus module 24 requests to send to enforcement engine module 23 according to the code process executes order.The then enforcement engine mould Block 23 executes back-end engine accordingly according to the execution command selection to execute specified execution code, and in the execution After code is to execute operation to the data set in the data management module 21, the code release information of formation is recorded.Wherein, Also recorded at least one of the code release information, datamation stream and data version information performed code with Corresponding relationship between data set.
In a specific implementation process, whenever the execution code that user submits to Gitlab server, Gitlab service Device will notify the system core module 24 by a Web hook.The system core module 24 can push away user's request Enter the queue of oneself, while choosing request from head of the queue to be handled.The system core module 24 can be by the execution of the request Code copies send code process request to the enforcement engine module 23, and to the enforcement engine module 23, to inform The execution back-end engine of selection.Then the enforcement engine module 23 will use the parameter of user's offer and input executes to run Code.After the task terminates, the system core module 24 will record the lower information specifically requested, including current Push exists Commit ID on Gitlab server, execute code release information, the parameter that user specifies and it is any specifically with experiment phase The information of pass.In some cases, experiment can generate new data set.The system core module 24 described so also will record these numbers According to the relationship between collection, i.e. datamation stream described in foregoing embodiments.
It should be noted that the system core module 24 is copied to the enforcement engine according to the data processing request The execution code of module 23 can also be the code administration module of calling other than the new execution code that user submits The execution code stored in 22.For example, execution code in user's appointment codes management module 22 handles corresponding data set, And the execution code defaulted in non-used code administration module 22.
In a kind of optinal plan, the data processing request that the system core module 24 is also used to submit according to user will One executes Code copying into the enforcement engine module 23, and sends an execution order and enable it to the enforcement engine module 23 The execution code is run to form a new data set, and forms a code ID for the execution code and the new data set phase Association, and record the code release information of formation.
Wherein, the data processing request includes but is not limited to: submitting a new data set or the modification data management The data set stored in module 21.
Specifically, on the basis of above-mentioned data version management system forms datamation stream and data version information, with And system core module 24 in the present embodiment is executing Code copying into the enforcement engine module 23 for one, and sends On the basis of one execution order enables it run described execution code to form a new data set to the enforcement engine module 23, also It is associated with new data set to performed execution code ID, and record the code release information of formation.In this way, when user needs When analyzing the execution historical record of multiple associated data sets, the related letter for the execution code for corresponding to each data set can be obtained Breath.It provides more for user for data set analysis, the average information of statistics.
As a preferred embodiment, the code release management system 2 further includes Subscriber Interface Module SIM, and there are many use for configuration Family UI, to receive the request of different user submission respectively or to different user feedback request information.
Specifically, user is frequently necessary to analyze some data sets and calculates some parameters, such as in natural language processing Accuracy rate or stock market return survey in daily investment repayment.In the datamation stream and versions of data for creating data set After information, the Subscriber Interface Module SIM can be provided for user data based on datamation stream and in associated each data set, Execute code etc..Shown data can help user to compare a pair of of historical analysis result, show in code and/or parameter It is different.A variety of UI design for a variety of users can help each user to obtain best algorithm and parameter.
In conclusion code release management system provided by the present application may be implemented to manage generation in the system integrated at one The version of code, and personal code work is run in systems;Can retain the code and data of user simultaneously, and can to version twice into Row compares, and discovery is different;In addition, data and metadata are stored separately by code release management system of the invention, allow to More efficiently cross filter data;Furthermore the present invention is each data set by providing respective version management for data set and code The workflow of directed acyclic and the incidence relation of both buildings are provided with code, efficiently solve the version management of data and code The problems such as low efficiency or confusion;In addition, designed using a variety of UI, can for user's comparison, analyze each history data set and provide Convenient and fast approach;In addition, each unit is distributed on different server, it can be convenient for the operating pressure on each server of mitigation.
Example IV
Referring to Fig. 4, the flow chart of code release management method of the invention is shown as, as shown, the of the application Four aspects are to provide a kind of code release management method.The code release management method is mainly by code release management system To execute.Wherein, the code release management system can be only fitted to single server, server cluster, based on cloud computing frame In the server or distributed server of structure.Wherein, the server cluster refers to get up many server centereds and carry out together Data version management, the server cluster can use multiple computers and carry out parallel computation, to improve arithmetic speed.It is described Server based on cloud computing framework passes through virtualization technology by each server storage pool, so that in code release management system Server shares computing resource where each module.The distributed server is by the data in the code release management system It is dispersed on multiple servers with program and carries out coordinated operation.
Each module in the code release management system can be configured and be serviced at any of the above-described kind according to actual design needs In device.Specifically, the code release management system executes the method by executing following steps.
In the step s 21, at least one data set is prestored and for being operated at least one described data set At least one executes code, and configures at least one for running the execution back-end engine for executing code.
Wherein, the data set is by the set of the data of version management.The data include but is not limited to: textual data According to, and/or multi-medium data etc..In a specific embodiment, the text data is exemplified as code data, system log Deng.The multi-medium data is exemplified as image data, video data etc..If the data management module 21 saves multiple data Collection, then can be between each data set onrelevant or related.For example, in data set A1, A2 and A3, data set A3 It is to be derived by data set A1 and A2.Data set A3 is associated by index or associate field with data set A1 and A2.
The metadata for indexing or describing data can also be included in the data set.Wherein, in the data set Each data and corresponding metadata can be associated by a data ID.Specifically, the metadata (Metadata) is also known as intermediary Data, relaying data mainly describe data attribute (property) for the data (data about data) for describing data Information, for support such as indicate storage location, historical data, resource lookup, file record function.The metadata is calculated It is a kind of electronic type catalogue, in order to achieve the purpose that scheduling, it is necessary to describe and collect in data perhaps characteristic, in turn Reach the purpose for assisting data retrieval.
A kind of optinal plan is that the data of the data set are stored in one first storage unit, first number of the data set According to being stored in one second storage unit, and the data of the data set and metadata are associated by a data ID.Here, described First storage unit and the second storage unit are configurable in same database server.It can also configure according to actual needs In different servers.Such as in an optional embodiment, first memory cell arrangements are in Hadoop distributed document In system (big data distributed file system);Second memory cell arrangements are in NoSQL database (database of non-relational) In.
Wherein, distributed file system (the Distributed File System) design is based on client/server Mode, in particular to the physical memory resources of file system management are not necessarily directly connected on the local node, but pass through meter Calculation machine network is connected with node.The NoSQL database is, for example, key assignments (Key-Value) storing data library, column storage Library, Document image analysis, figure (Graph) database or MongoDB database etc..
In this application, using distributed file system can efficient scan data set, but random access is then inefficient 's.In order to solve this problem, scheme provided by the present application is to store mark and the note of every picture, such as filename, big Small, content description, these contents are stored in NoSQL database to accelerate inquiry velocity, that is, are connected according to data ID Connect initial data and metadata.
The code release management system of the application has recorded each domain of data, for example, a new data is a newname The set of word and version number.It is MongoDB database used in the embodiment that the application specifically uses to store metadata, But not limitation and this, in other implementations, it is also possible to column storage database, key assignments storing data will be moved to To improve efficiency in library, Document image analysis or figure (Graph) database.
The code that executes is used to describe to carry out additions and deletions to the data in data set and data set to change etc. to operate.For example, institute Stating execution code includes but is not limited to: increasing the execution code for executing code, deleting data set of a new data set, in present count According to execution code, the deletion label/character etc. in the data that preset data is concentrated for increasing label/character etc. in the data of concentration The execution code for executing code, replacing in the data that preset data is concentrated label/character etc..
Can be selected at one is in embodiment, and the execution code is stored in such as Gitlab, and is used The API of GitLab is interacted.The Gitlab is using the edition management system of Ruby on Rails mono- open source, in fact An existing Git project warehouse from trustship, can be accessed disclosed or private items by web interface.Described Gitlab possesses the function similar with Github, can browse through source code, manages defect and annotation.It can be with Executive Team to warehouse Access, the Gitlab is highly susceptible to the version submitted of browsing and provides a file history library.Team Member can be with It is exchanged using built-in simple chat program (Wall).The Gitlab also provides a code snippet collecting function can Easily to realize code reuse, convenient for it is in the future in need when searched.
Here, the back-end engine that executes is set for each programming language for executing code.The execution back-end engine packet Include single machine engine and Distributed engine.
The single machine engine is, for example, the Python on single machine, and the Python is pure free software, source code GPL (GNU General Public License) agreement is followed with interpreter CPython, it when being executed first can be by .py Compilation of source code in file at Python byte code (bytecode), then again by Python Virtual Machine (Python virtual machine) executes these compiled byte code.
The Distributed engine is, for example, the Spark on cluster, and the Spark is a quick and general computing cluster Frame, its kernel are write using Scala language, it provides Scala, Java and Python programming language high-level API can easily develop the application program of parallel processing using these API.
In the particular embodiment, it is required for configuring the execution back-end engine, because not only facilitating as user in this way Put up the environment of distributed type assemblies;It can be automatically code and result data collection can be connected.That is, The available arbitrary intermediate result of code is executed in the code release management system of the application, as long as user remains original number According to and code.
In step S22, the code for receiving user's push is stored, and records the code release information of formation;Or Code according to user's push sends code process request, and transmission one, which executes order, enables its operation to the execution back-end engine The execution code prestored, and in the execution code to execute the code for recording formation after operation to the data set that prestores Version information.
Wherein, the code that the code release management system receives user's push is stored, and records the code of formation The mode of version information is specific as follows:
User is using user terminal to the code release management system upload code, then the code administration module is by institute Received code is saved, and the code release information of the corresponding code is formed simultaneously.Wherein, the code can be user The execution code for creating/adapting according to API provided by the code release management system.In addition, with foregoing embodiments Data version information it is similar, the code release information includes but is not limited to following at least one: name of code, code ID, Form time, specified parameter and running log.The code release management system, which can refer to using data set workflow, to be managed The mode of data set carries out version management to the code.
For example, user, which improves stored one, executes code, then the code release management system is by institute's received generation Code is updated and is saved.At the same time, the correspondence before and after the code management system also record modification between two execution codes Relationship, to form code operation stream;And on the basis of execution code release information before a modification, determine and record modified Code release information.
Alternatively, the code release management system sends code process request according to the code of user's push, one is sent It executes order and enables it run the execution code prestored to the execution back-end engine, and in the execution code to the institute prestored State the code release information that formation is recorded after data set execution operates.
Specifically, the code of user's push with corresponding execution back-end engine without configuring in advance, but works as user The execution code pushed in advance is selected to come when managing data set, the code release management system generates one and executes comprising corresponding The code process of back-end engine is requested, and starts corresponding execution back-end engine, the execution code for enabling its operation pushed, and After the execution code is executed to execute operation to a data set, the code release information of formation is recorded.
For example, user selects the code pushed in advance by input parameter, code release information etc., to manage data set In data change.The code release management system to be executed based on determinations such as the received parameter of institute, code release information Execution code, and generate the request of corresponding code process.The system core module is requested according to the code process to right The execution back-end engine transmission answered executes order.Then the back-end engine that executes is held specified by order execution according to described execute Line code, and after the execution code is to execute operation to the data set in the data management module, record formation Code release information.Wherein, at least one of the code release information, datamation stream and data version information also Corresponding relationship between record performed code and data set.
In a specific implementation process, whenever the execution code that user submits to Gitlab server, Gitlab service Device will notify the code release management system by a Web hook.The code release management system can ask user The queue of push-in oneself is asked, while choosing request from head of the queue to be handled;And by the execution code copies of the request to correspond to Execution back-end engine, to inform selected execution back-end engine.Then the execution back-end engine will use user's offer Parameter and input run execution code.After the task terminates, the code release management system, which will record, lower is specifically asked The information asked, commit ID, execution code release information, user including current Push on Gitlab server specify Parameter and any information specifically relevant to experiment.In some cases, experiment can generate new data set.The code described in this way Edition management system also will record the relationship between these data sets, i.e. datamation stream described in foregoing embodiments.
It should be noted that after the code release management system is copied to corresponding execute according to the data processing request It holds the execution code of engine other than the new execution code that user submits, can also be the stored execution code of calling. For example, user specifies stored execution code to handle corresponding data set, and default in non-used code administration module Execute code.
In a kind of optinal plan, the code release management system is held according to the data processing request that user submits by one Line code is copied in the execution back-end engine, and is sent an execution order and enabled described in its operation to the execution back-end engine Code is executed to form a new data set, and one code ID of formation is associated with the new data set by the execution code, and Record the code release information formed.
Wherein, the data processing request includes but is not limited to: submitting a new data set or the modification data management The data set stored in module.
Specifically, on the basis of above-mentioned data version management system forms datamation stream and data version information, with And Code copying is being executed into the execution back-end engine by one in the code release management system, and send one and execute life It enables to the execution back-end engine, on the basis of enabling it run the execution code to form a new data set, the code version This management system is also associated with new data set to performed execution code ID, and records the code release information of formation.Such as This can obtain corresponding to holding for each data set when user needs to analyze the execution historical record of multiple associated data sets The relevant information of line code.It provides more for user for data set analysis, the average information of statistics.
As a preferred embodiment, the code release management method further include: configure the step of there are many user UI.? The step in, configuration is there are many user UI, to receive the request of different user submission respectively or ask to different user feedback Seek information.
Specifically, user is frequently necessary to analyze some data sets and calculates some parameters, such as in natural language processing Accuracy rate or stock market return survey in daily investment repayment.In the datamation stream and versions of data for creating data set After information, the code release management system can be provided for user based on datamation stream and the number in associated each data set According to, execute code etc..Shown data can help user to compare a pair of of historical analysis result, show code and/or parameter On difference.A variety of UI design for a variety of users can help each user to obtain best algorithm and parameter.
It should be noted that data version management system and code release management system described in each embodiment in the present invention In each module can according to name coordinate share, enable two edition management systems manage simultaneously data set and execute generation Code, respective version.
In conclusion code release management system provided by the present application may be implemented to manage generation in the system integrated at one The version of code, and personal code work is run in systems;Can retain the code and data of user simultaneously, and can to version twice into Row compares, and discovery is different;It in addition, the present invention is by providing respective version management for data set and code, and is each data set The workflow of directed acyclic and the incidence relation of both buildings are provided with code, efficiently solve the version management of data and code The problems such as low efficiency or confusion;In addition, designed using UI, can be compared for user, analyze each history data set provide it is convenient Approach;In addition, each unit is distributed on different server, it can be convenient for the operating pressure on each server of mitigation.So this Invention effectively overcomes various shortcoming in the prior art and has high industrial utilization value.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should be covered by the claims of the present invention.

Claims (18)

1. a kind of data version management system characterized by comprising
Data management module is stored at least one data set;
Code administration module is stored at least one and executes code, and the execution code is used for at least one described data set It is operated;
Enforcement engine module executes back-end engine configured at least one, executes rear end according to received commands calls one Engine, operation one execute code to execute operation at least data set in the data management module;
System core module handles the data in the data management module when receiving the data processing request of user's submission Collection, creates the datamation stream of the data set, and record the data version information of formation, wherein the datamation stream is Indicate that the secondary relationship between at least two data sets, the secondary relationship include the execution historical record and version of a data set This.
2. data version management system according to claim 1, it is characterised in that: the data of the data set are stored in one First storage unit, the metadata of the data set are stored in one second storage unit, and the data of the data set and first number According to associated by a data ID.
3. data version management system according to claim 2, it is characterised in that: the system core module receives use The data processing request that family is submitted is when submitting a new data set, and the system core module extracts the number of the new data set According to and be stored in first storage unit, extract the metadata of the new data set and be stored in second storage unit, And data and metadata that a new data ID is associated with the new data set are formed, the datamation stream of the data set is created, and Record the data version information formed.
4. data version management system according to claim 2, it is characterised in that: the system core module receives use When the data processing request that family is submitted is the data set stored in the modification data management module, the system core mould Block according to the data processing request by one execute Code copying into the enforcement engine module, and send one execute order to The enforcement engine module enables it run described executions code to form a new data set, described in the system core module extraction The data of new data set are simultaneously stored in first storage unit, extract the metadata of the new data set and are stored in described Two storage units, and data and metadata that a new data ID is associated with the new data set are formed, and forming a code ID will The execution code is associated with the new data set, creates the datamation stream of the data set, and records the data of formation Version information.
5. data version management system according to claim 4, it is characterised in that: the system core module is according to described in The code that executes that data processing request is copied to the enforcement engine module is the new execution code or calling that user submits The execution code stored in the code administration module.
6. according to data version management system described in claim 2,3 or 4, it is characterised in that: first storage unit is matched It is placed in Hadoop distributed file system;Second memory cell arrangements are in NoSQL database.
7. data version management system according to claim 1,2,3 or 4, it is characterised in that: further include user interface mould Block, configuration is there are many user UI, to receive the request of different user submission respectively or to different user feedback request information.
8. data version management system according to claim 1,2,3 or 4, it is characterised in that: the data version information Including dataset name, data ID, code ID, form at least one of time and running log.
9. data version management system according to claim 1,2,3 or 4, it is characterised in that: the execution back-end engine Including single machine engine and Distributed engine.
10. a kind of data version management method, which is characterized in that the described method comprises the following steps:
At least one data set and at least one execution code for being operated at least one described data set are prestored, And it configures at least one for running the execution back-end engine for executing code;And;
When receiving the data processing request of user's submission, at least one described in an execution back-end engine operation is called to execute generation Code creates the datamation stream of the data set, and record the data of formation to execute processing at least one described data set Version information, wherein the datamation stream is the secondary relationship indicated between at least two data sets, the secondary relationship packet Include the execution historical record and version of a data set.
11. data version management method according to claim 10, it is characterised in that: the data of the data set are stored in One first storage unit, the metadata of the data set are stored in one second storage unit, and the data and member of the data set Data are associated by a data ID.
12. data version management method according to claim 11, it is characterised in that: described receives what user submitted Data processing request is to extract the data of the new data set when submitting a new data set and be stored in the first storage list Member extracts the metadata of the new data set and is stored in second storage unit, and is formed described in new data ID association The data and metadata of new data set, and record the data version information of formation.
13. data version management method according to claim 11, it is characterised in that: described receives what user submitted Data processing request is to execute Code copying extremely for one according to the data processing request when modifying a stored data set It is corresponding to execute in back-end engine, and send one execute order to the executions back-end engine enable its described execution code of operation with A new data set is formed, the data of the new data set are extracted and is stored in first storage unit, extracts the new data The metadata of collection is simultaneously stored in second storage unit, and forms data and member that a new data ID is associated with the new data set Data, and one code ID of formation are associated with the new data set by the execution code, and record the versions of data of formation Information.
14. data version management method according to claim 13, it is characterised in that: described according to the data processing The execution code that request is copied in the execution back-end engine is the new execution code of user's submission or prestoring for calling Execute code.
15. data version management method described in 1,12 or 13 according to claim 1, it is characterised in that: first storage is single Member is configured in Hadoop distributed file system;Second memory cell arrangements are in NoSQL database.
16. data version management method described in 1,12,13 or 14 according to claim 1, it is characterised in that: further include being configured with The step of a variety of user UI, to receive the request of different user submission respectively or to different user feedback request information.
17. data version management method described in 1,12,13 or 14 according to claim 1, it is characterised in that: the versions of data Information includes dataset name, data ID, code ID, forms at least one of time and running log.
18. data version management method described in 1,12,13 or 14 according to claim 1, it is characterised in that: the execution rear end Engine includes single machine engine and Distributed engine.
CN201610282533.2A 2016-04-29 2016-04-29 Data version management system and method Active CN105956087B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610282533.2A CN105956087B (en) 2016-04-29 2016-04-29 Data version management system and method
CN201910359068.1A CN110119393B (en) 2016-04-29 2016-04-29 Code version management system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610282533.2A CN105956087B (en) 2016-04-29 2016-04-29 Data version management system and method

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201910359068.1A Division CN110119393B (en) 2016-04-29 2016-04-29 Code version management system and method

Publications (2)

Publication Number Publication Date
CN105956087A CN105956087A (en) 2016-09-21
CN105956087B true CN105956087B (en) 2019-08-30

Family

ID=56914515

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201610282533.2A Active CN105956087B (en) 2016-04-29 2016-04-29 Data version management system and method
CN201910359068.1A Active CN110119393B (en) 2016-04-29 2016-04-29 Code version management system and method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910359068.1A Active CN110119393B (en) 2016-04-29 2016-04-29 Code version management system and method

Country Status (1)

Country Link
CN (2) CN105956087B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268275B (en) * 2017-06-12 2021-05-04 平安普惠企业管理有限公司 Software version control method and software version control equipment
CN108170756B (en) * 2017-12-22 2021-12-03 南京邮电大学 Implementation method of multidimensional, multi-view and visual annotation system based on Git warehouse
CN108228231B (en) * 2018-01-08 2021-07-27 南京邮电大学 Visualization drifting method of Git warehouse file annotation system
CN109032592A (en) * 2018-08-23 2018-12-18 常熟市盛铭信息技术有限公司 A kind of method that software code is shared mutually
CN109302448B (en) * 2018-08-27 2020-10-09 华为技术有限公司 Data processing method and device
CN110059096A (en) * 2019-03-16 2019-07-26 平安城市建设科技(深圳)有限公司 Data version management method, apparatus, equipment and storage medium
CN111221566B (en) * 2019-12-28 2021-10-22 华为技术有限公司 Method and device for combining multiple and changeable versions of software code
CN111198711B (en) * 2020-01-13 2023-02-28 陕西心像信息科技有限公司 Collection version control method and system based on MongoDB
CN111506779B (en) * 2020-04-20 2021-03-16 东云睿连(武汉)计算技术有限公司 Object version and associated information management method and system facing data processing
CN112698866B (en) * 2021-01-06 2022-06-17 中国科学院软件研究所 Code line life cycle tracing method based on Git and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1592291A (en) * 2003-08-28 2005-03-09 国际商业机器公司 Method and system for managing service state data
CN101076793A (en) * 2004-08-31 2007-11-21 国际商业机器公司 System structure for enterprise data integrated system
CN101770608A (en) * 2008-12-26 2010-07-07 新奥特(北京)视频技术有限公司 Management method and device of engineering versions

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078205A1 (en) * 2000-11-17 2002-06-20 Lloyd Nolan Resource control facility
US20030005408A1 (en) * 2001-07-02 2003-01-02 Pradeep Tumati System and method for creating software modifiable without halting its execution
US20070282927A1 (en) * 2006-05-31 2007-12-06 Igor Polouetkov Method and apparatus to handle changes in file ownership and editing authority in a document management system
CN101246420A (en) * 2007-12-29 2008-08-20 中国建设银行股份有限公司 Method and system for multi-language system implementing unified development
CN101276279B (en) * 2008-05-21 2010-12-08 天柏宽带网络科技(北京)有限公司 Unified development system and method
CN103049268B (en) * 2012-12-25 2016-08-03 中国科学院深圳先进技术研究院 A kind of application and development based on Naplet management system
CN103729195B (en) * 2014-01-15 2017-04-05 北京奇虎科技有限公司 A kind of control method and system of software version
CN103970579B (en) * 2014-05-29 2017-05-03 中国银行股份有限公司 Application version deploying method and application version deploying device
CN105094851A (en) * 2015-09-06 2015-11-25 浪潮软件股份有限公司 Method for momentarily issuing codes based on Git

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1592291A (en) * 2003-08-28 2005-03-09 国际商业机器公司 Method and system for managing service state data
CN101076793A (en) * 2004-08-31 2007-11-21 国际商业机器公司 System structure for enterprise data integrated system
CN101770608A (en) * 2008-12-26 2010-07-07 新奥特(北京)视频技术有限公司 Management method and device of engineering versions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于着色时间工作流网的产品数据管理系统的研究;李涛;《中国优秀博硕士学位论文全文数据库 (博士)信息科技辑》;20061115;69-87

Also Published As

Publication number Publication date
CN110119393A (en) 2019-08-13
CN105956087A (en) 2016-09-21
CN110119393B (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN105956087B (en) Data version management system and method
US10275221B2 (en) Systems and methods for generating data visualization applications
US11663033B2 (en) Design-time information based on run-time artifacts in a distributed computing cluster
US9996592B2 (en) Query relationship management
US10642832B1 (en) Reducing the domain of a subquery by retrieving constraints from the outer query
US10073867B2 (en) System and method for code generation from a directed acyclic graph using knowledge modules
US9043750B2 (en) Automated generation of two-tier mobile applications
US8863075B2 (en) Automated support for distributed platform development
US9146979B2 (en) Optimization of business warehouse queries by calculation engines
WO2012174767A1 (en) Implementation method and device for orm structure-based high efficiency database access
US20060129609A1 (en) Database synchronization using change log
US20150293947A1 (en) Validating relationships between entities in a data model
US10922640B2 (en) Smart template for predictive analytics
CN113157947A (en) Knowledge graph construction method, tool, device and server
US8862637B2 (en) Generating data access operations based on a data model using a data services model
US10459760B2 (en) Optimizing job execution in parallel processing with improved job scheduling using job currency hints
US9244961B2 (en) Concurrent access for hierarchical data storage
CN107423035B (en) Product data management system in software development process
US9892380B2 (en) Adaptive knowledge navigator
CN110019440A (en) The processing method and processing device of data
EP3086244A1 (en) Database system and method of operation thereof
US20170139982A1 (en) Processing of Data Chunks Using a Database Calculation Engine
US10452255B2 (en) Logical set operations
CN105590133B (en) For the knowledge management method of IT system operation maintenance
Blagaić et al. Application for data migration with complete data integrity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190603

Address after: 710077 Room 101, Block B, Yunhui Valley, 156 Tiangu Eighth Road, New Town, Yuhua Street Software, Xi'an High-tech Zone, Shaanxi Province

Applicant after: Cross Information Core Technology Research Institute (Xi'an) Co., Ltd.

Address before: 100084 Qinghua Garden, Haidian District, Haidian District, Beijing

Applicant before: Tsinghua University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant