CN114911768A

CN114911768A - Method, device, equipment and storage medium for managing data set version based on Git

Info

Publication number: CN114911768A
Application number: CN202210568625.2A
Authority: CN
Inventors: 杜松显; 卢江涛; 唐伟; 王家奇; 吕标彪
Original assignee: Hangzhou Yele Technology Co ltd
Current assignee: Hangzhou Yele Technology Co ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-16

Abstract

The application relates to the field of deep learning, and provides a method, a device, equipment and a storage medium for managing a data set version based on Git, wherein the method comprises the following steps: determining the class of the image label according to the model training task requirement corresponding to the target deep learning model; searching a version data set corresponding to the picture label type in each version data set of the data set list file to serve as a latest version data set; and training and generating the target deep learning model according to the latest version data set. The invention is based on the Git system, reduces errors caused by referring to a multi-version data set, is convenient for a user to backtrack through accessing historical versions, solves the technical problem of unified version management on unstructured data such as pictures at present, and improves the use experience of the user on the data set.

Description

Method, device, equipment and storage medium for managing data set version based on Git

Technical Field

The invention relates to the technical field of deep learning, in particular to a method, a device and equipment for managing a data set version based on Git and a computer readable storage medium.

Background

With the continuous development of deep learning technology, the machine can also have the analysis and learning ability like a human, and can recognize data such as characters, images and sounds. The deep learning model is built without leaving three elements of an algorithm, calculation power and data, and the deep learning model needs to be optimized by continuously adding data with feature labels; currently, in different scenes, sample data in each scene is needed for constructing a deep learning model, and a data set with multiple versions is generated in different scenes. Due to the presence of unstructured datasets like pictures, multiple versions are generated under different scenarios. When multiple versions of the same dataset are referenced in the same project, errors can occur during generation, and the user cannot access the historical versions for backtracking operations. Therefore, on the basis of the Git version management, how to realize the uniform version management of the unstructured data such as pictures becomes a technical problem to be solved urgently, and the use experience of the user on the data set is improved.

Disclosure of Invention

The invention mainly aims to provide a method, a device and equipment for managing the version of a dataset based on Git and a computer readable storage medium, and aims to solve the technical problem of unified version management of unstructured data such as pictures.

In order to achieve the above object, the present invention provides a method for managing versions of a Git-based data set, including: determining the class of the image label according to the model training task requirement corresponding to the target deep learning model; searching a version data set corresponding to the picture label type in each version data set of the data set list file to serve as a latest version data set; and training and generating the target deep learning model according to the latest version data set.

Further, before the step of determining the image tag category according to the model training task requirement corresponding to the target deep learning model, the method further includes:

extracting pictures from an original video file according to frames to obtain an original picture set;

acquiring picture names and related attributes in the original picture set, and generating a data set information file;

the training and generating the target deep learning model according to the latest version data set further comprises:

generating sample data according to the latest version data set and the data set information file;

and training and generating the target deep learning model according to the sample data.

Further, the determining the picture label category according to the model training task requirement corresponding to the target deep learning model includes:

and acquiring a model training task requirement corresponding to the target deep learning model, determining a target label group, and taking the target label group as the picture label category, wherein the picture label category at least comprises a product, a task, a camera and a project label.

Further, the searching for the version data set corresponding to the picture tag category in each version data set of the data set list file as a latest version data set includes:

comparing the picture label category with category information in the data set list file;

and when the category information same as the picture label category exists in the data set list, determining a version data set corresponding to the category information same as the picture label category as the latest version data set.

Further, after comparing the picture tag category with the category information in the data set list file, the method further includes:

and when the category information which is the same as the picture label category does not exist in the data set list, generating the latest version data set according to the picture label category and the original picture set.

Further, the training to generate the target deep learning model according to the latest version data set comprises:

the latest version data set is transmitted into a labeling system, and the latest version data set is labeled based on the picture label category to generate a label file set;

storing the label file set in an object storage system based on the picture label category as a sample data set;

and transmitting the sample data set into a training server, performing model training, and generating the target deep learning model.

Further, the transferring the latest version data set into an annotation system further comprises:

transmitting the latest version data set into a labeling system, and generating a checksum (checksum);

comparing a check value of the checksum (checksum) with a picture quantity value in the latest version dataset;

if the check value is the same as the picture quantity value, the latest version data set is transmitted without errors;

and if the check value is different from the picture quantity value, feeding back error information of the latest version data set transmission to a sender.

In addition, to achieve the above object, the present invention also provides a data set version management apparatus for Git, the apparatus including:

the label category determining module is used for determining the category of the image label according to the model training task requirement corresponding to the target deep learning model;

the searching module is used for searching the version data set corresponding to the picture label category in each version data set of the data set list file to serve as the latest version data set;

and the training module is used for training and generating the target deep learning model according to the latest version data set.

In addition, to achieve the above object, the present invention further provides a Git-based data set version management device, which includes a processor, a memory, and a Git-based data set version management program stored on the memory and executable by the processor, wherein the Git-based data set version management program, when executed by the processor, implements the steps of the Git-based data set version management method as described above.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium, on which a Git-based data set version management program is stored, wherein when the Git-based data set version management program is executed by a processor, the steps of the Git-based data set version management method are implemented.

The invention provides a method for managing a data set version of Git, which determines the category of a picture label according to the requirement of a model training task corresponding to a target deep learning model; searching a version data set corresponding to the picture label type in each version data set of the data set list file to serve as a latest version data set; and training and generating the target deep learning model according to the latest version data set. Therefore, errors caused by referring to a multi-version data set are reduced on the basis of Git version management, a user can conveniently backtrack through visiting historical versions, the technical problem of unified version management of unstructured data such as pictures at present is solved, and the use experience of the user on the data set is improved.

Drawings

FIG. 1 is a diagram illustrating a hardware structure of a dataset version management device for Git according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a method for managing versions of datasets Git according to the present invention;

fig. 3 is a functional module diagram of a first embodiment of the data set version management apparatus for Git according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method for managing the data set version based on Git is mainly applied to the data set version management equipment based on Git, and the data set version management generation equipment based on Git can be equipment with display and processing functions, such as a PC (personal computer), a portable computer, a mobile terminal and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of a Git-based data set version management device according to an embodiment of the present invention. In an embodiment of the present invention, the Git-based data set version management device may include a processor 1001 (e.g., a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used for implementing connection communication among the components; the user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface); the memory 1005 may be a high-speed RAM memory, or may be a non-volatile memory (e.g., a magnetic disk memory), and optionally, the memory 1005 may be a storage device independent of the processor 1001.

Those skilled in the art will appreciate that the hardware configuration shown in FIG. 1 does not constitute a limitation of the Git-based dataset version management device, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

With continued reference to FIG. 1, the memory 1005 of FIG. 1, which is one type of computer-readable storage medium, may include an operating system, a network communication module, and a Git-based dataset version management program.

In fig. 1, the network communication module is mainly used for connecting to a server and performing data communication with the server; and the processor 1001 may call the Git-based dataset version management program stored in the memory 1005 and execute the Git-based dataset version management method provided by the embodiment of the present invention.

The embodiment of the invention provides a data set version management method based on Git.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a Git-based data set version management method according to the present invention.

In this embodiment, the method for managing the version of the dataset based on Git includes the following steps:

step S10, determining the picture label category according to the model training task requirement corresponding to the target deep learning model;

in this embodiment, before step S10, extracting pictures from an original video file by frame to obtain an original picture set; and acquiring the picture name and the related attribute in the original picture set to generate a data set information file. Deep learning is the intrinsic law and expression hierarchy of learning sample data, and obtaining some information in the learning process is very helpful for interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. The deep learning model is built depending on sample data, and the required sample data is different under different scenes. Massive data are preprocessed based on deep model training requirements, and therefore redundancy of the data can be reduced. The main work of machine learning is to extract useful features, determine classification labels, and screen and classify data sets, so that subsequent data processing can be accelerated.

It can be understood that according to the training requirements of the model task, the characteristics of the mass data can be extracted, and the label category can be determined. Since information like pictures, video, and audio are unstructured data, there are many inconveniences to store and retrieve them. However, the information described by the label is universal, that is, the label can be applied to any data structure, and the label is marked on the unstructured data such as pictures, so that the unstructured data is converted into structured data, and subsequent operations such as storage, search and management are facilitated.

Step S20, in each version data set of the data set list file, searching the version data set corresponding to the picture label type as the latest version data set;

in an embodiment, the picture tag category is compared with category information in the data set list file; when the category information which is the same as the picture label category exists in the data set list, determining a version data set which corresponds to the category information which is the same as the picture label category and is used as the latest version data set; and when the category information identical to the picture label category does not exist in the data set list, generating the latest version data set according to the picture label category and the original picture set.

Specifically, in response to a deep learning model training task, an original video file is obtained from a remote central storage and an original picture set is generated. In the process of picture training in deep learning, in order to generate a data set list file from picture information, it is often necessary to generate a list. On the basis of a shuffle mechanism, a training task data set list is generated on an original picture set, a data set list file of the original picture set is traversed, and the existing data set feature vectors are separated from picture label category columns. Typically, the overall dataset list file per line content format may be: { "product": product "," camera ": camera", "task": task "," project ": item", labels [ "label01", "label02", "label03" ], "version": labels ", batches": total batch, "count": total data amount }.

In addition, because machine learning assumes that the data is required to satisfy independent co-distributions, any sample occurrence needs to satisfy "randomness". shuffle represents machine learning and deep learning in the sense that the data set of the training model is shuffled. The original picture data may be arranged in a certain order in the case of sample equalization, such as the first half being data of a certain category and the second half being data of another category. The data arrangement after the scrambling has a certain randomness, and the probability that the next obtained sample is any type of data in the sequential reading is the same.

It can be understood that in the data set management process, for the same data source, such as an original video file in this embodiment, data marked at different times can be distinguished according to versions, so that it is convenient to select a corresponding data set version for use in the subsequent model building and developing processes. After the data annotation is completed, the current state of the data set can be released, and a new data set version is generated. However, the data set just created (before release) has no data set version information, and the release operation must be performed before the version can be available, and the version can be applied to model development or training. Each version data set is stored in the data set list file, and the version data sets corresponding to the picture label categories can be searched by traversing the data set list file, so that all the version data sets are conveniently managed.

And step S30, training and generating the target deep learning model according to the latest version data set.

In this embodiment, sample data is generated according to the latest version data set and the data set information file; and training and generating the target deep learning model according to the sample data.

In particular, during the development of machine learning models, it is desirable that trained models perform well on new, unseen data. To simulate new, unseen data, a data split is performed on the latest version of the dataset, splitting it into 2 parts (sometimes referred to as a training-test split). In particular, the first part is a larger subset of data used as a training set (e.g., 80% of the original data), and the second part is typically a smaller subset used as a test set (the remaining 20% of the data). A prediction model is built using the training set and then this trained model is applied to the test set (i.e., as new, unseen data) for prediction. The best model is selected according to the performance of the model on the test set, and hyper-parameter optimization can be carried out in order to obtain the best model. Through the cyclic calling training process, each round comprises three steps of forward calculation, a loss function (optimization target) and backward propagation, a target deep learning model is obtained through training, the trained model is stored, the maximum utilization of sample data can be realized, the rationality of the model is guaranteed, and meanwhile, the target deep learning model can be obtained through training without version confusion.

Based on the foregoing embodiment shown in fig. 2, in this embodiment, before the step S10, the method further includes:

and acquiring the picture name and the related attribute in the original picture set, and generating a data set information file.

In this embodiment, the acquisition of model training data is often collected by recording a video, and then frame extraction processing is performed to obtain a plurality of pictures, and the pictures are placed in the same folder to generate an original picture set. After an original picture set is generated, picture names and relevant attributes of pictures, such as renaming of the pictures by a user or the size or opening mode of the pictures, are collected, and because the information is personalized and the meaning of training data is not great, the information is not required to be added into a cache queue for version management, and then the information is generated into a data set information file. A new file named gitignore is built in the working area in Git, and then the data set information files to be ignored are filled in, Git can automatically ignore the files and cannot upload the files to Git warehouse.

Specifically, the mode of video frame extraction generally adopts an FFmpeg command, and after video data is extracted into pictures, due to the fact that the picture similarity between key frames is also high, a plurality of redundant pictures exist in a data set to be trained, the technology of image similarity measurement is added for further screening. At present, the technology has a plurality of application scenes, such as searching images by using images, removing duplicate of similar images and the like. The image similarity measurement method comprises the following steps: histogram algorithm, cosine similarity, structure similarity, average hash algorithm and difference hash algorithm. Meanwhile, the data set information file is placed into gitignore, version management can not be added, and storage pressure is reduced. The content format of each line of the data set information file is as follows: { "name": picture storage path "," extension ": picture file format suffix", "width": picture pixel width "," height ": picture pixel height", "batch": dataset batch "," version ": tag version", "product": product "," camera ": camera", "task": task "," project ": item", "checksum": check value "," attributes ": attribute 01", "attribute 02 }. Similarly, the training process can be neglected and further includes: ignoring files automatically generated by the operating system, such as thumbnails and the like; intermediate files, executable files, etc. generated by compilation are ignored, i.e. if one file is automatically generated by another file, the automatically generated file does not have to be put into a version library, such as a class file generated by Java compilation; your own profile with sensitive information, such as the profile that holds the password, is ignored. By preprocessing the picture information, version management can be facilitated, and acquisition of required sample data is accelerated.

Further, the step S10 further includes:

In this embodiment, based on a task training target, a label category is determined, one task requirement corresponds to one group of label categories, and subsequent work is performed around the group of label categories. At present, image classification is an important support for target detection and semantic segmentation, and aims to classify different images into different categories and realize minimum classification errors. In this application we simplify the labels, such as the picture label categories including at least product, task, camera and item labels, for a more intuitive and clear illustration of the classification problem. However, in real life, with the increase of tags, when the number of tags is increased to hundreds of orders, it is very difficult for a user to find one tag, so that the tags need to be classified in a grading manner at the initial stage of construction, just like sorting computer folders, and clearly classified tags are more convenient for query and use.

Based on the foregoing embodiment shown in fig. 2, in this embodiment, the step S20 further includes:

In the embodiment, the data set list file is compared with the picture label category, a training task data set list is generated on the original picture set based on a shuffle mechanism, the data set list file of the original picture set is traversed, the existing data set feature vector is separated from the picture label category column, whether category information identical to the picture label category exists in the data set list or not is judged, if the category information is identical to the picture label category, the current version data set is used as the latest version data set, the data set required to be stored is compressed, and the storage pressure is reduced.

Further, the step S20 further includes:

and when the category information identical to the picture label category does not exist in the data set list, generating the latest version data set according to the picture label category and the original picture set.

In this embodiment, after traversing the data set list file of the original image set, in order to find the category information that is the same as the image tag category, the data set version is recorded in the system, and a new data set is generated on the original image set based on the current image tag category, and is used as the latest version data set. Therefore, the flexibility of the data set version is greatly increased, and the disorder of multi-version data sets is avoided.

In addition, the present application further provides another embodiment, where the step 30 further specifically includes:

In this embodiment, in Git, a file directory that needs to be version-controlled is called a repository (repository), each repository can be simply understood as a directory, all files in the directory implement version management through Git, and Git can track and record all updates occurring in the directory. The user acquires the latest version data set from the Git warehouse, the marking system reads the unmarked data part from the latest version data set, the marked data set enters the quality evaluation module, and the marked data set is transmitted to the object storage system after the quality evaluation is qualified and serves as the marked data set. And synchronizing the labeled data set to a user for use. The image labeling can be converted into a multi-classification problem of the image, one image may belong to a plurality of labels, but the problem is greatly different from a common multi-classification problem, the category information corresponding to the multi-classification problem is generally uniformly distributed, that is, the number of images to which each category belongs is generally uniformly distributed, however, the labeling information of the image labeling problem is generally not uniformly distributed, and a certain label may belong to more or less images.

Specifically, the SDK program is utilized to cooperate with a labeling platform to label the latest version data set, and the quality of the labeled data set is of great importance to influence on the prediction performance of the model. The data labeling platform can upload and download a data set, manage the data set, integrate multiple labeling tools into a whole, is simple and clear, has low learning cost, can issue multi-user labeling operation, performs manual labeling and intelligent labeling on the data set, and previews the labeling operation progress and result in real time. Certainly, the labeling system has an open and real-time communication system and a reasonable and public authority system for the labeling operation and the requirement document. And transmitting the completed data labeling task into a storage system, generating sample data, and training the model on a training platform. The method improves the labeling precision of the latest version data set, can generate target sample data in different scenes, and can realize semi-automation of the cloud model training process.

In this embodiment, the generated latest version data set is introduced into the annotation system, and the check value is compared with the picture quantity value in the latest version data set, so as to ensure the integrity and accuracy of the data.

Specifically, the checksum (checksum) refers to the accumulation of the transmission bit number, and when the transmission is finished, the receiver can determine whether all data is received according to the value. The check and verification algorithm is simple in rule, so that the algorithm is simple, the occupied system resources are few when the algorithm is operated, and the calculation speed is very high. The system replaces manual work to complete data transmission, avoids data errors caused by manual copying, and improves the reliability of the data set.

In addition, the embodiment of the invention also provides a data set version management device based on Git.

Referring to fig. 3, fig. 3 is a functional module diagram of a Git-based data set version management device according to a first embodiment of the present invention.

In this embodiment, the Git-based data set version management apparatus includes:

the label category determining module 10 is used for determining a picture label category according to a model training task requirement corresponding to the target deep learning model;

the searching module 20 is configured to search, in each version data set of the data set list file, a version data set corresponding to the picture tag category as a latest version data set;

and the training module 30 is configured to train and generate the target deep learning model according to the latest version data set.

Further, the Git-based data set version management device comprises a tag category determination module further comprising:

and the tag content unit is used for acquiring model training task requirements corresponding to the target deep learning model, determining a target tag group and taking the target tag group as the picture tag category, wherein the picture tag category at least comprises a product, a task, a camera and a project tag.

Further, the Git-based data set version management apparatus further includes:

the generating picture set module is used for extracting pictures from an original video file according to frames to obtain an original picture set;

and the information collection module is used for acquiring the picture names and the related attributes in the original picture set and generating a data set information file.

Further, the Git-based data set version management apparatus including the lookup module further includes:

the information comparison unit is used for comparing the picture label type with the type information in the data set list file; when the category information which is the same as the picture label category exists in the data set list, determining a version data set which corresponds to the category information which is the same as the picture label category and is used as the latest version data set; and when the category information which is the same as the picture label category does not exist in the data set list, generating the latest version data set according to the picture label category and the original picture set.

Further, the training module specifically further includes:

the labeling unit is used for transmitting the latest version data set into a labeling system, labeling the latest version data set based on the picture label category and generating a label file set;

the storage unit is used for storing the label file set in an object storage system based on the picture label category as a sample data set;

generating a sample data unit, which is used for generating sample data according to the latest version data set and the data set information file; and training and generating the target deep learning model according to the sample data.

And the training unit is used for transmitting the sample data set into a training server for model training to generate the target deep learning model.

Each module in the Git-based data set version management apparatus corresponds to each step in the Git-based data set version management method embodiment, and the functions and implementation processes thereof are not described in detail herein.

In addition, the embodiment of the invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores a Git-based data set version management program, wherein the Git-based data set version management program, when executed by a processor, implements the steps of the Git-based data set version management method as described above.

The method for implementing the Git-based data set version management program when executed may refer to various embodiments of the Git-based data set version management method of the present invention, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for managing versions of datasets based on Git is characterized by comprising the following steps:

determining the class of the image label according to the model training task requirement corresponding to the target deep learning model;

searching a version data set corresponding to the picture label type in each version data set of the data set list file to serve as a latest version data set;

and training and generating the target deep learning model according to the latest version data set.

2. The method for managing versions of datasets as claimed in claim 1, wherein the step of determining the category of the image tag according to the requirement of the model training task corresponding to the target deep learning model further comprises:

3. The method for managing versions of datasets as claimed in claim 1, wherein the determining the class of picture labels according to the model training task requirements corresponding to the target deep learning model comprises:

4. The method for managing versions of datasets as claimed in claim 1, wherein the searching for the version dataset corresponding to the photo tag category in each version dataset of the dataset list file as the latest version dataset comprises:

5. The Git-based dataset version management method as claimed in claim 4, further comprising after comparing the photo tag category with the category information in the dataset list file:

6. The Git-based dataset version management method of any of claims 1-5, wherein the training to generate the target deep learning model from the latest version dataset comprises:

7. The Git-based dataset version management method as claimed in claim 6, wherein said passing the latest version dataset into an annotation system further comprises:

8. A Git-based dataset version management apparatus, comprising:

9. A Git-based dataset version management device comprising a processor, a memory, and a Git-based dataset version management program stored on the memory and executable by the processor, wherein the Git-based dataset version management program, when executed by the processor, implements the steps of the Git-based dataset version management method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon a Git-based dataset version management program, wherein the Git-based dataset version management program, when executed by a processor, performs the steps of the Git-based dataset version management method as recited in any one of claims 1 to 7.