CN111027507A

CN111027507A - Training data set generation method and device based on video data identification

Info

Publication number: CN111027507A
Application number: CN201911324125.9A
Authority: CN
Inventors: 黄阳; 郑邦东; 乔迟; 李兵
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-04-17

Abstract

The application provides a training data set generation method and device based on video data identification.

Description

Training data set generation method and device based on video data identification

Technical Field

The present application relates to the field of video identification, and more particularly, to a training data set generation method and apparatus based on video data identification.

Background

The training data sources of the video recognition algorithm (model) are uneven and difficult to collect. At present, most of training pictures of the video recognition algorithm come from public data sets or acquisition data, so that the real service scenes used by the algorithm are not concerned much, and the actual production effect of the algorithm is general. The invention aims to solve the problem of difficult acquisition of the training data, and collects the training pictures conforming to the video recognition service scene, thereby optimizing the algorithm and leading the algorithm to achieve higher recognition accuracy aiming at the specific service scene. At present, the mode of automatically collecting training picture data related to a business scene is mainly that a camera directly collects the training picture data, and then operations such as cutting, rotating, zooming and the like are carried out to generate a large number of training pictures similar to the scene. The existing mode for collecting training pictures is simple and violent, the pictures are acquired by simply capturing video frames at regular time, and the pictures effective for training cannot be intelligently identified and stored, so that a large amount of redundancy exists, a plurality of pictures with extremely high similarity and pictures without required training scenes, behaviors, targets and the like are stored, a large amount of waste of disk space is caused, and a large amount of extra unnecessary workload is brought in the subsequent picture marking work.

Disclosure of Invention

In order to solve at least one of the above problems, an embodiment of an aspect of the present application provides a training data set generation method based on video data recognition, including:

acquiring at least one frame of video frame image from video data;

identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;

screening the video frame images by using the first characteristic information to obtain a video frame image set;

acquiring an artificial marking result of each frame of video frame image in the video frame image set;

and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.

In some embodiments, the feature information recognition model is built based on deep learning, and the training data set generation method further includes:

acquiring an artificial screening result of the video frame image set, and selecting the video frame images to form a screening training set of the video frame images according to the artificial screening result;

acquiring a screening training set after the operator corrects the first characteristic information;

and training the at least one characteristic information recognition model by using the corrected screening training set.

In some embodiments, the identifying each frame of video frame image by at least one feature information identification model comprises:

obtaining face feature information in the video frame image through a face feature recognition model;

obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;

and obtaining the motion characteristic information of each target in the video frame image by combining the motion recognition model according to the video frame image and the optical flow field image extracted from the video frame image.

In certain embodiments, further comprising:

and establishing the at least one characteristic information identification model.

In some embodiments, the artificial marking result is a first characteristic information correction result; the taking each frame of video frame image in the set of video frame images and the corresponding artificial labeling result as a set of training data comprises:

and taking each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.

In some embodiments, the manual marking result is a second feature information marked by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.

In some embodiments, the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.

Another embodiment of the present invention provides a training data set generating apparatus based on video data recognition, including:

the video frame image acquisition module is used for acquiring at least one frame of video frame image from the video data;

the characteristic information identification module is used for identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;

the video frame image set screening module is used for screening the video frame images by utilizing the first characteristic information to obtain a video frame image set;

the artificial marking result acquisition module is used for acquiring the artificial marking result of each frame of video frame image in the video frame image set;

and the training data set generating module is used for generating a training data set by taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data.

In some embodiments, the feature information recognition model is built based on deep learning, and the training data set generation means further includes:

the manual screening result acquisition module is used for acquiring the manual screening result of the video frame image set and selecting the video frame images to form a screening training set of the video frame images according to the manual screening result;

the correction screening training set acquisition module is used for acquiring a screening training set after the operator corrects the first characteristic information;

and the characteristic information recognition model training module is used for training the at least one characteristic information recognition model by using the corrected screening training set.

In certain embodiments, the characteristic information identification module comprises at least one of:

the face recognition unit obtains face feature information in the video frame image through a face feature recognition model;

the target detection unit is used for obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;

and the action recognition unit is used for obtaining the motion characteristic information of each target in the video frame image by combining the action recognition model according to the video frame image and the optical flow field image extracted from the video frame image.

In certain embodiments, further comprising:

and the model establishing module is used for establishing the at least one characteristic information identification model.

In some embodiments, the artificial marking result is a first characteristic information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.

A further embodiment of the present application provides a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described above.

A further embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method described above.

The beneficial effect of this application is as follows:

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 shows a flowchart of a training data set generation method based on video data recognition in an embodiment of the present application.

Fig. 2 shows a schematic structural diagram of a training data set generation device based on video data recognition in an embodiment of the present application.

FIG. 3 illustrates a schematic block diagram of a computer device suitable for use in implementing embodiments of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The invention firstly provides a training data set generation method based on video data identification, as shown in fig. 1, comprising the following steps:

s1: acquiring at least one frame of video frame image from video data;

s2: identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;

s3: screening the video frame images by using the first characteristic information to obtain a video frame image set;

s4: acquiring an artificial marking result of each frame of video frame image in the video frame image set;

s5: and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.

The application provides a training data set generation method based on video data identification, at least one type of first characteristic information is obtained through at least one type of characteristic information identification model, then video frame images are automatically screened according to the first characteristic information, further, many extra unnecessary workloads brought in subsequent picture marking work can be avoided, then a training data set is generated by combining with an artificial marking result, and the training data set can be used for training deep learning models of other video data.

Specifically, in step S1, the video frame image is captured by the video image capturing tool, and the video frame image is an image corresponding to each frame of the video, and 60 video frame images can be captured in one minute by taking the refresh frequency of 60 frames in 1 minute as an example.

In step S2, each feature information recognition model is a deep learning-based neural network model, the first feature information may include at least one of type feature information, position feature information, motion feature information, and face feature information of each object, and correspondingly, the feature information recognition model may be a face feature recognition model, an object detection model, a motion recognition model, or the like.

In the invention, the characteristic information identification model can adopt the existing neural network model.

For example, the face feature recognition model may be a Multi-task convolutional neural network (MTCNN), which puts face region detection and face keypoint detection together, and its theme frame is similar to cascade. The population can be divided into three-layer network structures of P-Net, R-Net, and O-Net.

The target detection model can adopt two types of algorithm models, one type is R-CNN algorithm (R-CNN, Fast R-CNN, etc.) based on Region Proposal, the two types are two-stage, the target candidate frame, namely the target position, needs to be generated by the algorithm, and then the candidate frame is classified and regressed. And the other is a one-stage algorithm like Yolo, SSD, which directly predicts the classes and positions of different targets using only one convolutional neural network CNN. The first category of methods is more accurate but slower, but the second category of algorithms is faster but less accurate.

The behavior recognition model can be built based on iDT (traditional method), Two-Stream genre idea or C3D algorithm, and specifically iDT (traditional method) is the best method out of the deep learning field. From the DT algorithm. The basic idea of the DT algorithm is to use an optical flow field to obtain some tracks in a video sequence, and then extract four features of HOF, HOG, MBH, and projector 4 along the tracks. HOF is based on grey-scale map calculations, several others are based on optical (dense optical flow) calculations. And finally, carrying out feature coding on the SVM-based classifier, and training the SVM-based classifier based on the coding result. iDT, the optical flow and surf key points between two frames are used for matching, thereby eliminating/reducing the influence caused by the camera movement.

The Two-Stream genre idea utilizes video frame images (spatial) and optical flow field images (temporal flow) extracted from the frame images to respectively train a model, and after the network generates a result, a Fusion (Fusion) is performed on the network. The two models represent static information and short timing information, respectively. The method can effectively solve the problem that the behavior category can be identified from a graph. The algorithm with the highest precision at present: the TSN.

The C3D algorithm extracts temporal and spatial features of video data by a 3D convolution operation kernel. These 3D feature extractors operate in the spatial and temporal dimensions and can therefore capture motion information of the video stream. A 3D convolutional neural network is then constructed based on the 3D convolutional feature extractor. This architecture can generate multiple channels of information from successive video frames, and then perform convolution and downsampling operations separately on each channel. And finally, combining the information of all channels to obtain the final feature description.

The feature information recognition model may be established online or offline, that is, the step of establishing the feature information recognition model may be included in the steps of the method of the present invention, or the present invention may also be used directly by using an existing model, which is not limited by the present invention.

It is obvious to those skilled in the art that the step S2 can be summarized as including at least one of the following steps according to the above-described embodiment:

Then, image screening is performed according to the identified first feature information, screening may be performed according to some screening rules, taking the face feature information as an example, the screening rules may be: and deleting the image which can not recognize the face feature information.

The other first feature information may be filtered by a similar filtering rule, or the filtering rule may be filtered based on combination information of the identified feature information, for example, various first feature information is identified by the plurality of models, and further, the content of the image is determined according to an existing determination method, for example, if the object is a dog, the action is running, and the scene is a road, the content of the image is determined as "the dog runs on the road" based on the identified feature information, and if the subsequent training is a human sex identification model, the image may be deleted because the content of the image does not have human beings.

The image content may be determined according to a combination of keywords, or may be further determined according to semantics, and specifically, similarly, for example, the above identified target is a dog, the action is running, and the scene is a road, the keywords "dog", "running", and "road" are generated by corresponding the identified feature information, and the sentence "dog runs on road" is generated by combining the three keywords according to semantics.

And screening the video frame images by the screening rule, wherein the video frame image set comprises the screened video frame images.

According to the embodiment, intelligent screening is realized in image acquisition, a large number of invalid training images are prevented from being acquired, and manpower and material resources are saved.

Further, the video frame images after being filtered may be subjected to artificial secondary filtering, in this embodiment, the method further includes:

s01: acquiring an artificial screening result of the video frame image set, and selecting the video frame images to form a screening training set of the video frame images according to the artificial screening result;

s02: acquiring a screening training set after the operator corrects the first characteristic information;

s03: and training the at least one characteristic information recognition model by using the corrected screening training set.

In this embodiment, the video frame images after being screened are manually screened to obtain a more accurate screening result, and then are manually corrected, and the first feature information after being manually corrected can be used as training data of the at least one feature information recognition model.

Further, the filtered video frame image may be subjected to manual labeling, and in some embodiments, the manual labeling result may be to correct the first feature information or to re-derive higher-level second feature information based on the first feature information.

For example, in the embodiment that the result of the manual labeling is to correct the first feature information, the training data set formed after the correction may be the training data set of the feature information recognition model, or may be the training data set of other neural network models based on deep learning, which is not limited to this.

And for the manual labeling result, the operator combines second characteristic information labeled by at least one first characteristic information in the video frame image set, and the dimension of the second characteristic information is higher than that of the first characteristic information.

By way of example, taking the face feature information as an example, by manually labeling gender feature information corresponding to a face in an image, the first feature information recognized by the former is a "hebi feature" or referred to as a "low-order feature", and the second feature information manually labeled by the latter is a "smart feature" or referred to as a "high-order feature", obviously, the dimension of the high-order feature is necessarily higher than that of the low-order feature (the face feature information generally includes some feature points and distances between the feature points, for example, the face feature information includes interpupillary distances, etc., while the gender feature is not only directly characterized by the interpupillary distances, which integrates multi-dimensional feature information comprehensive evaluation, for example, gender can be evaluated according to wearing, actions, and face comprehensive evaluation, so that the high-order feature is necessarily higher than the low-order feature in.

It is to be understood that the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.

From the above description, it can be known that the present application provides a training data set generation method based on video data recognition, at least one type of first feature information is obtained through at least one type of feature information recognition model, then video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, then a training data set is generated by combining with an artificial labeling result, and the training data set can be used for training deep learning models of other video data.

Based on the same inventive concept, an embodiment of the present invention further provides a training data set generating apparatus based on video data recognition, as shown in fig. 2, including:

the video frame image acquisition module 1 is used for acquiring at least one frame of video frame image from video data;

the characteristic information identification module 2 identifies each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;

the video frame image set screening module 3 is used for screening the video frame images by using the first characteristic information to obtain a video frame image set;

the artificial marking result acquisition module 4 is used for acquiring the artificial marking result of each frame of video frame image in the video frame image set;

and the training data set generating module 5 is used for generating a training data set by taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data.

In a preferred embodiment, the feature information recognition model is built based on deep learning, and the training data set generation device further includes:

In a preferred embodiment, the characteristic information identification module includes at least one of:

In a preferred embodiment, further comprising:

In a preferred embodiment, the manual marking result is a first characteristic information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.

In a preferred embodiment, the manual marking result is second feature information marked by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.

In a preferred embodiment, the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.

The training data set generation device based on video data identification can be known to obtain at least one type of first characteristic information through at least one type of characteristic information identification model, automatically screen video frame images according to the first characteristic information, further avoid bringing many extra unnecessary workloads in subsequent picture marking work, generate a training data set by combining with an artificial marking result, and the training data set can be used for training deep learning models of other video data.

An embodiment of the present application further provides a specific implementation manner of an electronic device, which is capable of implementing all steps in the training data set generation method based on video data recognition in the foregoing embodiment, and with reference to fig. 3, the electronic device specifically includes the following contents:

a processor (processor)601, a memory (memory)602, a communication interface (communications interface)603, and a communication bus 604;

the processor 601, the memory 602 and the communication interface 603 complete mutual communication through the communication bus 604; the communication interface 603 is used for realizing image transmission between a training data set generating device based on video data identification and related equipment such as a user terminal;

the processor 601 is configured to call a computer program in the memory 602, and the processor implements all the steps in the training data set generating method based on video data recognition in the above embodiments when executing the computer program, for example, the processor implements the following steps when executing the computer program:

s1: acquiring at least one frame of video frame image from video data;

As can be seen from the above description, in the electronic device provided in the embodiment of the present application, at least one type of first feature information is obtained through at least one type of feature information recognition model, and then, video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, and then, a training data set is generated by combining with an artificial labeling result, where the training data set can be used for training a deep learning model of other video data.

Embodiments of the present application further provide a computer-readable storage medium capable of implementing all steps of the training data set generation method based on video data recognition in the above embodiments, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements all steps of the training data set generation method based on video data recognition in the above embodiments, for example, when the processor executes the computer program, the processor implements the following steps:

s1: acquiring at least one frame of video frame image from video data;

As can be seen from the above description, in the computer-readable storage medium provided in the embodiment of the present application, at least one type of first feature information is obtained through at least one type of feature information recognition model, and then video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, and then a training data set is generated in combination with an artificial labeling result, where the training data set can be used for training a deep learning model of other video data.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.

The apparatuses, modules or units illustrated in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by an article with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle-mounted human-computer interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Although embodiments of the present description provide method steps as described in embodiments or flowcharts, more or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or end product executes, it may execute sequentially or in parallel (e.g., parallel processors or multi-threaded environments, or even distributed data processing environments) according to the method shown in the embodiment or the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the embodiments of the present description, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of multiple sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement the image storage by any method or technology. The images may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store images that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The above description is only an example of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure. Various modifications and variations to the embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present specification should be included in the scope of the claims of the embodiments of the present specification.

Claims

1. A training data set generation method based on video data recognition is characterized by comprising the following steps:

acquiring at least one frame of video frame image from video data;

2. The training data set generation method according to claim 1, wherein the feature information recognition model is built based on deep learning, the training data set generation method further comprising:

3. The method for generating a training data set based on video data recognition according to claim 2, wherein the recognizing each frame of video frame image through at least one feature information recognition model comprises at least one of the following steps:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein the artificial labeling result is a first feature information correction result; the taking each frame of video frame image in the set of video frame images and the corresponding artificial labeling result as a set of training data comprises:

6. The method as claimed in claim 1, wherein the manual labeling result is a second feature information labeled by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.

7. The method as claimed in claim 6, wherein the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.

8. A training data set generating apparatus based on video data recognition, comprising:

9. The training data set generation apparatus according to claim 8, wherein the feature information recognition model is built based on deep learning, the training data set generation apparatus further comprising:

10. A training data set generating device based on video data recognition according to claim 9, wherein the feature information recognition module comprises at least one of:

11. A training data set generating device based on video data recognition according to claim 9, further comprising:

12. The apparatus according to claim 8, wherein the manual labeling result is a first feature information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.

13. The apparatus according to claim 8, wherein the manual labeling result is labeled by an operator with second feature information that is higher in dimension than the first feature information and is combined with at least one of the first feature information in the video frame image set.

14. The apparatus according to claim 13, wherein the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.