CN111027507A - Training data set generation method and device based on video data identification - Google Patents

Training data set generation method and device based on video data identification Download PDF

Info

Publication number
CN111027507A
CN111027507A CN201911324125.9A CN201911324125A CN111027507A CN 111027507 A CN111027507 A CN 111027507A CN 201911324125 A CN201911324125 A CN 201911324125A CN 111027507 A CN111027507 A CN 111027507A
Authority
CN
China
Prior art keywords
video frame
frame image
feature information
characteristic information
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911324125.9A
Other languages
Chinese (zh)
Inventor
黄阳
郑邦东
乔迟
李兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911324125.9A priority Critical patent/CN111027507A/en
Publication of CN111027507A publication Critical patent/CN111027507A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a training data set generation method and device based on video data identification.

Description

Training data set generation method and device based on video data identification
Technical Field
The present application relates to the field of video identification, and more particularly, to a training data set generation method and apparatus based on video data identification.
Background
The training data sources of the video recognition algorithm (model) are uneven and difficult to collect. At present, most of training pictures of the video recognition algorithm come from public data sets or acquisition data, so that the real service scenes used by the algorithm are not concerned much, and the actual production effect of the algorithm is general. The invention aims to solve the problem of difficult acquisition of the training data, and collects the training pictures conforming to the video recognition service scene, thereby optimizing the algorithm and leading the algorithm to achieve higher recognition accuracy aiming at the specific service scene. At present, the mode of automatically collecting training picture data related to a business scene is mainly that a camera directly collects the training picture data, and then operations such as cutting, rotating, zooming and the like are carried out to generate a large number of training pictures similar to the scene. The existing mode for collecting training pictures is simple and violent, the pictures are acquired by simply capturing video frames at regular time, and the pictures effective for training cannot be intelligently identified and stored, so that a large amount of redundancy exists, a plurality of pictures with extremely high similarity and pictures without required training scenes, behaviors, targets and the like are stored, a large amount of waste of disk space is caused, and a large amount of extra unnecessary workload is brought in the subsequent picture marking work.
Disclosure of Invention
In order to solve at least one of the above problems, an embodiment of an aspect of the present application provides a training data set generation method based on video data recognition, including:
acquiring at least one frame of video frame image from video data;
identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
screening the video frame images by using the first characteristic information to obtain a video frame image set;
acquiring an artificial marking result of each frame of video frame image in the video frame image set;
and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.
In some embodiments, the feature information recognition model is built based on deep learning, and the training data set generation method further includes:
acquiring an artificial screening result of the video frame image set, and selecting the video frame images to form a screening training set of the video frame images according to the artificial screening result;
acquiring a screening training set after the operator corrects the first characteristic information;
and training the at least one characteristic information recognition model by using the corrected screening training set.
In some embodiments, the identifying each frame of video frame image by at least one feature information identification model comprises:
obtaining face feature information in the video frame image through a face feature recognition model;
obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and obtaining the motion characteristic information of each target in the video frame image by combining the motion recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
In certain embodiments, further comprising:
and establishing the at least one characteristic information identification model.
In some embodiments, the artificial marking result is a first characteristic information correction result; the taking each frame of video frame image in the set of video frame images and the corresponding artificial labeling result as a set of training data comprises:
and taking each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.
In some embodiments, the manual marking result is a second feature information marked by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.
In some embodiments, the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
Another embodiment of the present invention provides a training data set generating apparatus based on video data recognition, including:
the video frame image acquisition module is used for acquiring at least one frame of video frame image from the video data;
the characteristic information identification module is used for identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
the video frame image set screening module is used for screening the video frame images by utilizing the first characteristic information to obtain a video frame image set;
the artificial marking result acquisition module is used for acquiring the artificial marking result of each frame of video frame image in the video frame image set;
and the training data set generating module is used for generating a training data set by taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data.
In some embodiments, the feature information recognition model is built based on deep learning, and the training data set generation means further includes:
the manual screening result acquisition module is used for acquiring the manual screening result of the video frame image set and selecting the video frame images to form a screening training set of the video frame images according to the manual screening result;
the correction screening training set acquisition module is used for acquiring a screening training set after the operator corrects the first characteristic information;
and the characteristic information recognition model training module is used for training the at least one characteristic information recognition model by using the corrected screening training set.
In certain embodiments, the characteristic information identification module comprises at least one of:
the face recognition unit obtains face feature information in the video frame image through a face feature recognition model;
the target detection unit is used for obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and the action recognition unit is used for obtaining the motion characteristic information of each target in the video frame image by combining the action recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
In certain embodiments, further comprising:
and the model establishing module is used for establishing the at least one characteristic information identification model.
In some embodiments, the artificial marking result is a first characteristic information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.
In some embodiments, the manual marking result is a second feature information marked by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.
In some embodiments, the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
A further embodiment of the present application provides a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described above.
A further embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method described above.
The beneficial effect of this application is as follows:
the application provides a training data set generation method and device based on video data identification.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 shows a flowchart of a training data set generation method based on video data recognition in an embodiment of the present application.
Fig. 2 shows a schematic structural diagram of a training data set generation device based on video data recognition in an embodiment of the present application.
FIG. 3 illustrates a schematic block diagram of a computer device suitable for use in implementing embodiments of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The invention firstly provides a training data set generation method based on video data identification, as shown in fig. 1, comprising the following steps:
s1: acquiring at least one frame of video frame image from video data;
s2: identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
s3: screening the video frame images by using the first characteristic information to obtain a video frame image set;
s4: acquiring an artificial marking result of each frame of video frame image in the video frame image set;
s5: and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.
The application provides a training data set generation method based on video data identification, at least one type of first characteristic information is obtained through at least one type of characteristic information identification model, then video frame images are automatically screened according to the first characteristic information, further, many extra unnecessary workloads brought in subsequent picture marking work can be avoided, then a training data set is generated by combining with an artificial marking result, and the training data set can be used for training deep learning models of other video data.
Specifically, in step S1, the video frame image is captured by the video image capturing tool, and the video frame image is an image corresponding to each frame of the video, and 60 video frame images can be captured in one minute by taking the refresh frequency of 60 frames in 1 minute as an example.
In step S2, each feature information recognition model is a deep learning-based neural network model, the first feature information may include at least one of type feature information, position feature information, motion feature information, and face feature information of each object, and correspondingly, the feature information recognition model may be a face feature recognition model, an object detection model, a motion recognition model, or the like.
In the invention, the characteristic information identification model can adopt the existing neural network model.
For example, the face feature recognition model may be a Multi-task convolutional neural network (MTCNN), which puts face region detection and face keypoint detection together, and its theme frame is similar to cascade. The population can be divided into three-layer network structures of P-Net, R-Net, and O-Net.
The target detection model can adopt two types of algorithm models, one type is R-CNN algorithm (R-CNN, Fast R-CNN, etc.) based on Region Proposal, the two types are two-stage, the target candidate frame, namely the target position, needs to be generated by the algorithm, and then the candidate frame is classified and regressed. And the other is a one-stage algorithm like Yolo, SSD, which directly predicts the classes and positions of different targets using only one convolutional neural network CNN. The first category of methods is more accurate but slower, but the second category of algorithms is faster but less accurate.
The behavior recognition model can be built based on iDT (traditional method), Two-Stream genre idea or C3D algorithm, and specifically iDT (traditional method) is the best method out of the deep learning field. From the DT algorithm. The basic idea of the DT algorithm is to use an optical flow field to obtain some tracks in a video sequence, and then extract four features of HOF, HOG, MBH, and projector 4 along the tracks. HOF is based on grey-scale map calculations, several others are based on optical (dense optical flow) calculations. And finally, carrying out feature coding on the SVM-based classifier, and training the SVM-based classifier based on the coding result. iDT, the optical flow and surf key points between two frames are used for matching, thereby eliminating/reducing the influence caused by the camera movement.
The Two-Stream genre idea utilizes video frame images (spatial) and optical flow field images (temporal flow) extracted from the frame images to respectively train a model, and after the network generates a result, a Fusion (Fusion) is performed on the network. The two models represent static information and short timing information, respectively. The method can effectively solve the problem that the behavior category can be identified from a graph. The algorithm with the highest precision at present: the TSN.
The C3D algorithm extracts temporal and spatial features of video data by a 3D convolution operation kernel. These 3D feature extractors operate in the spatial and temporal dimensions and can therefore capture motion information of the video stream. A 3D convolutional neural network is then constructed based on the 3D convolutional feature extractor. This architecture can generate multiple channels of information from successive video frames, and then perform convolution and downsampling operations separately on each channel. And finally, combining the information of all channels to obtain the final feature description.
The feature information recognition model may be established online or offline, that is, the step of establishing the feature information recognition model may be included in the steps of the method of the present invention, or the present invention may also be used directly by using an existing model, which is not limited by the present invention.
It is obvious to those skilled in the art that the step S2 can be summarized as including at least one of the following steps according to the above-described embodiment:
obtaining face feature information in the video frame image through a face feature recognition model;
obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and obtaining the motion characteristic information of each target in the video frame image by combining the motion recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
Then, image screening is performed according to the identified first feature information, screening may be performed according to some screening rules, taking the face feature information as an example, the screening rules may be: and deleting the image which can not recognize the face feature information.
The other first feature information may be filtered by a similar filtering rule, or the filtering rule may be filtered based on combination information of the identified feature information, for example, various first feature information is identified by the plurality of models, and further, the content of the image is determined according to an existing determination method, for example, if the object is a dog, the action is running, and the scene is a road, the content of the image is determined as "the dog runs on the road" based on the identified feature information, and if the subsequent training is a human sex identification model, the image may be deleted because the content of the image does not have human beings.
The image content may be determined according to a combination of keywords, or may be further determined according to semantics, and specifically, similarly, for example, the above identified target is a dog, the action is running, and the scene is a road, the keywords "dog", "running", and "road" are generated by corresponding the identified feature information, and the sentence "dog runs on road" is generated by combining the three keywords according to semantics.
And screening the video frame images by the screening rule, wherein the video frame image set comprises the screened video frame images.
According to the embodiment, intelligent screening is realized in image acquisition, a large number of invalid training images are prevented from being acquired, and manpower and material resources are saved.
Further, the video frame images after being filtered may be subjected to artificial secondary filtering, in this embodiment, the method further includes:
s01: acquiring an artificial screening result of the video frame image set, and selecting the video frame images to form a screening training set of the video frame images according to the artificial screening result;
s02: acquiring a screening training set after the operator corrects the first characteristic information;
s03: and training the at least one characteristic information recognition model by using the corrected screening training set.
In this embodiment, the video frame images after being screened are manually screened to obtain a more accurate screening result, and then are manually corrected, and the first feature information after being manually corrected can be used as training data of the at least one feature information recognition model.
Further, the filtered video frame image may be subjected to manual labeling, and in some embodiments, the manual labeling result may be to correct the first feature information or to re-derive higher-level second feature information based on the first feature information.
For example, in the embodiment that the result of the manual labeling is to correct the first feature information, the training data set formed after the correction may be the training data set of the feature information recognition model, or may be the training data set of other neural network models based on deep learning, which is not limited to this.
And for the manual labeling result, the operator combines second characteristic information labeled by at least one first characteristic information in the video frame image set, and the dimension of the second characteristic information is higher than that of the first characteristic information.
By way of example, taking the face feature information as an example, by manually labeling gender feature information corresponding to a face in an image, the first feature information recognized by the former is a "hebi feature" or referred to as a "low-order feature", and the second feature information manually labeled by the latter is a "smart feature" or referred to as a "high-order feature", obviously, the dimension of the high-order feature is necessarily higher than that of the low-order feature (the face feature information generally includes some feature points and distances between the feature points, for example, the face feature information includes interpupillary distances, etc., while the gender feature is not only directly characterized by the interpupillary distances, which integrates multi-dimensional feature information comprehensive evaluation, for example, gender can be evaluated according to wearing, actions, and face comprehensive evaluation, so that the high-order feature is necessarily higher than the low-order feature in.
It is to be understood that the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
From the above description, it can be known that the present application provides a training data set generation method based on video data recognition, at least one type of first feature information is obtained through at least one type of feature information recognition model, then video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, then a training data set is generated by combining with an artificial labeling result, and the training data set can be used for training deep learning models of other video data.
Based on the same inventive concept, an embodiment of the present invention further provides a training data set generating apparatus based on video data recognition, as shown in fig. 2, including:
the video frame image acquisition module 1 is used for acquiring at least one frame of video frame image from video data;
the characteristic information identification module 2 identifies each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
the video frame image set screening module 3 is used for screening the video frame images by using the first characteristic information to obtain a video frame image set;
the artificial marking result acquisition module 4 is used for acquiring the artificial marking result of each frame of video frame image in the video frame image set;
and the training data set generating module 5 is used for generating a training data set by taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data.
In a preferred embodiment, the feature information recognition model is built based on deep learning, and the training data set generation device further includes:
the manual screening result acquisition module is used for acquiring the manual screening result of the video frame image set and selecting the video frame images to form a screening training set of the video frame images according to the manual screening result;
the correction screening training set acquisition module is used for acquiring a screening training set after the operator corrects the first characteristic information;
and the characteristic information recognition model training module is used for training the at least one characteristic information recognition model by using the corrected screening training set.
In a preferred embodiment, the characteristic information identification module includes at least one of:
the face recognition unit obtains face feature information in the video frame image through a face feature recognition model;
the target detection unit is used for obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and the action recognition unit is used for obtaining the motion characteristic information of each target in the video frame image by combining the action recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
In a preferred embodiment, further comprising:
and the model establishing module is used for establishing the at least one characteristic information identification model.
In a preferred embodiment, the manual marking result is a first characteristic information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.
In a preferred embodiment, the manual marking result is second feature information marked by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.
In a preferred embodiment, the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
The training data set generation device based on video data identification can be known to obtain at least one type of first characteristic information through at least one type of characteristic information identification model, automatically screen video frame images according to the first characteristic information, further avoid bringing many extra unnecessary workloads in subsequent picture marking work, generate a training data set by combining with an artificial marking result, and the training data set can be used for training deep learning models of other video data.
An embodiment of the present application further provides a specific implementation manner of an electronic device, which is capable of implementing all steps in the training data set generation method based on video data recognition in the foregoing embodiment, and with reference to fig. 3, the electronic device specifically includes the following contents:
a processor (processor)601, a memory (memory)602, a communication interface (communications interface)603, and a communication bus 604;
the processor 601, the memory 602 and the communication interface 603 complete mutual communication through the communication bus 604; the communication interface 603 is used for realizing image transmission between a training data set generating device based on video data identification and related equipment such as a user terminal;
the processor 601 is configured to call a computer program in the memory 602, and the processor implements all the steps in the training data set generating method based on video data recognition in the above embodiments when executing the computer program, for example, the processor implements the following steps when executing the computer program:
s1: acquiring at least one frame of video frame image from video data;
s2: identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
s3: screening the video frame images by using the first characteristic information to obtain a video frame image set;
s4: acquiring an artificial marking result of each frame of video frame image in the video frame image set;
s5: and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.
As can be seen from the above description, in the electronic device provided in the embodiment of the present application, at least one type of first feature information is obtained through at least one type of feature information recognition model, and then, video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, and then, a training data set is generated by combining with an artificial labeling result, where the training data set can be used for training a deep learning model of other video data.
Embodiments of the present application further provide a computer-readable storage medium capable of implementing all steps of the training data set generation method based on video data recognition in the above embodiments, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements all steps of the training data set generation method based on video data recognition in the above embodiments, for example, when the processor executes the computer program, the processor implements the following steps:
s1: acquiring at least one frame of video frame image from video data;
s2: identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
s3: screening the video frame images by using the first characteristic information to obtain a video frame image set;
s4: acquiring an artificial marking result of each frame of video frame image in the video frame image set;
s5: and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.
As can be seen from the above description, in the computer-readable storage medium provided in the embodiment of the present application, at least one type of first feature information is obtained through at least one type of feature information recognition model, and then video frame images are automatically screened according to the first feature information, so that a lot of extra unnecessary workload in subsequent image labeling work can be avoided, and then a training data set is generated in combination with an artificial labeling result, where the training data set can be used for training a deep learning model of other video data.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.
The apparatuses, modules or units illustrated in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by an article with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a vehicle-mounted human-computer interaction device, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Although embodiments of the present description provide method steps as described in embodiments or flowcharts, more or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or end product executes, it may execute sequentially or in parallel (e.g., parallel processors or multi-threaded environments, or even distributed data processing environments) according to the method shown in the embodiment or the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the embodiments of the present description, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of multiple sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement the image storage by any method or technology. The images may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store images that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The above description is only an example of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure. Various modifications and variations to the embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present specification should be included in the scope of the claims of the embodiments of the present specification.

Claims (16)

1. A training data set generation method based on video data recognition is characterized by comprising the following steps:
acquiring at least one frame of video frame image from video data;
identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
screening the video frame images by using the first characteristic information to obtain a video frame image set;
acquiring an artificial marking result of each frame of video frame image in the video frame image set;
and taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data to generate a training data set.
2. The training data set generation method according to claim 1, wherein the feature information recognition model is built based on deep learning, the training data set generation method further comprising:
acquiring an artificial screening result of the video frame image set, and selecting the video frame images to form a screening training set of the video frame images according to the artificial screening result;
acquiring a screening training set after the operator corrects the first characteristic information;
and training the at least one characteristic information recognition model by using the corrected screening training set.
3. The method for generating a training data set based on video data recognition according to claim 2, wherein the recognizing each frame of video frame image through at least one feature information recognition model comprises at least one of the following steps:
obtaining face feature information in the video frame image through a face feature recognition model;
obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and obtaining the motion characteristic information of each target in the video frame image by combining the motion recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
4. The method of claim 1, further comprising:
and establishing the at least one characteristic information identification model.
5. The method of claim 1, wherein the artificial labeling result is a first feature information correction result; the taking each frame of video frame image in the set of video frame images and the corresponding artificial labeling result as a set of training data comprises:
and taking each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.
6. The method as claimed in claim 1, wherein the manual labeling result is a second feature information labeled by an operator in combination with at least one first feature information in the video frame image set, and the dimension of the second feature information is higher than that of the first feature information.
7. The method as claimed in claim 6, wherein the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
8. A training data set generating apparatus based on video data recognition, comprising:
the video frame image acquisition module is used for acquiring at least one frame of video frame image from the video data;
the characteristic information identification module is used for identifying each frame of video frame image through at least one characteristic information identification model to obtain at least one first characteristic information;
the video frame image set screening module is used for screening the video frame images by utilizing the first characteristic information to obtain a video frame image set;
the artificial marking result acquisition module is used for acquiring the artificial marking result of each frame of video frame image in the video frame image set;
and the training data set generating module is used for generating a training data set by taking each frame of video frame image in the video frame image set and the corresponding artificial marking result as a group of training data.
9. The training data set generation apparatus according to claim 8, wherein the feature information recognition model is built based on deep learning, the training data set generation apparatus further comprising:
the manual screening result acquisition module is used for acquiring the manual screening result of the video frame image set and selecting the video frame images to form a screening training set of the video frame images according to the manual screening result;
the correction screening training set acquisition module is used for acquiring a screening training set after the operator corrects the first characteristic information;
and the characteristic information recognition model training module is used for training the at least one characteristic information recognition model by using the corrected screening training set.
10. A training data set generating device based on video data recognition according to claim 9, wherein the feature information recognition module comprises at least one of:
the face recognition unit obtains face feature information in the video frame image through a face feature recognition model;
the target detection unit is used for obtaining type characteristic information and position characteristic information of different targets in the video frame image through a target detection model;
and the action recognition unit is used for obtaining the motion characteristic information of each target in the video frame image by combining the action recognition model according to the video frame image and the optical flow field image extracted from the video frame image.
11. A training data set generating device based on video data recognition according to claim 9, further comprising:
and the model establishing module is used for establishing the at least one characteristic information identification model.
12. The apparatus according to claim 8, wherein the manual labeling result is a first feature information correction result; the training data set generation unit takes each frame of video frame image in the video frame image set corrected by the artificial first characteristic information and the corresponding corrected first characteristic information as a group of training data.
13. The apparatus according to claim 8, wherein the manual labeling result is labeled by an operator with second feature information that is higher in dimension than the first feature information and is combined with at least one of the first feature information in the video frame image set.
14. The apparatus according to claim 13, wherein the first feature information includes at least one of type feature information, position feature information, motion feature information, and face feature information of each target, and the second feature information includes: gender feature information and/or expression feature information.
15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201911324125.9A 2019-12-20 2019-12-20 Training data set generation method and device based on video data identification Pending CN111027507A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911324125.9A CN111027507A (en) 2019-12-20 2019-12-20 Training data set generation method and device based on video data identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911324125.9A CN111027507A (en) 2019-12-20 2019-12-20 Training data set generation method and device based on video data identification

Publications (1)

Publication Number Publication Date
CN111027507A true CN111027507A (en) 2020-04-17

Family

ID=70211089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911324125.9A Pending CN111027507A (en) 2019-12-20 2019-12-20 Training data set generation method and device based on video data identification

Country Status (1)

Country Link
CN (1) CN111027507A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582235A (en) * 2020-05-26 2020-08-25 瑞纳智能设备股份有限公司 Alarm method, system and equipment for monitoring abnormal events in station in real time
CN112040273A (en) * 2020-09-11 2020-12-04 腾讯科技(深圳)有限公司 Video synthesis method and device
CN112311092A (en) * 2020-10-26 2021-02-02 杭州市电力设计院有限公司余杭分公司 Method and system for identifying monitoring information of power system
CN113177603A (en) * 2021-05-12 2021-07-27 中移智行网络科技有限公司 Training method of classification model, video classification method and related equipment
CN113837114A (en) * 2021-09-27 2021-12-24 浙江力石科技股份有限公司 Method and system for acquiring face video clips in scenic spot
CN117437505A (en) * 2023-12-18 2024-01-23 杭州任性智能科技有限公司 Training data set generation method and system based on video

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121943A (en) * 2016-11-30 2018-06-05 阿里巴巴集团控股有限公司 Method of discrimination and device and computing device based on picture
CN108197668A (en) * 2018-01-31 2018-06-22 达闼科技(北京)有限公司 The method for building up and cloud system of model data collection
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN110059654A (en) * 2019-04-25 2019-07-26 台州智必安科技有限责任公司 A kind of vegetable Automatic-settlement and healthy diet management method based on fine granularity identification
CN110119703A (en) * 2019-05-07 2019-08-13 福州大学 The human motion recognition method of attention mechanism and space-time diagram convolutional neural networks is merged under a kind of security protection scene
CN110443141A (en) * 2019-07-08 2019-11-12 深圳中兴网信科技有限公司 Data set processing method, data set processing unit and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121943A (en) * 2016-11-30 2018-06-05 阿里巴巴集团控股有限公司 Method of discrimination and device and computing device based on picture
CN108197668A (en) * 2018-01-31 2018-06-22 达闼科技(北京)有限公司 The method for building up and cloud system of model data collection
CN109241903A (en) * 2018-08-30 2019-01-18 平安科技(深圳)有限公司 Sample data cleaning method, device, computer equipment and storage medium
CN110059654A (en) * 2019-04-25 2019-07-26 台州智必安科技有限责任公司 A kind of vegetable Automatic-settlement and healthy diet management method based on fine granularity identification
CN110119703A (en) * 2019-05-07 2019-08-13 福州大学 The human motion recognition method of attention mechanism and space-time diagram convolutional neural networks is merged under a kind of security protection scene
CN110443141A (en) * 2019-07-08 2019-11-12 深圳中兴网信科技有限公司 Data set processing method, data set processing unit and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨磊: ""图像分析"", 《数字媒体技术概论》, pages 26 - 27 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582235A (en) * 2020-05-26 2020-08-25 瑞纳智能设备股份有限公司 Alarm method, system and equipment for monitoring abnormal events in station in real time
CN112040273A (en) * 2020-09-11 2020-12-04 腾讯科技(深圳)有限公司 Video synthesis method and device
CN112311092A (en) * 2020-10-26 2021-02-02 杭州市电力设计院有限公司余杭分公司 Method and system for identifying monitoring information of power system
CN113177603A (en) * 2021-05-12 2021-07-27 中移智行网络科技有限公司 Training method of classification model, video classification method and related equipment
CN113177603B (en) * 2021-05-12 2022-05-06 中移智行网络科技有限公司 Training method of classification model, video classification method and related equipment
WO2022237065A1 (en) * 2021-05-12 2022-11-17 中移智行网络科技有限公司 Classification model training method, video classification method, and related device
CN113837114A (en) * 2021-09-27 2021-12-24 浙江力石科技股份有限公司 Method and system for acquiring face video clips in scenic spot
CN117437505A (en) * 2023-12-18 2024-01-23 杭州任性智能科技有限公司 Training data set generation method and system based on video

Similar Documents

Publication Publication Date Title
CN111027507A (en) Training data set generation method and device based on video data identification
CN111010590B (en) Video clipping method and device
CN108764133B (en) Image recognition method, device and system
CN107358157B (en) Face living body detection method and device and electronic equipment
CN109710780B (en) Archiving method and device
CN109522450B (en) Video classification method and server
Tanberk et al. A hybrid deep model using deep learning and dense optical flow approaches for human activity recognition
CN113128368B (en) Method, device and system for detecting character interaction relationship
CN111311634A (en) Face image detection method, device and equipment
CN107944381B (en) Face tracking method, face tracking device, terminal and storage medium
CN114238904B (en) Identity recognition method, and training method and device of dual-channel hyper-resolution model
Yang et al. Binary descriptor based nonparametric background modeling for foreground extraction by using detection theory
CN110688897A (en) Pedestrian re-identification method and device based on joint judgment and generation learning
Kong et al. Automatic analysis of complex athlete techniques in broadcast taekwondo video
Dai et al. Tan: Temporal aggregation network for dense multi-label action recognition
Elharrouss et al. FSC-set: counting, localization of football supporters crowd in the stadiums
CN112818955A (en) Image segmentation method and device, computer equipment and storage medium
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
Kwon et al. Toward an online continual learning architecture for intrusion detection of video surveillance
Patil et al. An approach of understanding human activity recognition and detection for video surveillance using HOG descriptor and SVM classifier
CN113610034B (en) Method and device for identifying character entities in video, storage medium and electronic equipment
CN112965602A (en) Gesture-based human-computer interaction method and device
Abhishek et al. Human verification over activity analysis via deep data mining
Martı́nez Carrillo et al. A compact and recursive Riemannian motion descriptor for untrimmed activity recognition
Pham et al. Unsupervised workflow extraction from first-person video of mechanical assembly

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220907

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.