WO2022237157A1

WO2022237157A1 - Video data set labeling method and apparatus

Info

Publication number: WO2022237157A1
Application number: PCT/CN2021/137579
Authority: WO
Inventors: 马筱; 乔宇; 王利民
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-05-10
Filing date: 2021-12-13
Publication date: 2022-11-17
Also published as: CN113139096B; CN113139096A

Abstract

Disclosed are a video data set labeling method and apparatus. The method comprises: determining a data set label according to a set action category selection rule, the data set label representing short-time instantaneous action and cyclic action types; choosing, according to the data set label, a matched video to be labeled; and uploading the video to be labeled to a labeling tool platform for action behavior detection and labeling, so as to determine an action behavior type label and corresponding start frame and end frame positions. According to the present invention, the boundary definition of the action behavior is more accurate, and both the labeling efficiency and the labeling quality are remarkably improved.

Description

A video dataset labeling method and device

technical field

The present invention relates to the technical field of computer vision, and more specifically, to a video data set labeling method and device.

Background technique

In recent years, video understanding has been widely used in video content analysis, intelligent surveillance, human-computer interaction and other fields. In video behavior understanding, based on deep learning, there are two more important tasks. One is video behavior classification, which mainly classifies trimmed videos according to the human behavior in them. The other type is video behavior detection, which aims to locate the start time and end time of an action in a long video. Video action detection, as an important part of video understanding, has been extensively studied in the computer vision community.

Compared with behavior classification, behavior detection is more difficult. Existing behavior detection methods usually first generate segment proposals that may contain actions, and then classify them. However, because the definition of the boundary is ambiguous, there may be multiple simultaneous actions in the unified video, which leads to great challenges in the accurate detection of actions. Different from behavior recognition, behavior detection requires accurate detection of action segments. For the generation of actions in real scenes, the boundary is often not very certain, especially the termination of the action, and it is relatively difficult to judge the integrity of the action. Due to the unclear boundaries of the video itself and the relatively simple existing timing detection and labeling tools, most of the existing video timing detection datasets are weakly calibrated, which also leads to the low average accuracy of current behavior detection.

After analysis, the existing video labeling schemes mainly have the following defects:

1) The label definition of the relevant video timing detection data set is relatively coarse-grained, the timing duration of different labels varies greatly, and the boundaries of different labels are not clearly defined, so the definition of the start and end boundaries cannot be more intuitively clarified.

2) Existing video annotation tools are mainly aimed at target detection, rather than marking the start and end time of an untrimmed video. In addition, existing video annotation tools have relatively simple functions and relatively simple interfaces. For a large amount of untrimmed data, there is no relatively convenient and full-featured labeling tool, and the labor cost is expensive. Due to the complexity of real-world videos, most of the existing video annotation tools can only annotate one type of tag after browsing a video. to consider. In addition, the existing labeling tools often cannot clearly reflect the time segment of the labeled label, which may easily lead to missing labels, repeated labeling, wrong labeling, etc. Sub-quality inspection, so it is also more important for the display of marked behavior segments.

technical problem

The purpose of the present invention is to overcome the above-mentioned defects in the prior art, and provide a video data set labeling method and device.

technical solution

According to a first aspect of the present invention, a video data set labeling method is provided. The method includes the following steps:

Step S1: Determine the data set label according to the set action category selection rule, and the data set label represents the type of short-term instantaneous action and cyclic action;

Step S2: Filter out matching videos to be labeled according to the dataset tags;

Step S3: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.

Step S4: Carry out sampling visual quality inspection on the labeling results and identify background samples and behavior segment samples through the behavior recognition model. Testing the labeling quality in this way can greatly save labor costs and improve accuracy.

According to a second aspect of the present invention, a video data set tagging device is provided. The unit includes:

Label selection module: used to determine the data set label according to the set action category selection rules, and the data set label represents short-term instantaneous action and cyclic action type;

Video retrieval module: used to filter out matching videos to be marked according to the data set tags;

Dataset labeling module: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.

Beneficial effect

Compared with the prior art, the present invention has the advantage that it provides a technical solution for tagging data sets for deep learning behavior timing detection, and first performs video plagiarism check on collected videos. In order to better unify the boundary definition indicators, the tags of the existing video data sets are screened according to certain rules, and the tags are structured by dismantling and merging, and are excluded according to certain rules, so as to more accurately start and stop different behaviors boundary. In addition, a video time sequence labeling tool is also designed to select the start and end frames of different labels for an input video, so that the time series of multiple labels can be better reflected, and the marked behavior segments can be displayed more intuitively.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

Fig. 1 is the flow chart of the video dataset labeling method according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of the overall process of a video dataset labeling method according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of a video tagging tool tagging according to an embodiment of the present invention;

Fig. 4 is a schematic flowchart of a video tagging tool according to an embodiment of the present invention.

Embodiments of the present invention

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangements of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the description.

In all examples shown and discussed herein, any specific values should be construed as exemplary only, and not as limitations. Therefore, other instances of the exemplary embodiment may have different values.

It should be noted that like numerals and letters denote like items in the following figures, therefore, once an item is defined in one figure, it does not require further discussion in subsequent figures.

As shown in Figure 1 and Figure 2, the provided video dataset labeling method includes the following steps:

Step S110, selecting a data set label according to a set rule.

In this step, define data set labels through certain rules, for example, select common labels in general scenarios, rather than labels for specific groups of people in specific scenarios; exclude labels with broad action definitions; exclude mainly through the difference between interactive objects Instead of distinct and categorized labels for human pose changes; exclude base body state labels that are common in every action; split actions that can be broken down into atomic actions.

In one embodiment, tags are mainly divided into two categories: short-term transient actions and cyclic actions to divide the process cycle of actions. The selected data can be filtered by existing behavior recognition methods.

By defining the "action" in the action category and performing screening and division, the label is more reasonable, and the coarse-grained action can be divided into more subtle actions.

Step S120, according to the selected tag search and filter out the videos to be tagged.

Collect relevant videos for tags, and check and filter them. In one embodiment, the plagiarism checking process includes: for the video to be processed, perform a neighbor search in the video library, filter out candidate videos similar to the video to be processed, and obtain a set of candidate videos; calculate the distance between each candidate video and the video to be processed Similarity, obtain the similarity result; determine whether the video to be processed has passed the plagiarism check according to the similarity result. Wherein, the similarity can be calculated by the Hamming distance of the hash values of the first frame and the last frame of the video.

Step S130, upload the video to be marked to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.

Specifically, as shown in Figure 3, the video is input to the labeling tool platform for video labeling, wherein the upper left is the selection area for actually labeling the start frame and the end frame; the upper right is the video selection area, which can be selected in batches (marked The video will display the labeled results); below the video selection is the label selection menu, the delete (delete), label (label), and save (save) buttons respectively have the functions of deleting the wrong operation box, labeling the operation box and saving the existing results.

Using the labeling tool platform in Figure 3, the function of labeling multiple labels at the same time can be realized, which has more practical significance. The lower part of the menu selection is the timeline (timeline) display area of the actual labeling results, which is convenient for labelers to check for wrong labels, missing labels and secondary quality inspection; the lower left is the video playback area, and the video can be quickly browsed through the slide bar. In addition, keyboard shortcuts can be set during actual labeling, so that labelers can label more quickly.

Refer to Figure 4 for details. First, input the video to be marked and click the play button. You can also quickly browse the entire video through the slide bar. Select the start frame and end frame of the action to be marked in the area where the 24 frames starting from the frame in the frame number box are displayed (for example, the two boxes show the marked start frame and the marked end frame respectively) ;Add a label through the menu bar, and select the start and end of the label, and then detect the labeled behavior segment through the time axis. This method is beneficial for the labeling personnel to check for wrong labeling, missing labeling and secondary quality inspection. Through the save and delete menu Select the label labeling result; end labeling.

In summary, compared with the existing labeling tools, the designed labeling tool platform has richer functions, more convenient operation, and a more intuitive interface, and using the above process for labeling is conducive to determining a clearer boundary and realizing a period of time. Label with multiple tags to reflect the start-stop correlation of the same behavior segment. In addition, the labeling rules and processes are more precise, reducing labeling bias and boundary uncertainty of temporal action localization.

Step S140, performing quality inspection on the labeling results of the dataset.

After obtaining the results marked by the video labeling tool, it can be sampled and visualized for quality inspection, and the background sample and behavior segment sample can be identified through the model. Through this identification and detection, the quality of labeling can greatly save labor costs and at the same time improve precision. For example, it is identified by the TSN (Temporal Segment Networks) method. The method mainly consists of a spatial flow convolutional network and a temporal flow convolutional network. But unlike two-stream which uses a single frame or a single pile of frames, TSN uses a series of short clips that are sparsely sampled from the entire video, and each clip will give its own preliminary prediction of the behavior category, from the "consensus" of these clips ” to get video-level prediction results. During the learning process, the loss value of video-level prediction (loss value). The results are counted, and according to the quality inspection results, it can be determined whether the marked data set meets expectations.

In order to verify the effect of the present invention, different methods were used to compare data sets, and the results are shown in Table 1 below.

Table 1: Comparison of different methods on the dataset

In Table 1, through three currently more advanced timing detection methods BMN, DBG, and G-TAD, the two-stream 3D convolution (I3D) model is used to extract RGB, optical flow, and RGB+optical flow features to conduct experiments on the data set. Among them, the boundary matching network (BMN) adopts a new time series nomination confidence evaluation mechanism-the boundary matching mechanism, and the boundary matching network based on the boundary matching mechanism. The BMN network can simultaneously generate a one-dimensional boundary probability sequence and a two-dimensional BM confidence map to densely evaluate the confidence scores of all possible temporal nominations. The dense boundary action generator network (DBG) estimates dense boundary confidence maps for all action proposals through a fast, end-to-end dense boundary action generator. The Subgraph Localization Model for Temporal Action Detection (G-TAD) transforms the temporal action detection problem into a subgraph localization problem by adaptively fusing multi-level semantic text information. The evaluation index is mainly represented by AR@AN, that is, it is judged by measuring the relationship between AR and the average number (AN) of the proposal. And calculated the area under the curve (AUC) of AR and AN as another evaluation indicator on the ActivityNet-1.3 dataset, where AN ranges from 0 to 100. It can be seen from Table 1 that the performance of existing video timing detection methods is much lower than other data sets due to the more accurate labeling results of the data set constructed based on the label screening method of the present invention.

Correspondingly, the present invention also provides a video data set labeling device, which is used to realize one or more aspects of the above method. For example, the device includes: a label selection module: used to determine the data set label according to the set action category selection rule, and the data set label represents short-term instantaneous action and cyclic action type; video retrieval module: used to Set tags to filter out matching videos to be labeled; dataset labeling module: upload the videos to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.

In the field of computer vision, in addition to public data sets, many application scenarios require specialized data sets for transfer learning or end-to-end training, which requires a large amount of training data. However, the present invention can be applied to video data set labeling in various fields, for example, for video assisted refereeing. Because the present invention is not sensitive to the speed and time of video movements, it can be widely used in various sports scenes, such as yoga with slow movements and gymnastics with rapid movements. Through a more precise labeling method, the boundary judgment between different actions can be made clearer. For example, for intelligent video review, abnormal action identification and judgment can be completed on the mobile terminal, and whether the abnormal action will occur can be judged by starting the action limit; it can also be applied to other identifications, such as abnormal detection of pipelines, etc. Another example, applied to smart security, can directly perform action recognition on smart terminals with limited computing resources, such as smart glasses, drones, smart cameras, etc., directly feed back abnormal behaviors, and improve the immediacy and accuracy of patrolling.

In summary, compared with the prior art, the advantages of the present invention are mainly reflected in the following aspects:

1) Video timing detection has a wide range of application values in the academic circle and the industry. However, due to the lack of obvious boundary definition and the high cost of manual labeling, the existing video datasets have a certain degree of high cost and cost when labeling. Due to the characteristics of weak labeling, some definitions of action tags are relatively rough, which is not suitable for defining accurate action boundaries. In addition, since different actions have different action process cycles, different granularity of human actions will also bring difficulties in detection. The more precise labeling granularity of the present invention refers to the labeling characteristics of the existing behavior-related data sets, and performs exclusion, screening, and splitting according to the criteria in the selection of action categories.

2) The existing video timing annotation tools have few functions and relatively simple interface. In order to help annotators to effectively and consistently annotate operation video segments, the present invention designs a video sequence detection tool. Among them, the video browsing area is set to help the annotator quickly preview the entire video, and provide functions such as fast forwarding; the start and end frame selection area uses different operations to annotate the start and end of the action, and represents the start and end frames through different markings; the label selection area , select the label category in different segments, in order to facilitate multi-label labeling, the categories are classified and menu options; the operation menu area is used to add, delete and modify labels and the label display area to help labelers view the results, Avoid missing labels, relabeling, and wrong labeling. In addition, in order to avoid bias caused by human subjective consciousness as much as possible and ensure data consistency, a labeling guide was also designed to clearly define the boundaries of each label through text and pictures through Wikipedia and related sports guides. The efficiency of labeling improves the efficiency and quality of labeling compared to pure manual work.

3), the prior art is mainly through manual inspection. On the basis of manual secondary quality inspection, the present invention classifies the samples of the background segment and the behavior segment of the labeling result through the existing behavior recognition model. Compared with existing datasets, the boundary definition is more precise.

The present invention can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, Python, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider). connect). In some embodiments, electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs) or programmable logic arrays (PLAs), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the invention are implemented by executing computer readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation by means of hardware, implementation by means of software, and implementation by a combination of software and hardware are all equivalent.

Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or technical improvement in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein. The scope of the invention is defined by the appended claims.

Claims

A video data set labeling method, comprising the following steps:

Step S1: Determine the data set label according to the set action category selection rule, and the data set label represents the type of short-term instantaneous action and cyclic action;

Step S2: Filter out matching videos to be labeled according to the dataset tags;

Step S3: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
The method according to claim 1, wherein said determining the data set label according to the set action category selection rule comprises:

Select common tags in general scenarios, and exclude tags that are not specific to specific groups of people;

Exclude tags with broad action definitions;

Exclude tags that are classified by differences in interacting objects rather than differences in human pose changes;

Excludes base body state labels that are common across every movement;

Split for splittable actions to obtain fine-grained labels.
The method according to claim 1, wherein step S2 comprises:

Collect related videos according to the data set tags, and perform duplicate checking and screening, wherein the duplicate checking calculates the similarity by the Hamming distance of the hash values of the first frame and the last frame of the video;

Determine whether the video to be processed has passed the plagiarism check according to the similarity result.
The method according to claim 1, wherein the labeling tool platform is provided with a start frame selection area, a video selection area, a label selection area, a result display area and a video playback area, wherein the start frame selection The area is used for the user to mark the start frame and the end frame; the video selection area is used for selecting one or more videos to be marked; the label selection area is used for the user to mark the action behavior label; the result display The area is used to display the starting time of marking to the user; the video playing area is used to display to the user continuous multi-frame images of the video to be marked, so as to mark the starting frame of the action behavior.
The method according to claim 4, wherein step S3 comprises:

Enter the video to be tagged and click the play button or browse the video to be tagged through the slide bar;

In the video selection area, a continuous 24-frame image is displayed for the user to select the start frame and the end frame of the action behavior to be marked;

Add a label through the menu bar of the label selection area, and select the start and end of the label;

Detect marked behavior fragments through the time axis of the result display area, for users to check for wrong labeling, missing labeling and secondary quality inspection;

Select the labeling result by setting the save and delete menus in the label selection area.
The method according to claim 1, further comprising:

Step S4: Carry out sampling visual quality inspection on the labeling results and identify background samples and behavior segment samples through the behavior recognition model.
The method according to claim 6, wherein step S4 comprises:

The TSN behavior recognition model is used to identify the background video segment and the behavior video segment, and its own prediction category score for the behavior category is given to test the quality of the labeling results.
A video data set tagging device, comprising:

Label selection module: used to determine the data set label according to the set action category selection rules, and the data set label represents short-term instantaneous action and cyclic action type;

Video retrieval module: used to filter out matching videos to be marked according to the data set tags;

Dataset labeling module: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.
A computer device comprising a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, wherein any one of claims 1 to 7 is implemented when the processor executes the program The steps of the method described in the item.