WO2022237157A1 - Video data set labeling method and apparatus - Google Patents

Video data set labeling method and apparatus Download PDF

Info

Publication number
WO2022237157A1
WO2022237157A1 PCT/CN2021/137579 CN2021137579W WO2022237157A1 WO 2022237157 A1 WO2022237157 A1 WO 2022237157A1 CN 2021137579 W CN2021137579 W CN 2021137579W WO 2022237157 A1 WO2022237157 A1 WO 2022237157A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
labeling
action
label
data set
Prior art date
Application number
PCT/CN2021/137579
Other languages
French (fr)
Chinese (zh)
Inventor
马筱
乔宇
王利民
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022237157A1 publication Critical patent/WO2022237157A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of computer vision, and more specifically, to a video data set labeling method and device.
  • video understanding has been widely used in video content analysis, intelligent surveillance, human-computer interaction and other fields.
  • video behavior understanding based on deep learning, there are two more important tasks.
  • One is video behavior classification, which mainly classifies trimmed videos according to the human behavior in them.
  • the other type is video behavior detection, which aims to locate the start time and end time of an action in a long video.
  • Video action detection as an important part of video understanding, has been extensively studied in the computer vision community.
  • the existing video labeling schemes mainly have the following defects:
  • Existing video annotation tools are mainly aimed at target detection, rather than marking the start and end time of an untrimmed video.
  • existing video annotation tools have relatively simple functions and relatively simple interfaces. For a large amount of untrimmed data, there is no relatively convenient and full-featured labeling tool, and the labor cost is expensive. Due to the complexity of real-world videos, most of the existing video annotation tools can only annotate one type of tag after browsing a video. to consider.
  • the existing labeling tools often cannot clearly reflect the time segment of the labeled label, which may easily lead to missing labels, repeated labeling, wrong labeling, etc. Sub-quality inspection, so it is also more important for the display of marked behavior segments.
  • the purpose of the present invention is to overcome the above-mentioned defects in the prior art, and provide a video data set labeling method and device.
  • a video data set labeling method includes the following steps:
  • Step S1 Determine the data set label according to the set action category selection rule, and the data set label represents the type of short-term instantaneous action and cyclic action;
  • Step S2 Filter out matching videos to be labeled according to the dataset tags
  • Step S3 Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  • Step S4 Carry out sampling visual quality inspection on the labeling results and identify background samples and behavior segment samples through the behavior recognition model. Testing the labeling quality in this way can greatly save labor costs and improve accuracy.
  • a video data set tagging device includes:
  • Label selection module used to determine the data set label according to the set action category selection rules, and the data set label represents short-term instantaneous action and cyclic action type;
  • Video retrieval module used to filter out matching videos to be marked according to the data set tags
  • Dataset labeling module Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  • the present invention has the advantage that it provides a technical solution for tagging data sets for deep learning behavior timing detection, and first performs video plagiarism check on collected videos.
  • the tags of the existing video data sets are screened according to certain rules, and the tags are structured by dismantling and merging, and are excluded according to certain rules, so as to more accurately start and stop different behaviors boundary.
  • a video time sequence labeling tool is also designed to select the start and end frames of different labels for an input video, so that the time series of multiple labels can be better reflected, and the marked behavior segments can be displayed more intuitively.
  • Fig. 1 is the flow chart of the video dataset labeling method according to one embodiment of the present invention.
  • FIG. 2 is a schematic diagram of the overall process of a video dataset labeling method according to an embodiment of the present invention
  • Fig. 3 is a schematic diagram of a video tagging tool tagging according to an embodiment of the present invention.
  • Fig. 4 is a schematic flowchart of a video tagging tool according to an embodiment of the present invention.
  • the provided video dataset labeling method includes the following steps:
  • Step S110 selecting a data set label according to a set rule.
  • data set labels through certain rules, for example, select common labels in general scenarios, rather than labels for specific groups of people in specific scenarios; exclude labels with broad action definitions; exclude mainly through the difference between interactive objects Instead of distinct and categorized labels for human pose changes; exclude base body state labels that are common in every action; split actions that can be broken down into atomic actions.
  • tags are mainly divided into two categories: short-term transient actions and cyclic actions to divide the process cycle of actions.
  • the selected data can be filtered by existing behavior recognition methods.
  • the label is more reasonable, and the coarse-grained action can be divided into more subtle actions.
  • Step S120 according to the selected tag search and filter out the videos to be tagged.
  • the plagiarism checking process includes: for the video to be processed, perform a neighbor search in the video library, filter out candidate videos similar to the video to be processed, and obtain a set of candidate videos; calculate the distance between each candidate video and the video to be processed Similarity, obtain the similarity result; determine whether the video to be processed has passed the plagiarism check according to the similarity result.
  • the similarity can be calculated by the Hamming distance of the hash values of the first frame and the last frame of the video.
  • Step S130 upload the video to be marked to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  • the video is input to the labeling tool platform for video labeling, wherein the upper left is the selection area for actually labeling the start frame and the end frame; the upper right is the video selection area, which can be selected in batches (marked The video will display the labeled results); below the video selection is the label selection menu, the delete (delete), label (label), and save (save) buttons respectively have the functions of deleting the wrong operation box, labeling the operation box and saving the existing results.
  • the function of labeling multiple labels at the same time can be realized, which has more practical significance.
  • the lower part of the menu selection is the timeline (timeline) display area of the actual labeling results, which is convenient for labelers to check for wrong labels, missing labels and secondary quality inspection; the lower left is the video playback area, and the video can be quickly browsed through the slide bar.
  • keyboard shortcuts can be set during actual labeling, so that labelers can label more quickly.
  • the designed labeling tool platform has richer functions, more convenient operation, and a more intuitive interface, and using the above process for labeling is conducive to determining a clearer boundary and realizing a period of time. Label with multiple tags to reflect the start-stop correlation of the same behavior segment.
  • the labeling rules and processes are more precise, reducing labeling bias and boundary uncertainty of temporal action localization.
  • Step S140 performing quality inspection on the labeling results of the dataset.
  • the video labeling tool After obtaining the results marked by the video labeling tool, it can be sampled and visualized for quality inspection, and the background sample and behavior segment sample can be identified through the model. Through this identification and detection, the quality of labeling can greatly save labor costs and at the same time improve precision. For example, it is identified by the TSN (Temporal Segment Networks) method.
  • the method mainly consists of a spatial flow convolutional network and a temporal flow convolutional network. But unlike two-stream which uses a single frame or a single pile of frames, TSN uses a series of short clips that are sparsely sampled from the entire video, and each clip will give its own preliminary prediction of the behavior category, from the "consensus" of these clips ” to get video-level prediction results. During the learning process, the loss value of video-level prediction (loss value). The results are counted, and according to the quality inspection results, it can be determined whether the marked data set meets expectations.
  • TSN Temporal Segment Networks
  • the boundary matching network adopts a new time series nomination confidence evaluation mechanism-the boundary matching mechanism, and the boundary matching network based on the boundary matching mechanism.
  • the BMN network can simultaneously generate a one-dimensional boundary probability sequence and a two-dimensional BM confidence map to densely evaluate the confidence scores of all possible temporal nominations.
  • the dense boundary action generator network estimates dense boundary confidence maps for all action proposals through a fast, end-to-end dense boundary action generator.
  • the Subgraph Localization Model for Temporal Action Detection transforms the temporal action detection problem into a subgraph localization problem by adaptively fusing multi-level semantic text information.
  • the evaluation index is mainly represented by AR@AN, that is, it is judged by measuring the relationship between AR and the average number (AN) of the proposal. And calculated the area under the curve (AUC) of AR and AN as another evaluation indicator on the ActivityNet-1.3 dataset, where AN ranges from 0 to 100. It can be seen from Table 1 that the performance of existing video timing detection methods is much lower than other data sets due to the more accurate labeling results of the data set constructed based on the label screening method of the present invention.
  • the present invention also provides a video data set labeling device, which is used to realize one or more aspects of the above method.
  • the device includes: a label selection module: used to determine the data set label according to the set action category selection rule, and the data set label represents short-term instantaneous action and cyclic action type; video retrieval module: used to Set tags to filter out matching videos to be labeled; dataset labeling module: upload the videos to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  • the present invention can be applied to video data set labeling in various fields, for example, for video assisted refereeing. Because the present invention is not sensitive to the speed and time of video movements, it can be widely used in various sports scenes, such as yoga with slow movements and gymnastics with rapid movements. Through a more precise labeling method, the boundary judgment between different actions can be made clearer. For example, for intelligent video review, abnormal action identification and judgment can be completed on the mobile terminal, and whether the abnormal action will occur can be judged by starting the action limit; it can also be applied to other identifications, such as abnormal detection of pipelines, etc.
  • Another example, applied to smart security can directly perform action recognition on smart terminals with limited computing resources, such as smart glasses, drones, smart cameras, etc., directly feed back abnormal behaviors, and improve the immediacy and accuracy of patrolling.
  • Video timing detection has a wide range of application values in the academic circle and the industry.
  • the existing video datasets have a certain degree of high cost and cost when labeling.
  • some definitions of action tags are relatively rough, which is not suitable for defining accurate action boundaries.
  • different granularity of human actions will also bring difficulties in detection.
  • the more precise labeling granularity of the present invention refers to the labeling characteristics of the existing behavior-related data sets, and performs exclusion, screening, and splitting according to the criteria in the selection of action categories.
  • the existing video timing annotation tools have few functions and relatively simple interface.
  • the present invention designs a video sequence detection tool.
  • the video browsing area is set to help the annotator quickly preview the entire video, and provide functions such as fast forwarding;
  • the start and end frame selection area uses different operations to annotate the start and end of the action, and represents the start and end frames through different markings;
  • the label selection area select the label category in different segments, in order to facilitate multi-label labeling, the categories are classified and menu options;
  • the operation menu area is used to add, delete and modify labels and the label display area to help labelers view the results, Avoid missing labels, relabeling, and wrong labeling.
  • a labeling guide was also designed to clearly define the boundaries of each label through text and pictures through Wikipedia and related sports guides.
  • the efficiency of labeling improves the efficiency and quality of labeling compared to pure manual work.
  • the prior art is mainly through manual inspection.
  • the present invention classifies the samples of the background segment and the behavior segment of the labeling result through the existing behavior recognition model. Compared with existing datasets, the boundary definition is more precise.
  • the present invention can be a system, method and/or computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanically encoded device such as a printer with instructions stored thereon
  • a hole card or a raised structure in a groove and any suitable combination of the above.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, Python, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider). connect).
  • electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs) or programmable logic arrays (PLAs), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the invention are implemented by executing computer readable program instructions.
  • These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation by means of hardware, implementation by means of software, and implementation by a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a video data set labeling method and apparatus. The method comprises: determining a data set label according to a set action category selection rule, the data set label representing short-time instantaneous action and cyclic action types; choosing, according to the data set label, a matched video to be labeled; and uploading the video to be labeled to a labeling tool platform for action behavior detection and labeling, so as to determine an action behavior type label and corresponding start frame and end frame positions. According to the present invention, the boundary definition of the action behavior is more accurate, and both the labeling efficiency and the labeling quality are remarkably improved.

Description

一种视频数据集标注方法及装置A video dataset labeling method and device 技术领域technical field
本发明涉及计算机视觉技术领域,更具体地,涉及一种视频数据集标注方法及装置。The present invention relates to the technical field of computer vision, and more specifically, to a video data set labeling method and device.
背景技术Background technique
近年来,视频理解在视频内容分析,智能监控,人机交互等领域得到了广泛应用。在视频行为理解中,基于深度学习,有两类比较重要的任务,一类是视频行为分类,其主要是针对修剪过的视频,按照其中的人类行为进行分类。另一类是视频行为检测,目的是在长视频中定位一个动作的开始时间和结束时间。视频行为检测作为视频理解的重要部分,在计算机视觉界已得到了广泛研究。In recent years, video understanding has been widely used in video content analysis, intelligent surveillance, human-computer interaction and other fields. In video behavior understanding, based on deep learning, there are two more important tasks. One is video behavior classification, which mainly classifies trimmed videos according to the human behavior in them. The other type is video behavior detection, which aims to locate the start time and end time of an action in a long video. Video action detection, as an important part of video understanding, has been extensively studied in the computer vision community.
相比于行为分类,行为检测难度更高,现有的行为检测方法通常是首先生成可能存在动作的片段提案,然后再对其进行分类。然而,因为边界的定义较为模糊,统一视频中可能还会存在多个动作同时进行的情况,导致对动作的准确检测具有巨大挑战。不同于行为识别,行为检测要求进行精确的动作片段检测,而对于真实场景下的动作产生,往往边界不是十分确定,尤其是动作的终止,并且对动作完整性的判断也相对困难。由于视频本身边界不明确、以及现有时序检测标注工具相对简陋,导致现有的视频时序检测数据集大多是弱标定方式,这也导致了目前行为检测平均精准度偏低。Compared with behavior classification, behavior detection is more difficult. Existing behavior detection methods usually first generate segment proposals that may contain actions, and then classify them. However, because the definition of the boundary is ambiguous, there may be multiple simultaneous actions in the unified video, which leads to great challenges in the accurate detection of actions. Different from behavior recognition, behavior detection requires accurate detection of action segments. For the generation of actions in real scenes, the boundary is often not very certain, especially the termination of the action, and it is relatively difficult to judge the integrity of the action. Due to the unclear boundaries of the video itself and the relatively simple existing timing detection and labeling tools, most of the existing video timing detection datasets are weakly calibrated, which also leads to the low average accuracy of current behavior detection.
经分析,现有的视频标注方案主要存在以下缺陷:After analysis, the existing video labeling schemes mainly have the following defects:
1)、相关视频时序检测数据集标签定义较为粗粒度,不同标签的时序时长相差较大并且不同标签的边界定义不明晰,无法更为直观地明晰起止边界定义。1) The label definition of the relevant video timing detection data set is relatively coarse-grained, the timing duration of different labels varies greatly, and the boundaries of different labels are not clearly defined, so the definition of the start and end boundaries cannot be more intuitively clarified.
2)、现有的视频标注工具主要针对目标检测工作,而非对一段未修剪的视频进行行为段的起止时间标注。此外,现有的视频标注工具功能相对简单,界面相对简陋。对于大量的未修剪的数据,没有一个相对便捷,功能齐全的标注工具,所需人工成本昂贵。由于现实视频的复杂性,现有的视频标注工具浏览一次视频大多只能针对一类标签进行标注,然而现实场景中,往往是多个行为同时出现,因此需要对一段时间的多标签情况也要进行考虑。另外,现有的标注工具往往不能清晰地体现已标注标签的时间片段,容易导致漏标,重复标,错标等情况,并且无法很好地体现同一行为段的起止关联性,也不方便二次质检,因此对于已标注行为段的显示也较为重要。2) Existing video annotation tools are mainly aimed at target detection, rather than marking the start and end time of an untrimmed video. In addition, existing video annotation tools have relatively simple functions and relatively simple interfaces. For a large amount of untrimmed data, there is no relatively convenient and full-featured labeling tool, and the labor cost is expensive. Due to the complexity of real-world videos, most of the existing video annotation tools can only annotate one type of tag after browsing a video. to consider. In addition, the existing labeling tools often cannot clearly reflect the time segment of the labeled label, which may easily lead to missing labels, repeated labeling, wrong labeling, etc. Sub-quality inspection, so it is also more important for the display of marked behavior segments.
技术问题technical problem
本发明的目的是克服上述现有技术的缺陷,提供一种视频数据集标注方法及装置。The purpose of the present invention is to overcome the above-mentioned defects in the prior art, and provide a video data set labeling method and device.
技术解决方案technical solution
根据本发明的第一方面,提供一种视频数据集标注方法。该方法包括以下步骤:According to a first aspect of the present invention, a video data set labeling method is provided. The method includes the following steps:
步骤S1:根据设定的动作类别选择规则确定数据集标签,该数据集标签表征短时间的瞬时动作和循环动作类型;Step S1: Determine the data set label according to the set action category selection rule, and the data set label represents the type of short-term instantaneous action and cyclic action;
步骤S2:根据所述数据集标签筛选出匹配的待标注视频;Step S2: Filter out matching videos to be labeled according to the dataset tags;
步骤S3:将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Step S3: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
步骤S4:对标注结果进行抽样可视化质检并通过行为识别模型对背景样例以及行为段样例进行识别,通过这种方式检测标注质量可以在大幅节省人工成本的同时提高精度。Step S4: Carry out sampling visual quality inspection on the labeling results and identify background samples and behavior segment samples through the behavior recognition model. Testing the labeling quality in this way can greatly save labor costs and improve accuracy.
根据本发明的第二方面,提供一种视频数据集标注装置。该装置包括:According to a second aspect of the present invention, a video data set tagging device is provided. The unit includes:
标签选择模块:用于根据设定的动作类别选择规则确定数据集标签,该数据集标签表征短时间的瞬时动作和循环动作类型;Label selection module: used to determine the data set label according to the set action category selection rules, and the data set label represents short-term instantaneous action and cyclic action type;
视频检索模块:用于根据所述数据集标签筛选出匹配的待标注视频;Video retrieval module: used to filter out matching videos to be marked according to the data set tags;
数据集标注模块:将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Dataset labeling module: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
有益效果Beneficial effect
与现有技术相比,本发明的优点在于,提供了一种用于深度学习行为时序检测的数据集的标注技术方案,首先对收集来的视频进行视频查重。为了更好地统一边界定义指标,针对现有的视频数据集的标签根据一定规则进行筛选,通过拆解,合并等方式将标签结构化,并按照一定规则进行了排除,更加精确不同行为的起止边界。此外,还设计了一个视频时序标注的工具,针对一段输入视频进行不同标签起止帧的选取,使得多种标签的时间序列更好地体现,并且对已标注行为段都可以更加直观的展现。Compared with the prior art, the present invention has the advantage that it provides a technical solution for tagging data sets for deep learning behavior timing detection, and first performs video plagiarism check on collected videos. In order to better unify the boundary definition indicators, the tags of the existing video data sets are screened according to certain rules, and the tags are structured by dismantling and merging, and are excluded according to certain rules, so as to more accurately start and stop different behaviors boundary. In addition, a video time sequence labeling tool is also designed to select the start and end frames of different labels for an input video, so that the time series of multiple labels can be better reflected, and the marked behavior segments can be displayed more intuitively.
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.
附图说明Description of drawings
被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例,并且连同其说明一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.
图1是根据本发明一个实施例的视频数据集标注方法的流程图;Fig. 1 is the flow chart of the video dataset labeling method according to one embodiment of the present invention;
图2是根据本发明一个实施例的视频数据集标注方法的总体过程示意图;FIG. 2 is a schematic diagram of the overall process of a video dataset labeling method according to an embodiment of the present invention;
图3是根据本发明一个实施例的视频标注工具标注示意图;Fig. 3 is a schematic diagram of a video tagging tool tagging according to an embodiment of the present invention;
图4是根据本发明一个实施例的视频标注工具流程示意图。Fig. 4 is a schematic flowchart of a video tagging tool according to an embodiment of the present invention.
本发明的实施方式Embodiments of the present invention
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangements of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the description.
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific values should be construed as exemplary only, and not as limitations. Therefore, other instances of the exemplary embodiment may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters denote like items in the following figures, therefore, once an item is defined in one figure, it does not require further discussion in subsequent figures.
结合图1和图2所示,所提供的视频数据集标注方法包括以下步骤:As shown in Figure 1 and Figure 2, the provided video dataset labeling method includes the following steps:
步骤S110,根据设定规则选择数据集标签。Step S110, selecting a data set label according to a set rule.
在此步骤中,通过一定规则定义数据集标签,例如,选择通用场景下的常见标签,而非特定场景特定人群类的标签;排除动作定义较宽泛的标签;排除主要通过交互对象的差异性区分而非人类姿态变化的不同而分类的标签;排除在每个动作中都通用的基础身体状态标签;将可以分为原子动作的动作进行拆分。In this step, define data set labels through certain rules, for example, select common labels in general scenarios, rather than labels for specific groups of people in specific scenarios; exclude labels with broad action definitions; exclude mainly through the difference between interactive objects Instead of distinct and categorized labels for human pose changes; exclude base body state labels that are common in every action; split actions that can be broken down into atomic actions.
在一个实施例中,将标签主要分为两类:短时间的瞬时动作和循环动作对动作的过程周期进行划分。所选取数据可通过现有行为识别方法进行筛选。In one embodiment, tags are mainly divided into two categories: short-term transient actions and cyclic actions to divide the process cycle of actions. The selected data can be filtered by existing behavior recognition methods.
通过定义动作类别中的“动作”并进行筛选和划分,是标签更为合理,粗粒度动作可以划分为更多的细微动作。By defining the "action" in the action category and performing screening and division, the label is more reasonable, and the coarse-grained action can be divided into more subtle actions.
步骤S120,根据选择的标签检索筛选出待标注的视频。Step S120, according to the selected tag search and filter out the videos to be tagged.
针对标签搜集相关视频,并对其进行查重和筛选。在一个实施例中,查重过程包括:对于待处理视频,在视频库中进行近邻检索,筛选出与待处理视频相似的候选视频,得到候选视频集合;计算每一候选视频与待处理视频的相似性,得到相似度结果;根据相似度结果确定待处理视频是否通过查重检测。其中,可通过对视频第一帧和最后一帧的哈希值的汉明距离来计算相似性。Collect relevant videos for tags, and check and filter them. In one embodiment, the plagiarism checking process includes: for the video to be processed, perform a neighbor search in the video library, filter out candidate videos similar to the video to be processed, and obtain a set of candidate videos; calculate the distance between each candidate video and the video to be processed Similarity, obtain the similarity result; determine whether the video to be processed has passed the plagiarism check according to the similarity result. Wherein, the similarity can be calculated by the Hamming distance of the hash values of the first frame and the last frame of the video.
步骤S130,将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Step S130, upload the video to be marked to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
具体地,参见图3所示,将视频输入至标注工具平台进行视频标注,其中左上方为实际标注起始帧与结束帧的选择区域;右上方为视频选择区域,可批量进行选取(已标注视频会显示已标注结果);视频选择下方为标签选择菜单,delete(删除),label(标注),save(保存)按钮分别具备删除误操作框,标注操作框以及保存现有结果的功能。Specifically, as shown in Figure 3, the video is input to the labeling tool platform for video labeling, wherein the upper left is the selection area for actually labeling the start frame and the end frame; the upper right is the video selection area, which can be selected in batches (marked The video will display the labeled results); below the video selection is the label selection menu, the delete (delete), label (label), and save (save) buttons respectively have the functions of deleting the wrong operation box, labeling the operation box and saving the existing results.
利用图3的标注工具平台,能够实现同时标注多标签的功能,更具备现实意义。菜单选择下方为实际标注结果的timeline(时间线)显示区域,便于标注人员检查错标、漏标以及二次质检;左下方为视频播放区域,可通过滑动条快速浏览视频。并且,实际标注时可设置键盘快捷键,以便于标注人员更快速地进行标注。Using the labeling tool platform in Figure 3, the function of labeling multiple labels at the same time can be realized, which has more practical significance. The lower part of the menu selection is the timeline (timeline) display area of the actual labeling results, which is convenient for labelers to check for wrong labels, missing labels and secondary quality inspection; the lower left is the video playback area, and the video can be quickly browsed through the slide bar. In addition, keyboard shortcuts can be set during actual labeling, so that labelers can label more quickly.
具体参见图4所示,首先,输入待标注视频并点击播放键,也可通过滑动条快速浏览整条视频。在显示以帧数框中的帧为起始的24帧图像的区域选择待标注动作的起始帧和结束帧(例如,两个框显示的分别是标定的起始帧和标定的结束帧);通过菜单栏添加标签,并选定标签的开始和结束,然后通过时间轴检测已标注行为片段,这种方式有利于标注人员检查错标、漏标以及二次质检,通过保存,删除菜单对标签标注结果进行选择;结束标注。Refer to Figure 4 for details. First, input the video to be marked and click the play button. You can also quickly browse the entire video through the slide bar. Select the start frame and end frame of the action to be marked in the area where the 24 frames starting from the frame in the frame number box are displayed (for example, the two boxes show the marked start frame and the marked end frame respectively) ;Add a label through the menu bar, and select the start and end of the label, and then detect the labeled behavior segment through the time axis. This method is beneficial for the labeling personnel to check for wrong labeling, missing labeling and secondary quality inspection. Through the save and delete menu Select the label labeling result; end labeling.
综上,所设计的标注工具平台相较于现有的标注工具,功能更加丰富,操作更加便捷,界面更加直观,并且利用上述过程进行标注有利于确定更明晰的边界,并实现对一段时间的多标签进行标注,体现同一行为段的起止关联性。此外,标注规则和流程更加精确,减少了标注偏差和时间动作定位的边界不确定性。In summary, compared with the existing labeling tools, the designed labeling tool platform has richer functions, more convenient operation, and a more intuitive interface, and using the above process for labeling is conducive to determining a clearer boundary and realizing a period of time. Label with multiple tags to reflect the start-stop correlation of the same behavior segment. In addition, the labeling rules and processes are more precise, reducing labeling bias and boundary uncertainty of temporal action localization.
步骤S140,对数据集标注结果进行质检。Step S140, performing quality inspection on the labeling results of the dataset.
在获得利用视频标注工具标注的结果后,可对其进行抽样可视化质检并通过模型对背景样例以及行为段样例进行识别,通过这种识别检测标注质量可以在大幅节省人工成本的同时提高精度。例如,通过TSN(Temporal Segment Networks)方法来对其进行识别。该方法主要由空间流卷积网络和时间流卷积网络构成。但不同于two-stream采用单帧或者单堆帧,TSN使用从整个视频中稀疏地采样一系列短片段,每个片段都将给出其本身对于行为类别的初步预测,从这些片段的“共识”来得到视频级的预测结果。在学习过程中,通过迭代更新模型参数来优化视频级预测的损失值(loss value)。结果进行统计,根据质检结果可确定所标注的数据集是否符合预期。After obtaining the results marked by the video labeling tool, it can be sampled and visualized for quality inspection, and the background sample and behavior segment sample can be identified through the model. Through this identification and detection, the quality of labeling can greatly save labor costs and at the same time improve precision. For example, it is identified by the TSN (Temporal Segment Networks) method. The method mainly consists of a spatial flow convolutional network and a temporal flow convolutional network. But unlike two-stream which uses a single frame or a single pile of frames, TSN uses a series of short clips that are sparsely sampled from the entire video, and each clip will give its own preliminary prediction of the behavior category, from the "consensus" of these clips ” to get video-level prediction results. During the learning process, the loss value of video-level prediction (loss value). The results are counted, and according to the quality inspection results, it can be determined whether the marked data set meets expectations.
为验证本发明的效果,采用不同方法在数据集上进行了比较,结果参见下表1。In order to verify the effect of the present invention, different methods were used to compare data sets, and the results are shown in Table 1 below.
表1:不同方法在数据集上的比较Table 1: Comparison of different methods on the dataset
Figure dest_path_image001
Figure dest_path_image001
表1中,通过三种目前较为先进的时序检测方法BMN、DBG、G-TAD分别通过双流3D卷积(I3D)模型提取RGB、光流、RGB+光流特征在数据集上进行实验。其中,边界匹配网络(BMN)通过一种新的时序提名置信度评估机制-边界匹配机制,以及基于边界匹配机制的边界匹配网络。BMN网络能够同时生成一维边界概率序列,以及二维的BM置信度图来密集的评估所有可能存在的时序提名的置信度分数。稠密边界动作生成器网络(DBG)通过一种快速的、端到端的稠密边界动作生成器对所有的动作提名估计出稠密的边界置信度图。用于时序动作检测的子图定位模型(G-TAD)通过自适应地融合多级语义文本信息,将时序动作检测问题转化为子图定位问题。评价指标主要用AR@AN表示,即通过测量AR与提案的平均数(AN)之间的关系来进行判断。并计算了AR与AN曲线下的面积(AUC),作为ActivityNet-1.3数据集上的另一个评估指标,其中AN的范围是0到100。由表1可以看出,基于本发明的这种标签筛选方法构建的数据集由于标注结果更加精确,使得现有的视频时序检测方法的性能远低于其他数据集。In Table 1, through three currently more advanced timing detection methods BMN, DBG, and G-TAD, the two-stream 3D convolution (I3D) model is used to extract RGB, optical flow, and RGB+optical flow features to conduct experiments on the data set. Among them, the boundary matching network (BMN) adopts a new time series nomination confidence evaluation mechanism-the boundary matching mechanism, and the boundary matching network based on the boundary matching mechanism. The BMN network can simultaneously generate a one-dimensional boundary probability sequence and a two-dimensional BM confidence map to densely evaluate the confidence scores of all possible temporal nominations. The dense boundary action generator network (DBG) estimates dense boundary confidence maps for all action proposals through a fast, end-to-end dense boundary action generator. The Subgraph Localization Model for Temporal Action Detection (G-TAD) transforms the temporal action detection problem into a subgraph localization problem by adaptively fusing multi-level semantic text information. The evaluation index is mainly represented by AR@AN, that is, it is judged by measuring the relationship between AR and the average number (AN) of the proposal. And calculated the area under the curve (AUC) of AR and AN as another evaluation indicator on the ActivityNet-1.3 dataset, where AN ranges from 0 to 100. It can be seen from Table 1 that the performance of existing video timing detection methods is much lower than other data sets due to the more accurate labeling results of the data set constructed based on the label screening method of the present invention.
相应地,本发明还提供一种视频数据集标注装置,用于实现上述方法的一个方面或多个方面。例如,该装置包括:标签选择模块:用于根据设定的动作类别选择规则确定数据集标签,该数据集标签表征短时间的瞬时动作和循环动作类型;视频检索模块:用于根据所述数据集标签筛选出匹配的待标注视频;数据集标注模块:将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Correspondingly, the present invention also provides a video data set labeling device, which is used to realize one or more aspects of the above method. For example, the device includes: a label selection module: used to determine the data set label according to the set action category selection rule, and the data set label represents short-term instantaneous action and cyclic action type; video retrieval module: used to Set tags to filter out matching videos to be labeled; dataset labeling module: upload the videos to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
在计算机视觉领域,除了公开的数据集之外,对很多应用场景都需要专门的数据集做迁移学习或者端到端的训练,这种情况需要大量的训练数据。而本发明可应用于多种领域的视频数据集标注,例如,用于视频辅助裁判方面。因为本发明对视频动作快慢时间不敏感,因此可以普适于多种体育运动场景中,如动作慢的瑜伽和动作变化迅速的体操等。通过更加精确的标注方法,可以使得不同动作之间的边界判断更明确。例如,用于智能视频审核,在移动端即可完成异常动作识别和研判,可以通过开始动作界限来判断是否将要发生该异常行为;还可以运用到其他识别中,如管道的异常检测等。又如,应用于智能安防,可以在计算资源受限的智能终端如智能眼镜、无人机、智能摄像头等上直接进行动作识别,直接反馈异常行为,提高巡防等的即时性和准确性。In the field of computer vision, in addition to public data sets, many application scenarios require specialized data sets for transfer learning or end-to-end training, which requires a large amount of training data. However, the present invention can be applied to video data set labeling in various fields, for example, for video assisted refereeing. Because the present invention is not sensitive to the speed and time of video movements, it can be widely used in various sports scenes, such as yoga with slow movements and gymnastics with rapid movements. Through a more precise labeling method, the boundary judgment between different actions can be made clearer. For example, for intelligent video review, abnormal action identification and judgment can be completed on the mobile terminal, and whether the abnormal action will occur can be judged by starting the action limit; it can also be applied to other identifications, such as abnormal detection of pipelines, etc. Another example, applied to smart security, can directly perform action recognition on smart terminals with limited computing resources, such as smart glasses, drones, smart cameras, etc., directly feed back abnormal behaviors, and improve the immediacy and accuracy of patrolling.
综上所述,相对于现有技术,本发明的优点主要体现在以下方面:In summary, compared with the prior art, the advantages of the present invention are mainly reflected in the following aspects:
1)、视频时序检测对学术圈和工业界都有着广泛的应用价值,但由于没有明显的边界定义以及人工标注成本损耗大,现有的视频数据集在标注时都存在一定程度的高成本和弱标注的特点,动作标签的某些定义相对而言比较粗糙,不适合定义准确的动作行为边界。另外,由于不同的动作有不同的动作过程周期,不同粒度的人类动作也会带来检测上的困难。本发明更为精确的标注粒度,参考现有行为相关数据集的标注特点,根据动作类别选择中的标准进行排除,筛选,拆分。1) Video timing detection has a wide range of application values in the academic circle and the industry. However, due to the lack of obvious boundary definition and the high cost of manual labeling, the existing video datasets have a certain degree of high cost and cost when labeling. Due to the characteristics of weak labeling, some definitions of action tags are relatively rough, which is not suitable for defining accurate action boundaries. In addition, since different actions have different action process cycles, different granularity of human actions will also bring difficulties in detection. The more precise labeling granularity of the present invention refers to the labeling characteristics of the existing behavior-related data sets, and performs exclusion, screening, and splitting according to the criteria in the selection of action categories.
2)、现有视频时序标注工具功能较少,界面相对简陋。本发明为了帮助标注员有效且一致地标注操作视频段,设计了视频时序检测的工具。其中设置视频浏览区域,帮助标注员快速预览整个视频,并提供了快进等功能;起止帧选择区域,通过不同的操作来注释动作的开始和结束,通过不同的标明代表起止帧;标签选择区域,选择不同段中的标签类别,为了方便多标签的标注,对类别进行了归类及菜单选项;操作菜单区域,用以添加,删除和修改标签以及标签显示区域,以帮助标注员查看结果,避免漏标,重标,错标。此外,为了尽可能避免人主观意识带来的偏差,保证数据一致性,还设计了标注指南,通过维基百科及相关运动指南通过文字和图片清楚地定义每个标签的边界。标注效率相比纯手工而言提升了效率,并提高了标注质量。2) The existing video timing annotation tools have few functions and relatively simple interface. In order to help annotators to effectively and consistently annotate operation video segments, the present invention designs a video sequence detection tool. Among them, the video browsing area is set to help the annotator quickly preview the entire video, and provide functions such as fast forwarding; the start and end frame selection area uses different operations to annotate the start and end of the action, and represents the start and end frames through different markings; the label selection area , select the label category in different segments, in order to facilitate multi-label labeling, the categories are classified and menu options; the operation menu area is used to add, delete and modify labels and the label display area to help labelers view the results, Avoid missing labels, relabeling, and wrong labeling. In addition, in order to avoid bias caused by human subjective consciousness as much as possible and ensure data consistency, a labeling guide was also designed to clearly define the boundaries of each label through text and pictures through Wikipedia and related sports guides. The efficiency of labeling improves the efficiency and quality of labeling compared to pure manual work.
3)、现有技术主要通过人工检验。本发明在人工二次质检的基础上,通过现有行为识别模型对标注结果的背景段和行为段进行了样本分类。对比现有数据集,边界定义更加精确。3), the prior art is mainly through manual inspection. On the basis of manual secondary quality inspection, the present invention classifies the samples of the background segment and the behavior segment of the labeling result through the existing behavior recognition model. Compared with existing datasets, the boundary definition is more precise.
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++、Python等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本发明的各个方面。Computer program instructions for performing operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, Python, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider). connect). In some embodiments, electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs) or programmable logic arrays (PLAs), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the invention are implemented by executing computer readable program instructions.
这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation by means of hardware, implementation by means of software, and implementation by a combination of software and hardware are all equivalent.
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or technical improvement in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

  1. 一种视频数据集标注方法,包括以下步骤:A video data set labeling method, comprising the following steps:
    步骤S1:根据设定的动作类别选择规则确定数据集标签,该数据集标签表征短时间的瞬时动作和循环动作类型;Step S1: Determine the data set label according to the set action category selection rule, and the data set label represents the type of short-term instantaneous action and cyclic action;
    步骤S2:根据所述数据集标签筛选出匹配的待标注视频;Step S2: Filter out matching videos to be labeled according to the dataset tags;
    步骤S3:将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Step S3: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  2. 根据权利要求1所述的方法,其特征在于,所述根据设定的动作类别选择规则确定数据集标签包括: The method according to claim 1, wherein said determining the data set label according to the set action category selection rule comprises:
    选择通用场景下的常见标签,并排除非特定场景特定人群类的标签;Select common tags in general scenarios, and exclude tags that are not specific to specific groups of people;
    排除动作定义宽泛的标签;Exclude tags with broad action definitions;
    排除通过交互对象的差异性区分而非人类姿态变化的不同而分类的标签;Exclude tags that are classified by differences in interacting objects rather than differences in human pose changes;
    排除在每个动作中都通用的基础身体状态标签;Excludes base body state labels that are common across every movement;
    对于可拆分的动作进行拆分,以获得细粒度标签。Split for splittable actions to obtain fine-grained labels.
  3. 根据权利要求1所述的方法,其特征在于,步骤S2包括: The method according to claim 1, wherein step S2 comprises:
    根据所述数据集标签搜集相关视频,并进行查重和筛选,其中查重通过对视频第一帧和最后一帧的哈希值的汉明距离来计算相似性;Collect related videos according to the data set tags, and perform duplicate checking and screening, wherein the duplicate checking calculates the similarity by the Hamming distance of the hash values of the first frame and the last frame of the video;
    根据相似性结果确定待处理视频是否通过查重检测。Determine whether the video to be processed has passed the plagiarism check according to the similarity result.
  4. 根据权利要求1所述的方法,其特征在于,所述标注工具平台设有起始帧选择区域、视频选择区域、标签选择区域、结果显示区域以及视频播放区域,其中,所述起始帧选择区域用于供用户标注起始帧与结束帧;所述视频选择区域用于供用于选择待标注的一个或多个视频;所述标签选择区域用于供用户标注动作行为标签;所述结果显示区域用于向用户显示标注起始时间;所述视频播放区域用于向用户显示待标注视频的连续多帧图像,以供标注动作行为的起始帧。 The method according to claim 1, wherein the labeling tool platform is provided with a start frame selection area, a video selection area, a label selection area, a result display area and a video playback area, wherein the start frame selection The area is used for the user to mark the start frame and the end frame; the video selection area is used for selecting one or more videos to be marked; the label selection area is used for the user to mark the action behavior label; the result display The area is used to display the starting time of marking to the user; the video playing area is used to display to the user continuous multi-frame images of the video to be marked, so as to mark the starting frame of the action behavior.
  5. 根据权利要求4所述的方法,其特征在于,步骤S3包括: The method according to claim 4, wherein step S3 comprises:
    输入待标注视频并点击播放键或通过滑动条浏览待标注视频;Enter the video to be tagged and click the play button or browse the video to be tagged through the slide bar;
    在所述视频选择区域,显示连续的24帧图像,以供用户选择待标注动作行为的起始帧和结束帧;In the video selection area, a continuous 24-frame image is displayed for the user to select the start frame and the end frame of the action behavior to be marked;
    通过所述标签选择区域的菜单栏添加标签,并选定标签的开始和结束;Add a label through the menu bar of the label selection area, and select the start and end of the label;
    通过所述结果显示区域的时间轴检测已标注行为片段,以供用户检查错标、漏标以及二次质检;Detect marked behavior fragments through the time axis of the result display area, for users to check for wrong labeling, missing labeling and secondary quality inspection;
    通过设置在所述标签选择区域的保存,删除菜单对标注结果进行选择。Select the labeling result by setting the save and delete menus in the label selection area.
  6. 根据权利要求1所述的方法,其特征在于,还包括: The method according to claim 1, further comprising:
    步骤S4:对标注结果进行抽样可视化质检并通过行为识别模型对背景样例以及行为段样例进行识别。Step S4: Carry out sampling visual quality inspection on the labeling results and identify background samples and behavior segment samples through the behavior recognition model.
  7. 根据权利要求6所述的方法,其特征在于,步骤S4包括: The method according to claim 6, wherein step S4 comprises:
    采用TSN行为识别模型对背景视频段以及行为视频段进行识别,给出其本身对于行为类别的预测类别得分,从而检验标注结果质量。The TSN behavior recognition model is used to identify the background video segment and the behavior video segment, and its own prediction category score for the behavior category is given to test the quality of the labeling results.
  8. 一种视频数据集标注装置,包括:A video data set tagging device, comprising:
    标签选择模块:用于根据设定的动作类别选择规则确定数据集标签,该数据集标签表征短时间的瞬时动作和循环动作类型;Label selection module: used to determine the data set label according to the set action category selection rules, and the data set label represents short-term instantaneous action and cyclic action type;
    视频检索模块:用于根据所述数据集标签筛选出匹配的待标注视频;Video retrieval module: used to filter out matching videos to be marked according to the data set tags;
    数据集标注模块:将待标注视频上传至标注工具平台进行动作行为检测和标注,以确定动作行为类型标签以及对应的起始帧和结束帧位置。Dataset labeling module: Upload the video to be labeled to the labeling tool platform for action detection and labeling, so as to determine the action type label and the corresponding start frame and end frame position.
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至7中任一项所述方法的步骤。 A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.
  10. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至7中任一项所述的方法的步骤。 A computer device comprising a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, wherein any one of claims 1 to 7 is implemented when the processor executes the program The steps of the method described in the item.
PCT/CN2021/137579 2021-05-10 2021-12-13 Video data set labeling method and apparatus WO2022237157A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110505869.1 2021-05-10
CN202110505869.1A CN113139096B (en) 2021-05-10 2021-05-10 Video dataset labeling method and device

Publications (1)

Publication Number Publication Date
WO2022237157A1 true WO2022237157A1 (en) 2022-11-17

Family

ID=76818024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137579 WO2022237157A1 (en) 2021-05-10 2021-12-13 Video data set labeling method and apparatus

Country Status (2)

Country Link
CN (1) CN113139096B (en)
WO (1) WO2022237157A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139096B (en) * 2021-05-10 2024-04-23 中国科学院深圳先进技术研究院 Video dataset labeling method and device
CN114373075A (en) * 2021-12-31 2022-04-19 西安电子科技大学广州研究院 Target component detection data set construction method, detection method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457494A (en) * 2019-08-01 2019-11-15 新华智云科技有限公司 Data mask method, device, electronic equipment and storage medium
CN110996138A (en) * 2019-12-17 2020-04-10 腾讯科技(深圳)有限公司 Video annotation method, device and storage medium
CN112101297A (en) * 2020-10-14 2020-12-18 杭州海康威视数字技术股份有限公司 Training data set determination method, behavior analysis method, device, system and medium
CN113139096A (en) * 2021-05-10 2021-07-20 中国科学院深圳先进技术研究院 Video data set labeling method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163122B (en) * 2020-10-30 2024-02-06 腾讯科技(深圳)有限公司 Method, device, computing equipment and storage medium for determining label of target video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457494A (en) * 2019-08-01 2019-11-15 新华智云科技有限公司 Data mask method, device, electronic equipment and storage medium
CN110996138A (en) * 2019-12-17 2020-04-10 腾讯科技(深圳)有限公司 Video annotation method, device and storage medium
CN112101297A (en) * 2020-10-14 2020-12-18 杭州海康威视数字技术股份有限公司 Training data set determination method, behavior analysis method, device, system and medium
CN113139096A (en) * 2021-05-10 2021-07-20 中国科学院深圳先进技术研究院 Video data set labeling method and device

Also Published As

Publication number Publication date
CN113139096B (en) 2024-04-23
CN113139096A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
US20220075806A1 (en) Natural language image search
US11836996B2 (en) Method and apparatus for recognizing text
Zhou et al. Salient region detection using diffusion process on a two-layer sparse graph
CN108052577B (en) Universal text content mining method, device, server and storage medium
CN108846126B (en) Generation of associated problem aggregation model, question-answer type aggregation method, device and equipment
US20210165817A1 (en) User interface for context labeling of multimedia items
CN108334627B (en) Method and device for searching new media content and computer equipment
Cooper et al. It takes two to tango: Combining visual and textual information for detecting duplicate video-based bug reports
US20160005171A1 (en) Image Analysis Device, Image Analysis System, and Image Analysis Method
CN104573130B (en) The entity resolution method and device calculated based on colony
WO2022237157A1 (en) Video data set labeling method and apparatus
WO2015061046A2 (en) Method and apparatus for performing topic-relevance highlighting of electronic text
CN102436483A (en) Video advertisement detecting method based on explicit type sharing subspace
WO2016014373A1 (en) Identifying presentation styles of educational videos
CN110851712A (en) Book information recommendation method and device and computer readable medium
Zhao et al. Effective local and global search for fast long-term tracking
Zang et al. Multimodal icon annotation for mobile applications
Sun et al. Ui components recognition system based on image understanding
Jeong et al. Automatic detection of slide transitions in lecture videos
Huang et al. Visual attention learning and antiocclusion-based correlation filter for visual object tracking
Bergh et al. A curated set of labeled code tutorial images for deep learning
Jin Seeing the Unseen: Errors and Bias in Visual Datasets
Bhanbhro et al. Symbol Detection in a Multi-class Dataset Based on Single Line Diagrams using Deep Learning Models
Sindel et al. SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks
US11947590B1 (en) Systems and methods for contextualized visual search

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941719

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941719

Country of ref document: EP

Kind code of ref document: A1