CN117156221B - Short video content understanding and labeling method - Google Patents

Short video content understanding and labeling method Download PDF

Info

Publication number
CN117156221B
CN117156221B CN202311421767.7A CN202311421767A CN117156221B CN 117156221 B CN117156221 B CN 117156221B CN 202311421767 A CN202311421767 A CN 202311421767A CN 117156221 B CN117156221 B CN 117156221B
Authority
CN
China
Prior art keywords
video
labeling
annotation
event
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311421767.7A
Other languages
Chinese (zh)
Other versions
CN117156221A (en
Inventor
张瑾
文静
袁泉
郝文涛
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Toutiaoyi Technology Co ltd
Original Assignee
Beijing Toutiaoyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Toutiaoyi Technology Co ltd filed Critical Beijing Toutiaoyi Technology Co ltd
Priority to CN202311421767.7A priority Critical patent/CN117156221B/en
Publication of CN117156221A publication Critical patent/CN117156221A/en
Application granted granted Critical
Publication of CN117156221B publication Critical patent/CN117156221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Abstract

The invention discloses a short video content understanding and labeling method, which relates to the technical field of video content labeling, and comprises the following steps of S1, reading a video stream to be labeled, and analyzing the video stream by using setting parameters, wherein the setting parameters comprise the size of a sliding window and the sliding stride. Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window. And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis. The labeling method improves the labeling efficiency of the labeling personnel, shortens the labeling consumption time and improves the precision of labeling contents.

Description

Short video content understanding and labeling method
Technical Field
The invention belongs to the field of video content annotation, and particularly relates to a short video content understanding annotation method.
Background
Video content understanding annotation is a key technology for realizing automatic content analysis and retrieval by marking objects, emotions, events and the like in video. It combines computer vision, natural language processing, and machine learning to help the machine system understand and interpret video content, providing a more accurate and personalized video experience for the user. The video content understanding annotation is a key for realizing intelligent video processing, and has wide application prospect in the industry and research field. And the early recognition needs to be manually performed with content understanding labeling so as to expand training data and train the deep neural network.
For example, the chinese patent with publication number CN112040256B discloses a method and a system for labeling video in a live broadcast experiment teaching process, by acquiring video and audio in a media stream, respectively identifying by using experimental equipment, matching the identification result with time, storing matching information, and labeling video, but the patent has a single labeling form, and cannot label deeper video event behaviors.
Disclosure of Invention
Aiming at the technical problems, the invention discloses a short video content understanding and labeling method, which comprises the following steps of,
step S1, reading a video stream to be marked, and analyzing the video stream by using setting parameters, wherein the setting parameters comprise the size of a sliding window and the sliding step.
Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window.
And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis.
S4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; and outputting the video stream with the marked time axis of the complete marked event.
Further, the annotation types comprise a key frame annotation, an object tracking annotation, an emotion and emotion annotation, an audio annotation and an event annotation, wherein the key frame annotation selects a key frame for a specific time point and annotates the key frame; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio mark marks the audio, and identifies voice content, sound effect and music in the audio.
Further, the labeling step for event labeling type comprises,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on the video frames in the window, cleaning the text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on the annotation event for the annotator to check.
Step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, the annotators modify the copies of the content summaries, and the modified copies act on the annotation events in time sequence.
Step S103, for the event and plot of each window video frame, weighting and integrating according to the set parameters, wherein the integrated video frame has event nodes appearing at different time points.
Further, the event node contains marked event information, and the event information comprises voice-to-text correction content, video event content description, content abstract and video matching degree.
Compared with the prior art, the invention has the beneficial effects that: (1) According to the method, different annotation types are set, the content in the video stream is understood and annotated, the video frame with the complete annotation event is generated, and training data is provided for training the deep neural network to automatically understand the video content.
Drawings
FIG. 1 is an exemplary flow chart of the labeling method of the present invention.
FIG. 2 is an exemplary flow chart of event annotation in accordance with the present invention.
Detailed Description
Embodiments, a short video content understanding annotation method as described in fig. 1, comprises,
step S1, a video stream to be marked is read, the video stream is analyzed by using setting parameters, and the setting parameters comprise the size of a sliding window and the sliding step.
Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window.
And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis.
S4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; and outputting the video stream with the marked time axis of the complete marked event.
In the above steps, the size of the sliding window is the length of the video frame in each window, and the sliding step is the length of the movement on the video frame.
The annotation types comprise a key frame annotation, an object tracking annotation, an emotion and emotion annotation, an audio annotation and an event annotation, wherein the key frame annotation selects a key frame for a specific time point and annotates the key frame; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio markers mark the audio, identifying speech content, sound effects, and music in the audio.
As shown in fig. 2, an exemplary step of the present embodiment for labeling of event label types includes,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on the video frames in the window, cleaning the text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on the annotation event for the annotator to check. And the annotators view the content abstracts through the annotating interface.
Step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, the annotators modify the copies of the content summaries, and the modified copies act on the annotation events in time sequence.
Step S103, for the event and plot of each window video frame, weighting and integrating according to the set parameters, wherein the integrated video frame has event nodes appearing at different time points.
In the above steps, the event node contains marked event information, and the event information includes a voice-to-text correction content, a video event content description, a content abstract and a video matching degree.

Claims (1)

1. A short video content understanding labeling method is characterized in that: comprising the steps of (a) a step of,
step S1, reading a video stream to be marked, and analyzing the video stream by using setting parameters, wherein the setting parameters comprise the size of a sliding window and the sliding step;
step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and obtaining the video frame of each window;
step S3, the video frames in each window are presented according to time sequences, and a labeling person sequentially labels the content of the video frames of each window according to different labeling types to generate labeling events on a labeling time axis;
s4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; outputting a video stream with a marked time axis of a complete marked event; the annotation types comprise key frame annotation, object tracking annotation, emotion and emotion annotation, audio annotation and event annotation, wherein for a specific time point, a key frame is selected and marked; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio annotation marks the audio, and identifies voice content, sound effect and music in the audio; the labeling step for the event labeling type, comprising,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on video frames in a window, cleaning text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on an annotation event for an annotator to check;
step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, a labeling operator modifies the copy of the content abstract, and the modified copy acts on the labeling event according to time sequence;
step S103, aiming at the event and plot of each window video frame, weighting and integrating according to set parameters, wherein the integrated video frame has event nodes appearing at different time points; the event node comprises marked event information, and the event information comprises voice-to-text correction content, video event content description, content abstract and video matching degree.
CN202311421767.7A 2023-10-31 2023-10-31 Short video content understanding and labeling method Active CN117156221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311421767.7A CN117156221B (en) 2023-10-31 2023-10-31 Short video content understanding and labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311421767.7A CN117156221B (en) 2023-10-31 2023-10-31 Short video content understanding and labeling method

Publications (2)

Publication Number Publication Date
CN117156221A CN117156221A (en) 2023-12-01
CN117156221B true CN117156221B (en) 2024-02-06

Family

ID=88899151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311421767.7A Active CN117156221B (en) 2023-10-31 2023-10-31 Short video content understanding and labeling method

Country Status (1)

Country Link
CN (1) CN117156221B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011145951A (en) * 2010-01-15 2011-07-28 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for automatically classifying content
CN110147699A (en) * 2018-04-12 2019-08-20 北京大学 A kind of image-recognizing method, device and relevant device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110255844A1 (en) * 2007-10-29 2011-10-20 France Telecom System and method for parsing a video sequence
TWI666595B (en) * 2018-02-26 2019-07-21 財團法人工業技術研究院 System and method for object labeling

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011145951A (en) * 2010-01-15 2011-07-28 Nippon Telegr & Teleph Corp <Ntt> Apparatus, method and program for automatically classifying content
CN110147699A (en) * 2018-04-12 2019-08-20 北京大学 A kind of image-recognizing method, device and relevant device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Video quality classification based home Video segmentation;Si Wu等;EEE International Conference on Multimedia and Expo(ICME);摘要 *

Also Published As

Publication number Publication date
CN117156221A (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Johansson The approach of the Text Encoding Initiative to the encoding of spoken discourse
CN111027584A (en) Classroom behavior identification method and device
CN109949799B (en) Semantic parsing method and system
Baur et al. eXplainable cooperative machine learning with NOVA
JP2007087397A (en) Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method
Doumbouya et al. Using radio archives for low-resource speech recognition: towards an intelligent virtual assistant for illiterate users
CN115618022B (en) Low-resource relation extraction method based on data synthesis and two-stage self-training
CN111180025A (en) Method and device for representing medical record text vector and inquiry system
Wagner et al. Applying cooperative machine learning to speed up the annotation of social signals in large multi-modal corpora
CN114996506A (en) Corpus generation method and device, electronic equipment and computer-readable storage medium
CN117156221B (en) Short video content understanding and labeling method
CN109213970B (en) Method and device for generating notes
CN110929015B (en) Multi-text analysis method and device
CN114118068B (en) Method and device for amplifying training text data and electronic equipment
CN110472032A (en) More classification intelligent answer search methods of medical custom entities word part of speech label
CN116129868A (en) Method and system for generating structured photo
CN116092472A (en) Speech synthesis method and synthesis system
CN114281948A (en) Summary determination method and related equipment thereof
Angrave et al. Creating tiktoks, memes, accessible content, and books from engineering videos? first solve the scene detection problem
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium
CN113963306A (en) Courseware title making method and device based on artificial intelligence
CN114328902A (en) Text labeling model construction method and device
CN113691382A (en) Conference recording method, conference recording device, computer equipment and medium
CN114078470A (en) Model processing method and device, and voice recognition method and device
JPWO2020054822A1 (en) Sound analyzer and its processing method, program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant