CN117156221A - Short video content understanding and labeling method and device - Google Patents
Short video content understanding and labeling method and device Download PDFInfo
- Publication number
- CN117156221A CN117156221A CN202311421767.7A CN202311421767A CN117156221A CN 117156221 A CN117156221 A CN 117156221A CN 202311421767 A CN202311421767 A CN 202311421767A CN 117156221 A CN117156221 A CN 117156221A
- Authority
- CN
- China
- Prior art keywords
- video
- labeling
- annotation
- content
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 62
- 230000008451 emotion Effects 0.000 claims description 16
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 10
- 238000012937 correction Methods 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a short video content understanding labeling method and a short video content understanding labeling device, and relates to the technical field of video content labeling. Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window. And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis. According to the invention, by arranging the labeling device, the labeling efficiency of a labeling person is improved, the labeling consumption time is shortened, and the precision of labeling content is improved.
Description
Technical Field
The invention belongs to the field of video content annotation, and particularly relates to a short video content understanding annotation method and device.
Background
Video content understanding annotation is a key technology for realizing automatic content analysis and retrieval by marking objects, emotions, events and the like in video. It combines computer vision, natural language processing, and machine learning to help the machine system understand and interpret video content, providing a more accurate and personalized video experience for the user. The video content understanding annotation is a key for realizing intelligent video processing, and has wide application prospect in the industry and research field. And the early recognition needs to be manually performed with content understanding labeling so as to expand training data and train the deep neural network.
For example, the chinese patent with publication number CN112040256B discloses a method and a system for labeling video in a live broadcast experiment teaching process, by acquiring video and audio in a media stream, respectively identifying by using experimental equipment, matching the identification result with time, storing matching information, and labeling video, but the patent has a single labeling form, and cannot label deeper video event behaviors.
Disclosure of Invention
Aiming at the technical problems, the invention discloses a short video content understanding and labeling method, which comprises the following steps of,
step S1, reading a video stream to be marked, and analyzing the video stream by using setting parameters, wherein the setting parameters comprise the size of a sliding window and the sliding step.
Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window.
And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis.
S4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; and outputting the video stream with the marked time axis of the complete marked event.
Further, the annotation types comprise a key frame annotation, an object tracking annotation, an emotion and emotion annotation, an audio annotation and an event annotation, wherein the key frame annotation selects a key frame for a specific time point and annotates the key frame; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio mark marks the audio, and identifies voice content, sound effect and music in the audio.
Further, the labeling step for event labeling type comprises,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on the video frames in the window, cleaning the text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on the annotation event for the annotator to check.
Step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, the annotators modify the copies of the content summaries, and the modified copies act on the annotation events in time sequence.
Step S103, for the event and plot of each window video frame, weighting and integrating according to the set parameters, wherein the integrated video frame has event nodes appearing at different time points.
Further, the event node contains marked event information, and the event information comprises voice-to-text correction content, video event content description, content abstract and video matching degree.
The short video content understanding labeling device comprises a video reading unit, a format conversion unit, a labeling conversion unit and a multi-mode fusion unit, wherein the video reading unit is used for reading a video stream to be labeled; the format conversion unit converts the format of the video stream to unify the video formats in the labeling process; the annotation conversion unit converts the annotation content into annotation nodes on a time axis; the multimode fusion unit carries out multimode fusion on the video frame content of each sliding window after the annotation to form the finally-to-be-output annotated content.
Further, the video reading unit configures reading parameters, the reading parameters comprise the sub-table rate of the video stream, the video reading unit converts the video stream with specific resolution into a format and presents the format to the labeling interface, and the labeling interface is used for labeling operation by a labeling operator.
Compared with the prior art, the invention has the beneficial effects that: (1) According to the method, different annotation types are set, the content in the video stream is understood and annotated, a video frame with a complete annotation event is generated, and training data are provided for training the deep neural network to automatically understand the video content; (2) According to the invention, by arranging the labeling device, the labeling efficiency of a labeling person is improved, the labeling consumption time is shortened, and the precision of labeling content is improved.
Drawings
FIG. 1 is an exemplary flow chart of the labeling method of the present invention.
FIG. 2 is an exemplary flow chart of event annotation in accordance with the present invention.
FIG. 3 is a schematic block diagram of the construction of the marking device unit of the present invention.
Reference numerals: 201-a video reading unit; 202-a format conversion unit; 203-labeling a conversion unit; 204-multimodal fusion unit.
Detailed Description
Embodiments, a short video content understanding annotation method as described in fig. 1, comprises,
step S1, a video stream to be marked is read, the video stream is analyzed by using setting parameters, and the setting parameters comprise the size of a sliding window and the sliding step.
Step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and acquiring the video frame of each window.
And step S3, presenting the video frames in each window according to time sequences, and sequentially marking the content of the video frames of each window by a mark reader according to different mark types to generate mark events on a mark time axis.
S4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; and outputting the video stream with the marked time axis of the complete marked event.
In the above steps, the size of the sliding window is the length of the video frame in each window, and the sliding step is the length of the movement on the video frame.
The annotation types comprise a key frame annotation, an object tracking annotation, an emotion and emotion annotation, an audio annotation and an event annotation, wherein the key frame annotation selects a key frame for a specific time point and annotates the key frame; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio markers mark the audio, identifying speech content, sound effects, and music in the audio.
As shown in fig. 2, an exemplary step of the present embodiment for labeling of event label types includes,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on the video frames in the window, cleaning the text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on the annotation event for the annotator to check. And the annotators view the content abstracts through the annotating interface.
Step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, the annotators modify the copies of the content summaries, and the modified copies act on the annotation events in time sequence.
Step S103, for the event and plot of each window video frame, weighting and integrating according to the set parameters, wherein the integrated video frame has event nodes appearing at different time points.
In the above steps, the event node contains marked event information, and the event information includes a voice-to-text correction content, a video event content description, a content abstract and a video matching degree.
As shown in fig. 3, the short video content understanding and labeling device includes a video reading unit 201, a format conversion unit 202, a label conversion unit 203, and a multi-mode fusion unit 204, where the video reading unit 201 is configured to read a video stream to be labeled; the format conversion unit 202 converts the format of the video stream to unify the video formats in the labeling process; the annotation conversion unit 203 converts the annotation content into annotation nodes on the time axis; the multi-mode fusion unit 204 performs multi-mode fusion on the video frame content of each sliding window after labeling, so as to form labeled content to be output finally.
The video reading unit 201 configures reading parameters, the reading parameters include the sub-table rate of the video stream, and the video reading unit 201 converts the video stream with a specific resolution into a format and presents the format to the labeling interface, and the labeling interface is used for labeling operation by a labeling person. In this embodiment, the labeling interface may be implemented by displaying through an HTML end of a website, where an HTML page displays a video stream processed by the video reading unit 201 and the format converting unit 202 through an API interface of the acquired background; the method can also be implemented as terminal software, and the video content is marked in a local mode.
Claims (6)
1. A short video content understanding labeling method is characterized in that: comprising the steps of (a) a step of,
step S1, reading a video stream to be marked, and analyzing the video stream by using setting parameters, wherein the setting parameters comprise the size of a sliding window and the sliding step;
step S2, utilizing the sliding window in the step to act on the video frame sequence, gradually sliding the window from the video starting position and obtaining the video frame of each window;
step S3, the video frames in each window are presented according to time sequences, and a labeling person sequentially labels the content of the video frames of each window according to different labeling types to generate labeling events on a labeling time axis;
s4, carrying out weighted integration on the marked video frames, and adjusting the weighted value through the parameter value of the set parameter; and outputting the video stream with the marked time axis of the complete marked event.
2. The short video content understanding annotation method as claimed in claim 1, wherein: the annotation types comprise key frame annotation, object tracking annotation, emotion and emotion annotation, audio annotation and event annotation, wherein for a specific time point, a key frame is selected and marked; the object tracking annotation carries out track or motion annotation on the trackable object in the video frame; the emotion and emotion marking marks the emotion changes which occur at different time points; the event annotation annotates the content event of the video frame; the audio mark marks the audio, and identifies voice content, sound effect and music in the audio.
3. The short video content understanding annotation method as claimed in claim 1, wherein: the labeling step for the event labeling type, comprising,
step S101, setting the annotation type as event annotation, carrying out voice-to-text recognition on video frames in a window, cleaning text data, generating a content abstract by using a text analysis generation technology, and displaying the content abstract on an annotation event for an annotator to check;
step S102, labeling and judging the content abstract by a labeling person, wherein the judging standard comprises whether the description is accurate, homonym errors exist or whether the description includes the need of manual correction; when manual correction is needed, a labeling operator modifies the copy of the content abstract, and the modified copy acts on the labeling event according to time sequence;
step S103, for the event and plot of each window video frame, weighting and integrating according to the set parameters, wherein the integrated video frame has event nodes appearing at different time points.
4. The short video content understanding annotation method as claimed in claim 1, wherein: the event node comprises marked event information, and the event information comprises voice-to-text correction content, video event content description, content abstract and video matching degree.
5. A short video content understanding annotation device, characterized by: the video processing system comprises a video reading unit (201), a format conversion unit (202), a labeling conversion unit (203) and a multi-mode fusion unit (204), wherein the video reading unit (201) is used for reading a video stream to be labeled; the format conversion unit (202) converts the format of the video stream to unify the video formats in the labeling process; the annotation conversion unit (203) converts the annotation content into annotation nodes on a time axis; the multimode fusion unit (204) carries out multimode fusion on the video frame content of each sliding window after labeling to form labeled content to be output finally.
6. The short video content understanding annotation device of claim 5, wherein: the video reading unit (201) configures reading parameters, the reading parameters comprise the sub-table rate of the video stream, the video reading unit (201) converts the video stream with specific resolution into a format and presents the format to the labeling interface, and the labeling interface is used for labeling operation by a labeling operator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311421767.7A CN117156221B (en) | 2023-10-31 | 2023-10-31 | Short video content understanding and labeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311421767.7A CN117156221B (en) | 2023-10-31 | 2023-10-31 | Short video content understanding and labeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117156221A true CN117156221A (en) | 2023-12-01 |
CN117156221B CN117156221B (en) | 2024-02-06 |
Family
ID=88899151
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311421767.7A Active CN117156221B (en) | 2023-10-31 | 2023-10-31 | Short video content understanding and labeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117156221B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011145951A (en) * | 2010-01-15 | 2011-07-28 | Nippon Telegr & Teleph Corp <Ntt> | Apparatus, method and program for automatically classifying content |
US20110255844A1 (en) * | 2007-10-29 | 2011-10-20 | France Telecom | System and method for parsing a video sequence |
CN110147699A (en) * | 2018-04-12 | 2019-08-20 | 北京大学 | A kind of image-recognizing method, device and relevant device |
US20190266439A1 (en) * | 2018-02-26 | 2019-08-29 | Industrial Technology Research Institute | System and method for object labeling |
-
2023
- 2023-10-31 CN CN202311421767.7A patent/CN117156221B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110255844A1 (en) * | 2007-10-29 | 2011-10-20 | France Telecom | System and method for parsing a video sequence |
JP2011145951A (en) * | 2010-01-15 | 2011-07-28 | Nippon Telegr & Teleph Corp <Ntt> | Apparatus, method and program for automatically classifying content |
US20190266439A1 (en) * | 2018-02-26 | 2019-08-29 | Industrial Technology Research Institute | System and method for object labeling |
CN110147699A (en) * | 2018-04-12 | 2019-08-20 | 北京大学 | A kind of image-recognizing method, device and relevant device |
Non-Patent Citations (1)
Title |
---|
SI WU等: "Video quality classification based home Video segmentation", EEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO(ICME) * |
Also Published As
Publication number | Publication date |
---|---|
CN117156221B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7063347B2 (en) | Artificial intelligence updates to existing content proposals to include suggestions from recorded media | |
CN107305541B (en) | Method and device for segmenting speech recognition text | |
CN106878632B (en) | Video data processing method and device | |
JP5796496B2 (en) | Input support system, method, and program | |
CN111198948B (en) | Text classification correction method, apparatus, device and computer readable storage medium | |
US7996227B2 (en) | System and method for inserting a description of images into audio recordings | |
CN103703431A (en) | Automatically creating a mapping between text data and audio data | |
JP2007087397A (en) | Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method | |
US11392791B2 (en) | Generating training data for natural language processing | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN110929015A (en) | Multi-text analysis method and device | |
JP4558680B2 (en) | Application document information creation device, explanation information extraction device, application document information creation method, explanation information extraction method | |
CN113901186A (en) | Telephone recording marking method, device, equipment and storage medium | |
CN109213970B (en) | Method and device for generating notes | |
WO2023045433A1 (en) | Prosodic information labeling method and related device | |
CN117156221B (en) | Short video content understanding and labeling method | |
CN112101003A (en) | Sentence text segmentation method, device and equipment and computer readable storage medium | |
CN116486812A (en) | Automatic generation method and system for multi-field lip language recognition sample based on corpus relation | |
CN114118068B (en) | Method and device for amplifying training text data and electronic equipment | |
US10540987B2 (en) | Summary generating device, summary generating method, and computer program product | |
CN110889289B (en) | Information accuracy evaluation method, device, equipment and computer readable storage medium | |
JP5382965B2 (en) | Application document information creation apparatus, application document information creation method, and program | |
CN114078470A (en) | Model processing method and device, and voice recognition method and device | |
CN111523307A (en) | Online translation new word note generation system based on symbolic marks | |
Hall et al. | Sign Language Phonetic Annotator-Analyzer: Open-Source Software for Form-Based Analysis of Sign Languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |