CN115795096A

CN115795096A - Video metadata labeling method for movie and television materials

Info

Publication number: CN115795096A
Application number: CN202211513362.1A
Authority: CN
Inventors: 孟昭旭; 韩菲琳
Original assignee: BEIJING FILM ACADEMY
Current assignee: BEIJING FILM ACADEMY
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-14

Abstract

The invention discloses a video metadata labeling method for a film and television material, and belongs to the technical field of digital movie production. The implementation method of the invention comprises the following steps: reading and preprocessing a video to obtain a video frame sequence, determining whether a slate exists, reading the video frame sequence frame by frame, identifying tag data items, and realizing automatic extraction of semantic information; mapping semantic information to semantic tag data items, storing the semantic tag data items to semantic tag fields, cleaning and eliminating redundant data items, and performing associated storage to realize automatic labeling of video semantic tags; and reading the video metadata, reserving necessary video metadata tag fields, and adding semantic tag fields to construct video metadata of the video material. According to the invention, through analyzing and processing the video materials, the frame picture semantic information is automatically extracted, the semantic labels related to the video content are automatically labeled, and the video metadata stored in the video materials are stored, so that the searching and managing efficiency of the video materials is improved.

Description

Video metadata labeling method for film and television material

Technical Field

The invention relates to a video metadata labeling method for a film and television material, and belongs to the technical field of digital movie production.

Background

In the digital movie production process, movie and television materials shot by a digital camera all exist in a storage medium in a file form. Metadata of the video material is used for describing shooting-related basic information, such as focal length, aperture, sensitivity, resolution, frame rate and the like; however, metadata written automatically by a camera is relatively simple, so a memorial is equipped at a movie shooting site, and a memorial is manually used for recording a memorial list and manually inputting content information of a shot picture, so that the metadata of a movie material is supplemented.

With the large-scale, industrialized and efficient production of commercial movies, the situations of multi-machine-position shooting, rapid transition shooting, coordination of multiple production departments and the like often occur, and the efficiency of a mode for manually marking basic information is low. With the emergence of new technologies of virtual shooting and other electronic photography, shooting modes are more complex and diversified, and manual labeling increasingly depends on the practical experience of a memorial and the understanding of shooting contents, so that the probability of forgetting, remembering and misreading is increased.

At present, the post-processing of the film and television industry supports the classified management of metadata based on film and television materials, so that the efficient retrieval and management of the film and television materials in the post-production process are facilitated: for example, patent CN111680189A discloses a method and apparatus for retrieving movie and television play content, which generates summary information according to video entity information of a movie and television play, and constructs a movie and television play map to improve retrieval effect; then, as in an association method and system of multi-source video metadata of patent CN106484774A, the association relationship between video metadata is established by methods of duplication removal, field splitting, format normalization, etc., so as to extract video entity information of a movie work, such as movie name, showing time, region classification, etc.; and CN109670080A, a method, an apparatus, a device and a storage medium for determining movie labels, which use video entity information of movie and television dramas and audience description keywords as associated label words of movie and television works to serve a user film recommendation algorithm. However, the above related video content retrieval technology is mainly based on video entity information such as a movie name, keywords, an abstract, and a main creator, and these contents are mostly descriptions of video works themselves, and cannot meet the retrieval requirements of large-scale video materials.

The prior art lacks a video metadata labeling method, which replaces the traditional manual recording to a certain extent, can automatically extract content information presented in a picture, construct a video semantic tag and store the video semantic tag in video metadata so as to serve the retrieval and management of video materials.

Disclosure of Invention

The invention aims to provide a video metadata labeling method for a video material, which can improve the searching and managing efficiency of the video material by analyzing and processing the video material, extracting frame semantic information, automatically labeling semantic labels related to video content and storing the semantic labels into the video metadata of the video material.

The purpose of the invention is realized by the following technical scheme:

the invention discloses a video metadata labeling method of a video material, which comprises the following steps:

step one, extracting semantic information according to a frame picture of a movie material, and specifically comprising the following substeps:

step 1.1, reading a video and carrying out video preprocessing to obtain a video frame sequence, and determining whether a slate board exists in a frame picture by a target detection method;

the video preprocessing comprises the following steps: performing code conversion, resolution reduction and frame sampling on a video, reserving semantic information in a video frame picture, and outputting a video frame sequence;

the scene recording board is used for writing and identifying a scene serial number, a lens serial number and shooting times of current shooting contents, wherein the scene serial number, the lens serial number and the shooting times are respectively referred to as a field number, a mirror number and a secondary number for short;

step 1.2, reading the video frame sequence frame by frame, carrying out the clapper area positioning on the picture with the clapper by a clapper detection and identification method, identifying the field number, the mirror number and the secondary number marked on the clapper, and marking the field number, the mirror number and the secondary number marked on the clapper as the frame label data item;

step 1.3, reading the video frame sequence frame by frame, extracting the content information appearing in the frame picture by a target detection and identification method in the semantic information extraction module, and storing the content information as a semantic tag data item;

preferably, the data items of the semantic tag field include: actor, pose, scene, inside/outside, scene, day/night, mirror movement mode, object, text description;

preferably, the target detection and identification method includes: the system comprises a face detection identification model, a face key point detection model, a human skeleton key point detection model, a scene identification model, a camera mirror-moving mode identification model, an object detection and identification model and an image description generation model;

marking a video semantic label according to the semantic information, and specifically comprising the following substeps:

step 2.1, determining semantic tag fields to be extracted of the video, mapping semantic information extracted from frame pictures to semantic tag data items, and storing the semantic information to the semantic tag fields corresponding to each frame picture;

step 2.2, integrating all semantic tag fields extracted from the frame sequence, screening and analyzing semantic tag data items, and cleaning and eliminating redundant data items according to interframe continuity to generate effective video semantic tag data items;

step 2.3, the extracted video semantic tag data items are stored in association with corresponding semantic tag fields;

step three, constructing video metadata of the film and television material according to the semantic tags, and specifically comprising the following sub-steps of:

step 3.1, acquiring a film and television material, reading video metadata, performing splitting, deleting, modifying and format normalization operation on the read video metadata, and reserving necessary video metadata tag fields;

the movie and television material comprises videos shot by a digital camera, and also comprises material videos which exist in an internet video website, a video content provider and a media asset storage system and are used for making movies, television shows, comprehensive programs, network live broadcasts and the like;

step 3.2, adding semantic tag fields to the read video metadata, supplementing and adding semantic tag fields related to video contents on the basis of the existing metadata tag fields of the video materials, and writing the labeled video semantic tags into the video metadata according to the original data format of the metadata of the current video materials;

in the semantic tag field, the semantic tag categories which are necessarily set comprise a scene number, a mirror number, a secondary number, actors, a pose, human key points, a scene category, an internal scene/an external scene, a day/night, a mirror moving mode, an object category and character description;

if a certain field fails to extract a corresponding data item, recording the data item as N/A;

step 3.3, integrating the initial metadata read in the step 3.1 with the semantic information labels and the corresponding data items obtained in the step two by taking each material as a separation unit, and storing the initial metadata and the semantic information labels and the corresponding data items in a format of a text file;

and 3.4, converting the format of the metadata according to the requirement, rewriting the metadata into a data structure or a label field supporting reading and writing of the metadata aiming at the specified nonlinear editing software or digital middle piece software, and storing the data in a video metadata file with a specific format.

Has the beneficial effects that:

the invention relates to a video metadata labeling method of a video material, which realizes the reading, writing and generation of video metadata by analyzing the content information of a video frame picture frame by frame, extracting each data item of a semantic tag field, analyzing and screening the data items for constructing the video semantic tag field, and combining a metadata editing method of the video material.

Drawings

FIG. 1 is a schematic structural diagram of a labeling system corresponding to a video metadata labeling method for video material according to the present invention;

FIG. 2 is a general flowchart of a method for tagging video metadata of video material according to the present invention;

FIG. 3 is a schematic flow chart of a method for extracting semantic information according to an embodiment;

FIG. 4 is a flowchart illustrating a method for labeling semantic tags of a video according to an embodiment;

FIG. 5 is a flowchart illustrating a method for constructing video metadata of a video material according to an embodiment;

FIG. 6 is a diagram illustrating file structures of two types of video metadata according to an embodiment;

fig. 7 is a diagram illustrating an example of a video metadata parsing result of a movie material according to an embodiment.

Detailed Description

For better illustrating the objects and advantages of the present invention, the following description is provided in conjunction with the accompanying drawings and examples.

Example 1:

as shown in fig. 1, an annotation system for applying a video metadata annotation method of a video material according to an embodiment of the present invention includes: extracting a semantic information module M10, labeling a video semantic module M20, and constructing a video metadata module M30;

the semantic information extracting module M10 is used for extracting semantic information according to the frame picture of the film and television material;

in the embodiment, the corresponding model required in the semantic information extracting module M10 is obtained by training by collecting a semantic information database related to movie and television series contents, and specifically includes:

(1) Training through a slate image data set to construct a slate detection and identification model, carrying out image processing on various types of slate images to obtain a slate image with standard brightness without perspective and distortion, marking the corresponding position of the scene information in the slate image, and training to construct a slate image segmentation model;

(2) Training and constructing a handwritten character recognition model through a handwritten character data set;

(3) Training and constructing a face detection recognition model and a face key point detection model through an actor face data set;

(4) Training and constructing a human skeleton key point detection model through an actor action data set;

(5) Training and constructing a scene recognition model through a scene classification data set of a video material;

(6) Training and constructing a scene recognition model through a scene data set;

(7) Training and constructing a camera lens moving mode identification model through a film and television material lens moving classification data set;

(8) Training and constructing an object detection and identification model through the props and the object data set;

(9) Training and constructing an image description generation model through a video material description data set;

a video labeling semantic module M20, configured to label video semantic tags according to the semantic information;

the video metadata construction module M30 is used for constructing video metadata of the video material according to the semantic tags;

the construction module is used for constructing video metadata according to the labeled semantic tag information of the video and the required data format and storing the semantic tag information in a video metadata file with a specified format.

In the post-production process, the film and television materials are imported into post-production software in batches, and data items of initial metadata can be read and obtained, wherein the data items comprise basic information for describing shooting, such as frame rate, resolution, duration, audio information and the like;

in the post-production process, the video metadata labeling method of the video material is applied to analyze and process the video materials uploaded in batches, generate semantic tags related to the content of the video materials required by a user and data items thereof, and store the semantic tags and the data items in the video metadata, so that post-production software can read the semantic tags and the data items conveniently, and classification, retrieval and management of the materials are realized according to metadata information, as shown in fig. 3, the video metadata labeling method comprises the following steps:

step one, extracting semantic information according to a frame picture of a movie material, as shown in fig. 3, including: a video preprocessing step 101, a clapper board detecting and identifying step 102 and a frame picture semantic information extracting step 103;

a video preprocessing step 101, which reads a video and performs video preprocessing, includes the following substeps:

step 101.1, acquiring a video file of a movie material, and sorting the movie material to be processed according to a file name, a file type and file generation time;

step 101.2, reading the video material to be processed, acquiring video information, and decoding and converting the video information;

the decoding and conversion specifically include: decoding the video data by adopting a corresponding video decoding algorithm, performing down-sampling on the decoded video data, converting the down-sampled video data into specific bit depth, pixel format, resolution, frame rate and color space, and converting the audio data of an original video material file into specific bit depth, specific sampling frequency and specifically coded audio data;

decoding and conversion are performed to facilitate subsequent analysis and processing;

step 101.3, acquiring a coded and converted video file, and extracting a frame sequence according to a certain sampling frequency;

the frame rate of the video material is usually 24fps, 30fps and 60fps, and only a certain amount of video frame sequences need to be extracted to satisfy the analysis and understanding of the video content;

the obtained video frame sequence is sent to a semantic information extracting module M10;

step 101.4, a semantic information database is constructed;

the film and television materials and the user requirements are different, and the supplement of the semantic tags to be detected can be realized by constructing a semantic information data set, so that the function and the step of the semantic information extracting module M10 can be adjusted according to the user requirements;

101.5, training a semantic information extraction model through the semantic information database constructed in the step 101.4, sorting the type of the semantic information to be recognized, and adjusting a specific recognition and detection model to be used by the semantic information extraction module M10;

according to the deep learning technology and the computer vision technology, the semantic information extracting module M10 analyzes and understands the content of the frames of the video materials processed by 101.2 and 101.3;

102, extracting the scene information of the frame picture by detecting and identifying the clap slates, comprising the following substeps:

step 102.1, reading a frame picture of a film and television material, and detecting whether a clapper board exists in the current picture;

if yes, continuing to execute the step 106.2, and if not, executing the step 103.1;

step 102.2, determining a boundary frame where the slate is located according to a slate positioning method;

102.3, cutting an image according to the boundary frame of the slate, segmenting the image of the region of the slate obtained by 106.3 by combining an image segmentation method, and sending different segmented regions to the next step;

step 102.4, inputting the segmented script board area image into a handwritten character recognition module;

102.5, acquiring character information recorded in the clapper boards such as a field number, a mirror number, a secondary number and the like;

the numeric area of the field number, the mirror number and the secondary number data items is integer natural number, and the literal information is a character string;

step 103, extracting semantic information of the frame picture by the semantic information extracting module M10, comprising the following substeps:

step 103.1, reading a frame picture of a film and television material, detecting whether a person exists in the current picture by combining a face detection and identification model and a face key point detection model, and if so, carrying out face detection and identification to obtain an actor information data item;

sending an input frame into a trained human face key point detection model, performing human face detection and human face boundary box regression, screening all human face boundary boxes, outputting a human face boundary box and a plurality of human face key feature points, and judging whether a human face exists:

if no face exists, information of the actor is not extracted, and the step 103.4 is continuously executed;

if the face exists, aligning the detected face region according to a plurality of face key feature points, inputting the aligned face image into a face detection and recognition model, comparing the face image with actor face data one by one to obtain corresponding actor information and outputting the actor information;

the actor information comprises the position of a human face bounding box, corresponding actor names and the like, a plurality of human faces are allowed to exist in one frame of picture, the actor information is arranged from left to right and from top to bottom in the picture, and steps 103.2 and 103.3 are executed;

103.2, estimating the face pose according to the key feature points of the face detected in the step 103.1 by combining a pose detection method, and judging the face orientation and the shooting angle of the current actor;

the pose includes: front, side, 3/4 side, incline, pitch;

103.3, determining the height of the face bounding box in the picture height by combining the face bounding box and the size of the frame picture according to the face bounding box detected in the step 103.1 and through a scene recognition model; setting corresponding threshold values aiming at different scenes according to conventional scene shooting experiences, and obtaining shot scene data items of the frame of picture by comparing the threshold values;

the scene data items include: panorama, medium view, close up, big feature;

step 103.4, reading a frame picture of the film and television material, detecting whether a person exists in the current picture, and if so, acquiring human action information by combining a human skeleton key point detection model;

for part of shot scenes, under the conditions that actors are not clear from the back of a shot or faces and the like, determining the proportion of the height of the human body boundary frame in the height of the shot by using a human body skeleton key point detection model, setting corresponding threshold values for different scenes according to the conventional scene shooting experience, and obtaining shot scene data items of frame pictures by comparing the corresponding threshold values with the threshold values;

if no person exists, the scene information determines the shooting scene data item of the frame picture depending on the object boundary box output by the object detection and recognition model in the step 103.8;

step 103.5, reading a frame picture of the film and television material, classifying scenes in the current frame picture through a scene recognition model, searching and comparing scene categories by combining a scene catalog, and analyzing to obtain classification results of an internal scene and an external scene of the scene;

step 103.6, reading a frame picture of the film and television material, comprehensively considering values such as average brightness, color temperature, chroma and the like of three channels of R, G and B of the frame picture by combining the scene type and the classification results of the internal scene and the external scene, and analyzing to obtain the classification results of the day scene and the night scene of the frame picture;

step 103.7, reading the frame picture of the movie and television material, and analyzing and identifying the camera motion mode in the current frame picture through a camera moving mode identification model;

inputting a current frame picture and a front multi-frame and a rear multi-frame in a frame sequence mode, determining a sparse light flow graph of each frame by using a light flow method, obtaining a light flow motion track on a time domain, and combining light flow track characteristics of different lens moving modes of a camera to realize classification and identification of the lens moving modes of the camera;

the mirror operation mode data item value taking comprises the following steps: fixing the machine position, pushing, pulling, shaking, moving along a track and holding by hands;

step 103.8, reading a frame picture of the film and television material, detecting objects and props existing in the current picture through an object detection and identification model, and obtaining bounding boxes and categories of objects with different sizes;

the object category data item values are the prop type and the related objects marked in advance, and the method comprises the following steps: microphone stands, light supplement lamps and tripods which are easy to wear;

103.9, reading a frame picture of the video material, and generating a text description of the current frame picture by combining the image description generation model and the semantic information obtained in the step;

in the embodiment, the semantic information includes a field number, a mirror number, a secondary number, actors, a pose, a human body key point, a scene type, an internal scene/an external scene, a day/night, a mirror moving mode, an object type and a text description, as shown in table 1, and the extracted semantic information is stored and recorded in a semantic tag field corresponding to a frame picture;

TABLE 1

Frame number

Number of field

Mirror number

Number of times

Actor(s)

Pose position

Jing pin

Inside/outside

Scene

Day/night

Mirror conveying mode

Object

Description of the characters

000001

3

5

1

Xiaoming

Side wall

Panoramic view

Inner part

Parlor

Day(s) day

Fixed machine position

Football game

Xiaoming playing football in living room

Marking video semantic labels according to semantic information:

obtaining semantic information of a video frame picture through the first step, analyzing and screening all semantic information of a frame sequence for the video as a whole, eliminating repeated, redundant and wrong data items according to a semantic tag field required by a user, reserving data items valuable for video retrieval and management, and constructing a video semantic tag, as shown in fig. 4, the method comprises the following substeps:

step 201, determining semantic tag fields to be extracted according to user requirements, acquiring semantic information obtained in the step one, and mapping data items of all categories of the extracted semantic information to semantic tag fields corresponding to a frame according to a frame index;

step 202, acquiring all semantic tag fields and data items constructed in the step 201, sequentially integrating all semantic tags and data items of the whole frame sequence of the video of the film and television material, performing statistical analysis, cleaning and eliminating redundant data items according to interframe continuity, screening semantic tags required by users, refining and retaining effective video semantic tag data items through a multiframe statistical mode.

Step 203, constructing semantic label fields of the video, and performing associated storage on the data items screened in the step 202;

because the user requirements are different, a complementary model is obtained by constructing a corresponding sample data set and training, so that the semantic tag fields are expanded or deleted, and the semantic tag fields related to other video contents are supplemented;

thirdly, constructing video metadata of the film and television material according to the semantic tags:

based on the video semantic tags obtained in the second step, according to the data format of the video metadata of the original video material, adding semantic tag fields of the video metadata, and performing associated storage with corresponding data items, according to the difference of later software used by a user, performing format conversion on the video metadata in combination with a data structure of the video metadata that can be analyzed by software, so as to obtain the video metadata for searching and managing the video material, as shown in fig. 5, the method comprises the following substeps:

step 301, because the metadata storage data formats of digital cameras of different manufacturers are different, video metadata analysis methods provided by movie and television material shooting equipment manufacturers are used, or video metadata files of movie and television materials are read by means of post-processing software, and video information tags and data items of current movie and television materials are extracted and obtained;

step 302, on the basis of the video information label of the current video material, a semantic label field required by a user is added, namely the video semantic label constructed in the step two;

and step 303, storing the labels in step 301 and step two and the corresponding data items thereof in a text file format by taking each material as a separation unit.

Step 304, converting the format of the text file stored in the step 303 into a metadata standard format which can be analyzed by post-production or digital middle-piece software so as to be convenient for application and stored in a specific file;

as shown in fig. 6, the embodiment takes two mainstream movie post-production software as an example:

the metadata file format supported by the editing software Avid Media Composer is ale file, the file data structure is shown in fig. 6 (a), the file header field is used for recording the basic information and the category information of the movie materials, the rest part is used for recording the semantic tag information of the movie materials, each material occupies one line and is separated by a tab;

the metadata file format supported by the toning software Davinci Resolve is a csv file, the file data structure is shown in FIG. 6 (b), each field in the first line is label category information, the field in the back is used for recording semantic label information of film and television materials, each material occupies one line and is separated by commas;

the video metadata labeling method of the video material supports a user to upload video materials in batch for analysis and processing, supports the user to define a required semantic information label, stores the semantic information label in a metadata file format selected by the user, and derives a corresponding metadata file according to requirements;

in the post-production process, the metadata file derived by the method is imported into post-production software, as shown in fig. 7, the post-production software can analyze the metadata file and each data item derived by the method of the present invention, so as to read the data item of the initial metadata and the semantic data item complemented by the method of the present invention, and through each label in the metadata, the classification, retrieval and management of the movie and television materials can be realized according to the metadata information, thereby improving the management efficiency of the movie and television materials in the post-production process.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A video metadata labeling method for video materials is characterized by comprising the following steps: the method comprises the following steps:

step 2.1, determining semantic tag fields to be extracted of the video, mapping semantic information extracted from frame pictures to semantic tag data items, and storing the semantic tag data items in the semantic tag fields corresponding to each frame picture;

step 2.3, the extracted video semantic tag data items and corresponding semantic tag fields are stored in an associated mode;

step three, building video metadata of the movie and television material according to the semantic tags, and specifically comprising the following substeps:

step 3.1, acquiring video materials, reading video metadata, performing splitting, deleting, modifying and format normalization operation on the read video metadata, and reserving necessary video metadata tag fields;

step 3.2, adding semantic tag fields to the read video metadata, supplementing and adding semantic tag fields related to video content on the basis of the existing metadata tag fields of the video materials, and writing the labeled video semantic tags into the video metadata according to the original data format of the metadata of the current video materials;

step 3.3, taking each material as a separation unit, integrating the initial metadata read in the step 3.1 with the semantic information labels and the corresponding data items obtained in the step two, and storing the semantic information labels and the corresponding data items in a text file format;

2. The method of claim 1, wherein said video metadata annotation comprises: the video preprocessing in step 1.1 comprises: performing code conversion, resolution reduction and frame sampling on a video, reserving semantic information in a video frame picture, and outputting a video frame sequence;

the target detection method in the step 1.1 is realized by a clapper board detection and identification model;

the clap slate in the step 1.1 is a device used during movie and television play shooting, and is used for writing and marking the scene serial number, the lens serial number and the shooting times of the current shooting content;

the scene number, the shot number, and the shooting times are referred to as a field number, a mirror number, and a time number, respectively.

3. The method of claim 1, wherein said video metadata annotation comprises: the method for detecting and identifying the clapper in the step 1.2 is realized by a handwritten character identification model.

4. The method of claim 1, wherein said video metadata annotation comprises: the data items of the semantic tag field in step 1.3 include: actor, pose, scene, inside/outside, scene, day/night, mirror movement mode, object, text description;

the target detection and identification method in step 1.3 is realized by a face detection identification model, a face key point detection model, a human skeleton key point detection model, a scene identification model, a camera mirror-moving mode identification model, an object detection and identification model and an image description generation model respectively.

5. The method of claim 1, wherein said video metadata annotation comprises: the film and television material in step 3.1 comprises video shot by a digital camera, and also comprises material video existing in an internet video website, a video content provider and a media asset storage system and used for making movies, television shows, comprehensive programs, network live broadcasts and the like.

6. The method of claim 1, wherein said video metadata annotation comprises: in the semantic tag field in step 3.2, the semantic tag categories which are necessarily set comprise a scene number, a mirror number, a secondary number, actors, a pose, human body key points, a scene category, an internal scene/an external scene, a scene, day/night, a mirror moving mode, an object category and character description;

if a field fails to extract the corresponding data item, it is recorded as N/A.