CN117315524A - Video processing method, video processing device, electronic equipment and computer storage medium - Google Patents

Video processing method, video processing device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN117315524A
CN117315524A CN202311052500.5A CN202311052500A CN117315524A CN 117315524 A CN117315524 A CN 117315524A CN 202311052500 A CN202311052500 A CN 202311052500A CN 117315524 A CN117315524 A CN 117315524A
Authority
CN
China
Prior art keywords
video
audio
features
target
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311052500.5A
Other languages
Chinese (zh)
Inventor
黄少飞
李瀚�
王钰晴
朱宏吉
刘偲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taobao China Software Co Ltd
Original Assignee
Taobao China Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taobao China Software Co Ltd filed Critical Taobao China Software Co Ltd
Priority to CN202311052500.5A priority Critical patent/CN117315524A/en
Publication of CN117315524A publication Critical patent/CN117315524A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a computer storage medium, wherein in the video processing method, the multi-size video characteristics of a video frame are firstly obtained, and meanwhile, the audio characteristics corresponding to a target video frame are obtained, so that the video mask characteristics which are related to the audio and aimed at the target video frame can be obtained according to the multi-size video characteristics and the audio characteristics; meanwhile, based on preset audio query information aiming at each video frame, target characteristics of sound emission objects corresponding to the audio query information in a plurality of video frames can be determined, finally, based on the target characteristics and the video mask characteristics, a segmentation mask of the target sound emission objects aiming at the target video frames is obtained, the sound emission objects in the video can be accurately segmented, and meanwhile, the method can be suitable for complex scenes in which a plurality of sound emission objects exist in the video at the same time.

Description

Video processing method, video processing device, electronic equipment and computer storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a video processing method, a video processing apparatus, an electronic device, and a computer storage medium.
Background
As video is popular in people's daily lives and works, video processing technology is becoming more and more important. In the video processing process, the positioning and segmentation of sound objects in the video construct the association between the audio and the video, and the method has wide application in the real scene. For example, in a live video scene, the recognition and segmentation of the sounding object can highlight the talking anchor in the video, so that a better viewing experience is provided for the audience; in a video conference, if a scene of the multi-person conference is involved, the body of a speaker can be enlarged in a video window through recognition and segmentation of sound objects, so that the attention of other listeners is attracted; in the field of short video editing, recognition and segmentation of sound production objects can rapidly realize foreground and background distinction and content editing
However, the existing video processing method can only roughly identify and divide the sound object in the video, so that the division result of the sound object is inaccurate. Therefore, how to accurately divide the sound object in the video is a technical problem that needs to be solved currently.
Disclosure of Invention
The application provides a video processing method for realizing accurate segmentation of sound objects in video, a video processing device, electronic equipment and a computer storage medium.
The application provides a video processing method, which comprises the following steps:
carrying out framing treatment on a video to be treated to obtain a plurality of video frames corresponding to the video to be treated;
extracting video features of any target video frame in a plurality of video frames under a plurality of sizes to obtain multi-size video features of the target video frame;
according to the audio corresponding to the video to be processed, obtaining the audio characteristics corresponding to the target video frame;
obtaining a video mask feature related to the audio for the target video frame based on the multi-size video feature and the audio feature;
determining target characteristics of sound production objects corresponding to the audio inquiry information in the plurality of video frames in the processing video characteristics based on preset audio inquiry information for each video frame; the processing video features are video features after feature fusion and time sequence interaction are carried out on the multi-size video features and the audio features;
obtaining a segmentation mask of a target sounding object for the target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
Optionally, the obtaining the video mask feature related to the audio for the target video frame according to the multi-size video feature and the audio feature includes:
and performing feature fusion and time sequence interaction on the multi-size video features and the audio features to obtain video mask features which are related to the audio and aim at the target video frames.
Optionally, the performing feature fusion and time sequence interaction on the multi-size video feature and the audio feature to obtain a video mask feature related to the audio for the target video frame includes:
performing feature fusion on the multi-size video features and the audio features by adopting a first attention mechanism to obtain fusion video features fused with the audio features;
processing the multi-size video feature and the fusion video feature by adopting a second attention mechanism to obtain an aggregation video feature of the multi-size video feature under different sizes of the same pixel;
performing time sequence interaction processing on the audio features and the aggregated video features by adopting a third attention mechanism to obtain processed video features after the time sequence interaction processing;
Based on the processed video features, video mask features associated with the audio for the target video frame are obtained.
Optionally, the performing time sequence interaction processing on the audio feature and the aggregated video feature by using a third attention mechanism to obtain a processed video feature after the time sequence interaction processing includes:
determining initial features in the plurality of video frames from the audio features in the aggregate video feature for the audio features;
adopting self-attention to perform feature enhancement on the initial features, and determining features after time sequence interaction between different target video frames;
and mapping the features after the time sequence interaction to obtain the processed video features after the time sequence interaction.
Optionally, the determining, based on preset audio query information for each video frame, target features of sound objects corresponding to the audio query information in the plurality of video frames in processing video features includes:
determining a sounding object corresponding to audio query information based on preset audio query information for each video frame;
obtaining video features of the processed video features under a specified size according to the processed video features;
And aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of an audio query encoder, and obtaining target characteristics of the sound generating object in the plurality of video frames.
Optionally, the audio query encoder includes: a multi-head cross attention module, a multi-head self attention module and a forward network module;
the obtaining, for each piece of audio query information, the target feature of the sound object in the plurality of video frames by using the video feature of the specified size and the audio query information as input information of an audio query encoder, includes:
aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of the multi-head cross attention module to obtain output result information of the multi-head cross attention module;
the output result information of the multi-head cross attention module is used as the input information of the multi-head self attention module, and the output result information of the multi-head self attention module is obtained;
the output result information of the multi-head self-attention module is used as the input information of the forward network module, and the output result information of the forward network module is obtained;
And obtaining target characteristics of the sounding object in the video frames according to the output result information of the forward network module.
Optionally, the audio query encoder includes a multi-layer attention mechanism, and each layer of attention mechanism is provided with a multi-head cross attention module, a multi-head self attention module and a forward network module;
the audio query information input into the first layer of attention mechanism is the audio characteristics of the audio corresponding to each video frame; inputting the audio query information of the second-layer attention mechanism or the attention mechanisms above the second layer as the output result information of the previous-layer attention mechanism; the output of the last layer of attention mechanism is a target feature of the sound object in the plurality of video frames.
Optionally, the obtaining a segmentation mask of the target sound object for the target video frame according to the target feature and the video mask feature includes:
and performing matrix multiplication operation and preset function operation on the target features and the video mask features to obtain a segmentation mask of the target sounding object aiming at the target video frame.
The application provides a video processing apparatus, comprising:
The frame dividing processing unit is used for carrying out frame dividing processing on the video to be processed to obtain a plurality of video frames corresponding to the video to be processed;
the video feature extraction unit is used for extracting video features of any target video frame in a plurality of sizes of the target video frame aiming at any target video frame in a plurality of video frames to obtain multi-size video features of the target video frame;
an audio feature obtaining unit, configured to obtain an audio feature corresponding to the target video frame according to audio corresponding to the video to be processed;
a video mask feature obtaining unit configured to obtain a video mask feature related to the audio for the target video frame based on the multi-size video feature and the audio feature;
a target feature determining unit, configured to determine, in processing video features, target features of sound emission objects corresponding to audio query information in the plurality of video frames based on preset audio query information for each video frame; the processing video features are video features after feature fusion and time sequence interaction are carried out on the multi-size video features and the audio features;
a segmentation mask obtaining unit, configured to obtain a segmentation mask of a target sound object for the target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
The application provides an electronic device, comprising:
a processor;
and a memory for storing a computer program that is executed by the processor to perform the video processing method described above.
The present application provides a computer storage medium storing a computer program to be executed by a processor to perform the above-described video processing method.
Compared with the prior art, the embodiment of the application has the following advantages:
the application provides a video processing method, which comprises the following steps: carrying out framing treatment on the video to be treated to obtain a plurality of video frames corresponding to the video to be treated; extracting video features of the target video frames under a plurality of sizes aiming at any one target video frame in a plurality of video frames to obtain multi-size video features of the target video frames; according to the audio corresponding to the video to be processed, obtaining the audio characteristics corresponding to the target video frame; obtaining video mask features related to the audio for the target video frame according to the multi-size video features and the audio features; determining target characteristics of sound production objects corresponding to the audio inquiry information in a plurality of video frames in the processing video characteristics based on preset audio inquiry information for each video frame; the video feature processing is to perform feature fusion and time sequence interaction on the multi-size video feature and the audio feature; obtaining a segmentation mask of a target sounding object for a target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame. In the video processing method, the multi-size video features of the video frames are acquired firstly, and meanwhile, the audio features corresponding to the target video frames are acquired, so that the video mask features which are related to the audio and aimed at the target video frames can be acquired according to the multi-size video features and the audio features; meanwhile, based on preset audio query information aiming at each video frame, target characteristics of sound emission objects corresponding to the audio query information in a plurality of video frames can be determined, finally, based on the target characteristics and the video mask characteristics, a segmentation mask of the target sound emission objects aiming at the target video frames is obtained, the sound emission objects in the video can be accurately segmented, and meanwhile, the method can be suitable for complex scenes in which a plurality of sound emission objects exist in the video at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may also be obtained according to these drawings for a person having ordinary skill in the art.
Fig. 1 is a flowchart of a video processing method according to a first embodiment of the present application;
fig. 2 is a detailed process schematic diagram of a video processing method according to a first embodiment of the present application;
fig. 3 is a schematic diagram of a video processing apparatus according to a second embodiment of the present application;
fig. 4 is a schematic diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and, therefore, the application is not limited to the specific embodiments disclosed below.
The application provides a video processing method, a video processing device, an electronic apparatus and a computer storage medium. The following describes a video processing method, a video processing apparatus, an electronic device, and a computer storage medium, respectively, by specific embodiments. In order to more clearly show the video processing method provided by the embodiment of the present application, an application scenario of the video processing method provided by the embodiment of the present application is first introduced.
The video processing method can be applied to a scene for automatically dividing the sound object in the video. For example, when the duration corresponding to a video is two minutes, it is assumed that the first minute in the video is when the user a is speaking, and the second minute is when the user B is speaking, by using the video processing method, the shape contour corresponding to the user a in the video and the shape contour identification corresponding to the user B can be segmented; of course, in the first minute, the shape outline corresponding to the user A is segmented in the video, and mainly in the first minute, the user A is speaking; in the second minute, the shape outline corresponding to the user B is segmented in the video, and mainly in the second minute, the user B is speaking. Of course, it will be appreciated that if user a and user B speak simultaneously in the video, the shape contours corresponding to user a and user B may be segmented simultaneously in the video.
The foregoing description is a diagram of an application scenario of a video processing method according to the present application, and the application scenario of the video processing method is merely one embodiment of the application scenario of the video processing method according to the present application, which is provided for convenience in understanding the video processing method according to the present application, and is not intended to limit the video processing method according to the present application. Other application scenarios of the video processing method in the embodiments of the present application are not described in detail.
First embodiment
A first embodiment of the present application provides a video processing method, and in particular, please refer to fig. 1, which is a flowchart of the video processing method provided in the first embodiment of the present application.
The video processing method of the embodiment of the application comprises the following steps:
step S101: and carrying out framing treatment on the video to be treated to obtain a plurality of video frames corresponding to the video to be treated.
In the video processing method, in order to facilitate the segmentation of the sound object in the video, the frame processing can be performed on the video to be processed, and then the segmentation of the sound object of the video to be processed is converted into the segmentation of the sound object in the video frame.
The sound generating object is an object which generates sound in the video to be processed, and the number of the sound generating objects in the video to be processed can be one or a plurality of sound generating objects; of course, after the video to be processed is subjected to framing processing, the number of the sound generating objects in each video frame can be one or more.
Step S102: and extracting video features of the target video frames under a plurality of sizes aiming at any one target video frame in the plurality of video frames to obtain multi-size video features of the target video frames.
After framing the video to be processed, each video frame after framing may be taken as a target video frame. For each target video frame, the target video frame can be input into a video encoder (such as ResNet and ViT network), the multi-size video features of the target video frame are extracted, and the video encoder can downsample the features of the target video frame to obtain video features with different sizes, namely: multi-size video features. For example, the video features of the target video frame (original image) may be extracted, the video features of half the image of the target video frame may be extracted, the video features of one-fourth the image of the target video frame may be extracted, and the video features of one-eighth the image of the target video frame may be extracted. The video feature of the present embodiment may refer to a visual feature.
In order to facilitate understanding of the multi-size video features, please refer to fig. 2, which is a detailed process diagram of the video processing method according to the first embodiment of the present application, in fig. 2, after the video to be processed is subjected to framing processing, three video frames exist, where the multi-size video feature of the first video frame is V 1 The method comprises the steps of carrying out a first treatment on the surface of the The multi-size video characteristic of the second video frame is V 2 The method comprises the steps of carrying out a first treatment on the surface of the The multi-size video characteristic of the third video frame is V 3 . In fig. 2, the multi-size video feature is a video feature under four-size conditions.
Step S103: and obtaining the audio characteristics corresponding to the target video frame according to the audio corresponding to the video to be processed.
In the video processing method of the present application, it is also necessary to extract the audio features of the video to be processed. Specifically, the audio of the video to be processed may be extracted; the audio is then input to an audio encoder, which extracts the audio characteristics of each target video frame. Specifically, referring to fig. 2, the audio feature obtained by the audio encoder (e.g., VGGish) is a (the audio feature of the first video frame is a 1 The method comprises the steps of carrying out a first treatment on the surface of the The audio characteristic of the first video frame is A 2 The method comprises the steps of carrying out a first treatment on the surface of the The audio characteristic of the first video frame is A 3 )。
Step S104: an audio-related video mask feature for the target video frame is obtained from the multi-size video feature and the audio feature.
After the multi-size video features and the audio features are obtained, the multi-size video features and the audio features of a plurality of video frames are subjected to cross-mode feature fusion and time sequence feature interaction through a pixel encoder, and video mask features which are related to the audio and subjected to time sequence enhancement through ABTI for target video frames are obtained. Because the audio features and the video features belong to features of different modes, fusion of the features of different modes needs to be performed in a cross-mode feature fusion mode. The time sequence feature corresponds to the time sequence feature corresponding to the multi-size video feature of the plurality of video frames, is a sequence feature, and the time sequence data of the plurality of frames is the time sequence.
In this embodiment, as a way to obtain the video mask feature related to audio for the target video frame from the multi-size video feature and the audio feature, it may be referred to as: and performing feature fusion and time sequence interaction on the multi-size video features and the audio features to obtain video mask features which are related to the audio and aim at the target video frames.
Specifically, performing feature fusion and time sequence interaction on the multi-size video features and the audio features to obtain video mask features related to the audio for the target video frame, which may be referred to as:
Firstly, carrying out feature fusion on multi-size video features and audio features by adopting a first attention mechanism to obtain fusion video features fused with the audio features; then, processing the multi-size video features and the fusion video features by adopting a second attention mechanism to obtain aggregated video features of the multi-size video features under different sizes of the same pixel; then, performing time sequence interaction processing on the audio features and the aggregated video features by adopting a third attention mechanism to obtain processed video features after the time sequence interaction processing; finally, audio-related video mask features for the target video frame are obtained based on the processed video features.
The performing time sequence interaction processing on the audio feature and the aggregate video feature by using the third attention mechanism to obtain the processed video feature after the time sequence interaction processing may be: first, for audio features, determining initial features in a plurality of video frames with the audio features in an aggregate video feature; then, adopting self-attention to perform feature enhancement on the initial features, and determining the features after time sequence interaction between different target video frames; and then, mapping the features after the time sequence interaction to obtain the processed video features after the time sequence interaction.
Specifically, referring to FIG. 2, the pixel encoder is composed of three parts, namely a cross-modal attention (an example of a first attention mechanism), a multi-scale deformable attention (an example of a second attention mechanism in FIG. 2), and an ABTI module (an example of an Audio-Bridged Temporal Interaction, an Audio bridged timing interaction module, and a third attention mechanism). Wherein cross-modal attention is used for multi-size video feature V for a plurality of video frames, t-th frame t And audio feature A for the t-th frame t Feature fusion is carried out to obtain a fused video feature M after the audio feature is fused t See the following formula for details:
wherein f q ,f k f v ,f w Representing the full-connection transform layer,representing matrix multiplication, T representing matrix transposition, softmax is an activation function that normalizes a vector of values to a vector of probability distributions, and the sum of the probabilities is 1,is an intermediate calculation result.
Multi-scale deformable attention to multi-scale video feature V t Merging video features M t As input, for the same video frameVideo features at different sizes for a pixel location are aggregated. Multiscale deformable attention is an existing mode of operation aimed at feature fusion between feature maps of different sizes
The ABTI module is a time sequence interaction module for audio bridging, and in the process of using the ABTI module, the following operations are executed:
first, audio features A in multiple video frames with audio as a query in cross-modal attention p And aggregate video feature F for multiple video frames q Performing cross-modal attention calculation to obtain initial characteristics O corresponding to the audio characteristics of each video frame in a plurality of video frames pq The calculation process is as follows:
wherein f a ,f m ,f n Representing the full-connection transform layer,representing a matrix multiplication. T represents the matrix transpose, softmax is an activation function that normalizes a numerical vector to a probability distribution vector, and the sum of the probabilities is 1, S pq Is an intermediate calculation result. O (O) pq The method is characterized in that the audio features of the p-th frame are shallow visual features corresponding to the video features of the q-th frame, and the method is only an intermediate calculation result; the feature of the subsequent audio query encoder aggregate is the deep visual features of the visual features corresponding to each audio query in all video frames, directly correlated to the final prediction result.
In the process involved in this step S104, there are a plurality of audio features, the number of which is the same as the number of video frames, and the audio features may be determined based on the audio information corresponding to each video frame, for example: if the sound object in the first video frame is user a, the sound object corresponding to the audio feature of the first video frame is user a, and actually, the video features of user a of a plurality of video frames are obtained.
Then, the features corresponding to the initial features of each audio feature in a plurality of video frames (the features corresponding to the initial features can be a group of vectors output by a convolutional neural network) are enhanced in a self-attention mode, so that the time sequence interaction of the sound object corresponding to the audio feature among different video frames is realized, and the characteristics after interaction are thatThen, the interacted characteristics are mapped back to the original video characteristics, and the video characteristics after time sequence enhancement are obtained>(i.e., the processed video features after the time sequence interaction processing):
wherein f o Representing the full-connection transform layer,representing matrix multiplication, Σ represents accumulation, and T on the accumulated symbol represents the total video frame number.
Finally, the first three scales of the time-sequence enhanced video feature (as in fig. 2) As input to the subsequent audio query encoder, the last scale (e.g. +.f. in fig. 2>) As a video mask feature, it is used later to obtain a segmentation mask. The video mask feature is the largest-sized video feature of the multi-sized video features.
Step S105: based on preset audio query information for each video frame, target characteristics of sound production objects corresponding to the audio query information in a plurality of video frames are determined in the processing video characteristics.
In this embodiment, the video feature is processed by performing feature fusion and time sequence interaction on the multi-size video feature and the audio feature.
In this embodiment, as one implementation of determining, in processing video features, target features of sound-producing objects corresponding to audio query information in a plurality of video frames based on preset audio query information for each video frame: firstly, determining a sounding object corresponding to audio query information based on preset audio query information for each video frame; then, according to the processed video features, obtaining the video features of the processed video features under the specified size; and then, aiming at each piece of audio query information, taking the video characteristics and the audio query information with the specified size as input information of an audio query encoder to obtain target characteristics of the sounding object in a plurality of video frames. In the process of step S105, the preset audio query information for each video frame is actually updated continuously, that is, the audio query information is updated continuously on the basis of the initial audio query information, and the audio query information is updated continuously by the audio query encoder.
The audio query encoder includes: a multi-head cross attention module, a multi-head self attention module and a forward network module.
As input information of an audio query encoder, taking video features and audio query information at specified sizes as input information for each audio query information, obtaining target features of a sound object in a plurality of video frames may mean: firstly, aiming at each piece of audio query information, taking video characteristics and audio query information with specified sizes as input information of a multi-head cross attention module to obtain output result information of the multi-head cross attention module; then, the output result information of the multi-head cross attention module is used as the input information of the multi-head self attention module, and the output result information of the multi-head self attention module is obtained; then, the output result information of the multi-head self-attention module is used as the input information of the forward network module, and the output result information of the forward network module is obtained; and finally, according to the output result information of the forward network module, obtaining the target characteristics of the sounding object in a plurality of video frames.
As can be seen from fig. 2: the audio query encoder includes multiple layers of attention mechanisms, each layer of attention mechanism having a multi-headed cross attention module, a multi-headed self attention module, and a forward network module.
The audio query information (initial audio query information) input into the first-layer attention mechanism is audio characteristics of audio corresponding to each video frame; inputting the audio query information of the second-layer attention mechanism or the attention mechanisms above the second layer as the output result information of the previous-layer attention mechanism; the output of the last layer of attention mechanism is the target characteristics of the sound object in a plurality of video frames.
Specifically, referring to fig. 2, in the audio query encoder, first, the same number of audio queries as the number of video frames is preset, where each audio query represents a sound object (the sound object may be one or more) in a corresponding video frame, and in the audio query encoder, target features of the sound objects in all video frames are gradually aggregated through a multi-layer attention mechanism, where the target features may be actually aggregated by video features of all pixels (i.e., video frames) belonging to the same sound object. X N in fig. 2 represents three layers, MHCA, MHSA, and FFN, repeated N more times.
Taking the first layer of the multi-layer attention mechanism as an example, the audio query feature A output by the first layer-1 is input l-1 And the video characteristics output by the pixel encoder, Through multi-headed cross-attention modules (mhc a,namely: multi-Head Cross Attention), multi-headed self-attention module (MHSA: after Multi-Head Self-attribute) and forward network module (FFN, fead Forward Network) and Layer Normalization (LN), the audio query feature output A for this layer is obtained l
For example, in obtaining audio query feature A of the second layer 2 In this case, the audio query feature A of the first layer 1Inputs the MHCA of the second layer, and then outputs the result X of the MHCA of the second layer 2 Inputting MHSA of the second layer, and then inputting output result of MHSA of the second layer into FFN of the second layer to obtain audio query feature A of the second layer 2 . In fact, this obtained A 2 I.e., an example of updating audio query information.
Step S106: obtaining a segmentation mask of a target sounding object for a target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
After obtaining the target features and video mask features of the plurality of video frames, obtaining a segmentation mask for the target sound object of the target video frame based on the target features and video mask features may refer to:
and performing matrix multiplication operation and preset function operation on the target features and the video mask features to obtain a segmentation mask of the target sound object aiming at the target video frame.
After repeating the above process for L times, obtaining the target feature A corresponding to each audio query L . Combining a t-frame target feature with a t-frame video mask featureMatrix multiplication is carried out, and a segmentation mask of a target sounding object corresponding to each target feature is obtained through sigmoid function activation:
wherein, sigma is a sigmoid function,representing matrix multiplication and T representing matrix transposition. L represents the total layer number, L represents the first layer, and the value range of L is 1-N.
The segmentation mask may refer to setting the pixel value to 1 in the foreground (i.e., the region where the sound object is located) of the image and setting the pixel value to 0 in the background region.
After the segmentation mask of the target sound object corresponding to each target feature is obtained, the shape outline of the sound object in each video frame in fig. 2 can be identified, and it can be seen from fig. 2 that the sound object of the first frame of video frame is a left user, the sound object of the second frame of video frame is a right user, and the sound object of the third frame of video frame is a guitar.
According to the method, the sounding objects in each video frame are represented by adopting the audio query information, and the video features corresponding to the sounding objects are extracted by utilizing the audio query information, so that the association between the audio features and the video features and the sounding objects is constructed, and compared with the existing method for only interacting the audio features and the video features at the pixel level, the method is more beneficial to quickly and accurately identifying the sounding objects in the video in the whole; meanwhile, the time sequence interaction is completed by adopting the time sequence interaction module of audio bridging, and video characteristics irrelevant to the audio in video frames can be filtered by adopting the audio bridging, so that the subsequent data processing process is more efficient.
The application provides a video processing method, which comprises the following steps: carrying out framing treatment on the video to be treated to obtain a plurality of video frames corresponding to the video to be treated; extracting video features of the target video frames under a plurality of sizes aiming at any one target video frame in a plurality of video frames to obtain multi-size video features of the target video frames; according to the audio corresponding to the video to be processed, obtaining the audio characteristics corresponding to the target video frame; obtaining video mask features related to the audio for the target video frame according to the multi-size video features and the audio features; determining target characteristics of sound production objects corresponding to the audio inquiry information in a plurality of video frames in the processing video characteristics based on preset audio inquiry information for each video frame; the video feature processing is to perform feature fusion and time sequence interaction on the multi-size video feature and the audio feature; obtaining a segmentation mask of a target sounding object for a target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame. In the video processing method, the multi-size video features of the video frames are acquired firstly, and meanwhile, the audio features corresponding to the target video frames are acquired, so that the video mask features which are related to the audio and aimed at the target video frames can be acquired according to the multi-size video features and the audio features; meanwhile, based on preset audio query information aiming at each video frame, target characteristics of sound emission objects corresponding to the audio query information in a plurality of video frames can be determined, finally, based on the target characteristics and the video mask characteristics, a segmentation mask of the target sound emission objects aiming at the target video frames is obtained, the sound emission objects in the video can be accurately segmented, and meanwhile, the method can be suitable for complex scenes in which a plurality of sound emission objects exist in the video at the same time.
Second embodiment
The second embodiment of the present application also provides a video processing apparatus corresponding to the video processing method provided in the first embodiment of the present application. Since the device embodiment is substantially similar to the first embodiment, the description is relatively simple, and reference is made to the partial description of the first embodiment for relevant points. The device embodiments described below are merely illustrative.
Fig. 3 is a schematic diagram of a video processing apparatus according to a second embodiment of the present application.
The video processing apparatus 300, said apparatus comprising:
the framing processing unit 301 is configured to perform framing processing on a video to be processed, so as to obtain a plurality of video frames corresponding to the video to be processed;
a video feature extraction unit 302, configured to extract, for any one target video frame of a plurality of video frames, video features of the target video frame under a plurality of sizes, and obtain a multi-size video feature of the target video frame;
an audio feature obtaining unit 303, configured to obtain an audio feature corresponding to the target video frame according to the audio corresponding to the video to be processed;
a video mask feature obtaining unit 304, configured to obtain a video mask feature related to the audio for the target video frame according to the multi-size video feature and the audio feature;
A target feature determining unit 305 for determining, in processing video features, target features of sound-producing objects corresponding to audio query information in the plurality of video frames based on preset audio query information for each video frame; the processing video features are video features after feature fusion and time sequence interaction are carried out on the multi-size video features and the audio features;
a segmentation mask obtaining unit 306, configured to obtain a segmentation mask of a target sound object for the target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
Optionally, the video mask feature obtaining unit is specifically configured to:
and performing feature fusion and time sequence interaction on the multi-size video features and the audio features to obtain video mask features which are related to the audio and aim at the target video frames.
Optionally, the video mask feature obtaining unit is specifically configured to:
performing feature fusion on the multi-size video features and the audio features by adopting a first attention mechanism to obtain fusion video features fused with the audio features;
Processing the multi-size video feature and the fusion video feature by adopting a second attention mechanism to obtain an aggregation video feature of the multi-size video feature under different sizes of the same pixel;
performing time sequence interaction processing on the audio features and the aggregated video features by adopting a third attention mechanism to obtain processed video features after the time sequence interaction processing;
based on the processed video features, video mask features associated with the audio for the target video frame are obtained.
Optionally, the video mask feature obtaining unit is specifically configured to:
determining initial features in the plurality of video frames from the audio features in the aggregate video feature for the audio features;
adopting self-attention to perform feature enhancement on the initial features, and determining features after time sequence interaction between different target video frames;
and mapping the features after the time sequence interaction to obtain the processed video features after the time sequence interaction.
Optionally, the target feature determining unit is specifically configured to:
determining a sounding object corresponding to audio query information based on preset audio query information for each video frame;
Obtaining video features of the processed video features under a specified size according to the processed video features;
and aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of an audio query encoder, and obtaining target characteristics of the sound generating object in the plurality of video frames.
Optionally, the audio query encoder includes: a multi-head cross attention module, a multi-head self attention module and a forward network module;
the target feature determining unit is specifically configured to:
aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of the multi-head cross attention module to obtain output result information of the multi-head cross attention module;
the output result information of the multi-head cross attention module is used as the input information of the multi-head self attention module, and the output result information of the multi-head self attention module is obtained;
the output result information of the multi-head self-attention module is used as the input information of the forward network module, and the output result information of the forward network module is obtained;
And obtaining target characteristics of the sounding object in the video frames according to the output result information of the forward network module.
Optionally, the audio query encoder includes a multi-layer attention mechanism, and each layer of attention mechanism is provided with a multi-head cross attention module, a multi-head self attention module and a forward network module;
the audio query information input into the first layer of attention mechanism is the audio characteristics of the audio corresponding to each video frame; inputting the audio query information of the second-layer attention mechanism or the attention mechanisms above the second layer as the output result information of the previous-layer attention mechanism; the output of the last layer of attention mechanism is a target feature of the sound object in the plurality of video frames.
Optionally, the segmentation mask obtaining unit is specifically configured to:
and performing matrix multiplication operation and preset function operation on the target features and the video mask features to obtain a segmentation mask of the target sounding object aiming at the target video frame.
Third embodiment
The third embodiment of the present application also provides an electronic device corresponding to the method of the first embodiment of the present application.
As shown in fig. 4, fig. 4 is a schematic diagram of an electronic device according to a third embodiment of the present application.
In this embodiment, an optional hardware structure of the electronic device 400 may be as shown in fig. 4, including: at least one processor 401, at least one memory 402, and at least one communication bus 405; the memory 402 contains a program 403 and data 404.
Bus 405 may be a communication device that transfers data between components within electronic device 400, such as an internal bus (e.g., a CPU-memory bus, processor central processing unit, CPU for short), an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port), and so forth.
In addition, the electronic device further includes: at least one network interface 406, at least one peripheral interface 407. Network interface 406 to provide wired or wireless communication with an external network 408 (e.g., the Internet, an intranet, a local area network, a mobile communication network, etc.); in some embodiments, the network interface 806 may include any number of network interface controllers (English: network interface controller, NIC for short), radio Frequency (RF) modules, transponders, transceivers, modems, routers, gateways, any combination of wired network adapters, wireless network adapters, bluetooth adapters, infrared adapters, near field communication (English: near Field Communication, NFC) adapters, cellular network chips, and the like.
The peripheral interface 407 is used to connect with a peripheral, such as peripheral 1 (409 in fig. 4), peripheral 2 (410 in fig. 4), and peripheral 3 (411 in fig. 4). Peripherals, i.e., peripheral devices, which may include, but are not limited to, cursor control devices (e.g., mice, touchpads, or touchscreens), keyboards, displays (e.g., cathode ray tube displays, liquid crystal displays). A display or light emitting diode display, a video input device (e.g., a video camera or an input interface communicatively coupled to a video archive), etc.
The processor 401 may be a CPU or a specific integrated circuit ASIC (Application Specific Integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present application.
The memory 402 may comprise a high-speed RAM (collectively, random Access Memory, or random access memory) memory, or may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 401 invokes programs and data stored in the memory 402 to execute the method of the first embodiment of the present application.
Fourth embodiment
The fourth embodiment of the present application also provides a computer storage medium storing a computer program that is executed by a processor to perform the method of the first embodiment of the present application, corresponding to the method of the first embodiment of the present application.
While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The Memory may include volatile Memory, random Access Memory (RAM), and/or nonvolatile Memory in a computer-readable medium, such as Read-Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.
1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change Memory (English: phase change Memory; PRAM for short), static random access Memory (English: static Random Access Memory; SRAM for short), dynamic random access Memory (English: dynamic Random Access Memory; DRAM for short), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (EEPROM for short), flash Memory or other Memory technology, read-Only optical disk read-Only Memory (English: compact Disc Read-Only Memory; CD-ROM for short), digital versatile disks (English: digital versatiledisc; DVD for short) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable storage media (non-transitory computer readable storage media), such as modulated data signals and carrier waves.
2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Claims (11)

1. A video processing method, comprising:
carrying out framing treatment on a video to be treated to obtain a plurality of video frames corresponding to the video to be treated;
Extracting video features of any target video frame in a plurality of video frames under a plurality of sizes to obtain multi-size video features of the target video frame;
according to the audio corresponding to the video to be processed, obtaining the audio characteristics corresponding to the target video frame;
obtaining a video mask feature related to the audio for the target video frame based on the multi-size video feature and the audio feature;
determining target characteristics of sound production objects corresponding to the audio inquiry information in the plurality of video frames in the processing video characteristics based on preset audio inquiry information for each video frame; the processing video features are video features after feature fusion and time sequence interaction are carried out on the multi-size video features and the audio features;
obtaining a segmentation mask of a target sounding object for the target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
2. The method of claim 1, wherein the obtaining the audio-related video mask feature for the target video frame from the multi-size video feature and the audio feature comprises:
And performing feature fusion and time sequence interaction on the multi-size video features and the audio features to obtain video mask features which are related to the audio and aim at the target video frames.
3. The method of claim 2, wherein said feature fusing and timing interaction of the multi-size video features with the audio features to obtain the video mask features associated with the audio for the target video frame comprises:
performing feature fusion on the multi-size video features and the audio features by adopting a first attention mechanism to obtain fusion video features fused with the audio features;
processing the multi-size video feature and the fusion video feature by adopting a second attention mechanism to obtain an aggregation video feature of the multi-size video feature under different sizes of the same pixel;
performing time sequence interaction processing on the audio features and the aggregated video features by adopting a third attention mechanism to obtain processed video features after the time sequence interaction processing;
based on the processed video features, video mask features associated with the audio for the target video frame are obtained.
4. The method of claim 3, wherein the performing the time-series interaction processing on the audio feature and the aggregate video feature using the third attention mechanism to obtain the processed video feature after the time-series interaction processing comprises:
Determining initial features in the plurality of video frames from the audio features in the aggregate video feature for the audio features;
adopting self-attention to perform feature enhancement on the initial features, and determining features after time sequence interaction between different target video frames;
and mapping the features after the time sequence interaction to obtain the processed video features after the time sequence interaction.
5. The method of claim 1, wherein the determining, in processing video features, target features of sound objects corresponding to the audio query information in the plurality of video frames based on preset audio query information for each video frame, comprises:
determining a sounding object corresponding to audio query information based on preset audio query information for each video frame;
obtaining video features of the processed video features under a specified size according to the processed video features;
and aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of an audio query encoder, and obtaining target characteristics of the sound generating object in the plurality of video frames.
6. The method of claim 5, wherein the audio query encoder comprises: a multi-head cross attention module, a multi-head self attention module and a forward network module;
the obtaining, for each piece of audio query information, the target feature of the sound object in the plurality of video frames by using the video feature of the specified size and the audio query information as input information of an audio query encoder, includes:
aiming at each piece of audio query information, taking the video characteristics under the specified size and the audio query information as input information of the multi-head cross attention module to obtain output result information of the multi-head cross attention module;
the output result information of the multi-head cross attention module is used as the input information of the multi-head self attention module, and the output result information of the multi-head self attention module is obtained;
the output result information of the multi-head self-attention module is used as the input information of the forward network module, and the output result information of the forward network module is obtained;
and obtaining target characteristics of the sounding object in the video frames according to the output result information of the forward network module.
7. The method of claim 6, wherein the audio query encoder comprises a multi-layer attention mechanism, each layer of attention mechanism having a multi-head cross attention module, a multi-head self attention module, and a forward network module;
the audio query information input into the first layer of attention mechanism is the audio characteristics of the audio corresponding to each video frame; inputting the audio query information of the second-layer attention mechanism or the attention mechanisms above the second layer as the output result information of the previous-layer attention mechanism; the output of the last layer of attention mechanism is a target feature of the sound object in the plurality of video frames.
8. The method of claim 1, wherein the obtaining a segmentation mask for the target sound object of the target video frame based on the target feature and the video mask feature comprises:
and performing matrix multiplication operation and preset function operation on the target features and the video mask features to obtain a segmentation mask of the target sounding object aiming at the target video frame.
9. A video processing apparatus, comprising:
The frame dividing processing unit is used for carrying out frame dividing processing on the video to be processed to obtain a plurality of video frames corresponding to the video to be processed;
the video feature extraction unit is used for extracting video features of any target video frame in a plurality of sizes of the target video frame aiming at any target video frame in a plurality of video frames to obtain multi-size video features of the target video frame;
an audio feature obtaining unit, configured to obtain an audio feature corresponding to the target video frame according to audio corresponding to the video to be processed;
a video mask feature obtaining unit configured to obtain a video mask feature related to the audio for the target video frame based on the multi-size video feature and the audio feature;
a target feature determining unit, configured to determine, in processing video features, target features of sound emission objects corresponding to audio query information in the plurality of video frames based on preset audio query information for each video frame; the processing video features are video features after feature fusion and time sequence interaction are carried out on the multi-size video features and the audio features;
a segmentation mask obtaining unit, configured to obtain a segmentation mask of a target sound object for the target video frame according to the target feature and the video mask feature; the segmentation mask is used to represent a target sound object of the target video frame.
10. An electronic device, comprising:
a processor;
a memory for storing a computer program to be run by a processor for performing the method of any one of claims 1-8.
11. A computer storage medium, characterized in that the computer storage medium stores a computer program, which is executed by a processor, for performing the method of any of claims 1-8.
CN202311052500.5A 2023-08-19 2023-08-19 Video processing method, video processing device, electronic equipment and computer storage medium Pending CN117315524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311052500.5A CN117315524A (en) 2023-08-19 2023-08-19 Video processing method, video processing device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311052500.5A CN117315524A (en) 2023-08-19 2023-08-19 Video processing method, video processing device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN117315524A true CN117315524A (en) 2023-12-29

Family

ID=89245266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311052500.5A Pending CN117315524A (en) 2023-08-19 2023-08-19 Video processing method, video processing device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN117315524A (en)

Similar Documents

Publication Publication Date Title
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
CN111209970B (en) Video classification method, device, storage medium and server
CN109766925B (en) Feature fusion method and device, electronic equipment and storage medium
WO2018153294A1 (en) Face tracking method, storage medium, and terminal device
CN110728294A (en) Cross-domain image classification model construction method and device based on transfer learning
EP3620982B1 (en) Sample processing method and device
WO2020082382A1 (en) Method and system of neural network object recognition for image processing
CN110211195B (en) Method, device, electronic equipment and computer-readable storage medium for generating image set
CN110675385A (en) Image processing method and device, computer equipment and storage medium
CN109993026B (en) Training method and device for relative recognition network model
CN112597918A (en) Text detection method and device, electronic equipment and storage medium
CN113887615A (en) Image processing method, apparatus, device and medium
TWI803243B (en) Method for expanding images, computer device and storage medium
CN111353514A (en) Model training method, image recognition method, device and terminal equipment
CN111815748B (en) Animation processing method and device, storage medium and electronic equipment
WO2024041108A1 (en) Image correction model training method and apparatus, image correction method and apparatus, and computer device
WO2020244076A1 (en) Face recognition method and apparatus, and electronic device and storage medium
CN114926322B (en) Image generation method, device, electronic equipment and storage medium
CN116258873A (en) Position information determining method, training method and device of object recognition model
CN117315524A (en) Video processing method, video processing device, electronic equipment and computer storage medium
KR101321840B1 (en) Image normalization method and apparatus by using fuzzy-based retinex
CN113344200B (en) Method for training separable convolutional network, road side equipment and cloud control platform
CN114882308A (en) Biological feature extraction model training method and image segmentation method
US20240161382A1 (en) Texture completion
CN109614854B (en) Video data processing method and device, computer device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination