CN115022710A

CN115022710A - Video processing method and device and readable storage medium

Info

Publication number: CN115022710A
Application number: CN202210604049.2A
Authority: CN
Inventors: 曹汝帅; 李琳; 李伯龙; 周效军; 毕蕾
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-06
Anticipated expiration: 2042-05-30
Also published as: CN115022710B

Abstract

The invention discloses a video processing method, video processing equipment and a readable storage medium, and relates to the technical field of video processing. The method comprises the following steps: for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object of which an audio role is a speaker, and a spatial position relationship between the first object and the second object, wherein the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided on the basis of objects in a video to be processed, and the video segment sets comprise at least one video sub-segment; determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object; and rendering the audio data of the first video sub-segment according to the sound source direction vector. The embodiment of the invention can enable the user to have the same hearing experience with the first object in the video to be processed.

Description

Video processing method and device and readable storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video processing method, a video processing device, and a readable storage medium.

Background

At present, in order to increase the on-site experience of a user when watching a video, a three-dimensional sound effect can be generated according to the relation between the watching user outside the video and a sound source position in the video, and then an immersive auditory experience is created for the watching user by adopting the three-dimensional sound effect. However, the three-dimensional sound effect generated according to the position relation between the watching user outside the video and the sound source in the video is separated from the video and is not combined with the video, and the three-dimensional sound effect has no interaction with the user, so that the hearing experience of the user when watching the video is influenced.

Disclosure of Invention

The embodiment of the invention provides a video processing method, video processing equipment and a readable storage medium, which are used for improving the hearing experience of a user when watching a video.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object of which an audio role is a speaker, and a spatial position relationship between the first object and the second object, wherein the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided on the basis of objects in a video to be processed, and the video segment sets comprise at least one video sub-segment;

determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object;

and rendering the audio data of the first video sub-segment according to the sound source direction vector.

Optionally, before the determining, for a first video sub-segment in the first set of video segments, a spatial position of a first object, a spatial position of a second object whose audio character is a speaker, and a spatial positional relationship between the first object and the second object, the method further includes:

according to the preset interval frame number, carrying out object identification on the frame image of the video to be processed to obtain at least one frame image of each object;

the plurality of sets of video segments are formed using at least one frame of image of a different object.

Optionally, the at least one video sub-segment is obtained as follows:

sequencing at least one frame of image of different objects according to the frame number, and comparing the frame numbers of two adjacent frames of images;

determining that the two adjacent frame images belong to two video sub-segments under the condition that the frame number difference value of the two adjacent frame images is larger than a preset threshold value; or under the condition that the frame number difference value of the two adjacent frame images is smaller than or equal to the preset threshold, dividing the video sub-segments according to the scene information of the two adjacent frame images.

Optionally, the dividing the video sub-segment according to the scene information of the two adjacent frames of images includes:

if the two adjacent frames of images do not belong to the same scene, determining that the two adjacent frames of images belong to two video sub-segments; or if the two adjacent frames of images belong to the same scene, determining that the two adjacent frames of images belong to the same video sub-segment.

Optionally, the determining, for a first video sub-segment in a first set of video segments, a spatial position of the first object, a spatial position of a second object whose audio character is a speaker, and a spatial positional relationship between the first object and the second object includes:

carrying out object identification on each frame of image of the first video sub-segment to obtain a detection frame of the first object and a detection frame of the second object;

obtaining the spatial position of the first object and the spatial position of the second object according to the detection frame of the first object and the detection frame of the second object;

determining a spatial position relationship between the first object and the second object based on a spatial position of the first object and a spatial position of the second object in a case where the first object and the second object are located in the same frame image; or, in the case that the first object and the second object are not located in the same frame image, determining a spatial position relationship between the first object and the second object according to a spatial position of a static reference object in each frame image.

Optionally, the determining a sound source direction vector according to the spatial position of the first object and the spatial position relationship between the first object and the second object includes:

performing audio recognition and face recognition on the first video sub-segment to obtain audio roles of each object, wherein the audio roles comprise a sound producer and a receiver;

acquiring the first object input by a user;

under the condition that the audio role of the first object is the listener, determining the sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object;

wherein a start point of the sound source direction vector is a spatial position of a second object whose audio character is a speaker, and an end point is a spatial position of the first object.

Optionally, the performing audio recognition and face recognition on the first video sub-segment to obtain an audio role of each object includes:

performing tone recognition on the audio data of the first video sub-segment to obtain tone information;

comparing the tone information with a tone feature library to obtain an audio recognition result;

performing key point analysis on the target part of each object in the multi-frame images of the first video sub-segment to obtain a face recognition result;

and acquiring the audio roles of the objects according to the audio recognition result and the face recognition result.

Optionally, the method further comprises:

carrying out scene identification on the first video sub-segment to obtain a scene label;

the rendering the audio data of the first video sub-segment according to the sound source direction vector comprises:

determining parameters in a three-dimensional sound effect processing model according to the sound source direction vector and the scene label;

and rendering the audio data of the first video sub-segment by adopting the three-dimensional sound effect processing model after the parameters are determined.

In a second aspect, an embodiment of the present invention further provides a video processing apparatus, including: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor; the processor is configured to read the program in the memory to implement the steps in the video processing method according to the first aspect.

In a third aspect, an embodiment of the present invention further provides a video processing apparatus, including:

the video processing device comprises a first determining module, a second determining module and a processing module, wherein the first determining module is used for determining the spatial position of a first object, the spatial position of a second object of which an audio role is a speaker and the spatial position relation between the first object and the second object for a first video sub-segment in a first video segment set, the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided on the basis of objects in a video to be processed, and the video segment sets comprise at least one video sub-segment;

the second determining module is used for determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object;

and the rendering module is used for rendering the audio data of the first video sub-segment according to the sound source direction vector.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the video processing method according to the first aspect.

In the embodiment of the invention, for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object of which an audio role is a speaker, and a spatial position relationship between the first object and the second object, wherein the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided on the basis of objects in a video to be processed, and the video segment sets comprise at least one video sub-segment; determining a sound source direction vector according to the spatial position of the first object and the spatial position relationship between the first object and the second object; rendering the audio data of the first video sub-segment according to the sound source direction vector; the spatial position of the object is determined, and spatial sense is introduced into the two-dimensional video, so that the user and the first object in the video to be processed have the same auditory experience, the effect of three-dimensional sound effect is realized, and the auditory experience of the user when watching the video is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flow chart of a video processing method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a partitioned video sub-segment provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a frame image provided by an embodiment of the invention;

FIG. 4 is a diagram illustrating the determination of audio roles for objects provided by an embodiment of the present invention;

fig. 5 is a block diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a video processing device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a video processing method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object whose audio role is a speaker, and a spatial position relationship between the first object and the second object, wherein the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided based on objects in a video to be processed, and the video segment sets include at least one video sub-segment;

the object may be, for example, a person, a car, or the like. Taking a person as an example, in order to enable a user to have the same auditory experience as the person in the video to be processed, dividing is performed on the basis of the person in the video to be processed to obtain a plurality of video segment sets. The multiple video segment sets may be composed of image frames including the same person or image frames including multiple persons. Each video segment set may in turn be composed of one or more video sub-segments, and a video sub-segment may comprise one or more image frames.

Illustratively, the video V to be processed includes three lead actor characters U1, U2, and U3, and each lead actor character corresponds to a set of video segments. The video clip set of the master character U1 is VU1 ═ V1t1, V1t2, …, V1ti, …, V1tn ], which indicates n video sub-clips in which the master character U1 co-occurs in the video V to be processed, each video sub-clip can be represented by a pair of frame numbers of frame images, i.e., V1ti ═ start, end ], V1ti indicates a video sub-clip, and start indicates a start frame image number, i.e., a start time, of the video sub-clip; end denotes the end frame picture number, i.e. the end time, of the video sub-segment.

In the step, a first video sub-segment is any video sub-segment in a first video segment set, a spatial position of a first object and a spatial position of a second object, namely coordinates of the first object and the second object in a frame image are determined based on any video sub-segment in the first video segment set, and a spatial position relation between the first object and the second object, namely a direction vector from one object to the other object is determined by using the coordinates of the first object and the second object in the frame image, so that estimation of the spatial position of the object is realized, and a spatial sense is introduced for a two-dimensional video.

Step 102, determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object;

in this step, the user may select a first object in an object list, where the object list includes a plurality of persons and the like in the video to be processed. If the first object is included in the video frame of the first video sub-segment, determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object (including the character and the sound-producing object of which the audio characters are sound-producing persons). Here, the object includes a character and a sound object such as a car, etc., and the user can hear a car whistle in the external environment where the video is located.

It should be noted that, if the first object is not included in the objects, the video processing method may end the flow.

And 103, rendering the audio data of the first video sub-segment according to the sound source direction vector.

In the step, the audio data of the first video sub-segment is rendered according to the direction vector of the sound source, the sound is modified, and the rendered audio data is output, so that a user can feel three-dimensional hearing experience, the immersive hearing experience is realized, and the audio-visual effect of the video is improved.

In this embodiment, for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object, and a spatial position relationship between the first object and the second object, where the first video segment set is any one of a plurality of video segment sets, the video segment set is divided based on an object in a video to be processed, and the video segment set includes at least one video sub-segment; determining a sound source direction vector according to the spatial position of the first object and the spatial position relation between the first object and the second object; rendering the audio data of the first video sub-segment according to the sound source direction vector; the spatial position of the object is determined, and spatial sense is introduced into the two-dimensional video, so that the user and the first object in the video to be processed have the same auditory experience, the effect of three-dimensional sound effect is realized, and the auditory experience of the user when watching the video is improved.

In a specific embodiment, before step 101, the method further includes:

In order to increase the object recognition speed of the frame images of the video to be processed, the frame images are selected for object recognition according to a preset interval frame number, that is, a frame skipping manner is adopted, wherein the preset interval frame number is related to a frame rate, and is selected from the section [7,10] by taking the frame rate FPS as an example. For example, the preset number of interval frames is 7, and assuming that the frame number of the frame image currently subjected to object recognition is Fi, the frame number of the frame image subjected to object recognition next is Fi + 7.

Taking an object as a person as an example, the object recognition here may be face recognition, that is, performing face recognition on a frame image of a video to be processed, so as to obtain at least one frame image in which each person appears.

It should be further noted that other face recognition algorithms such as MTCN (Multi-task Cascaded Convolutional Neural network) or PFLD (functional Facial Landmark Detector) may be adopted to perform face recognition and detection on the frame image, so as to obtain at least one face key point and at least one face feature. And then searching and matching the at least one facial feature in a facial feature library, wherein the facial feature library can comprise information of each person in the object list, such as real names/artist names, media information such as head images/photos and the like, if the person is matched, the frame of image comprises the person, and each frame of image can comprise one or more persons. And dividing the frame images based on different characters to obtain at least one frame image of each character.

The face recognition algorithm may also obtain the position information of the person in the frame image, which is indicated by a detection box (bbox), for example, the position information of the leading person U1 in the frame image is indicated as U1_ FN _ bbox. In this way, it is possible to determine that at least one frame image in which each person appears, for example, a frame image in which the lead character U1 appears is represented as F _ U1_ P ═ Fi, Fj, Fk.,. Fo, Fp, Fq ], and subscripts i, j, k, o, P, q represent frame numbers, thereby forming a plurality of sets of video segments.

In an embodiment, the at least one video sub-segment is obtained as follows:

It should be noted that the preset threshold is related to the frame rate, and is generally 120 times of the frame rate. That is, if the difference between the frame numbers of two adjacent frame images in the frame image F _ U1_ P in which the lead actor character U1 appears is greater than the preset threshold, that is, no lead actor character U1 is recognized for more than 2 minutes, it is determined that the two adjacent frame images belong to two video sub-segments, and the two adjacent frame images are directly segmented. If the difference between the frame numbers of two adjacent images in the frame image F _ U1_ P where the lead actor character U1 appears is less than or equal to the preset threshold, that is, the lead actor character U1 is identified again within 2 minutes, it is necessary to further divide the video sub-segment according to the scene information of the two adjacent images.

In an embodiment, the dividing the video sub-segment according to the scene information of the two adjacent frames of images includes:

As shown in fig. 2, a scene recognition and shot segmentation algorithm is generally used to determine whether two adjacent frames of images belong to the same scene. If the images belong to the same scene, continuing to judge the next frame of image until finding out the frame of image which does not belong to the same scene, and finishing the segmentation. For example, if the video sub-segment 1 determined by the adjacent frame images Fi and Fj in the frame image F _ U1_ P in which the lead actor character U1 appears does not belong to the same scene as the video sub-segment 2 determined by the adjacent frame images Fj and Fk, the two video sub-segments are not merged and divided into two video sub-segments.

It should be noted that, when the scene recognition and shot segmentation algorithm recognition are adopted, attention needs to be paid to smoothing processing at front and rear nodes of a video segment, so as to obtain at least one accurate video sub-segment.

In one embodiment, step 101 includes:

And carrying out face recognition on each frame of image to obtain a detection frame of each person. On the basis, the static reference objects in each frame of image can be identified to obtain corresponding detection frames, so that the spatial positions of the objects are determined.

Further, in the case where each object is located in the same frame image, it indicates that an object whose audio character is a sound producer and a receiver appears in one frame image at the same time, and the frame image is a panoramic shot, not a close-up.

As shown in fig. 3, it is assumed that the resolution of the frame image is 1080 × 720, and objects U1, U2, and U3 appear at the same time, a detection frame U1_ bbox of the object U1 is [110,100,60,60], a detection frame U2_ bbox of the object U2 is [840,90,60,60], and a detection frame U3_ bbox of the object U3 is [610,590,60,70 ]. Here, the detection box bbox represents one box in a format of [ x, y, w, h ], x/y being the coordinates of the upper left corner point of the box, and w/h being the width and height of the box, whereby the spatial position of each object, that is, the central coordinates of the detection box of each object being (x + w/2, y + h/2), such as the central coordinates of the object U1 being (140,130), the central coordinates of the object U2 being (870,120), and the central coordinates of the object U3 being (640,625), can be determined.

Furthermore, the spatial position relationship between the objects is determined by using the center coordinates of the detection frame of each object, and generally expressed by a vector, where the vector has no definite direction and is bidirectional, and if the object as the center is determined, i.e., the object considered as the speaker is determined, the vector has a specific direction.

For example, the spatial position relationship between the object U3 and the object U1 is represented by a vector (140-.

And under the condition that all the objects are not positioned in the same frame image, the situation shows that the objects with the audio roles of a speaker and a receiver do not appear in one frame image at the same time, the frame image is a close-up shot and is not a panorama, all the frame images in the first video sub-segment are traversed, whether the frame image meeting the situation exists or not is searched, and if the frame image exists, the spatial position of the object of the frame image is used. If the reference objects do not exist, the static reference objects are introduced.

Generally, a target detection algorithm is adopted to determine whether a static reference object exists in a frame image, and if so, the position of the static reference object is marked. These static reference objects may be static objects such as televisions, cars, trees, and the like. Assuming that the object U1 and the object U3 are human beings, the object U2 is a static reference object, and the object U1 and the object U2 simultaneously appear in a certain frame image of the first video sub-clip, a spatial positional relationship between the object U1 and the object U2, that is, a vector of U2_ to _ U1 can be obtained; the object U2 and the object U3 simultaneously appear in another frame image of the first video sub-clip, whereby the spatial positional relationship between the object U2 and the object U3, i.e., the vector of U2_ to _ U3, can be derived. Finally, according to the triangle theorem, the positional relationship between the object U1 and the object U3, that is, the vector of U3_ to _ U1 can be obtained.

In addition, if the above two cases do not exist, the positional relationship is not determined and is set to a null value in order to ensure the accuracy of the positional relationship between the objects.

In one embodiment, step 102 includes:

acquiring the first object input by a user;

That is, the vector between the objects is known, but it is not determined which of the two objects is the speaker and which is the receiver, and thus the direction of the vector cannot be determined, so it is necessary to determine the audio role of each object first, and then construct a speaker-to-receiver sound source direction vector based on the vector between the first object whose audio role is the receiver and the second object whose audio role is the speaker.

The first object input by the user is used as a receiver, and the user can have the same auditory experience as the first object when watching the video, so that the role playing effect is achieved.

After distinguishing the audio roles of the objects and obtaining the spatial position relationship between the objects, the spatial position relationship between the sound producer and the sound receiver can be determined by determining the vector between the two objects, which object is the sound producer and which object is the sound receiver, so that the spatial position relationship between the sound producer and the sound receiver can be known, the spatial position of the sound receiver can be fixed, and the sound source direction vector can be determined by using the vector between the sound producer and the sound receiver with the spatial position of the sound receiver as the origin of coordinates.

In an embodiment, the performing audio recognition and face recognition on the first video sub-segment to obtain an audio role of each object includes:

As shown in fig. 4, the first video sub-segment needs to be identified from two sides in order to determine the audio role of each object. Firstly, extracting tone information (or other sound information) from audio data corresponding to a first video sub-segment based on audio basic knowledge, and comparing the tone information (or other sound information) with a tone feature library of a pre-established video to be processed (if other sound information is extracted, a corresponding sound feature library is established), so as to obtain an audio recognition result, namely determining at least one candidate object corresponding to a speaker; secondly, performing key point analysis on a first object in continuous multi-frame images of the first video sub-segment, wherein the first object is generally a character, especially a leading actor, the target part is a sound production part, and for the character, namely the mouth, judging whether the spatial position of the mouth of the leading actor in different frame images is changed, if so, determining the character as a sound producer, otherwise, determining the character as a receiver, obtaining a face recognition result, and determining an object corresponding to the sound producer. And finally, the recognition results of the two aspects are integrated, the audio role is determined to be the object of the speaker, and the audio roles of other objects are determined to be the listeners, so that the judgment accuracy of the audio role can be improved by combining audio recognition and face recognition.

In one embodiment, the starting point of the sound source direction vector is a spatial position of a second object whose audio character is a speaker, and the end point is a spatial position of the first object.

In an embodiment, the method further comprises:

performing scene recognition on the first video sub-segment to obtain a scene label;

step 103, comprising:

It should be noted that, a scene recognition algorithm is adopted to train a scene recognition model by constructing a data set corresponding to a scene list, where the scene list includes, but is not limited to, the following scenes: classrooms, bedrooms, living rooms, office buildings, jungles and the like. Performing scene recognition on the first video sub-segment based on the trained scene recognition model to obtain a scene label of the first video sub-segment, and if the scene probability given by the scene recognition model is higher than a preset probability, storing the scene label; otherwise, the scene label is judged to be null and is not stored. The predetermined probability is determined empirically or experimentally, and is typically 75%.

Wherein, the three-dimensional sound effect processing model relates to the following parameters: speaker (sound source) location, listener location, and external environment. The listener position is fixed, and the hearing effect is different if the position of the sound producer is different, so this embodiment adopts sound source direction vector to represent the spatial position relation between sound producer and listener, and adopts scene label, confirms the parameter of the external environment that sound producer and listener are located, and different scene labels correspond external environment parameter difference, and this external environment parameter is according to experience or experimental calibration and confirms, and use the scene label as the big auditorium as an example, and the external environment parameter sets up the code as follows:

def_int_(self)：

#log defaults

self density of 1.0

self diffusion 1.0

self gain of 0.32

self _gainhf (high gain) ═ 0.89

self _ decay _ time 1.49

self · hfratio (high frequency ratio) 0.83

self [ reflections _ gain ] 0.05 (reflection gain)

self-reflexions delay 0.007 (reflection delay)

self _ late _ reverse _ gain of 1.26

self _ late _ reverse _ delay of 0.011

self air absorption gain of 0.994

self _ room _ roloff _ factor is 0

self decay hflim (decay limit) True

That is to say, according to sound source direction vector and scene label, establish three-dimensional audio processing model, support three-dimensional stereo audio processing, adopt this three-dimensional audio processing model to render the audio data of first video sub-segment, output the audio data after rendering, realize three-dimensional immersive sense of hearing and experience, improve the sense of hearing experience of user when watching the video, have extremely strong interest and substitution sense, bring the sense of hearing impact as personally on the scene.

In the scheme, the spatial position relation between the first object and the second object in the video clip is acquired, a spatial sense is introduced for the two-dimensional video, the object selected by a user is used as a listener, a sound source direction vector is determined, parameters of a three-dimensional sound effect processing model are determined, the rendering of audio data of the video clip is realized, the user can feel the three-dimensional sound effect by directly wearing an earphone when watching the video, the immersive hearing experience of a player character is created, the user is personally on the scene, other characters seem to speak with the user, the audience group is improved, and the method can be applied to the family three-dimensional cinema.

As shown in fig. 5, an embodiment of the present invention further provides a video processing apparatus, where the apparatus 500 includes:

a first determining module 501, configured to determine, for a first video sub-segment in a first video segment set, a spatial position of a first object, a spatial position of a second object whose audio role is a speaker, and a spatial position relationship between the first object and the second object, where the first video segment set is any one of a plurality of video segment sets, the video segment set is divided based on an object in a video to be processed, and the video segment set includes at least one video sub-segment;

a second determining module 502, configured to determine a sound source direction vector according to the spatial position of the first object and a spatial position relationship between the first object and the second object;

a rendering module 503, configured to render the audio data of the first video sub-segment according to the sound source direction vector.

Optionally, the apparatus further comprises:

the acquisition module is used for carrying out object identification on the frame images of the video to be processed according to the preset interval frame number to acquire at least one frame image of each object;

a generating module, configured to form the multiple video clip sets by using at least one frame of image of different objects.

Optionally, the apparatus 500 further comprises:

the comparison module is used for sequencing at least one frame of image of different objects according to the frame number and comparing the frame numbers of two adjacent frames of images;

the dividing module is used for determining that the two adjacent frames of images belong to two video sub-segments under the condition that the frame number difference value of the two adjacent frames of images is greater than a preset threshold value; or under the condition that the frame number difference value of the two adjacent frame images is smaller than or equal to the preset threshold, dividing the video sub-segments according to the scene information of the two adjacent frame images.

Optionally, the dividing module is specifically configured to:

Optionally, the first determining module 501 is specifically configured to:

determining a spatial position relationship between the first object and the second object based on a spatial position of the first object and a spatial position of the second object in a case where the first object and the second object are located in the same frame image; or, in the case that the first object and the second object are not located in the same frame image, determining the position relationship between the first object and the second object according to the spatial position of the static reference object in each frame image.

Optionally, the second determining module 502 includes:

the obtaining unit is used for carrying out audio recognition and face recognition on the first video sub-segment to obtain audio roles of each object, wherein the audio roles comprise a sound producer and a sound receiver;

an acquisition unit configured to acquire the first object input by a user;

a first determining unit configured to determine the sound source direction vector according to a spatial position of the first object and a spatial positional relationship between the first object and the second object when the audio character of the first object is the listener;

Optionally, the obtaining unit is specifically configured to:

performing tone identification on the audio data corresponding to the first video sub-segment to obtain tone information;

Optionally, the apparatus further comprises:

the identification module is used for carrying out scene identification on the first video sub-segment to obtain a scene label;

a rendering module 503, comprising:

the second determining unit is used for determining parameters in the three-dimensional sound effect processing model according to the sound source direction vector and the scene label;

and the rendering unit is used for rendering the audio data of the first video sub-segment by adopting the three-dimensional sound effect processing model with the determined parameters.

The apparatus provided in the embodiment of the present invention may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

As shown in fig. 6, the video processing apparatus according to the embodiment of the present invention includes: a processor 600; and a memory 620 connected to the processor 600 through a bus interface, wherein the memory 620 is used for storing programs and data used by the processor 600 in executing operations, and the processor 600 calls and executes the programs and data stored in the memory 620.

The processor 600 is used to read the program in the memory 620 and execute the following processes:

A transceiver 610 for receiving and transmitting data at the controller of the processor 600.

Where in fig. 6, the bus architecture may include any number of interconnected buses and bridges, with various circuits being linked together, particularly one or more processors represented by processor 600 and memory represented by memory 620. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 610 may be a number of elements including a transmitter and a transceiver providing a means for communicating with various other apparatus over a transmission medium. For different user devices, the user interface 630 may also be an interface capable of interfacing with a desired device externally, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 600 is responsible for managing the bus architecture and general processing, and the memory 620 may store data used by the processor 1200 in performing operations.

Optionally, the processor 600 is further configured to read the computer program and execute the following steps:

determining a spatial positional relationship between the first object and the second object based on a spatial position of the first object and a spatial position of the second object in a case where the first and second objects are located in the same frame image; or, in the case that the first object and the second object are not located in the same frame image, determining a spatial position relationship between the first object and the second object according to a spatial position of a static reference object in each frame image.

acquiring the first object input by a user;

Optionally, the processor 600 is further configured to read the computer program, and execute the following steps:

The video processing device provided in the embodiment of the present invention may implement the above method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.

In addition, the specific embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the information interaction method, and can achieve the same technical effects, and is not described herein again to avoid repetition.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the transceiving method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A video processing method, comprising:

for a first video sub-segment in a first video segment set, determining a spatial position of a first object, a spatial position of a second object of which an audio role is a speaker, and a spatial position relationship between the first object and the second object, wherein the first video segment set is any one of a plurality of video segment sets, the video segment sets are divided on the basis of objects in a video to be processed, and the first video segment set comprises at least one video sub-segment;

2. The method of claim 1, wherein prior to said determining, for a first video sub-segment of the first set of video segments, a spatial location of a first object, a spatial location of a second object whose audio character is a speaker, and a spatial location relationship between the first object and the second object, the method further comprises:

3. The video processing method according to claim 1 or 2, wherein the at least one video sub-segment is obtained as follows:

4. The video processing method according to claim 3, wherein said dividing the video sub-segments according to the scene information of the two adjacent frames of images comprises:

5. The method of claim 1, wherein determining a spatial location of a first object, a spatial location of a second object whose audio character is a speaker, and a spatial location relationship between the first object and the second object for a first video sub-segment of a first set of video segments comprises:

carrying out object identification on each frame of image of the first video sub-segment to obtain detection frames of the first object and the second object;

6. The video processing method according to claim 1, wherein determining a sound source direction vector according to the spatial position of the first object and the spatial positional relationship between the first object and the second object comprises:

acquiring the first object input by a user;

7. The video processing method of claim 6, wherein the performing audio recognition and face recognition on the first video sub-segment to obtain the audio role of each object comprises:

8. The video processing method of claim 1, wherein the method further comprises:

9. A video processing apparatus comprising: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor; processor for reading a program in a memory to implement the steps in the video processing method according to any one of claims 1 to 8.

10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the steps in the video processing method according to any one of claims 1 to 8.