WO2022017006A1

WO2022017006A1 - Video processing method and apparatus, and terminal device and computer-readable storage medium

Info

Publication number: WO2022017006A1
Application number: PCT/CN2021/097743
Authority: WO
Inventors: 崔志佳; 范泽华
Original assignee: Oppo广东移动通信有限公司
Priority date: 2020-07-22
Filing date: 2021-06-01
Publication date: 2022-01-27
Also published as: CN111818385B; CN111818385A

Abstract

Provided are a video processing method and apparatus, and a terminal device and a computer-readable storage medium. The method comprises: if specified editing information of the current video frame is detected, acquiring a target object included in the current video frame; according to the specified editing information and at least one target audio component, processing initial audio that corresponds to the current video frame, so as to obtain target audio; and associating the current video frame with the target audio to obtain a target video.

Description

Video processing method, apparatus, terminal device and computer-readable storage medium

This application claims the priority of the Chinese patent application filed on July 22, 2020 with the application number 202010710645.X and the invention titled "video processing method, video processing device and terminal equipment", the entire contents of which are by reference Incorporated in this application.

technical field

The present application belongs to the technical field of video processing, and in particular, relates to a video processing method, apparatus, terminal device, and computer-readable storage medium.

Background technique

In the process of video recording or video editing, the user may adjust some video frames by zooming the image, adjusting the focus object, etc., to highlight certain content.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a video processing method, an apparatus, a terminal device, and a computer-readable storage medium.

In a first aspect, an embodiment of the present application provides a video processing method, including:

If the specified editing information of the current video frame is detected, then the target object contained in the current video frame is obtained;

According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio The corresponding audio component in the audio;

Associating the current video frame with the target audio to obtain a target video.

In a second aspect, an embodiment of the present application provides a video processing apparatus, including:

an acquisition module, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;

A processing module, configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;

an association module, configured to associate the current video frame with the target audio to obtain a target video.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, a display, and a computer program stored in the memory and running on the processor, wherein, when the processor executes the computer program The video processing method as described above in the first aspect is implemented.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the video processing method described in the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the video processing method described above in the first aspect.

Description of drawings

1 is a schematic flowchart of a video processing method provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of step S102 provided by an embodiment of the present application;

3 is a schematic flowchart of another video processing method provided by an embodiment of the present application;

4 is a schematic flowchart of another video processing method provided by an embodiment of the present application;

5 is a schematic flowchart of still another video processing method provided by an embodiment of the present application;

6 is an exemplary schematic diagram of obtaining target audio provided by an embodiment of the present application;

7 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

detailed description

In the following description, for the purpose of illustration rather than limitation, specific details, such as specific system structures and technologies, are provided for a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.

It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.

At present, after the video frame is adjusted to highlight some content, the audio corresponding to the image still uses the original audio, so that the visual rendering effect of the image and the sound rendering effect of the audio do not match, resulting in the rendering of the obtained video. less effective.

Based on the above technical problems, an embodiment of the present application provides a video processing method, which includes:

Compared with the prior art, the beneficial effect of the embodiment of the present application is: in the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame can be acquired, wherein the The target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the The current video frame and the target audio are associated to obtain the target video; at this time, any of the target audio components is an audio component corresponding to the target object in the initial audio, so it can be specified according to the specified Editing information to make corresponding adjustments to the target audio component and other parts of the target object in the initial video, so that the achieved sound effect of the corresponding target audio is more suitable for the visual presented in the current video frame. effect, so as to enhance the presentation effect of the target video and improve the user experience.

In an optional embodiment of the present application, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:

Inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, where the output result includes the identified audio object and the audio components corresponding to each audio object respectively;

Compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, determine that in the initial audio corresponding to the current video frame, there is a target corresponding to at least one target object audio component.

In an optional embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame is obtained, including:

Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.

In an optional embodiment of the present application, before processing the initial audio corresponding to the current video frame and obtaining the target audio according to the specified editing information and at least one target audio component, the method includes:

According to a preset object-band mapping table, target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.

In an optional embodiment of the present application, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame includes:

If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;

and / or,

If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.

In an optional embodiment of the present application, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain the target audio includes:

According to the specified editing information, obtain the original video frame corresponding to the current video frame;

detecting the first object contained in the original video frame;

Comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;

According to the comparison result, the initial audio corresponding to the current video frame is processed to obtain the target audio.

According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.

In an optional embodiment of the present application, according to the specified editing information, adjusting the loudness of the target audio component in the initial audio includes:

If the specified editing information includes the current image zoom factor, the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.

In an optional embodiment of the present application, the target audio component is a sub-audio identified and extracted from the initial audio in advance; or

The target audio component is obtained by pre-identifying a specific frequency band in the initial audio.

The embodiment of the present application also provides a video processing apparatus, which includes:

In an optional embodiment of the present application, the video processing apparatus further includes:

The second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;

The comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame. The target audio component corresponding to the target object.

In an optional embodiment of the present application, the obtaining module is used for:

The third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.

and / or,

a first obtaining unit, configured to obtain the original video frame corresponding to the current video frame according to the specified editing information;

a detection unit, configured to detect the first object contained in the original video frame;

A comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;

The second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.

In an optional embodiment of the present application, the processing module is used for:

In an optional embodiment of the present application, the target audio component is a sub-audio identified and extracted from the initial audio in advance; ID is obtained.

Embodiments of the present application further provide a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above when executing the computer program The video processing methods described in the application embodiments are provided.

The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, wherein the computer program implements the video processing method described in the above-mentioned embodiments of the application when the computer program is executed by the processor.

The video processing method provided by the embodiments of the present application can be applied to servers, desktop computers, mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, and notebook computers , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA) and other terminal equipment, the embodiment of the present application does not make any restrictions on the specific type of the terminal equipment.

FIG. 1 shows a flowchart of a video processing method provided by an embodiment of the present application, and the video processing method can be applied to a terminal device.

At present, in the process of video recording or video editing, the user may adjust some video frames to highlight a scene or an object. For example, the image zoom factor of the video frame may be adjusted to change the field of view of the video frame. Or adjust the focus object, etc. However, in the prior art, after the video frame is adjusted, the audio corresponding to the video frame often still uses the original audio. It can be seen that the prior art does not find that after adjusting the video frame, the spatial relationship between the scene and the object perceived by the user from the video frame may change, but the sound presented in the original audio is still based on the video. The sound captured by the spatial relationship presented before frame scaling. Therefore, at present, after adjusting the video frame, the visual presentation effect of the image and the sound presentation effect of the audio may not match, resulting in a poor presentation effect of the obtained video.

However, through the real-time example of the present application, the target audio component and other parts of the target object in the initial video can be adjusted correspondingly according to the specified editing information. At this time, the obtained target audio can follow the current video frame. changes according to the specified editing operation, so that the achieved sound effect of the corresponding target audio is more suitable for the visual effect presented in the current video frame, improving the presentation effect of the target video, and providing users with a more immersive A real viewing experience.

Specifically, as shown in FIG. 1, the video processing method may include:

Step S101, if the specified editing information of the current video frame is detected, acquire the target object contained in the current video frame.

In the embodiment of the present application, the current video frame may be the current video frame collected in real time during the video collection process, or may be a video frame extracted from the video to be edited when the video is edited. Of course, in other application scenarios, the current video frame is obtained through other acquisition methods.

In this embodiment of the present application, there may be various methods for identifying the target object in the current video frame. Exemplarily, target detection can be performed on the current video frame by a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net or a multi-scale detection method, etc., to obtain the target object contained in the current video frame.

In this embodiment of the present application, the specified editing information may be information associated with a specified editing operation on the current video frame. The specified editing operation may be used to realize the specified adjustment of the current video frame automatically performed by the user or the terminal device, for example, to realize the zooming of the current video frame or to adjust the focus object in the current video frame Wait. Exemplarily, the specified editing operation may include an image scaling operation that does not meet a preset condition (for example, the current image scaling factor corresponding to the image scaling operation does not meet the preset condition); and/or, the specified editing operation The operation may include an image focus operation, and after the image focus operation is performed, a first focus object in the current video frame and a second focus object in a video frame preceding the current video frame are different.

In some embodiments, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame includes:

and / or,

In the embodiments of the present application, in some cases, when the current image zoom factor does not meet the preset conditions, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relatively relative to the terminal. The original video frame captured by the camera of the device is reduced or enlarged. At this time, the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame. Wherein, the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like. However, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, it can be determined that the focus object of the current video frame is relative to the previous video frame The frame has changed. At this time, it can be considered that the first focus object in the current video frame is the content that the user wants to highlight in the current video frame. Therefore, the first focus object in the current video frame can be used as at least part of the target object.

Step S102, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in The corresponding audio component in the initial audio.

In this embodiment of the present application, the specific form of the target audio component may be set according to an actual scene. Exemplarily, the target audio component may be a sub-audio identified and extracted from the initial audio in advance; or, a specific frequency band in the initial audio may be pre-identified as the target audio component.

Wherein, there may also be various ways of identifying the target audio component from the current video frame, which is not limited here. Exemplarily, the initial audio can be processed by algorithms such as Mel cepstrum algorithm, trained Recurrent Neural Networks (Recurrent Neural Networks) and/or Convolutional Neural Networks, etc., so as to identify the specific audio objects. The audio component is separated from the initial audio and identified, and if the specific audio object is any of the target objects, the corresponding audio component can be used as the target audio component. In addition, according to the pre-stored mapping relationship between frequency bands and objects, in the initial audio, target frequency bands corresponding to each target object may be determined and identified as the target audio components.

In this embodiment of the present application, after the target object contained in the current video frame is acquired, corresponding adjustments may be made to the target audio component of the target object in the initial video, so that the obtained target Audio can change with the current video frame.

In some embodiments, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:

In the embodiment of the present application, through pre-training, the first neural network can identify audio feature information such as waveform features and frequency features of different audio objects from audio, so as to extract and separate audio components of the audio objects. Exemplarily, the first neural network may be a recurrent neural network such as a bidirectional recurrent neural network (Bidirectional RNN, Bi-RNN), a long short-term memory network (Long Short-Term Memory networks, LSTM).

In this embodiment of the present application, the first neural network may be pre-trained according to the first training data set corresponding to the first neural network. Wherein, the first training data set may include multiple training audios and labels corresponding to each training audio, and the labels may include the audio objects.

In the embodiment of the present application, for example, the second neural network may be a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net, or a multi-scale detection method, etc. .

In the embodiment of the present application, the second training data set corresponding to the second neural network includes a plurality of training images and labels corresponding to each training image. The labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network at least partially overlap, so that the second neural network can be used in the current The target object identified in the video frame is associated with the audio object identified in the initial audio by the first neural network. Therefore, through the first neural network and the second neural network, the association between objects, images, and audio is realized, so that when the image in the video changes the field of view due to zooming, the corresponding visual field can be changed according to the change of the field of view. The corresponding audio is adjusted appropriately, thereby avoiding the problem that the visual presentation effect after image adjustment does not match the sound presentation effect of the audio in the prior art.

As shown in Table 1, it is an exemplary association setting manner of the labels in the first training data set and the second training data set.

Table 1:

标签Label	训练图像training images	训练音频training audio
人people	AA	aa
猫cat	BB	bb
狗dog	CC	cc
车vehicle	DD	dd
……...	……...	……...

Wherein, in the specific database, a plurality of training audios in the first training data set, ie, training audio a, training audio b, training audio c, and training audio d, are stored. In a specific database, a plurality of training images in the second training data set, ie, training image A, training image B, training image C, and training image D, are stored. The label corresponding to the training audio a is the same as the label corresponding to the training image A, the label corresponding to the training audio b is the same as the label corresponding to the training image B, and the label corresponding to the training audio c is the same as the label corresponding to the training audio The label corresponding to the image C is the same, and the label corresponding to the training audio d is the same as the label corresponding to the training image D. In some embodiments, some labels may only have audio but no images. For example, wind has only corresponding training audio but no corresponding training images, and objects such as turtles that generally do not make sounds may only have corresponding training images but no corresponding training audio. . Therefore, there may be some differences between the labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network.

In some embodiments, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method includes:

In the embodiment of the present application, the preset object-frequency band mapping table may pre-store the mapping relationship between each object and the frequency band. Therefore, by querying the object-frequency band mapping table, it is possible to determine the target object. The corresponding target frequency band is identified in the initial audio.

In some embodiments, processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain the target audio includes:

Step S201, obtaining the original video frame corresponding to the current video frame according to the specified editing information;

Step S202, detecting the first object contained in the original video frame;

Step S203, compare the target object contained in the current video frame with the first object contained in the original video frame, and obtain a comparison result;

Step S204, according to the comparison result, process the initial audio corresponding to the current video frame to obtain the target audio.

In the embodiment of the present application, the original video frame corresponding to the current video frame may be acquired, so as to determine the processing of the initial audio according to the content change between the original video frame and the current video frame.

The original video frame may be a video frame collected and displayed by a terminal device through a camera, and the current video frame is obtained after editing the original video frame according to the specified editing information. Therefore, by comparing the target object included in the current video frame with the first object included in the original video frame, the content that the user wants to present prominently can be better specified with reference to the original video frame, thereby more The initial audio is processed in a targeted manner.

For example, in some examples, the target object of the current video frame includes a person, and the first object in the original video frame includes a person and a dog, then by comparing the target object and the first object, it is possible to It is considered that the user wants to highlight a person in the current video frame, therefore, when processing the corresponding initial audio, the audio intensity of the target audio component identified as a person in the initial audio can be increased according to the specified editing information , and reduce the audio intensity of the audio component identified as the dog to better match the image display effect in the current video frame.

According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, including:

In this embodiment of the present application, the adjustment range for the loudness of the target audio component in the initial audio may be determined according to the specified editing information. For example, if the specified editing information includes the current image zoom factor, the target audio in the initial audio can be determined according to the current image zoom factor according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor. The loudness of the component. However, if the specified editing information includes information on switching the focus object, the information such as the distance and size between the second image area of the second focus object before the switch and the first image area of the first focus object after the switch may be based on the information , and determine the loudness of the target audio component in the initial audio.

By adjusting the loudness of the target audio component in the initial audio, the spatial relationship between the scene and the object perceived by the user from the current video frame and the scene and the object perceived by the user from the target audio can be made. to match the spatial relationship, so that users can have a better experience.

Step S103, associate the current video frame with the target audio to obtain a target video.

In this embodiment of the present application, the current video frame and the target audio may be associated by means of a time stamp, so that the current video frame and the target audio can be synchronized during playback. And, the current video frame and the target audio can be merged into a file of a specific video format, that is, the target video can be obtained.

On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 3 , the video processing method may include:

Step S301, if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, then the specified object contained in the current video frame is used as at least part of the target object;

Step S302, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;

Step S303, associate the current video frame with the target audio to obtain a target video.

In the embodiment of the present application, exemplarily, the preset condition may refer to that the current image zoom factor belongs to a preset interval or is equal to a preset multiple value (eg, equal to 1), or the like. In some cases, when the current image zoom factor does not meet a preset condition, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relative to the image captured by the camera of the terminal device. The original video frame is downscaled or upscaled. At this time, the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame. Wherein, the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like.

On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 4 , the video processing method may include:

Step S401, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, then the first focus object in the current video frame is at least partially the target object;

Step S402, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;

Step S403, associate the current video frame with the target audio to obtain a target video.

In the embodiment of the present application, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the focus object of the current video frame may be determined Changed relative to the previous video frame. At this time, it can be considered that the first focus object in the current video frame is the content that the user wants to highlight in the current video frame. Therefore, the first focus object in the current video frame can be used as at least part of the target object.

On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 5 , the video processing method may include:

Step S501, if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, and if it is detected that the first focus object in the current video frame and the previous video frame of the current video frame are detected. If the second focus object is different, the specified object contained in the current video frame and the first focus object in the current video frame are used as at least part of the target object;

Step S502, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;

Step S503, associate the current video frame with the target audio to obtain a target video.

In this embodiment of the present application, the target object may be determined according to the current image zoom factor of the current video frame and the change of the first focused image in the current video frame relative to the second focused image of the previous video frame. Wherein, the target object determined according to the current image zoom factor of the current video frame and the target object determined according to the change of the first focused image in the current video frame relative to the second focused image of the previous video frame may be the same target Therefore, the processing of the target audio component corresponding to the target object can be jointly determined by combining the current image zoom factor and the first focus object, so as to process the initial audio component, so that all the corresponding target audio The achieved sound effect is more suitable for the visual effect presented in the current video frame.

A specific implementation manner of the embodiment of the present application is described below with a specific example.

Exemplarily, as shown in FIG. 6( a ), it is the original video frame corresponding to the current video frame. By performing target detection on the original video frame by the second neural network, it can be detected that the first objects included in the original video frame include cows, cats and dogs.

As shown in Figure 6(b), by inputting the initial audio into the first neural network, the output of the first neural network includes the first audio component corresponding to the cow and the second audio component corresponding to the cat. The audio component and the third audio component corresponding to the dog. The volume level of the first audio component is a, the volume level of the second audio component is b, and the volume level of the other audio components is c.

The user can zoom in on the original video frame through a specific screen operation gesture. At this time, if the current image zoom factor is greater than 1, it can be considered that the current image zoom factor of the current video frame does not meet the preset condition. The field of view of the current video frame obtained after the enlargement process will be smaller than the field of view of the original video frame.

As shown in Fig. 6(c), if the zoom factor of the current image is 1.5 and the target object in the current video frame does not include a cow, the image parts of cats and dogs in the current video frame can be seen Compared with the original video frame, the size of the second audio component b and the third audio component c can be increased, and the size of the first audio component a can be decreased according to the current image zoom factor, so as to obtain target audio.

As shown in FIG. 6(d), if the zoom factor of the current image is further increased to 2, and the target object in the current video frame does not include a cow, you can see the cat and the cat in the current video frame. The proportion of the image occupied by the dog part is larger, therefore, the size of the second audio component b and the third audio component c can be further increased, and the size of the first audio component a can be further reduced according to the current image zoom factor size to obtain the target audio.

It can be seen that, in this example, as the zoom factor of the current image increases and the image corresponding to the target object gradually becomes larger, the target audio component can be adjusted accordingly, so that the corresponding target audio component can also increase with the size of the image. Highlighted and enhanced to better simulate the state of the user gradually approaching the target object, making it easier for the user to feel immersed in the environment, thereby improving the user experience.

It should be noted that the above example is only an exemplary description of the embodiment of the present application, rather than a limitation of the embodiment of the present application.

In the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame may be acquired, wherein the target object may be considered to be the content you want to highlight; then, According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the current video frame and the target audio are associated to obtain the target video; At this time, any one of the target audio components is an audio component corresponding to the target object in the initial audio. Therefore, according to the specified editing information, the target of the target object in the initial video can be adjusted. The audio component and other parts are adjusted accordingly, so that the sound effect achieved by the corresponding target audio is more suitable for the visual effect presented in the current video frame, so as to enhance the presentation effect of the target video and improve the user experience. .

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the above-mentioned video processing methods in the above embodiments, FIG. 4 shows a structural block diagram of a video processing apparatus provided by the embodiments of the present application. For convenience of description, only parts related to the embodiments of the present application are shown.

7, the video processing device 7 includes:

an acquisition module 701, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;

The processing module 702 is configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components. the audio component corresponding to the target object in the initial audio;

An association module 703, configured to associate the current video frame with the target audio to obtain a target video.

Optionally, the video processing device 7 further includes:

Optionally, the obtaining module 701 is specifically used for:

Optionally, the video processing device 7 further includes:

Optionally, the obtaining module 701 is specifically used for:

and / or,

Optionally, the processing module 702 specifically includes:

Optionally, the processing module 702 is specifically used for:

It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the above device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 8 , the terminal device 8 of this embodiment includes: at least one processor 80 (only one is shown in FIG. 8 ), a memory 81 , and a memory 81 that is stored in the above-mentioned memory 81 and can run on the above-mentioned at least one processor 80 The above computer program 82, when the above processor 80 executes the above computer program 82, implements the steps in any of the above video processing method embodiments.

The above-mentioned terminal device 8 may be computing devices such as a server, a mobile phone, a wearable device, an augmented reality (AR)/virtual reality (VR) device, a desktop computer, a notebook, a desktop computer, and a handheld computer. The terminal device may include, but is not limited to, a processor 80 and a memory 81 . Those skilled in the art can understand that FIG. 8 is only an example of the terminal device 8, and does not constitute a limitation on the terminal device 8. It may include more or less components than the one shown, or combine some components, or different components , for example, may also include input devices, output devices, network access devices, and so on. Wherein, the above-mentioned input devices may include keyboards, touchpads, fingerprint collection sensors (for collecting user's fingerprint information and fingerprint direction information), microphones, cameras, etc., and output devices may include displays, speakers, and the like.

The above-mentioned processor 80 may be a central processing unit (Central Processing Unit, CPU), and the processor 80 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In some embodiments, the above-mentioned memory 81 may be an internal storage unit of the above-mentioned terminal device 8 , such as a hard disk or a memory of the terminal device 8 . The above-mentioned memory 81 may also be an external storage device of the above-mentioned terminal device 8 in other embodiments, such as a plug-in hard disk equipped on the above-mentioned terminal device 8, a smart memory card (Smart Media Card, SMC), a Secure Digital (Secure Digital) , SD) card, flash memory card (Flash Card), etc. Further, the above-mentioned memory 81 may also include both an internal storage unit of the above-mentioned terminal device 8 and an external storage device. The above-mentioned memory 81 is used to store an operating system, an application program, a boot loader (Boot Loader), data, and other programs, such as program codes of the above-mentioned computer programs, and the like. The above-mentioned memory 81 can also be used to temporarily store data that has been output or is to be output.

In addition, although not shown, the above-mentioned terminal device 8 may also include a network connection module, such as a Bluetooth module, a Wi-Fi module, a cellular network module, and the like, which will not be repeated here.

In this embodiment of the present application, when the above-mentioned processor 80 executes the above-mentioned computer program 82 to implement the steps in any of the above-mentioned video processing method embodiments, if the designated editing information of the current video frame is detected, it can obtain the information in the current video frame. The included target object, wherein the target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed , obtain the target audio, and associate the current video frame with the target audio to obtain the target video; at this time, any of the target audio components is an audio corresponding to the target object in the initial audio Therefore, according to the specified editing information, the target audio component of the target object in the initial video can be adjusted accordingly, so that the sound effect of the corresponding target audio is more suitable for the The visual effect presented in the current video frame is to enhance the presentation effect of the target video and improve the user experience.

Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.

The embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.

If the above-mentioned integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above-mentioned embodiments, which can be completed by instructing the relevant hardware through a computer program. The above-mentioned computer program can be stored in a computer-readable storage medium, and the computer program is in When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code form, executable file or some intermediate form. The above-mentioned computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory ( RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/device and method may be implemented in other manners. For example, the apparatus/equipment embodiments described above are only illustrative. For example, the division of the above modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or components. May be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that the above-mentioned embodiments can still be used for The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in this document. within the scope of protection of the application.

Claims

A video processing method, comprising:

If the specified editing information of the current video frame is detected, then the target object contained in the current video frame is obtained;

According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio The corresponding audio component in the audio;

Associating the current video frame with the target audio to obtain a target video.
The video processing method according to claim 1, wherein, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further comprises:

Inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, where the output result includes the identified audio object and the audio components corresponding to each audio object respectively;

Compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, determine that in the initial audio corresponding to the current video frame, there is a target corresponding to at least one target object audio component.
The video processing method according to claim 2, wherein, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame, comprising:

Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
The video processing method according to claim 1, wherein, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method comprises:

According to a preset object-band mapping table, target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
The video processing method according to claim 1, wherein, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame comprises:

If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;

and / or,

If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
The video processing method according to claim 1, wherein, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain the target audio, comprising:

According to the specified editing information, obtain the original video frame corresponding to the current video frame;

detecting the first object contained in the original video frame;

Comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;

According to the comparison result, the initial audio corresponding to the current video frame is processed to obtain the target audio.
The video processing method according to any one of claims 1 to 6, wherein, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio ,include:

According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
The video processing method according to claim 7, wherein, according to the specified editing information, adjusting the loudness of the target audio component in the initial audio comprises:

If the specified editing information includes the current image zoom factor, the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
The video processing method according to any one of claims 1 to 6, wherein the target audio component is a sub-audio identified and extracted from the initial audio in advance; or

The target audio component is obtained by pre-identifying a specific frequency band in the initial audio.
A video processing device, comprising:

an acquisition module, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;

A processing module, configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;

an association module, configured to associate the current video frame with the target audio to obtain a target video.
The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:

The second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;

The comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame. The target audio component corresponding to the target object.
The video processing apparatus according to claim 11, wherein the obtaining module is used for:

Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:

The third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
The video processing apparatus according to claim 10, wherein the obtaining module is used for:

If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;

and / or,

If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:

a first obtaining unit, configured to obtain the original video frame corresponding to the current video frame according to the specified editing information;

a detection unit, configured to detect the first object contained in the original video frame;

a comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;

The second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
The video processing device according to any one of claims 10 to 15, wherein the processing module is used for:

According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
The video processing apparatus of claim 16, wherein the processing module is used for:

If the specified editing information includes the current image zoom factor, the loudness of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
The video processing apparatus according to any one of claims 10 to 15, wherein the target audio component is a sub-audio identified and extracted from the initial audio in advance; Specific frequency bands in the original audio are pre-identified.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein, when the processor executes the computer program, any one of claims 1 to 9 is implemented. The video processing method described in one item.
A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the video processing method according to any one of claims 1 to 9 is implemented.