CN111818385B

CN111818385B - Video processing method, video processing device and terminal equipment

Info

Publication number: CN111818385B
Application number: CN202010710645.XA
Authority: CN
Inventors: 崔志佳; 范泽华
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2022-08-09
Anticipated expiration: 2040-07-22
Also published as: WO2022017006A1; CN111818385A

Abstract

The application provides a video processing method, which comprises the following steps: if the specified editing information of the current video frame is detected, acquiring a target object contained in the current video frame; processing initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one target audio component is an audio component corresponding to one target object in the initial audio; and associating the current video frame with the target audio to obtain a target video. By the method, the problem that after the video frame is adjusted to highlight certain content, the visual presentation effect of the image is not matched with the sound presentation effect of the audio, so that the presentation effect of the obtained video is poor can be solved.

Description

Video processing method, video processing device and terminal equipment

Technical Field

The present application belongs to the field of video processing technologies, and in particular, to a video processing method, a video processing apparatus, a terminal device, and a computer-readable storage medium.

Background

During video recording or video editing, a user may adjust some video frames by zooming images, adjusting focusing objects, and the like to highlight certain contents.

However, at present, after the video frame is adjusted to highlight some content, the audio corresponding to the image still often continues to use the original audio, so that the visual presentation effect of the image is not matched with the sound presentation effect of the audio, resulting in poor presentation effect of the obtained video.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device, a terminal device and a computer readable storage medium, which can solve the problem that after video frames are adjusted to highlight certain contents, the visual presentation effect of images is not matched with the sound presentation effect of audio, so that the presentation effect of obtained videos is poor.

In a first aspect, an embodiment of the present application provides a video processing method, including:

if the specified editing information of the current video frame is detected, acquiring a target object contained in the current video frame;

processing initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one target audio component is an audio component corresponding to one target object in the initial audio;

and associating the current video frame with the target audio to obtain a target video.

In a second aspect, an embodiment of the present application provides a video processing apparatus, including:

the acquisition module is used for acquiring a target object contained in the current video frame if the specified editing information of the current video frame is detected;

a processing module, configured to process an initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, where any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio;

and the association module is used for associating the current video frame with the target audio to obtain a target video.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, a display, and a computer program stored in the memory and executable on the processor, where the processor implements the video processing method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the video processing method according to the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the video processing method described in the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the present application, if the specified editing information of the current video frame is detected, a target object included in the current video frame may be acquired, where the target object may be considered as content that is desired to be highlighted; then, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain a target audio, and associating the current video frame with the target audio to obtain a target video; at this time, any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio, so that the target audio component and other parts of the target object in the initial video can be correspondingly adjusted according to the specified editing information, so that the achieved sound effect of the corresponding target audio more fits the visual effect presented in the current video frame, the presentation effect of the target video is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of step S102 according to an embodiment of the present application;

fig. 3 is a schematic flow chart of another video processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another video processing method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another video processing method according to an embodiment of the present application;

FIG. 6 is an exemplary diagram of obtaining target audio provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The video processing method provided by the embodiment of the application can be applied to terminal devices such as a server, a desktop computer, a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the embodiment of the application does not limit the specific types of the terminal devices at all.

Fig. 1 shows a flowchart of a video processing method provided in an embodiment of the present application, where the video processing method can be applied to a terminal device.

Currently, during video recording or video editing, a user may adjust some video frames to highlight a certain scene or a certain object, for example, may adjust an image zoom factor of the video frames to change a view of the video frames, or adjust a focused object, etc. However, in the prior art, after the video frame is adjusted, the audio corresponding to the video frame still usually follows the original audio. It can be seen that, in the prior art, it is not found that after the video frame is adjusted, the spatial relationship between the scene and the object perceived by the user from the video frame may be changed, but the sound presented in the original audio is still the sound collected based on the spatial relationship presented before the video frame is scaled. Therefore, at present, after adjusting the video frame, the visual presentation effect of the image and the audio presentation effect of the audio may be mismatched, resulting in poor presentation effect of the obtained video.

By the real-time embodiment of the application, parts such as the target audio component of the target object in the initial video can be correspondingly adjusted according to the specified editing information, at the moment, the obtained target audio can change along with the specified editing operation of the current video frame, so that the achieved sound effect of the corresponding target audio is more fit with the visual effect presented in the current video frame, the presentation effect of the target video is improved, and more immersive viewing experience is provided for a user.

Specifically, as shown in fig. 1, the video processing method may include:

step S101, if the specified editing information of the current video frame is detected, the target object contained in the current video frame is obtained.

In the embodiment of the application, the current video frame may be a current video frame acquired in real time in a video acquisition process, or a frame of video frame extracted from a video to be edited when the video is edited. Of course, in other application scenarios, the current video frame is obtained through other obtaining modes.

In the embodiment of the present application, there may be a plurality of methods for identifying the target object in the current video frame. For example, the target detection method may be performed on the current video frame by a target detection method, such as Spatial gradient Pooling Networks (SPPNet), fast-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net, or a multi-scale detection method, to obtain the target object included in the current video frame.

In this embodiment of the present application, the specified editing information may be information associated with a specified editing operation on the current video frame. The specified editing operation may be used to implement specified adjustment of the current video frame, which is automatically performed by a user or a terminal device, for example, to implement zooming of the current video frame or to adjust a focusing object in the current video frame. For example, the specified editing operation may include an image scaling operation that does not satisfy a preset condition (for example, a current image scaling multiple corresponding to the image scaling operation does not satisfy the preset condition); and/or the specified editing operation may include an image focusing operation, and after performing the image focusing operation, a first focusing object in the current video frame and a second focusing object in a video frame previous to the current video frame are different.

In some embodiments, the obtaining the target object included in the current video frame if the specified editing information of the current video frame is detected includes:

if the zoom factor of the current image of the current video frame is detected not to meet the preset condition, using a designated object contained in the current video frame as at least part of the target object;

and/or the presence of a gas in the gas,

and if the first focusing object in the current video frame is detected to be different from the second focusing object in the previous video frame of the current video frame, taking the first focusing object in the current video frame as at least part of the target object.

In this embodiment of the application, in some cases, when the current image zoom multiple does not satisfy the preset condition, it may be considered that a user performs image zoom processing on the current video frame, so that the current video frame is reduced or enlarged relative to an original video frame acquired by a camera of a terminal device. At this time, the field of view of the current video frame is also changed compared to the original video frame, and therefore, the content desired to be highlighted in the current video frame can be determined by acquiring the specified object included in the current video frame. Wherein the designated object is a focusing object in the current video frame, a user-selected object, one or more target objects detected by target detection, and the like. And if it is detected that the first focusing object in the current video frame is different from the second focusing object in the previous video frame of the current video frame, it may be determined that the focusing object of the current video frame is changed from the previous video frame. At this time, the first focus object in the current video frame may be considered as the content that the user wants to highlight in the current video frame, and therefore, the first focus object in the current video frame may be at least part of the target object.

Step S102, according to the specified editing information and at least one target audio frequency component, processing the initial audio frequency corresponding to the current video frame to obtain the target audio frequency, wherein any one target audio frequency component is the audio frequency component corresponding to one target object in the initial audio frequency.

In the embodiment of the present application, the specific form of the target audio component may be set according to an actual scene. For example, the target audio component may be a sub audio previously identified and extracted from the initial audio; alternatively, a specific frequency band in the initial audio may be identified in advance as the target audio component.

There may be various ways to identify the target audio component from the current video frame, which is not limited herein. For example, the initial audio may be processed through a mel-frequency cepstrum algorithm, a trained Recurrent Neural network (Recurrent Neural Networks), and/or a convolutional Neural network, so as to separate and identify the audio component of the identified specific audio object from the initial audio, and if the specific audio object is any of the target objects, the corresponding audio component may be used as the target audio component. In addition, according to a mapping relationship between pre-stored frequency bands and objects, in the initial audio, a target frequency band corresponding to each target object may be determined and identified as the target audio component.

In this embodiment of the application, after the target object included in the current video frame is acquired, corresponding adjustment may be performed on portions of the initial video, such as a target audio component of the target object, so that the acquired target audio may change along with the change of the current video frame.

In some embodiments, before processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, the method further includes:

inputting the initial audio into the trained first neural network, and obtaining an output result of the trained first neural network, wherein the output result comprises the recognized audio objects and audio components corresponding to the audio objects respectively;

and comparing the audio object with the target object, and if at least one audio object is identical to the target object, determining that a target audio component corresponding to at least one target object exists in the initial audio corresponding to the current video frame.

In the embodiment of the application, through pre-training, the first neural network can identify audio characteristic information such as waveform characteristics, frequency characteristics and the like of different audio objects from audio, so as to extract and separate audio components of the audio objects. Illustratively, the first neural network may be a Bidirectional recurrent neural network (Bi-RNN), a Long-Short Term Memory network (LSTM), or other recurrent neural networks.

In this embodiment, the first neural network may be pre-trained according to a first training data set corresponding to the first neural network. The first training data set may include a plurality of training audios and labels corresponding to the training audios, where the labels may include the audio objects.

and performing target identification on the current video frame through a trained second neural network to obtain a target object contained in the current video frame, wherein at least part of a label in a first training data set corresponding to the first neural network is overlapped with a label in a second training data set corresponding to the second neural network.

In the embodiment of the present application, the second neural network may be an object detection method such as Spatial Pyramid Powing Networks (SPPNet), fast-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net, or a multi-scale detection method, for example.

In this embodiment of the application, the second training data set corresponding to the second neural network includes a plurality of training images and labels respectively corresponding to the training images. Tags in a first training data set corresponding to the first neural network at least partially coincide with tags in a second training data set corresponding to the second neural network, so as to associate a target object identified by the second neural network in the current video frame with an audio object identified by the first neural network in the initial audio. Therefore, through the first neural network and the second neural network, the association among the object, the image and the audio is realized, so that when the visual field of the image in the video changes due to zooming, the corresponding audio can be correspondingly adjusted according to the change of the visual field, and the problem that the visual presentation effect after the image is adjusted and the sound presentation effect of the audio in the prior art are not matched is avoided.

As shown in table 1, an exemplary association setting manner of the labels in the first training data set and the second training data set is provided.

Table 1:

in the specific database, a plurality of training audios in the first training data set, i.e., a training audio a, a training audio b, a training audio c, a training audio d, and the like, are stored. In the specific database, a plurality of training images in the second training data set, i.e., training image a, training image B, training image C, training image D, and the like are stored. The label corresponding to the training audio a is the same as the label corresponding to the training image a, the label corresponding to the training audio B is the same as the label corresponding to the training image B, the label corresponding to the training audio C is the same as the label corresponding to the training image C, and the label corresponding to the training audio D is the same as the label corresponding to the training image D. In some embodiments, some tags may only have audio and no images, e.g., wind only has corresponding training audio and no corresponding training images, while objects that are not normally audible, such as turtles, may only have corresponding training images and no corresponding training audio. Thus, there may also be some differences between the labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network.

In some embodiments, before processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, the method includes:

and according to a preset object-frequency band mapping table, identifying target frequency bands respectively corresponding to all target objects in the initial audio, and taking the target frequency bands as target audio components of the corresponding target objects.

In the embodiment of the present application, the preset object-frequency band mapping table may pre-store mapping relationships between objects and frequency bands, so that by querying the object-frequency band mapping table, a target frequency band corresponding to the target object may be determined, and the target frequency band may be identified in the initial audio.

In some embodiments, the processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio includes:

step S201, obtaining an original video frame corresponding to the current video frame according to the specified editing information;

step S202, detecting a first object contained in the original video frame;

step S203, comparing the target object contained in the current video frame with the first object contained in the original video frame to obtain a comparison result;

step S204, according to the comparison result, processing the initial audio corresponding to the current video frame to obtain the target audio.

In this embodiment of the present application, an original video frame corresponding to the current video frame may be obtained, so as to determine processing of the initial audio according to a content change condition between the original video frame and the current video frame.

The original video frame may be a video frame which is acquired and displayed by a terminal device through a camera, and the current video frame is obtained after the original video frame is edited according to the specified editing information. Therefore, compared with the target object contained in the current video frame and the first object contained in the original video frame, the content which the user wants to highlight can be better defined by referring to the original video frame, so that the initial audio is processed more specifically.

For example, in some examples, the target object of the current video frame includes a person, and the first object in the original video frame includes a person and a dog, and the comparison between the target object and the first object may determine that the user wants to highlight the person in the current video frame, so when processing the corresponding initial audio, the audio intensity of the target audio component identified as the person in the initial audio may be increased, and the audio intensity of the audio component identified as the dog may be decreased according to the specified editing information, so as to better match the image display effect in the current video frame.

The processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio includes:

and adjusting the loudness of a target audio component in the initial audio according to the specified editing information to obtain the target audio.

In this embodiment of the present application, the adjustment amplitude of the loudness of the target audio component in the initial audio may be determined according to the specified editing information. For example, if the specified editing information includes a current image scaling factor, the loudness of the target audio component in the initial audio may be determined according to the current image scaling factor and the predetermined correspondence between the image scaling factor and the loudness adjustment factor. If the specified editing information includes information of switching of the focusing object, the loudness of the target audio component in the initial audio may be determined according to information such as a distance and a size between the second image region of the second focusing object before switching and the first image region of the first focusing object after switching.

By adjusting the loudness of the target audio component in the initial audio, the spatial relationship between the scene and the object perceived by the user from the current video frame can be matched with the spatial relationship between the scene and the object perceived by the user from the target audio, so that the user can have better experience.

And step S103, associating the current video frame with the target audio to obtain a target video.

In the embodiment of the application, the current video frame and the target audio can be associated through a timestamp and the like, so that the current video frame and the target audio can be synchronized during playing. And, the current video frame and the target audio may be merged into a file of a specific video format, i.e. the target video is obtained.

On the basis of the above embodiment, as an alternative embodiment of the present application, referring to fig. 3, the video processing method may include:

step S301, if it is detected that the current image zoom factor of the current video frame does not meet a preset condition, taking a specified object contained in the current video frame as at least part of the target object;

step S302, according to the specified editing information and at least one target audio component, processing an initial audio corresponding to the current video frame to obtain a target audio, wherein any one target audio component is an audio component corresponding to one target object in the initial audio;

step S303, associating the current video frame with the target audio to obtain a target video.

In the embodiment of the present application, for example, the preset condition may refer to that the current image zoom multiple belongs to a preset interval or is equal to a preset multiple value (e.g., equal to 1), and the like. In some cases, when the current image zoom multiple does not satisfy the preset condition, it may be considered that the user performs image zoom processing on the current video frame, so that the current video frame is reduced or enlarged relative to an original video frame acquired by a camera of the terminal device. At this time, the field of view of the current video frame is also changed compared to the original video frame, and therefore, the content desired to be highlighted in the current video frame can be determined by acquiring the specified object included in the current video frame. Wherein the designated object is a focusing object in the current video frame, a user-selected object, one or more target objects detected by target detection, and the like.

On the basis of the above embodiment, as an alternative embodiment of the present application, referring to fig. 4, the video processing method may include:

step S401, if it is detected that a first focusing object in the current video frame is different from a second focusing object in a previous video frame of the current video frame, taking the first focusing object in the current video frame as at least part of the target object;

step S402, according to the specified editing information and at least one target audio component, processing an initial audio corresponding to the current video frame to obtain a target audio, wherein any one target audio component is an audio component corresponding to one target object in the initial audio;

and S403, associating the current video frame with the target audio to obtain a target video.

In this embodiment of the present application, if it is detected that a first focusing object in the current video frame is different from a second focusing object in a previous video frame of the current video frame, it may be determined that a focusing object of the current video frame is changed relative to the previous video frame. At this time, the first focus object in the current video frame may be considered as content that the user wants to highlight in the current video frame, and therefore, the first focus object in the current video frame may be at least part of the target object.

On the basis of the above embodiment, as an alternative embodiment of the present application, referring to fig. 5, the video processing method may include:

step S501, if it is detected that the current image zoom multiple of the current video frame does not satisfy a preset condition, and if it is detected that a first focus object in the current video frame is different from a second focus object in a previous video frame of the current video frame, taking a designated object contained in the current video frame and the first focus object in the current video frame as at least part of the target objects;

step S502, according to the appointed editing information and at least one target audio frequency component, processing the initial audio frequency corresponding to the current video frame to obtain the target audio frequency, wherein any one target audio frequency component is the audio frequency component corresponding to one target object in the initial audio frequency;

step S503, the current video frame is associated with the target audio to obtain a target video.

In this embodiment of the application, the target object may be determined according to a current image zoom multiple of the current video frame and a change condition of a first in-focus image in the current video frame relative to a second in-focus image in a previous video frame. The target object determined according to the current image zoom factor of the current video frame and the target object determined according to the change condition of the first focused image in the current video frame relative to the second focused image in the previous video frame may be the same target object, and therefore, the processing of the target audio component corresponding to the target object may be determined jointly with the current image zoom factor and the first focused object, so as to process the initial audio component, and make the achieved sound effect of the corresponding target audio more fit the visual effect presented in the current video frame.

A specific implementation of the embodiment of the present application is described below as a specific example.

Illustratively, as shown in fig. 6(a), the current video frame corresponds to an original video frame. And performing target detection on the original video frame through a second neural network, wherein the first objects contained in the original video frame can be detected to comprise a cow, a cat and a dog.

As shown in fig. 6(b), the initial audio is input into the first neural network, and the output result of the first neural network includes a first audio component corresponding to a cow, a second audio component corresponding to a cat, and a third audio component corresponding to a dog. The volume of the first audio component is a, the volume of the second audio component is b, and the volumes of the other audio components are c.

The user can perform amplification processing on the original video frame through a specific screen operation gesture, at this time, the current image zoom factor is greater than 1, and it can be considered that the current image zoom factor of the current video frame does not satisfy a preset condition. The field of view of the current video frame obtained after the enlargement process will be smaller than the field of view of the original video frame.

As shown in fig. 6(c), if the current image scaling factor is 1.5 and the target object in the current video frame does not include a cow, it can be seen that the image portions of the cat and the dog in the current video frame are larger than the original video frame, so the sizes of the second audio component b and the third audio component c can be increased and the size of the first audio component a can be decreased according to the current image scaling factor to obtain the target audio.

As shown in fig. 6(d), if the current image scaling factor is further increased to 2, and the target object in the current video frame does not include a cow, it can be seen that the image proportion occupied by the cat and dog parts in the current video frame is larger, so that the sizes of the second audio component b and the third audio component c can be further increased according to the current image scaling factor, and the size of the first audio component a can be further decreased to obtain the target audio.

As can be seen, in this example, as the zoom factor of the current image increases and the image corresponding to the target object gradually increases, the target audio component may be adjusted accordingly, so that the corresponding target audio component may also be enhanced along with the highlighting of the image, so as to better simulate the state in which the user gradually approaches the target object, so that the user may more easily generate the feeling of experiencing himself/herself, thereby improving the user experience.

It should be noted that the above-mentioned example is only an illustrative illustration of the embodiment of the present application, and is not a limitation of the embodiment of the present application.

In the embodiment of the present application, if the specified editing information of the current video frame is detected, a target object included in the current video frame may be acquired, where the target object may be considered as content that is desired to be highlighted; then, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain a target audio, and associating the current video frame with the target audio to obtain a target video; at this time, any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio, so that the target audio component and other parts of the target object in the initial video can be correspondingly adjusted according to the specified editing information, so that the achieved sound effect of the corresponding target audio more fits the visual effect presented in the current video frame, the presentation effect of the target video is improved, and the user experience is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 4 shows a block diagram of a video processing apparatus according to an embodiment of the present application, which corresponds to the video processing method described above in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 7, the video processing apparatus 7 includes:

an obtaining module 701, configured to obtain a target object included in a current video frame if specified editing information of the current video frame is detected;

a processing module 702, configured to process an initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, to obtain a target audio, where any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio;

the associating module 703 is configured to associate the current video frame with the target audio to obtain a target video.

Optionally, the video processing apparatus 7 further includes:

the second processing module is used for inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, wherein the output result comprises the recognized audio objects and audio components corresponding to the audio objects respectively;

and the comparison module is used for comparing the audio object with the target object, and if at least one audio object is the same as the target object, determining that a target audio component corresponding to at least one target object exists in the initial audio corresponding to the current video frame.

Optionally, the obtaining module 701 is specifically configured to:

Optionally, the video processing apparatus 7 further includes:

and the third processing module is used for identifying target frequency bands corresponding to the target objects in the initial audio according to a preset object-frequency band mapping table, and taking the target frequency bands as target audio components of the corresponding target objects.

Optionally, the obtaining module 701 is specifically configured to:

and/or the presence of a gas in the gas,

Optionally, the processing module 702 specifically includes:

a first obtaining unit, configured to obtain, according to the specified editing information, an original video frame corresponding to the current video frame;

a detection unit configured to detect a first object included in the original video frame;

a comparison unit, configured to compare a target object included in the current video frame with a first object included in the original video frame, and obtain a comparison result;

and the second processing unit is used for processing the initial audio corresponding to the current video frame according to the comparison result to obtain a target audio.

Optionally, the processing module 702 is specifically configured to:

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 8, the terminal device 8 of this embodiment includes: at least one processor 80 (only one of which is shown in fig. 8), a memory 81, and a computer program 82 stored in the memory 81 and executable on the at least one processor 80, wherein the steps of any of the various video processing method embodiments described above are implemented when the processor 80 executes the computer program 82.

The terminal device 8 may be a server, a mobile phone, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, a desktop computer, a notebook, a desktop computer, a palmtop computer, or other computing devices. The terminal device may include, but is not limited to, a processor 80, a memory 81. Those skilled in the art will appreciate that fig. 8 is merely an example of the terminal device 8, and does not constitute a limitation of the terminal device 8, and may include more or less components than those shown, or combine some of the components, or different components, such as may also include input devices, output devices, network access devices, etc. The input device may include a keyboard, a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, a camera, and the like, and the output device may include a display, a speaker, and the like.

The Processor 80 may be a Central Processing Unit (CPU), and the Processor 80 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. In other embodiments, the memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of the computer programs. The above-mentioned memory 81 can also be used to temporarily store data that has been output or is to be output.

In addition, although not shown, the terminal device 8 may further include a network connection module, such as a bluetooth module Wi-Fi module, a cellular network module, and the like, which is not described herein again.

In this embodiment, when the processor 80 executes the computer program 82 to implement the steps in any of the video processing method embodiments, if the specified editing information of the current video frame is detected, a target object included in the current video frame may be obtained, where the target object may be considered as content that is desired to be highlighted; then, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain a target audio, and associating the current video frame with the target audio to obtain a target video; at this time, any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio, so that the target audio component and other parts of the target object in the initial video can be correspondingly adjusted according to the specified editing information, so that the achieved sound effect of the corresponding target audio more fits the visual effect presented in the current video frame, the presentation effect of the target video is improved, and the user experience is improved.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/device and method may be implemented in other ways. For example, the above-described apparatus/device embodiments are merely illustrative, and for example, the division of the above modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A video processing method, comprising:

if the specified editing information of the current video frame is detected, acquiring a target object contained in the current video frame; the specified editing information is information related to specified editing operation of the current video frame;

associating the current video frame with the target audio to obtain a target video;

detecting a first object contained in an original video frame;

comparing the target object contained in the current video frame with the first object contained in the original video frame to obtain a comparison result;

and determining a highlighted target object according to the comparison result, and processing the initial audio corresponding to the current video frame to obtain a target audio.

2. The video processing method according to claim 1, wherein before processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, the method further comprises:

3. The video processing method of claim 2, wherein the obtaining the target object included in the current video frame if the specified editing information of the current video frame is detected comprises:

4. The video processing method of claim 1, wherein before processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, the method comprises:

5. The video processing method according to claim 1, wherein said obtaining the target object included in the current video frame if the specified editing information of the current video frame is detected comprises:

and/or the presence of a gas in the gas,

6. The video processing method according to any one of claims 1 to 5, wherein said processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio comprises:

7. A video processing apparatus, comprising:

the acquisition module is used for acquiring a target object contained in the current video frame if the specified editing information of the current video frame is detected; the specified editing information is information related to specified editing operation of the current video frame;

a processing module, configured to process an initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio, where any one of the target audio components is an audio component corresponding to one of the target objects in the initial audio; the processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain a target audio includes: detecting a first object contained in an original video frame; comparing a target object contained in the current video frame with a first object contained in the original video frame to obtain a comparison result; according to the comparison result, determining a highlighted target object, and processing the initial audio corresponding to the current video frame to obtain a target audio;

8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the video processing method according to any of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the video processing method according to any one of claims 1 to 6.