WO2022017006A1 - Video processing method and apparatus, and terminal device and computer-readable storage medium - Google Patents

Video processing method and apparatus, and terminal device and computer-readable storage medium Download PDF

Info

Publication number
WO2022017006A1
WO2022017006A1 PCT/CN2021/097743 CN2021097743W WO2022017006A1 WO 2022017006 A1 WO2022017006 A1 WO 2022017006A1 CN 2021097743 W CN2021097743 W CN 2021097743W WO 2022017006 A1 WO2022017006 A1 WO 2022017006A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video frame
target
current video
initial
Prior art date
Application number
PCT/CN2021/097743
Other languages
French (fr)
Chinese (zh)
Inventor
崔志佳
范泽华
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2022017006A1 publication Critical patent/WO2022017006A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams

Definitions

  • the present application belongs to the technical field of video processing, and in particular, relates to a video processing method, apparatus, terminal device, and computer-readable storage medium.
  • the user may adjust some video frames by zooming the image, adjusting the focus object, etc., to highlight certain content.
  • Embodiments of the present application provide a video processing method, an apparatus, a terminal device, and a computer-readable storage medium.
  • an embodiment of the present application provides a video processing method, including:
  • the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio
  • the corresponding audio component in the audio is a target object in the initial audio
  • an embodiment of the present application provides a video processing apparatus, including:
  • an acquisition module configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected
  • a processing module configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;
  • an association module configured to associate the current video frame with the target audio to obtain a target video.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, a display, and a computer program stored in the memory and running on the processor, wherein, when the processor executes the computer program
  • the video processing method as described above in the first aspect is implemented.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the video processing method described in the first aspect.
  • an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the video processing method described above in the first aspect.
  • FIG. 1 is a schematic flowchart of a video processing method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of step S102 provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another video processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another video processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of still another video processing method provided by an embodiment of the present application.
  • FIG. 6 is an exemplary schematic diagram of obtaining target audio provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • the audio corresponding to the image still uses the original audio, so that the visual rendering effect of the image and the sound rendering effect of the audio do not match, resulting in the rendering of the obtained video. less effective.
  • an embodiment of the present application provides a video processing method, which includes:
  • the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio
  • the corresponding audio component in the audio is a target object in the initial audio
  • the beneficial effect of the embodiment of the present application is: in the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame can be acquired, wherein the The target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the The current video frame and the target audio are associated to obtain the target video; at this time, any of the target audio components is an audio component corresponding to the target object in the initial audio, so it can be specified according to the specified Editing information to make corresponding adjustments to the target audio component and other parts of the target object in the initial video, so that the achieved sound effect of the corresponding target audio is more suitable for the visual presented in the current video frame. effect, so as to enhance the presentation effect of the target video and improve the user experience.
  • the method before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:
  • the target object contained in the current video frame is obtained, including:
  • Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame.
  • the labels in the second training data set corresponding to the second neural network at least partially overlap.
  • the method before processing the initial audio corresponding to the current video frame and obtaining the target audio according to the specified editing information and at least one target audio component, the method includes:
  • target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
  • acquiring the target object contained in the current video frame includes:
  • the specified object contained in the current video frame is used as at least part of the target object
  • the first focus object in the current video frame is used as at least part of the target object.
  • processing the initial audio corresponding to the current video frame to obtain the target audio includes:
  • the initial audio corresponding to the current video frame is processed to obtain the target audio.
  • processing the initial audio corresponding to the current video frame to obtain the target audio includes:
  • the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  • adjusting the loudness of the target audio component in the initial audio includes:
  • the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
  • the target audio component is a sub-audio identified and extracted from the initial audio in advance;
  • the target audio component is obtained by pre-identifying a specific frequency band in the initial audio.
  • the embodiment of the present application also provides a video processing apparatus, which includes:
  • an acquisition module configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected
  • a processing module configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;
  • an association module configured to associate the current video frame with the target audio to obtain a target video.
  • the video processing apparatus further includes:
  • the second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;
  • the comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame.
  • the target audio component corresponding to the target object is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame.
  • the target audio component corresponding to the target object.
  • the obtaining module is used for:
  • Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame.
  • the labels in the second training data set corresponding to the second neural network at least partially overlap.
  • the video processing apparatus further includes:
  • the third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
  • the obtaining module is used for:
  • the specified object contained in the current video frame is used as at least part of the target object
  • the first focus object in the current video frame is used as at least part of the target object.
  • the video processing apparatus further includes:
  • a first obtaining unit configured to obtain the original video frame corresponding to the current video frame according to the specified editing information
  • a detection unit configured to detect the first object contained in the original video frame
  • a comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result
  • the second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
  • the processing module is used for:
  • the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  • the processing module is used for:
  • the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
  • the target audio component is a sub-audio identified and extracted from the initial audio in advance; ID is obtained.
  • Embodiments of the present application further provide a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above when executing the computer program
  • a terminal device including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above when executing the computer program
  • the embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, wherein the computer program implements the video processing method described in the above-mentioned embodiments of the application when the computer program is executed by the processor.
  • the video processing method provided by the embodiments of the present application can be applied to servers, desktop computers, mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, and notebook computers , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA) and other terminal equipment, the embodiment of the present application does not make any restrictions on the specific type of the terminal equipment.
  • FIG. 1 shows a flowchart of a video processing method provided by an embodiment of the present application, and the video processing method can be applied to a terminal device.
  • the user may adjust some video frames to highlight a scene or an object.
  • the image zoom factor of the video frame may be adjusted to change the field of view of the video frame. Or adjust the focus object, etc.
  • the audio corresponding to the video frame often still uses the original audio. It can be seen that the prior art does not find that after adjusting the video frame, the spatial relationship between the scene and the object perceived by the user from the video frame may change, but the sound presented in the original audio is still based on the video. The sound captured by the spatial relationship presented before frame scaling. Therefore, at present, after adjusting the video frame, the visual presentation effect of the image and the sound presentation effect of the audio may not match, resulting in a poor presentation effect of the obtained video.
  • the target audio component and other parts of the target object in the initial video can be adjusted correspondingly according to the specified editing information.
  • the obtained target audio can follow the current video frame. changes according to the specified editing operation, so that the achieved sound effect of the corresponding target audio is more suitable for the visual effect presented in the current video frame, improving the presentation effect of the target video, and providing users with a more immersive A real viewing experience.
  • the video processing method may include:
  • Step S101 if the specified editing information of the current video frame is detected, acquire the target object contained in the current video frame.
  • the current video frame may be the current video frame collected in real time during the video collection process, or may be a video frame extracted from the video to be edited when the video is edited.
  • the current video frame is obtained through other acquisition methods.
  • target detection can be performed on the current video frame by a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net or a multi-scale detection method, etc., to obtain the target object contained in the current video frame.
  • SPPNet Spatial Pyramid Pooling Networks
  • SSD Single Shot MultiBox Detector
  • Retina-Net a multi-scale detection method, etc.
  • the specified editing information may be information associated with a specified editing operation on the current video frame.
  • the specified editing operation may be used to realize the specified adjustment of the current video frame automatically performed by the user or the terminal device, for example, to realize the zooming of the current video frame or to adjust the focus object in the current video frame Wait.
  • the specified editing operation may include an image scaling operation that does not meet a preset condition (for example, the current image scaling factor corresponding to the image scaling operation does not meet the preset condition); and/or, the specified editing operation
  • the operation may include an image focus operation, and after the image focus operation is performed, a first focus object in the current video frame and a second focus object in a video frame preceding the current video frame are different.
  • acquiring the target object contained in the current video frame includes:
  • the specified object contained in the current video frame is used as at least part of the target object
  • the first focus object in the current video frame is used as at least part of the target object.
  • the current image zoom factor when the current image zoom factor does not meet the preset conditions, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relatively relative to the terminal.
  • the original video frame captured by the camera of the device is reduced or enlarged.
  • the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame.
  • the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like.
  • the first focus object in the current video frame can be used as at least part of the target object.
  • Step S102 according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in The corresponding audio component in the initial audio.
  • the specific form of the target audio component may be set according to an actual scene.
  • the target audio component may be a sub-audio identified and extracted from the initial audio in advance; or, a specific frequency band in the initial audio may be pre-identified as the target audio component.
  • the initial audio can be processed by algorithms such as Mel cepstrum algorithm, trained Recurrent Neural Networks (Recurrent Neural Networks) and/or Convolutional Neural Networks, etc., so as to identify the specific audio objects.
  • the audio component is separated from the initial audio and identified, and if the specific audio object is any of the target objects, the corresponding audio component can be used as the target audio component.
  • target frequency bands corresponding to each target object may be determined and identified as the target audio components.
  • the method before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:
  • the first neural network can identify audio feature information such as waveform features and frequency features of different audio objects from audio, so as to extract and separate audio components of the audio objects.
  • the first neural network may be a recurrent neural network such as a bidirectional recurrent neural network (Bidirectional RNN, Bi-RNN), a long short-term memory network (Long Short-Term Memory networks, LSTM).
  • the first neural network may be pre-trained according to the first training data set corresponding to the first neural network.
  • the first training data set may include multiple training audios and labels corresponding to each training audio, and the labels may include the audio objects.
  • acquiring the target object contained in the current video frame includes:
  • Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame.
  • the labels in the second training data set corresponding to the second neural network at least partially overlap.
  • the second neural network may be a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net, or a multi-scale detection method, etc. .
  • SPPNet Spatial Pyramid Pooling Networks
  • SSD Single Shot MultiBox Detector
  • Retina-Net or a multi-scale detection method, etc.
  • the second training data set corresponding to the second neural network includes a plurality of training images and labels corresponding to each training image.
  • the labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network at least partially overlap, so that the second neural network can be used in the current
  • the target object identified in the video frame is associated with the audio object identified in the initial audio by the first neural network. Therefore, through the first neural network and the second neural network, the association between objects, images, and audio is realized, so that when the image in the video changes the field of view due to zooming, the corresponding visual field can be changed according to the change of the field of view.
  • the corresponding audio is adjusted appropriately, thereby avoiding the problem that the visual presentation effect after image adjustment does not match the sound presentation effect of the audio in the prior art.
  • Table 1 it is an exemplary association setting manner of the labels in the first training data set and the second training data set.
  • a plurality of training audios in the first training data set ie, training audio a, training audio b, training audio c, and training audio d
  • a plurality of training images in the second training data set ie, training image A, training image B, training image C, and training image D, are stored.
  • the label corresponding to the training audio a is the same as the label corresponding to the training image A
  • the label corresponding to the training audio b is the same as the label corresponding to the training image B
  • the label corresponding to the training audio c is the same as the label corresponding to the training audio
  • the label corresponding to the image C is the same
  • the label corresponding to the training audio d is the same as the label corresponding to the training image D.
  • some labels may only have audio but no images. For example, wind has only corresponding training audio but no corresponding training images, and objects such as turtles that generally do not make sounds may only have corresponding training images but no corresponding training audio. . Therefore, there may be some differences between the labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network.
  • the method before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method includes:
  • target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
  • the preset object-frequency band mapping table may pre-store the mapping relationship between each object and the frequency band. Therefore, by querying the object-frequency band mapping table, it is possible to determine the target object. The corresponding target frequency band is identified in the initial audio.
  • processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain the target audio includes:
  • Step S201 obtaining the original video frame corresponding to the current video frame according to the specified editing information
  • Step S202 detecting the first object contained in the original video frame
  • Step S203 compare the target object contained in the current video frame with the first object contained in the original video frame, and obtain a comparison result
  • Step S204 process the initial audio corresponding to the current video frame to obtain the target audio.
  • the original video frame corresponding to the current video frame may be acquired, so as to determine the processing of the initial audio according to the content change between the original video frame and the current video frame.
  • the original video frame may be a video frame collected and displayed by a terminal device through a camera, and the current video frame is obtained after editing the original video frame according to the specified editing information. Therefore, by comparing the target object included in the current video frame with the first object included in the original video frame, the content that the user wants to present prominently can be better specified with reference to the original video frame, thereby more The initial audio is processed in a targeted manner.
  • the target object of the current video frame includes a person
  • the first object in the original video frame includes a person and a dog
  • the audio intensity of the target audio component identified as a person in the initial audio can be increased according to the specified editing information , and reduce the audio intensity of the audio component identified as the dog to better match the image display effect in the current video frame.
  • the initial audio corresponding to the current video frame is processed to obtain the target audio, including:
  • the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  • the adjustment range for the loudness of the target audio component in the initial audio may be determined according to the specified editing information.
  • the specified editing information includes the current image zoom factor
  • the target audio in the initial audio can be determined according to the current image zoom factor according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
  • the loudness of the component may be based on the information , and determine the loudness of the target audio component in the initial audio.
  • the spatial relationship between the scene and the object perceived by the user from the current video frame and the scene and the object perceived by the user from the target audio can be made. to match the spatial relationship, so that users can have a better experience.
  • Step S103 associate the current video frame with the target audio to obtain a target video.
  • the current video frame and the target audio may be associated by means of a time stamp, so that the current video frame and the target audio can be synchronized during playback. And, the current video frame and the target audio can be merged into a file of a specific video format, that is, the target video can be obtained.
  • the video processing method may include:
  • Step S301 if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, then the specified object contained in the current video frame is used as at least part of the target object;
  • Step S302 according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
  • Step S303 associate the current video frame with the target audio to obtain a target video.
  • the preset condition may refer to that the current image zoom factor belongs to a preset interval or is equal to a preset multiple value (eg, equal to 1), or the like.
  • the current image zoom factor does not meet a preset condition, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relative to the image captured by the camera of the terminal device.
  • the original video frame is downscaled or upscaled.
  • the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame.
  • the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like.
  • the video processing method may include:
  • Step S401 if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, then the first focus object in the current video frame is at least partially the target object;
  • Step S402 according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
  • Step S403 associate the current video frame with the target audio to obtain a target video.
  • the focus object of the current video frame may be determined Changed relative to the previous video frame.
  • the first focus object in the current video frame is the content that the user wants to highlight in the current video frame. Therefore, the first focus object in the current video frame can be used as at least part of the target object.
  • the video processing method may include:
  • Step S501 if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, and if it is detected that the first focus object in the current video frame and the previous video frame of the current video frame are detected. If the second focus object is different, the specified object contained in the current video frame and the first focus object in the current video frame are used as at least part of the target object;
  • Step S502 according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
  • Step S503 associate the current video frame with the target audio to obtain a target video.
  • the target object may be determined according to the current image zoom factor of the current video frame and the change of the first focused image in the current video frame relative to the second focused image of the previous video frame.
  • the target object determined according to the current image zoom factor of the current video frame and the target object determined according to the change of the first focused image in the current video frame relative to the second focused image of the previous video frame may be the same target Therefore, the processing of the target audio component corresponding to the target object can be jointly determined by combining the current image zoom factor and the first focus object, so as to process the initial audio component, so that all the corresponding target audio The achieved sound effect is more suitable for the visual effect presented in the current video frame.
  • FIG. 6( a ) it is the original video frame corresponding to the current video frame.
  • the first objects included in the original video frame include cows, cats and dogs.
  • the output of the first neural network includes the first audio component corresponding to the cow and the second audio component corresponding to the cat.
  • the volume level of the first audio component is a
  • the volume level of the second audio component is b
  • the volume level of the other audio components is c.
  • the user can zoom in on the original video frame through a specific screen operation gesture.
  • the current image zoom factor is greater than 1, it can be considered that the current image zoom factor of the current video frame does not meet the preset condition.
  • the field of view of the current video frame obtained after the enlargement process will be smaller than the field of view of the original video frame.
  • the zoom factor of the current image is 1.5 and the target object in the current video frame does not include a cow, the image parts of cats and dogs in the current video frame can be seen Compared with the original video frame, the size of the second audio component b and the third audio component c can be increased, and the size of the first audio component a can be decreased according to the current image zoom factor, so as to obtain target audio.
  • the zoom factor of the current image is further increased to 2
  • the target object in the current video frame does not include a cow
  • the proportion of the image occupied by the dog part is larger, therefore, the size of the second audio component b and the third audio component c can be further increased, and the size of the first audio component a can be further reduced according to the current image zoom factor size to obtain the target audio.
  • the target audio component can be adjusted accordingly, so that the corresponding target audio component can also increase with the size of the image. Highlighted and enhanced to better simulate the state of the user gradually approaching the target object, making it easier for the user to feel immersed in the environment, thereby improving the user experience.
  • the target object contained in the current video frame may be acquired, wherein the target object may be considered to be the content you want to highlight; then, According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the current video frame and the target audio are associated to obtain the target video; At this time, any one of the target audio components is an audio component corresponding to the target object in the initial audio. Therefore, according to the specified editing information, the target of the target object in the initial video can be adjusted. The audio component and other parts are adjusted accordingly, so that the sound effect achieved by the corresponding target audio is more suitable for the visual effect presented in the current video frame, so as to enhance the presentation effect of the target video and improve the user experience. .
  • FIG. 4 shows a structural block diagram of a video processing apparatus provided by the embodiments of the present application. For convenience of description, only parts related to the embodiments of the present application are shown.
  • the video processing device 7 includes:
  • an acquisition module 701 configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected
  • the processing module 702 is configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components. the audio component corresponding to the target object in the initial audio;
  • An association module 703, configured to associate the current video frame with the target audio to obtain a target video.
  • the video processing device 7 further includes:
  • the second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;
  • the comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame.
  • the target audio component corresponding to the target object is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame.
  • the target audio component corresponding to the target object.
  • the obtaining module 701 is specifically used for:
  • Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame.
  • the labels in the second training data set corresponding to the second neural network at least partially overlap.
  • the video processing device 7 further includes:
  • the third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
  • the obtaining module 701 is specifically used for:
  • the specified object contained in the current video frame is used as at least part of the target object
  • the first focus object in the current video frame is used as at least part of the target object.
  • processing module 702 specifically includes:
  • a first obtaining unit configured to obtain the original video frame corresponding to the current video frame according to the specified editing information
  • a detection unit configured to detect the first object contained in the original video frame
  • a comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result
  • the second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
  • processing module 702 is specifically used for:
  • the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  • the target object contained in the current video frame may be acquired, wherein the target object may be considered to be the content you want to highlight; then, According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the current video frame and the target audio are associated to obtain the target video; At this time, any one of the target audio components is an audio component corresponding to the target object in the initial audio. Therefore, according to the specified editing information, the target of the target object in the initial video can be adjusted. The audio component and other parts are adjusted accordingly, so that the sound effect achieved by the corresponding target audio is more suitable for the visual effect presented in the current video frame, so as to enhance the presentation effect of the target video and improve the user experience. .
  • FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 8 of this embodiment includes: at least one processor 80 (only one is shown in FIG. 8 ), a memory 81 , and a memory 81 that is stored in the above-mentioned memory 81 and can run on the above-mentioned at least one processor 80
  • the above computer program 82 when the above processor 80 executes the above computer program 82, implements the steps in any of the above video processing method embodiments.
  • the above-mentioned terminal device 8 may be computing devices such as a server, a mobile phone, a wearable device, an augmented reality (AR)/virtual reality (VR) device, a desktop computer, a notebook, a desktop computer, and a handheld computer.
  • the terminal device may include, but is not limited to, a processor 80 and a memory 81 .
  • FIG. 8 is only an example of the terminal device 8, and does not constitute a limitation on the terminal device 8. It may include more or less components than the one shown, or combine some components, or different components , for example, may also include input devices, output devices, network access devices, and so on.
  • the above-mentioned input devices may include keyboards, touchpads, fingerprint collection sensors (for collecting user's fingerprint information and fingerprint direction information), microphones, cameras, etc.
  • output devices may include displays, speakers, and the like.
  • the above-mentioned processor 80 may be a central processing unit (Central Processing Unit, CPU), and the processor 80 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the above-mentioned memory 81 may be an internal storage unit of the above-mentioned terminal device 8 , such as a hard disk or a memory of the terminal device 8 .
  • the above-mentioned memory 81 may also be an external storage device of the above-mentioned terminal device 8 in other embodiments, such as a plug-in hard disk equipped on the above-mentioned terminal device 8, a smart memory card (Smart Media Card, SMC), a Secure Digital (Secure Digital) , SD) card, flash memory card (Flash Card), etc.
  • the above-mentioned memory 81 may also include both an internal storage unit of the above-mentioned terminal device 8 and an external storage device.
  • the above-mentioned memory 81 is used to store an operating system, an application program, a boot loader (Boot Loader), data, and other programs, such as program codes of the above-mentioned computer programs, and the like.
  • the above-mentioned memory 81 can also be used to temporarily store data that has been output or is to be output.
  • the above-mentioned terminal device 8 may also include a network connection module, such as a Bluetooth module, a Wi-Fi module, a cellular network module, and the like, which will not be repeated here.
  • a network connection module such as a Bluetooth module, a Wi-Fi module, a cellular network module, and the like, which will not be repeated here.
  • the above-mentioned processor 80 executes the above-mentioned computer program 82 to implement the steps in any of the above-mentioned video processing method embodiments, if the designated editing information of the current video frame is detected, it can obtain the information in the current video frame.
  • the included target object wherein the target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed , obtain the target audio, and associate the current video frame with the target audio to obtain the target video; at this time, any of the target audio components is an audio corresponding to the target object in the initial audio Therefore, according to the specified editing information, the target audio component of the target object in the initial video can be adjusted accordingly, so that the sound effect of the corresponding target audio is more suitable for the
  • the visual effect presented in the current video frame is to enhance the presentation effect of the target video and improve the user experience.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.
  • the above-mentioned integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above-mentioned embodiments, which can be completed by instructing the relevant hardware through a computer program.
  • the above-mentioned computer program can be stored in a computer-readable storage medium, and the computer program is in When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code form, executable file or some intermediate form.
  • the above-mentioned computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media.
  • ROM read-only memory
  • RAM Random Access Memory
  • electrical carrier signals telecommunication signals
  • software distribution media for example, U disk, mobile hard disk, disk or CD, etc.
  • computer readable media may not be electrical carrier signals and telecommunications signals.
  • the disclosed apparatus/device and method may be implemented in other manners.
  • the apparatus/equipment embodiments described above are only illustrative.
  • the division of the above modules or units is only a logical function division.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Abstract

Provided are a video processing method and apparatus, and a terminal device and a computer-readable storage medium. The method comprises: if specified editing information of the current video frame is detected, acquiring a target object included in the current video frame; according to the specified editing information and at least one target audio component, processing initial audio that corresponds to the current video frame, so as to obtain target audio; and associating the current video frame with the target audio to obtain a target video.

Description

视频处理方法、装置、终端设备及计算机可读存储介质Video processing method, apparatus, terminal device and computer-readable storage medium
本申请要求于2020年07月22日提交中国专利局、申请号为202010710645.X、发明名称为“视频处理方法、视频处理装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on July 22, 2020 with the application number 202010710645.X and the invention titled "video processing method, video processing device and terminal equipment", the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请属于视频处理技术领域,尤其涉及视频处理方法、装置、终端设备及计算机可读存储介质。The present application belongs to the technical field of video processing, and in particular, relates to a video processing method, apparatus, terminal device, and computer-readable storage medium.
背景技术Background technique
在视频录像或者视频编辑等过程中,用户可能会通过缩放图像、调整对焦对象等方式对一些视频帧进行调整,以突出呈现某些内容。In the process of video recording or video editing, the user may adjust some video frames by zooming the image, adjusting the focus object, etc., to highlight certain content.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了视频处理方法、装置、终端设备及计算机可读存储介质。Embodiments of the present application provide a video processing method, an apparatus, a terminal device, and a computer-readable storage medium.
第一方面,本申请实施例提供了一种视频处理方法,包括:In a first aspect, an embodiment of the present application provides a video processing method, including:
若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;If the specified editing information of the current video frame is detected, then the target object contained in the current video frame is obtained;
根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio The corresponding audio component in the audio;
将所述当前视频帧和所述目标音频进行关联,获得目标视频。Associating the current video frame with the target audio to obtain a target video.
第二方面,本申请实施例提供了一种视频处理装置,包括:In a second aspect, an embodiment of the present application provides a video processing apparatus, including:
获取模块,用于若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;an acquisition module, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;
处理模块,用于根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;A processing module, configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;
关联模块,用于将所述当前视频帧和所述目标音频进行关联,获得目标视频。an association module, configured to associate the current video frame with the target audio to obtain a target video.
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器、显示器以及存储在上述存储器中并可在上述处理器上运行的计算机程序,其中,上述处理器执行上述计算机程序时实现如第一方面上述的视频处理方法。In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, a display, and a computer program stored in the memory and running on the processor, wherein, when the processor executes the computer program The video processing method as described above in the first aspect is implemented.
第四方面,本申请实施例提供了一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现如第一方面上述的视频处理方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the video processing method described in the first aspect.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中上述的视频处理方法。In a fifth aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the video processing method described above in the first aspect.
附图说明Description of drawings
图1是本申请一实施例提供的一种视频处理方法的流程示意图;1 is a schematic flowchart of a video processing method provided by an embodiment of the present application;
图2是本申请一实施例提供的步骤S102的一种流程示意图;FIG. 2 is a schematic flowchart of step S102 provided by an embodiment of the present application;
图3是本申请一实施例提供的另一种视频处理方法的流程示意图;3 is a schematic flowchart of another video processing method provided by an embodiment of the present application;
图4是本申请一实施例提供的又一种视频处理方法的流程示意图;4 is a schematic flowchart of another video processing method provided by an embodiment of the present application;
图5是本申请一实施例提供的再一种视频处理方法的流程示意图;5 is a schematic flowchart of still another video processing method provided by an embodiment of the present application;
图6是本申请一实施例提供的获得目标音频的示例性示意图;6 is an exemplary schematic diagram of obtaining target audio provided by an embodiment of the present application;
图7是本申请一实施例提供的一种视频处理装置的结构示意图;7 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application;
图8是本申请一实施例提供的终端设备的结构示意图。FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式detailed description
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体 细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details, such as specific system structures and technologies, are provided for a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.
目前,在对视频帧进行调整以突出呈现某些内容之后,图像所对应的音频往往依然沿用原始的音频,使得图像的视觉呈现效果和音频的声音呈现效果不匹配,导致所获得的视频的呈现效果较差。At present, after the video frame is adjusted to highlight some content, the audio corresponding to the image still uses the original audio, so that the visual rendering effect of the image and the sound rendering effect of the audio do not match, resulting in the rendering of the obtained video. less effective.
基于上述技术问题,本申请实施例提供一种视频处理方法,其中,包括:Based on the above technical problems, an embodiment of the present application provides a video processing method, which includes:
若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;If the specified editing information of the current video frame is detected, then the target object contained in the current video frame is obtained;
根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio The corresponding audio component in the audio;
将所述当前视频帧和所述目标音频进行关联,获得目标视频。Associating the current video frame with the target audio to obtain a target video.
本申请实施例与现有技术相比存在的有益效果是:本申请实施例中,可以若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,其中,所述目标对象可以认为是所想要突出显示的内容;然后,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,并将所述当前视频帧和所述目标音频进行关联,获得目标视频;此时,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量,因此可以根据所述指定编辑信息,对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,从而使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果,以提升所述目标视频的呈现效果,改善用户的体验。Compared with the prior art, the beneficial effect of the embodiment of the present application is: in the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame can be acquired, wherein the The target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the The current video frame and the target audio are associated to obtain the target video; at this time, any of the target audio components is an audio component corresponding to the target object in the initial audio, so it can be specified according to the specified Editing information to make corresponding adjustments to the target audio component and other parts of the target object in the initial video, so that the achieved sound effect of the corresponding target audio is more suitable for the visual presented in the current video frame. effect, so as to enhance the presentation effect of the target video and improve the user experience.
本申请一种可选的实施例中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,还包括:In an optional embodiment of the present application, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:
将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;Inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, where the output result includes the identified audio object and the audio components corresponding to each audio object respectively;
将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。Compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, determine that in the initial audio corresponding to the current video frame, there is a target corresponding to at least one target object audio component.
本申请一种可选的实施例中,所述若检测到当前视频帧的指定编辑信息,则获取所述 当前视频帧中包含的目标对象,包括:In an optional embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame is obtained, including:
通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
本申请一种可选的实施例中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,包括:In an optional embodiment of the present application, before processing the initial audio corresponding to the current video frame and obtaining the target audio according to the specified editing information and at least one target audio component, the method includes:
根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。According to a preset object-band mapping table, target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
本申请一种可选的实施例中,所述若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,包括:In an optional embodiment of the present application, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame includes:
若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
和/或,and / or,
若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
本申请一种可选的实施例中,所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:In an optional embodiment of the present application, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain the target audio includes:
根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;According to the specified editing information, obtain the original video frame corresponding to the current video frame;
检测所述原始视频帧中所包含的第一对象;detecting the first object contained in the original video frame;
比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;Comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;
根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。According to the comparison result, the initial audio corresponding to the current video frame is processed to obtain the target audio.
本申请一种可选的实施例中,所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:In an optional embodiment of the present application, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain the target audio includes:
根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
本申请一种可选的实施例中,根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小包括:In an optional embodiment of the present application, according to the specified editing information, adjusting the loudness of the target audio component in the initial audio includes:
若所述指定编辑信息包括当前图像缩放倍数,则根据预先确定的图像缩放倍数与响度调整倍数之间的对应关系,确定所述初始音频中的目标音频分量的响度大小。If the specified editing information includes the current image zoom factor, the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
本申请一种可选的实施例中,所述目标音频分量为预先从所述初始音频中识别并提取出的子音频;或者In an optional embodiment of the present application, the target audio component is a sub-audio identified and extracted from the initial audio in advance; or
所述目标音频分量为通过对所述初始音频中的特定频段预先标识得到。The target audio component is obtained by pre-identifying a specific frequency band in the initial audio.
本申请实施例还提供一种视频处理装置,其中,包括:The embodiment of the present application also provides a video processing apparatus, which includes:
获取模块,用于若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;an acquisition module, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;
处理模块,用于根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;A processing module, configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;
关联模块,用于将所述当前视频帧和所述目标音频进行关联,获得目标视频。an association module, configured to associate the current video frame with the target audio to obtain a target video.
本申请一种可选的实施例中,所述视频处理装置还包括:In an optional embodiment of the present application, the video processing apparatus further includes:
第二处理模块,用于将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;The second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;
比对模块,用于将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。The comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame. The target audio component corresponding to the target object.
本申请一种可选的实施例中,所述获取模块用于:In an optional embodiment of the present application, the obtaining module is used for:
通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
本申请一种可选的实施例中,所述视频处理装置还包括:In an optional embodiment of the present application, the video processing apparatus further includes:
第三处理模块,用于根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。The third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
本申请一种可选的实施例中,所述获取模块用于:In an optional embodiment of the present application, the obtaining module is used for:
若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
和/或,and / or,
若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
本申请一种可选的实施例中,所述视频处理装置还包括:In an optional embodiment of the present application, the video processing apparatus further includes:
第一获取单元,用于根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;a first obtaining unit, configured to obtain the original video frame corresponding to the current video frame according to the specified editing information;
检测单元,用于检测所述原始视频帧中所包含的第一对象;a detection unit, configured to detect the first object contained in the original video frame;
比对单元,用于比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;A comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;
第二处理单元,用于根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。The second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
本申请一种可选的实施例中,所述处理模块用于:In an optional embodiment of the present application, the processing module is used for:
根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
本申请一种可选的实施例中,所述处理模块用于:In an optional embodiment of the present application, the processing module is used for:
若所述指定编辑信息包括当前图像缩放倍数,则根据预先确定的图像缩放倍数与响度调整倍数之间的对应关系,确定所述初始音频中的目标音频分量的响度大小。If the specified editing information includes the current image zoom factor, the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
本申请一种可选的实施例中,所述目标音频分量为预先从所述初始音频中识别并提取出的子音频;或者所述目标音频分量为通过对所述初始音频中的特定频段预先标识得到。In an optional embodiment of the present application, the target audio component is a sub-audio identified and extracted from the initial audio in advance; ID is obtained.
本申请实施例还提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如上申请实施例所述的视频处理方法。Embodiments of the present application further provide a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above when executing the computer program The video processing methods described in the application embodiments are provided.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如上申请实施例所述的视频处理方法。The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, wherein the computer program implements the video processing method described in the above-mentioned embodiments of the application when the computer program is executed by the processor.
本申请实施例提供的视频处理方法可以应用于服务器、台式电脑、手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The video processing method provided by the embodiments of the present application can be applied to servers, desktop computers, mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, and notebook computers , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA) and other terminal equipment, the embodiment of the present application does not make any restrictions on the specific type of the terminal equipment.
图1示出了本申请实施例提供的一种视频处理方法的流程图,该视频处理方法可以应用于终端设备。FIG. 1 shows a flowchart of a video processing method provided by an embodiment of the present application, and the video processing method can be applied to a terminal device.
目前,在视频录像或者视频编辑等过程中,用户可能会调整一些视频帧,以突出呈现某个场景或者某个对象,例如,可能会调整视频帧的图像缩放倍数,从而改变视频帧的视野,或者调整对焦对象等。然而现有技术中,在调整视频帧之后,视频帧所对应的音频往往依然沿用原始的音频。可见,现有技术中没有发现在调整视频帧之后,用户从视频帧中所感知到的场景和对象之间的空间关系可能会发生了变化,但原始的音频中所呈现的声音依然是基于视频帧缩放前所呈现的空间关系所采集到的声音。因此,目前,在调整视频帧之后,图像的视觉呈现效果和音频的声音呈现效果会出现不匹配的情况,导致所获得的视频的呈现效果较差。At present, in the process of video recording or video editing, the user may adjust some video frames to highlight a scene or an object. For example, the image zoom factor of the video frame may be adjusted to change the field of view of the video frame. Or adjust the focus object, etc. However, in the prior art, after the video frame is adjusted, the audio corresponding to the video frame often still uses the original audio. It can be seen that the prior art does not find that after adjusting the video frame, the spatial relationship between the scene and the object perceived by the user from the video frame may change, but the sound presented in the original audio is still based on the video. The sound captured by the spatial relationship presented before frame scaling. Therefore, at present, after adjusting the video frame, the visual presentation effect of the image and the sound presentation effect of the audio may not match, resulting in a poor presentation effect of the obtained video.
而通过本申请实时例,可以根据所述指定编辑信息对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,此时,所获得的目标音频可以随着当前视频帧的指定编辑操作而变化,从而使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果,提升所述目标视频的呈现效果,为用户提供更为身临其境的观看体验。However, through the real-time example of the present application, the target audio component and other parts of the target object in the initial video can be adjusted correspondingly according to the specified editing information. At this time, the obtained target audio can follow the current video frame. changes according to the specified editing operation, so that the achieved sound effect of the corresponding target audio is more suitable for the visual effect presented in the current video frame, improving the presentation effect of the target video, and providing users with a more immersive A real viewing experience.
具体地,如图1所示,该视频处理方法可以包括:Specifically, as shown in FIG. 1, the video processing method may include:
步骤S101,若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象。Step S101, if the specified editing information of the current video frame is detected, acquire the target object contained in the current video frame.
本申请实施例中,所述当前视频帧可以为视频采集过程中,实时采集到的当前视频帧,也可以是对视频进行编辑时,从要编辑的视频中提取的一帧视频帧。当然,其他应用场景中,通过其他获取方式得到所述当前视频帧。In the embodiment of the present application, the current video frame may be the current video frame collected in real time during the video collection process, or may be a video frame extracted from the video to be edited when the video is edited. Of course, in other application scenarios, the current video frame is obtained through other acquisition methods.
本申请实施例中,识别所述当前视频帧中的目标对象的方法可以有多种。示例性的,可以通过诸如Spatial Pyramid Pooling Networks(SPPNet)、Faster-RCNN、Single Shot MultiBox Detector(SSD)、Retina-Net或者多尺度检测方法等等目标检测方法对所述当前视频帧进行目标检测,以获得所述当前视频帧中包含的目标对象。In this embodiment of the present application, there may be various methods for identifying the target object in the current video frame. Exemplarily, target detection can be performed on the current video frame by a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net or a multi-scale detection method, etc., to obtain the target object contained in the current video frame.
本申请实施例中,所述指定编辑信息可以为对所述当前视频帧的指定编辑操作所关联的信息。其中,所述指定编辑操作可以用于实现用户或者终端设备自动执行的对所述当前视频帧的指定调整,例如,用于实现对当前视频帧的缩放或者调整所述当前视频帧中的对焦对象等。示例性的,所述指定编辑操作可以包括不满足预设条件的图像缩放操作(例如,所述图像缩放操作所对应的当前图像缩放倍数不满足预设条件);和/或,所述指定编辑操作可以包括图像对焦操作,并且,在执行所述图像对焦操作之后,所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同。In this embodiment of the present application, the specified editing information may be information associated with a specified editing operation on the current video frame. The specified editing operation may be used to realize the specified adjustment of the current video frame automatically performed by the user or the terminal device, for example, to realize the zooming of the current video frame or to adjust the focus object in the current video frame Wait. Exemplarily, the specified editing operation may include an image scaling operation that does not meet a preset condition (for example, the current image scaling factor corresponding to the image scaling operation does not meet the preset condition); and/or, the specified editing operation The operation may include an image focus operation, and after the image focus operation is performed, a first focus object in the current video frame and a second focus object in a video frame preceding the current video frame are different.
在一些实施例中,所述若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,包括:In some embodiments, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame includes:
若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
和/或,and / or,
若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
本申请实施例中,在一些情况下,当所述当前图像缩放倍数不满足预设条件时,可以认为用户对所述当前视频帧做了图像缩放处理,以使得所述当前视频帧相对于终端设备的摄像头所采集的原始视频帧进行了缩小或者放大。此时,所述当前视频帧的视野相较于原始视频帧也发生了变化,因此,可以通过获取所述当前视频帧中所包含的指定对象来确定当前视频帧中想要突出的内容。其中,所述指定对象为所述当前视频帧中的对焦对象、用户选定的对象、目标检测所检测到的一个或多个目标物体等等。而若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则可以确定 所述当前视频帧的对焦对象相对于前一视频帧发生了变化。此时,可以认为当前视频帧中的第一对焦对象为用户在当前视频帧中想要突出的内容,因此,可以将当前视频帧中的第一对焦对象作为至少部分所述目标对象。In the embodiments of the present application, in some cases, when the current image zoom factor does not meet the preset conditions, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relatively relative to the terminal. The original video frame captured by the camera of the device is reduced or enlarged. At this time, the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame. Wherein, the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like. However, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, it can be determined that the focus object of the current video frame is relative to the previous video frame The frame has changed. At this time, it can be considered that the first focus object in the current video frame is the content that the user wants to highlight in the current video frame. Therefore, the first focus object in the current video frame can be used as at least part of the target object.
步骤S102,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量。Step S102, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in The corresponding audio component in the initial audio.
本申请实施例中,所述目标音频分量的具体形式可以根据实际场景来设置。示例性的,所述目标音频分量可以为预先从所述初始音频中识别并提取出的子音频;或者,也可以将所述初始音频中的特定频段预先标识为所述目标音频分量。In this embodiment of the present application, the specific form of the target audio component may be set according to an actual scene. Exemplarily, the target audio component may be a sub-audio identified and extracted from the initial audio in advance; or, a specific frequency band in the initial audio may be pre-identified as the target audio component.
其中,从所述当前视频帧中识别出所述目标音频分量的方式也可以有多种,在此不做限制。示例性的,可以通过梅尔倒频谱算法、训练后的循环神经网络(Recurrent Neural Networks)和/或卷积神经网络等等算法对所述初始音频进行处理,从而将识别到的特定的音频对象的音频分量从所述初始音频中分离出来并进行标识,若该特定的音频对象为任一所述目标对象,则对应的音频分量可以作为所述目标音频分量。此外,也可以根据预先保存的频段与对象之间的映射关系,在所述初始音频中,确定各个目标对象所对应的目标频段并进行标识,以作为所述目标音频分量。Wherein, there may also be various ways of identifying the target audio component from the current video frame, which is not limited here. Exemplarily, the initial audio can be processed by algorithms such as Mel cepstrum algorithm, trained Recurrent Neural Networks (Recurrent Neural Networks) and/or Convolutional Neural Networks, etc., so as to identify the specific audio objects. The audio component is separated from the initial audio and identified, and if the specific audio object is any of the target objects, the corresponding audio component can be used as the target audio component. In addition, according to the pre-stored mapping relationship between frequency bands and objects, in the initial audio, target frequency bands corresponding to each target object may be determined and identified as the target audio components.
本申请实施例中,在获取到所述当前视频帧中包含的目标对象之后,可以对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,以使得所获得的目标音频可以随着当前视频帧的变化而变化。In this embodiment of the present application, after the target object contained in the current video frame is acquired, corresponding adjustments may be made to the target audio component of the target object in the initial video, so that the obtained target Audio can change with the current video frame.
在一些实施例中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,还包括:In some embodiments, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further includes:
将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;Inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, where the output result includes the identified audio object and the audio components corresponding to each audio object respectively;
将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。Compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, determine that in the initial audio corresponding to the current video frame, there is a target corresponding to at least one target object audio component.
本申请实施例中,通过预先训练,所述第一神经网络可以从音频中识别不同音频对象的波形特征、频率特征等音频特征信息,以提取并分离出音频对象的音频分量。示例性的,所述第一神经网络可以为双向循环神经网络(Bidirectional RNN,Bi-RNN)、长短期记忆网络(Long Short-Term Memory networks,LSTM)等循环神经网络。In the embodiment of the present application, through pre-training, the first neural network can identify audio feature information such as waveform features and frequency features of different audio objects from audio, so as to extract and separate audio components of the audio objects. Exemplarily, the first neural network may be a recurrent neural network such as a bidirectional recurrent neural network (Bidirectional RNN, Bi-RNN), a long short-term memory network (Long Short-Term Memory networks, LSTM).
本申请实施例中,可以根据所述第一神经网络所对应的第一训练数据集对所述第一神经网络进行预先训练。其中,所述第一训练数据集中可以包括多个训练音频和各个训练音频分别对应的标签,所述标签可以包括所述音频对象。In this embodiment of the present application, the first neural network may be pre-trained according to the first training data set corresponding to the first neural network. Wherein, the first training data set may include multiple training audios and labels corresponding to each training audio, and the labels may include the audio objects.
在一些实施例中,所述若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,包括:In some embodiments, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame includes:
通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
本申请实施例中,示例性的,第二神经网络可以为诸如Spatial Pyramid Pooling Networks(SPPNet)、Faster-RCNN、Single Shot MultiBox Detector(SSD)、Retina-Net或者多尺度检测方法等等目标检测方法。In the embodiment of the present application, for example, the second neural network may be a target detection method such as Spatial Pyramid Pooling Networks (SPPNet), Faster-RCNN, Single Shot MultiBox Detector (SSD), Retina-Net, or a multi-scale detection method, etc. .
本申请实施例中,所述第二神经网络所对应的第二训练数据集包括多个训练图像和各个训练图像分别对应的标签。所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合,以便于将所述第二神经网 络在所述当前视频帧中识别到的目标对象和所述第一神经网络在所述初始音频中识别到的音频对象进行关联。因此,通过所述第一神经网络和第二神经网络,实现了对象、图像、音频三者之间的关联,以使得当视频中的图像由于缩放导致视野变化时,可以根据视野的变化,相应地进行对应音频的调整,从而避免了现有技术中图像调整后的视觉呈现效果和音频的声音呈现效果不匹配的问题。In the embodiment of the present application, the second training data set corresponding to the second neural network includes a plurality of training images and labels corresponding to each training image. The labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network at least partially overlap, so that the second neural network can be used in the current The target object identified in the video frame is associated with the audio object identified in the initial audio by the first neural network. Therefore, through the first neural network and the second neural network, the association between objects, images, and audio is realized, so that when the image in the video changes the field of view due to zooming, the corresponding visual field can be changed according to the change of the field of view. The corresponding audio is adjusted appropriately, thereby avoiding the problem that the visual presentation effect after image adjustment does not match the sound presentation effect of the audio in the prior art.
如表1所示,为所述第一训练数据集和第二训练数据集中的标签的一种示例性关联设置方式。As shown in Table 1, it is an exemplary association setting manner of the labels in the first training data set and the second training data set.
表1:Table 1:
标签Label 训练图像training images 训练音频training audio
people AA aa
cat BB bb
dog CC cc
vehicle DD dd
……... ……... ……...
其中,在特定数据库中,存储有第一训练数据集中的多个训练音频,即训练音频a、训练音频b、训练音频c以及训练音频d等。在特定数据库中,存储有第二训练数据集中的多个训练图像,即训练图像A、训练图像B、训练图像C以及训练图像D等。而训练音频a所对应的标签和所述训练图像A所对应的标签相同,训练音频b所对应的标签和所述训练图像B所对应的标签相同,训练音频c所对应的标签和所述训练图像C所对应的标签相同,训练音频d所对应的标签和所述训练图像D所对应的标签相同。在一些实施例中,一些标签可能只有音频而没有图像,例如风只有对应的训练音频而没有对应的训练图像,而乌龟等一般不会发声的物体可能只有对应的训练图像而没有对应的训练音频。因此,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签也可以存在一些不同。Wherein, in the specific database, a plurality of training audios in the first training data set, ie, training audio a, training audio b, training audio c, and training audio d, are stored. In a specific database, a plurality of training images in the second training data set, ie, training image A, training image B, training image C, and training image D, are stored. The label corresponding to the training audio a is the same as the label corresponding to the training image A, the label corresponding to the training audio b is the same as the label corresponding to the training image B, and the label corresponding to the training audio c is the same as the label corresponding to the training audio The label corresponding to the image C is the same, and the label corresponding to the training audio d is the same as the label corresponding to the training image D. In some embodiments, some labels may only have audio but no images. For example, wind has only corresponding training audio but no corresponding training images, and objects such as turtles that generally do not make sounds may only have corresponding training images but no corresponding training audio. . Therefore, there may be some differences between the labels in the first training data set corresponding to the first neural network and the labels in the second training data set corresponding to the second neural network.
在一些实施例中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,包括:In some embodiments, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method includes:
根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。According to a preset object-band mapping table, target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
本申请实施例中,所述预设的对象-频段映射表中可以预先存储有各个对象和频段之间的映射关系,因此,通过查询所述对象-频段映射表,可以确定所述目标对象所对应的目标频段,并在所述初始音频中进行标识。In the embodiment of the present application, the preset object-frequency band mapping table may pre-store the mapping relationship between each object and the frequency band. Therefore, by querying the object-frequency band mapping table, it is possible to determine the target object. The corresponding target frequency band is identified in the initial audio.
在一些实施例中,所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:In some embodiments, processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain the target audio includes:
步骤S201,根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;Step S201, obtaining the original video frame corresponding to the current video frame according to the specified editing information;
步骤S202,检测所述原始视频帧中所包含的第一对象;Step S202, detecting the first object contained in the original video frame;
步骤S203,比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;Step S203, compare the target object contained in the current video frame with the first object contained in the original video frame, and obtain a comparison result;
步骤S204,根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。Step S204, according to the comparison result, process the initial audio corresponding to the current video frame to obtain the target audio.
本申请实施例中,可以获取所述当前视频帧所对应的原始视频帧,以根据所述原始视频帧与所述当前视频帧之间的内容变化情况来确定对所述初始音频的处理。In the embodiment of the present application, the original video frame corresponding to the current video frame may be acquired, so as to determine the processing of the initial audio according to the content change between the original video frame and the current video frame.
其中,所述原始视频帧可以是终端设备通过摄像头采集到的并显示的视频帧,而所述当前视频帧为根据所述指定编辑信息对所述原始视频帧进行编辑之后得到的。因此,比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,可以参考所述 原始视频帧更好地明确用户想要突出呈现的内容,从而更有针对性地对所述初始音频进行处理。The original video frame may be a video frame collected and displayed by a terminal device through a camera, and the current video frame is obtained after editing the original video frame according to the specified editing information. Therefore, by comparing the target object included in the current video frame with the first object included in the original video frame, the content that the user wants to present prominently can be better specified with reference to the original video frame, thereby more The initial audio is processed in a targeted manner.
例如,在一些示例中,所述当前视频帧的目标对象包括人,而所述原始视频帧中的第一对象包括人和狗,则通过比对所述目标对象和所述第一对象,可以认为用户想要在所述当前视频帧中突出显示人,因此,在对相应的初始音频进行处理时,可以根据所述指定编辑信息,提高所述初始音频中标识为人的目标音频分量的音频强度,而降低标识为狗的音频分量的音频强度,以更匹配所述当前视频帧中的图像显示效果。For example, in some examples, the target object of the current video frame includes a person, and the first object in the original video frame includes a person and a dog, then by comparing the target object and the first object, it is possible to It is considered that the user wants to highlight a person in the current video frame, therefore, when processing the corresponding initial audio, the audio intensity of the target audio component identified as a person in the initial audio can be increased according to the specified editing information , and reduce the audio intensity of the audio component identified as the dog to better match the image display effect in the current video frame.
所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, including:
根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
本申请实施例中,可以根据所述指定编辑信息,确定对所述初始音频中的目标音频分量的响度的调整幅度。例如,若所述指定编辑信息中包括当前图像缩放倍数,可以根据预先确定的图像缩放倍数与响度调整倍数之间的对应关系,根据所述当前图像缩放倍数,确定所述初始音频中的目标音频分量的响度大小。而若所述指定编辑信息包括对焦对象切换的信息,则可以根据切换前的第二对焦对象的第二图像区域和切换后的第一对焦对象的第一图像区域之间的距离、大小等信息,确定所述初始音频中的目标音频分量的响度大小。In this embodiment of the present application, the adjustment range for the loudness of the target audio component in the initial audio may be determined according to the specified editing information. For example, if the specified editing information includes the current image zoom factor, the target audio in the initial audio can be determined according to the current image zoom factor according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor. The loudness of the component. However, if the specified editing information includes information on switching the focus object, the information such as the distance and size between the second image area of the second focus object before the switch and the first image area of the first focus object after the switch may be based on the information , and determine the loudness of the target audio component in the initial audio.
通过调整所述初始音频中的目标音频分量的响度大小,可以使得用户从当前视频帧中所感知到的场景和对象之间的空间关系和用户从目标音频中所感知到的场景和对象之间的空间关系相匹配,从而使得用户可以有更好的体验。By adjusting the loudness of the target audio component in the initial audio, the spatial relationship between the scene and the object perceived by the user from the current video frame and the scene and the object perceived by the user from the target audio can be made. to match the spatial relationship, so that users can have a better experience.
步骤S103,将所述当前视频帧和所述目标音频进行关联,获得目标视频。Step S103, associate the current video frame with the target audio to obtain a target video.
本申请实施例中,可以通过时间戳等方式,将所述当前视频帧和所述目标音频进行关联,以使得当前视频帧和所述目标音频在播放时可以同步。并且,可以将所述当前视频帧和所述目标音频合并至特定视频格式的文件中,即获得所述目标视频。In this embodiment of the present application, the current video frame and the target audio may be associated by means of a time stamp, so that the current video frame and the target audio can be synchronized during playback. And, the current video frame and the target audio can be merged into a file of a specific video format, that is, the target video can be obtained.
在上述实施例的基础上,作为本申请的一个可选实施例,参考图3,所述视频处理方法可以包括:On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 3 , the video processing method may include:
步骤S301,若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;Step S301, if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, then the specified object contained in the current video frame is used as at least part of the target object;
步骤S302,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;Step S302, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
步骤S303,将所述当前视频帧和所述目标音频进行关联,获得目标视频。Step S303, associate the current video frame with the target audio to obtain a target video.
本申请实施例中,示例性的,所述预设条件可以指所述当前图像缩放倍数属于预设区间或者等于预设倍数值(如等于1)等等。在一些情况下,当所述当前图像缩放倍数不满足预设条件时,可以认为用户对所述当前视频帧做了图像缩放处理,以使得所述当前视频帧相对于终端设备的摄像头所采集的原始视频帧进行了缩小或者放大。此时,所述当前视频帧的视野相较于原始视频帧也发生了变化,因此,可以通过获取所述当前视频帧中所包含的指定对象来确定当前视频帧中想要突出的内容。其中,所述指定对象为所述当前视频帧中的对焦对象、用户选定的对象、目标检测所检测到的一个或多个目标物体等等。In the embodiment of the present application, exemplarily, the preset condition may refer to that the current image zoom factor belongs to a preset interval or is equal to a preset multiple value (eg, equal to 1), or the like. In some cases, when the current image zoom factor does not meet a preset condition, it may be considered that the user has performed image zoom processing on the current video frame, so that the current video frame is relative to the image captured by the camera of the terminal device. The original video frame is downscaled or upscaled. At this time, the field of view of the current video frame has also changed compared to the original video frame. Therefore, the content to be highlighted in the current video frame can be determined by acquiring the specified object contained in the current video frame. Wherein, the specified object is a focused object in the current video frame, an object selected by a user, one or more target objects detected by target detection, and the like.
在上述实施例的基础上,作为本申请的一个可选实施例,参考图4,所述视频处理方法可以包括:On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 4 , the video processing method may include:
步骤S401,若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象;Step S401, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, then the first focus object in the current video frame is at least partially the target object;
步骤S402,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对 应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;Step S402, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
步骤S403,将所述当前视频帧和所述目标音频进行关联,获得目标视频。Step S403, associate the current video frame with the target audio to obtain a target video.
本申请实施例中,若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则可以确定所述当前视频帧的对焦对象相对于前一视频帧发生了变化。此时,可以认为当前视频帧中的第一对焦对象为用户在当前视频帧中想要突出的内容,因此,可以将当前视频帧中的第一对焦对象作为至少部分所述目标对象。In the embodiment of the present application, if it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the focus object of the current video frame may be determined Changed relative to the previous video frame. At this time, it can be considered that the first focus object in the current video frame is the content that the user wants to highlight in the current video frame. Therefore, the first focus object in the current video frame can be used as at least part of the target object.
在上述实施例的基础上,作为本申请的一个可选实施例,参考图5,所述视频处理方法可以包括:On the basis of the foregoing embodiment, as an optional embodiment of the present application, referring to FIG. 5 , the video processing method may include:
步骤S501,若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,并且,若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将所述当前视频帧中包含的指定对象以及所述当前视频帧中的第一对焦对象作为至少部分所述目标对象;Step S501, if it is detected that the current image zoom factor of the current video frame does not meet the preset condition, and if it is detected that the first focus object in the current video frame and the previous video frame of the current video frame are detected. If the second focus object is different, the specified object contained in the current video frame and the first focus object in the current video frame are used as at least part of the target object;
步骤S502,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;Step S502, according to the specified editing information and at least one target audio component, process the initial audio corresponding to the current video frame to obtain target audio, wherein any of the target audio components is a target object in the corresponding audio component in the initial audio;
步骤S503,将所述当前视频帧和所述目标音频进行关联,获得目标视频。Step S503, associate the current video frame with the target audio to obtain a target video.
本申请实施例中,可以根据所述当前视频帧的当前图像缩放倍数以及当前视频帧中的第一对焦图像相对于前一视频帧的第二对焦图像的变化情况,确定所述目标对象。其中,根据所述当前视频帧的当前图像缩放倍数确定的目标对象与根据当前视频帧中的第一对焦图像相对于前一视频帧的第二对焦图像的变化情况确定的目标对象可能为同一目标对象,因此,对所述目标对象所对应的目标音频分量的处理可以结合所述当前图像缩放倍数和所述第一对焦对象共同确定,从而处理所述初始音频分量,使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果。In this embodiment of the present application, the target object may be determined according to the current image zoom factor of the current video frame and the change of the first focused image in the current video frame relative to the second focused image of the previous video frame. Wherein, the target object determined according to the current image zoom factor of the current video frame and the target object determined according to the change of the first focused image in the current video frame relative to the second focused image of the previous video frame may be the same target Therefore, the processing of the target audio component corresponding to the target object can be jointly determined by combining the current image zoom factor and the first focus object, so as to process the initial audio component, so that all the corresponding target audio The achieved sound effect is more suitable for the visual effect presented in the current video frame.
下面以一个具体示例,说明本申请实施例的一种具体实现方式。A specific implementation manner of the embodiment of the present application is described below with a specific example.
示例性的,如图6(a)所示,为所述当前视频帧所对应的原始视频帧。通过第二神经网络对所述原始视频帧进行目标检测,可以检测到所述原始视频帧中包含的第一对象包括牛、猫和狗。Exemplarily, as shown in FIG. 6( a ), it is the original video frame corresponding to the current video frame. By performing target detection on the original video frame by the second neural network, it can be detected that the first objects included in the original video frame include cows, cats and dogs.
如图6(b)所示,通过将所述初始音频输入所述第一神经网络,所述第一神经网络的输出结果中,包括牛所对应的第一音频分量、猫所对应的第二音频分量以及狗所对应的第三音频分量。其中,第一音频分量的音量大小为a,第二音频分量的音量大小为b,其他音频分量的音量大小为c。As shown in Figure 6(b), by inputting the initial audio into the first neural network, the output of the first neural network includes the first audio component corresponding to the cow and the second audio component corresponding to the cat. The audio component and the third audio component corresponding to the dog. The volume level of the first audio component is a, the volume level of the second audio component is b, and the volume level of the other audio components is c.
用户可以通过特定的屏幕操作手势对所述原始视频帧进行放大处理,此时,所述当前图像缩放倍数大于1,可以认为所述当前视频帧的当前图像缩放倍数不满足预设条件。放大处理后所得到的当前视频帧的视野将会小于所述原始视频帧的视野。The user can zoom in on the original video frame through a specific screen operation gesture. At this time, if the current image zoom factor is greater than 1, it can be considered that the current image zoom factor of the current video frame does not meet the preset condition. The field of view of the current video frame obtained after the enlargement process will be smaller than the field of view of the original video frame.
如图6(c)所示,若所述当前图像缩放倍数为1.5,并且所述当前视频帧中的目标对象不包括牛时,可以看到所述当前视频帧中的猫和狗的图像部分相较于原始视频帧变大,因此,可以根据所述当前图像缩放倍数,增加所述第二音频分量b和第三音频分量c的大小,降低所述第一音频分量a的大小,以获得目标音频。As shown in Fig. 6(c), if the zoom factor of the current image is 1.5 and the target object in the current video frame does not include a cow, the image parts of cats and dogs in the current video frame can be seen Compared with the original video frame, the size of the second audio component b and the third audio component c can be increased, and the size of the first audio component a can be decreased according to the current image zoom factor, so as to obtain target audio.
如图6(d)所示,若所述当前图像缩放倍数进一步增大至2,并且所述当前视频帧中的目标对象也不包括牛时,可以看到所述当前视频帧中的猫和狗的部分所占的图像比例更大,因此,可以根据所述当前图像缩放倍数,进一步增加所述第二音频分量b和第三音频分量c的大小,并进一步降低所述第一音频分量a的大小,以获得目标音频。As shown in FIG. 6(d), if the zoom factor of the current image is further increased to 2, and the target object in the current video frame does not include a cow, you can see the cat and the cat in the current video frame. The proportion of the image occupied by the dog part is larger, therefore, the size of the second audio component b and the third audio component c can be further increased, and the size of the first audio component a can be further reduced according to the current image zoom factor size to obtain the target audio.
可见,本示例中,随着所述当前图像缩放倍数的增大和所述目标对象所对应的图像逐 渐变大,可以相应的调整目标音频分量,以使得相应的目标音频分量也可以随着图像的突出显示而增强,以更好地模拟用户逐渐接近目标对象的状态,使得用户更容易产生身历其境的感觉,从而提升了用户体验。It can be seen that, in this example, as the zoom factor of the current image increases and the image corresponding to the target object gradually becomes larger, the target audio component can be adjusted accordingly, so that the corresponding target audio component can also increase with the size of the image. Highlighted and enhanced to better simulate the state of the user gradually approaching the target object, making it easier for the user to feel immersed in the environment, thereby improving the user experience.
需要说明的是,上述示例仅为对本申请实施例的一种示例性说明,而非对本申请实施例的限制。It should be noted that the above example is only an exemplary description of the embodiment of the present application, rather than a limitation of the embodiment of the present application.
本申请实施例中,可以若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,其中,所述目标对象可以认为是所想要突出显示的内容;然后,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,并将所述当前视频帧和所述目标音频进行关联,获得目标视频;此时,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量,因此可以根据所述指定编辑信息,对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,从而使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果,以提升所述目标视频的呈现效果,改善用户的体验。In the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame may be acquired, wherein the target object may be considered to be the content you want to highlight; then, According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the current video frame and the target audio are associated to obtain the target video; At this time, any one of the target audio components is an audio component corresponding to the target object in the initial audio. Therefore, according to the specified editing information, the target of the target object in the initial video can be adjusted. The audio component and other parts are adjusted accordingly, so that the sound effect achieved by the corresponding target audio is more suitable for the visual effect presented in the current video frame, so as to enhance the presentation effect of the target video and improve the user experience. .
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
对应于上文实施例上述的视频处理方法,图4示出了本申请实施例提供的一种视频处理装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the above-mentioned video processing methods in the above embodiments, FIG. 4 shows a structural block diagram of a video processing apparatus provided by the embodiments of the present application. For convenience of description, only parts related to the embodiments of the present application are shown.
参照图7,该视频处理装置7包括:7, the video processing device 7 includes:
获取模块701,用于若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;an acquisition module 701, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;
处理模块702,用于根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;The processing module 702 is configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components. the audio component corresponding to the target object in the initial audio;
关联模块703,用于将所述当前视频帧和所述目标音频进行关联,获得目标视频。An association module 703, configured to associate the current video frame with the target audio to obtain a target video.
可选的,所述视频处理装置7还包括:Optionally, the video processing device 7 further includes:
第二处理模块,用于将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;The second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;
比对模块,用于将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。The comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame. The target audio component corresponding to the target object.
可选的,所述获取模块701具体用于:Optionally, the obtaining module 701 is specifically used for:
通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
可选的,所述视频处理装置7还包括:Optionally, the video processing device 7 further includes:
第三处理模块,用于根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。The third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
可选的,所述获取模块701具体用于:Optionally, the obtaining module 701 is specifically used for:
若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
和/或,and / or,
若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
可选的,所述处理模块702具体包括:Optionally, the processing module 702 specifically includes:
第一获取单元,用于根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;a first obtaining unit, configured to obtain the original video frame corresponding to the current video frame according to the specified editing information;
检测单元,用于检测所述原始视频帧中所包含的第一对象;a detection unit, configured to detect the first object contained in the original video frame;
比对单元,用于比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;A comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;
第二处理单元,用于根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。The second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
可选的,所述处理模块702具体用于:Optionally, the processing module 702 is specifically used for:
根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
本申请实施例中,可以若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,其中,所述目标对象可以认为是所想要突出显示的内容;然后,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,并将所述当前视频帧和所述目标音频进行关联,获得目标视频;此时,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量,因此可以根据所述指定编辑信息,对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,从而使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果,以提升所述目标视频的呈现效果,改善用户的体验。In the embodiment of the present application, if the specified editing information of the current video frame is detected, the target object contained in the current video frame may be acquired, wherein the target object may be considered to be the content you want to highlight; then, According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio, and the current video frame and the target audio are associated to obtain the target video; At this time, any one of the target audio components is an audio component corresponding to the target object in the initial audio. Therefore, according to the specified editing information, the target of the target object in the initial video can be adjusted. The audio component and other parts are adjusted accordingly, so that the sound effect achieved by the corresponding target audio is more suitable for the visual effect presented in the current video frame, so as to enhance the presentation effect of the target video and improve the user experience. .
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将上述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the above device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
图8为本申请一实施例提供的终端设备的结构示意图。如图8所示,该实施例的终端设备8包括:至少一个处理器80(图8中仅示出一个)、存储器81以及存储在上述存储器81中并可在上述至少一个处理器80上运行的计算机程序82,上述处理器80执行上述计算机程序82时实现上述任意各个视频处理方法实施例中的步骤。FIG. 8 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 8 , the terminal device 8 of this embodiment includes: at least one processor 80 (only one is shown in FIG. 8 ), a memory 81 , and a memory 81 that is stored in the above-mentioned memory 81 and can run on the above-mentioned at least one processor 80 The above computer program 82, when the above processor 80 executes the above computer program 82, implements the steps in any of the above video processing method embodiments.
上述终端设备8可以是服务器、手机、可穿戴设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、桌上型计算机、笔记本、台式电脑以及掌上电脑等计算设备。该终端设备可包括,但不仅限于,处理器80、存储器81。本领域技术人员可以理解,图8仅仅是终端设备8的举例,并不构成对终端设备8的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入设备、输出设备、网络接入设备等。其中,上述输入设备可以包括键盘、触控板、指纹采集传感器(用于采集用户的指纹信息和指纹的方向信息)、麦克风、摄像头等,输出设备可以包括显示器、扬声器等。The above-mentioned terminal device 8 may be computing devices such as a server, a mobile phone, a wearable device, an augmented reality (AR)/virtual reality (VR) device, a desktop computer, a notebook, a desktop computer, and a handheld computer. The terminal device may include, but is not limited to, a processor 80 and a memory 81 . Those skilled in the art can understand that FIG. 8 is only an example of the terminal device 8, and does not constitute a limitation on the terminal device 8. It may include more or less components than the one shown, or combine some components, or different components , for example, may also include input devices, output devices, network access devices, and so on. Wherein, the above-mentioned input devices may include keyboards, touchpads, fingerprint collection sensors (for collecting user's fingerprint information and fingerprint direction information), microphones, cameras, etc., and output devices may include displays, speakers, and the like.
上述处理器80可以是中央处理单元(Central Processing Unit,CPU),该处理器80还可 以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The above-mentioned processor 80 may be a central processing unit (Central Processing Unit, CPU), and the processor 80 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
上述存储器81在一些实施例中可以是上述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。上述存储器81在另一些实施例中也可以是上述终端设备8的外部存储设备,例如上述终端设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,上述存储器81还可以既包括上述终端设备8的内部存储单元也包括外部存储设备。上述存储器81用于存储操作系统、应用程序、引导装载程序(Boot Loader)、数据以及其他程序等,例如上述计算机程序的程序代码等。上述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。In some embodiments, the above-mentioned memory 81 may be an internal storage unit of the above-mentioned terminal device 8 , such as a hard disk or a memory of the terminal device 8 . The above-mentioned memory 81 may also be an external storage device of the above-mentioned terminal device 8 in other embodiments, such as a plug-in hard disk equipped on the above-mentioned terminal device 8, a smart memory card (Smart Media Card, SMC), a Secure Digital (Secure Digital) , SD) card, flash memory card (Flash Card), etc. Further, the above-mentioned memory 81 may also include both an internal storage unit of the above-mentioned terminal device 8 and an external storage device. The above-mentioned memory 81 is used to store an operating system, an application program, a boot loader (Boot Loader), data, and other programs, such as program codes of the above-mentioned computer programs, and the like. The above-mentioned memory 81 can also be used to temporarily store data that has been output or is to be output.
另外,尽管未示出,上述终端设备8还可以包括网络连接模块,如蓝牙模块Wi-Fi模块、蜂窝网络模块等等,在此不再赘述。In addition, although not shown, the above-mentioned terminal device 8 may also include a network connection module, such as a Bluetooth module, a Wi-Fi module, a cellular network module, and the like, which will not be repeated here.
本申请实施例中,上述处理器80执行上述计算机程序82以实现上述任意各个视频处理方法实施例中的步骤时,可以若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,其中,所述目标对象可以认为是所想要突出显示的内容;然后,根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,并将所述当前视频帧和所述目标音频进行关联,获得目标视频;此时,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量,因此可以根据所述指定编辑信息,对所述初始视频中关于所述目标对象的目标音频分量等部分进行相应的调整,从而使得相应的目标音频的所达到的声音效果更贴合所述当前视频帧中所呈现的视觉效果,以提升所述目标视频的呈现效果,改善用户的体验。In this embodiment of the present application, when the above-mentioned processor 80 executes the above-mentioned computer program 82 to implement the steps in any of the above-mentioned video processing method embodiments, if the designated editing information of the current video frame is detected, it can obtain the information in the current video frame. The included target object, wherein the target object can be considered as the content you want to highlight; then, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed , obtain the target audio, and associate the current video frame with the target audio to obtain the target video; at this time, any of the target audio components is an audio corresponding to the target object in the initial audio Therefore, according to the specified editing information, the target audio component of the target object in the initial video can be adjusted accordingly, so that the sound effect of the corresponding target audio is more suitable for the The visual effect presented in the current video frame is to enhance the presentation effect of the target video and improve the user experience.
本申请实施例还提供了一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现可实现上述各个方法实施例中的步骤。The embodiments of the present application provide a computer program product, when the computer program product runs on a terminal device, so that the terminal device can implement the steps in the foregoing method embodiments when executed.
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,上述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,上述计算机程序包括计算机程序代码,上述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。上述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the above-mentioned integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above-mentioned embodiments, which can be completed by instructing the relevant hardware through a computer program. The above-mentioned computer program can be stored in a computer-readable storage medium, and the computer program is in When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code form, executable file or some intermediate form. The above-mentioned computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), a random access memory ( RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/设备实施例仅仅是示意性的,例如,上述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus/device and method may be implemented in other manners. For example, the apparatus/equipment embodiments described above are only illustrative. For example, the division of the above modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or components. May be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
以上上述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that the above-mentioned embodiments can still be used for The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in this document. within the scope of protection of the application.

Claims (20)

  1. 一种视频处理方法,其中,包括:A video processing method, comprising:
    若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;If the specified editing information of the current video frame is detected, then the target object contained in the current video frame is obtained;
    根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;According to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain target audio, wherein any target audio component is a target object in the initial audio The corresponding audio component in the audio;
    将所述当前视频帧和所述目标音频进行关联,获得目标视频。Associating the current video frame with the target audio to obtain a target video.
  2. 如权利要求1所述的视频处理方法,其中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,还包括:The video processing method according to claim 1, wherein, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method further comprises:
    将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;Inputting the initial audio into the trained first neural network to obtain an output result of the trained first neural network, where the output result includes the identified audio object and the audio components corresponding to each audio object respectively;
    将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。Compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, determine that in the initial audio corresponding to the current video frame, there is a target corresponding to at least one target object audio component.
  3. 如权利要求2所述的视频处理方法,其中,所述若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,包括:The video processing method according to claim 2, wherein, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame, comprising:
    通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
  4. 如权利要求1所述的视频处理方法,其中,在根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频之前,包括:The video processing method according to claim 1, wherein, before obtaining the target audio by processing the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component, the method comprises:
    根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。According to a preset object-band mapping table, target frequency bands corresponding to each target object are identified in the initial audio, and the target frequency bands are used as target audio components of the corresponding target objects.
  5. 如权利要求1所述的视频处理方法,其中,所述若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象,包括:The video processing method according to claim 1, wherein, if the specified editing information of the current video frame is detected, acquiring the target object contained in the current video frame comprises:
    若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
    和/或,and / or,
    若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
  6. 如权利要求1所述的视频处理方法,其中,所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:The video processing method according to claim 1, wherein, according to the specified editing information and at least one target audio component, processing the initial audio corresponding to the current video frame to obtain the target audio, comprising:
    根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;According to the specified editing information, obtain the original video frame corresponding to the current video frame;
    检测所述原始视频帧中所包含的第一对象;detecting the first object contained in the original video frame;
    比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;Comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;
    根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。According to the comparison result, the initial audio corresponding to the current video frame is processed to obtain the target audio.
  7. 如权利要求1至6任意一项所述的视频处理方法,其中,所述根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,包括:The video processing method according to any one of claims 1 to 6, wherein, according to the specified editing information and at least one target audio component, the initial audio corresponding to the current video frame is processed to obtain the target audio ,include:
    根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  8. 如权利要求7所述的视频处理方法,其中,根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小包括:The video processing method according to claim 7, wherein, according to the specified editing information, adjusting the loudness of the target audio component in the initial audio comprises:
    若所述指定编辑信息包括当前图像缩放倍数,则根据预先确定的图像缩放倍数与响度调整倍数之间的对应关系,确定所述初始音频中的目标音频分量的响度大小。If the specified editing information includes the current image zoom factor, the loudness size of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
  9. 根据权利要求1至6任意一项所述的视频处理方法,其中,所述目标音频分量为预先从所述初始音频中识别并提取出的子音频;或者The video processing method according to any one of claims 1 to 6, wherein the target audio component is a sub-audio identified and extracted from the initial audio in advance; or
    所述目标音频分量为通过对所述初始音频中的特定频段预先标识得到。The target audio component is obtained by pre-identifying a specific frequency band in the initial audio.
  10. 一种视频处理装置,其中,包括:A video processing device, comprising:
    获取模块,用于若检测到当前视频帧的指定编辑信息,则获取所述当前视频帧中包含的目标对象;an acquisition module, configured to acquire the target object contained in the current video frame if the specified editing information of the current video frame is detected;
    处理模块,用于根据所述指定编辑信息和至少一个目标音频分量,对所述当前视频帧所对应的初始音频进行处理,获得目标音频,其中,任一所述目标音频分量为一个所述目标对象在所述初始音频中所对应的音频分量;A processing module, configured to process the initial audio corresponding to the current video frame according to the specified editing information and at least one target audio component to obtain target audio, wherein any one of the target audio components is one of the target audio components the audio component corresponding to the object in the initial audio;
    关联模块,用于将所述当前视频帧和所述目标音频进行关联,获得目标视频。an association module, configured to associate the current video frame with the target audio to obtain a target video.
  11. 根据权利要求10所述的视频处理装置,其中,所述视频处理装置还包括:The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:
    第二处理模块,用于将所述初始音频输入训练后的第一神经网络,获得所述训练后的第一神经网络的输出结果,所述输出结果中包括识别到的音频对象和各个音频对象所分别对应的音频分量;The second processing module is configured to input the initial audio into the trained first neural network, and obtain an output result of the trained first neural network, where the output result includes the identified audio objects and each audio object The corresponding audio components;
    比对模块,用于将所述音频对象与所述目标对象进行比对,若存在至少一个音频对象与所述目标对象相同,则确定所述当前视频帧所对应的初始音频中,存在至少一个目标对象所对应的目标音频分量。The comparison module is used to compare the audio object with the target object, and if there is at least one audio object that is the same as the target object, then determine that there is at least one audio in the initial audio corresponding to the current video frame. The target audio component corresponding to the target object.
  12. 根据权利要求11所述的视频处理装置,其中,所述获取模块用于:The video processing apparatus according to claim 11, wherein the obtaining module is used for:
    通过训练后的第二神经网络对所述当前视频帧进行目标识别,获得所述当前视频帧中包含的目标对象,其中,所述第一神经网络所对应的第一训练数据集中的标签与所述第二神经网络所对应的第二训练数据集中的标签至少部分重合。Target recognition is performed on the current video frame through the trained second neural network to obtain the target object contained in the current video frame, wherein the labels in the first training data set corresponding to the first neural network are the same as those in the current video frame. The labels in the second training data set corresponding to the second neural network at least partially overlap.
  13. 根据权利要求10所述的视频处理装置,其中,所述视频处理装置还包括:The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:
    第三处理模块,用于根据预设的对象-频段映射表,在所述初始音频中标识各个目标对象所分别对应的目标频段,并将所述目标频段作为对应的目标对象的目标音频分量。The third processing module is configured to identify target frequency bands corresponding to each target object in the initial audio according to a preset object-frequency band mapping table, and use the target frequency band as a target audio component of the corresponding target object.
  14. 根据权利要求10所述的视频处理装置,其中,所述获取模块用于:The video processing apparatus according to claim 10, wherein the obtaining module is used for:
    若检测到所述当前视频帧的当前图像缩放倍数不满足预设条件,则将所述当前视频帧中包含的指定对象作为至少部分所述目标对象;If it is detected that the current image zoom factor of the current video frame does not meet the preset condition, the specified object contained in the current video frame is used as at least part of the target object;
    和/或,and / or,
    若检测到所述当前视频帧中的第一对焦对象和所述当前视频帧的前一视频帧中的第二对焦对象不同,则将当前视频帧中的第一对焦对象作为至少部分所述目标对象。If it is detected that the first focus object in the current video frame is different from the second focus object in the previous video frame of the current video frame, the first focus object in the current video frame is used as at least part of the target object.
  15. 根据权利要求10所述的视频处理装置,其中,所述视频处理装置还包括:The video processing apparatus according to claim 10, wherein the video processing apparatus further comprises:
    第一获取单元,用于根据所述指定编辑信息,获取所述当前视频帧所对应的原始视频帧;a first obtaining unit, configured to obtain the original video frame corresponding to the current video frame according to the specified editing information;
    检测单元,用于检测所述原始视频帧中所包含的第一对象;a detection unit, configured to detect the first object contained in the original video frame;
    比对单元,用于比对所述当前视频帧中包含的目标对象和所述原始视频帧中所包含的第一对象,获得比对结果;a comparison unit for comparing the target object contained in the current video frame and the first object contained in the original video frame to obtain a comparison result;
    第二处理单元,用于根据所述比对结果,对所述当前视频帧所对应的初始音频进行处理,获得目标音频。The second processing unit is configured to process the initial audio corresponding to the current video frame according to the comparison result to obtain the target audio.
  16. 根据权利要求10至15任意一项所述的视频处理装置,其中,所述处理模块用于:The video processing device according to any one of claims 10 to 15, wherein the processing module is used for:
    根据所述指定编辑信息,调整所述初始音频中的目标音频分量的响度大小,获得所述目标音频。According to the specified editing information, the loudness of the target audio component in the initial audio is adjusted to obtain the target audio.
  17. 如权利要求16所述的视频处理装置,其中,所述处理模块用于:The video processing apparatus of claim 16, wherein the processing module is used for:
    若所述指定编辑信息包括当前图像缩放倍数,则根据预先确定的图像缩放倍数与响度调整倍数之间的对应关系,确定所述初始音频中的目标音频分量的响度大小。If the specified editing information includes the current image zoom factor, the loudness of the target audio component in the initial audio is determined according to the predetermined correspondence between the image zoom factor and the loudness adjustment factor.
  18. 根据权利要求10至15任一项所述的视频处理装置,其中,所述目标音频分量为预先从所述初始音频中识别并提取出的子音频;或者所述目标音频分量为通过对所述初始音频中的特定频段预先标识得到。The video processing apparatus according to any one of claims 10 to 15, wherein the target audio component is a sub-audio identified and extracted from the initial audio in advance; Specific frequency bands in the original audio are pre-identified.
  19. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至9任一项所述的视频处理方法。A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein, when the processor executes the computer program, any one of claims 1 to 9 is implemented. The video processing method described in one item.
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的视频处理方法。A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the video processing method according to any one of claims 1 to 9 is implemented.
PCT/CN2021/097743 2020-07-22 2021-06-01 Video processing method and apparatus, and terminal device and computer-readable storage medium WO2022017006A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010710645.XA CN111818385B (en) 2020-07-22 2020-07-22 Video processing method, video processing device and terminal equipment
CN202010710645.X 2020-07-22

Publications (1)

Publication Number Publication Date
WO2022017006A1 true WO2022017006A1 (en) 2022-01-27

Family

ID=72861861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097743 WO2022017006A1 (en) 2020-07-22 2021-06-01 Video processing method and apparatus, and terminal device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN111818385B (en)
WO (1) WO2022017006A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449354A (en) * 2022-02-07 2022-05-06 上海幻电信息科技有限公司 Video editing method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111818385B (en) * 2020-07-22 2022-08-09 Oppo广东移动通信有限公司 Video processing method, video processing device and terminal equipment
CN112188260A (en) * 2020-10-26 2021-01-05 咪咕文化科技有限公司 Video sharing method, electronic device and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009200959A (en) * 2008-02-22 2009-09-03 Sony Corp Data editing apparatus, data editing method, program and storage medium
CN105611404A (en) * 2015-12-31 2016-05-25 北京东方云图科技有限公司 Method and device for automatically adjusting audio volume according to video application scenes
CN105657538A (en) * 2015-12-31 2016-06-08 北京东方云图科技有限公司 Method and device for synthesizing video file by mobile terminal
CN107241646A (en) * 2017-07-12 2017-10-10 北京奇虎科技有限公司 The edit methods and device of multimedia video
US9794632B1 (en) * 2016-04-07 2017-10-17 Gopro, Inc. Systems and methods for synchronization based on audio track changes in video editing
CN107967706A (en) * 2017-11-27 2018-04-27 腾讯音乐娱乐科技(深圳)有限公司 Processing method, device and the computer-readable recording medium of multi-medium data
CN108307127A (en) * 2018-01-12 2018-07-20 广州市百果园信息技术有限公司 Method for processing video frequency and computer storage media, terminal
CN111818385A (en) * 2020-07-22 2020-10-23 Oppo广东移动通信有限公司 Video processing method, video processing device and terminal equipment

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100542129B1 (en) * 2002-10-28 2006-01-11 한국전자통신연구원 Object-based three dimensional audio system and control method
JP2011234139A (en) * 2010-04-28 2011-11-17 Sharp Corp Three-dimensional audio signal generating device
US20140152530A1 (en) * 2012-12-03 2014-06-05 Honeywell International Inc. Multimedia near to eye display system
CN105512348B (en) * 2016-01-28 2019-03-26 北京旷视科技有限公司 For handling the method and apparatus and search method and device of video and related audio
CN107135419A (en) * 2017-06-14 2017-09-05 北京奇虎科技有限公司 A kind of method and apparatus for editing video
CN107493442A (en) * 2017-07-21 2017-12-19 北京奇虎科技有限公司 A kind of method and apparatus for editing video
CN109040641B (en) * 2018-08-30 2020-10-16 维沃移动通信有限公司 Video data synthesis method and device
CN109857905B (en) * 2018-11-29 2022-03-15 维沃移动通信有限公司 Video editing method and terminal equipment
CN109815844A (en) * 2018-12-29 2019-05-28 西安天和防务技术股份有限公司 Object detection method and device, electronic equipment and storage medium
CN109761118A (en) * 2019-01-15 2019-05-17 福建天眼视讯网络科技有限公司 Wisdom ladder networking control method and system based on machine vision
CN110544270A (en) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 method and device for predicting human face tracking track in real time by combining voice recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009200959A (en) * 2008-02-22 2009-09-03 Sony Corp Data editing apparatus, data editing method, program and storage medium
CN105611404A (en) * 2015-12-31 2016-05-25 北京东方云图科技有限公司 Method and device for automatically adjusting audio volume according to video application scenes
CN105657538A (en) * 2015-12-31 2016-06-08 北京东方云图科技有限公司 Method and device for synthesizing video file by mobile terminal
US9794632B1 (en) * 2016-04-07 2017-10-17 Gopro, Inc. Systems and methods for synchronization based on audio track changes in video editing
CN107241646A (en) * 2017-07-12 2017-10-10 北京奇虎科技有限公司 The edit methods and device of multimedia video
CN107967706A (en) * 2017-11-27 2018-04-27 腾讯音乐娱乐科技(深圳)有限公司 Processing method, device and the computer-readable recording medium of multi-medium data
CN108307127A (en) * 2018-01-12 2018-07-20 广州市百果园信息技术有限公司 Method for processing video frequency and computer storage media, terminal
CN111818385A (en) * 2020-07-22 2020-10-23 Oppo广东移动通信有限公司 Video processing method, video processing device and terminal equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449354A (en) * 2022-02-07 2022-05-06 上海幻电信息科技有限公司 Video editing method and system
CN114449354B (en) * 2022-02-07 2023-12-08 上海幻电信息科技有限公司 Video editing method and system

Also Published As

Publication number Publication date
CN111818385B (en) 2022-08-09
CN111818385A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2022017006A1 (en) Video processing method and apparatus, and terminal device and computer-readable storage medium
WO2019109801A1 (en) Method and device for adjusting photographing parameter, storage medium, and mobile terminal
CN106651955B (en) Method and device for positioning target object in picture
WO2020082902A1 (en) Sound effect processing method for video, and related products
CN108833784B (en) Self-adaptive composition method, mobile terminal and computer readable storage medium
US11455491B2 (en) Method and device for training image recognition model, and storage medium
TWI702544B (en) Method, electronic device for image processing and computer readable storage medium thereof
CN108961157B (en) Picture processing method, picture processing device and terminal equipment
US10963982B2 (en) Video watermark generation method and device, and terminal
WO2021213067A1 (en) Object display method and apparatus, device and storage medium
CN108961267B (en) Picture processing method, picture processing device and terminal equipment
CN111209970A (en) Video classification method and device, storage medium and server
WO2020125229A1 (en) Feature fusion method and apparatus, and electronic device and storage medium
CN110909209B (en) Live video searching method and device, equipment, server and storage medium
CN110119733B (en) Page identification method and device, terminal equipment and computer readable storage medium
CN108965981B (en) Video playing method and device, storage medium and electronic equipment
CN109034150B (en) Image processing method and device
CN111091845A (en) Audio processing method and device, terminal equipment and computer storage medium
WO2021190625A1 (en) Image capture method and device
CN112689221A (en) Recording method, recording device, electronic device and computer readable storage medium
CN107133361B (en) Gesture recognition method and device and terminal equipment
CN105139848A (en) Data conversion method and apparatus
CN109842820B (en) Barrage information input method and device, mobile terminal and readable storage medium
CN107085823A (en) Face image processing process and device
CN109167939B (en) Automatic text collocation method and device and computer storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21846810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21846810

Country of ref document: EP

Kind code of ref document: A1