WO2020082902A1

WO2020082902A1 - Sound effect processing method for video, and related products

Info

Publication number: WO2020082902A1
Application number: PCT/CN2019/104044
Authority: WO
Inventors: 朱克智; 严锋贵
Original assignee: Oppo广东移动通信有限公司
Priority date: 2018-10-25
Filing date: 2019-09-02
Publication date: 2020-04-30
Also published as: CN109413563B; CN109413563A

Abstract

Disclosed are a sound effect processing method for a video, and related products. The method comprises the following steps: obtaining a photographed first video, and extracting image frame data and audio frame data from the first video; obtaining an audio time interval of the audio frame data, and extracting a first group of image frame data corresponding to the audio time interval from the image frame data; and analyzing the first group of image frame data to determine the acoustic source location of an audio, and performing 3D sound effect processing on the audio frame data according to the acoustic source location to obtain the processed audio frame data. According to the technical solution provided by the present application, the user experience is high.

Description

Video audio processing method and related products

Technical field

This application relates to the field of audio technology, and in particular to a method for processing audio effects of video and related products.

Background technique

With the widespread application of electronic devices (such as mobile phones, tablet computers, etc.), more and more applications can be supported by electronic devices, and their functions are becoming more and more powerful. Indispensable electronic appliances, video applications are high-frequency applications for electronic equipment.

Summary of the invention

The embodiments of the present application provide a method for processing audio effects of video and related products, which can process the audio of the video by the position of the sound source, thereby improving the user experience.

In a first aspect, an embodiment of the present application provides a video audio processing method, the method includes the following steps:

Obtain the captured first video and extract the image frame data and audio frame data in the first video;

Obtain the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;

The first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.

In a second aspect, a movie sound effect processing device is provided. The movie sound effect processing device includes:

An obtaining unit, configured to obtain the first video captured, and extract image frame data and audio frame data in the first video;

The processing unit is used for acquiring the audio time interval of the audio frame data, extracting the first group of image frame data corresponding to the audio time interval from the image frame data; analyzing the first group of image frame data to determine the audio source Position; 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured by the above The processor executes, and the above program includes instructions for performing the steps in the first aspect of the embodiments of the present application.

According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the first Part or all of the steps described in one aspect.

According to a fifth aspect, an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing the computer program, and the computer program is operable to cause the computer to execute as implemented in the present application Examples of some or all of the steps described in the first aspect. The computer program product may be a software installation package.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained based on these drawings.

1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

2 is a schematic flowchart of a movie sound processing method disclosed in an embodiment of the present application;

3 is a schematic flowchart of another movie sound effect processing method disclosed in an embodiment of the present application;

4 is a schematic structural diagram of a movie sound processing device disclosed in an embodiment of the present application;

5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of this application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.

The terms “first” and “second” in the description and claims of the present application and the above drawings are used to distinguish different objects, not to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes steps or units that are not listed, or optionally also includes Other steps or units inherent to these processes, methods, products, or equipment.

Reference herein to "embodiments" means that specific features, structures, or characteristics described in connection with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive of other embodiments. Those skilled in the art understand explicitly and implicitly that the embodiments described herein can be combined with other embodiments.

The electronic devices involved in the embodiments of the present application may include various handheld devices with wireless communication functions (such as smart phones), in-vehicle devices, virtual reality (virtual reality (VR) / augmented reality (AR) devices), and Wearable devices, computing devices, or other processing devices connected to wireless modems, as well as various forms of user equipment (UE), mobile stations (MS), terminal devices (terminal), R & D / test platforms, Servers, etc. For convenience of description, the devices mentioned above are collectively referred to as electronic devices.

In a specific implementation, in the embodiment of the present application, the electronic device may filter the audio data (the sound emitted by the sound source) using an HRTF (Head Related Transfer Function) filter to obtain virtual surround sound, which is also called Surround sound, or panoramic sound, achieve a three-dimensional stereo sound effect. The corresponding name of HRTF in the time domain is HRIR (Head Related Related Impulse). Or convolve the audio data with the binaural room impulse response (BRIR). The binaural room impulse response consists of three parts: direct sound, early reflection sound and reverberation.

There is no return to the audio position during video shooting, that is, there is no corresponding processing for the sound source at the left, right, etc., which results in a poor scene restoration effect of the video and affects the user experience.

In an optional solution of a video sound processing method, the 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data specifically including:

If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.

In an optional solution of a video audio processing method, the method further includes:

The first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.

In an optional solution of a video sound processing method, the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.

In an optional solution of a video audio processing method, the method of determining that the first video is indoor specifically includes:

Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first A video is indoor, otherwise, it is determined that the first video is non-indoor; the n is an integer greater than or equal to 2.

In an optional solution of a video sound processing method, the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:

Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames , When the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m≥w ≥x, m, w and x are all integers greater than or equal to 2.

In an optional solution of a video audio processing method, the action of identifying the mouth area of x image frames to determine that the x image frames have a mouth specifically includes:

Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.

Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.

In an optional solution of a video sound processing method, the audio time interval for acquiring audio frame data specifically includes:

Filtering the audio frame data to obtain filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as the audio time interval.

In an optional solution of a video sound processing device, the processing unit is specifically used to increase the volume of the left channel in the audio frame data or decrease the audio frame data if the sound source position is on the left The volume of the right channel; if the sound source position is on the right, increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.

In an optional solution of a video sound processing device, the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is located indoors.

In an optional solution of a video sound processing device, the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.

In an optional solution of a video audio processing device, the processing unit is specifically configured to randomly extract n frames of image data from the image frame data, and transmit the n frames of image data to a trained classifier for execution The classification algorithm processes and determines n scenes corresponding to n frames of image data. If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; the n is an integer greater than or equal to 2.

In an optional solution of a video audio processing device, the processing unit is specifically configured to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition on the m image frames Processing to obtain w image frames containing a human face, extracting x image frames that are consecutive in time among the w image frames, and determining x image frames with mouth movements when x image frames are identified in the mouth area, determine x image frames The position of the middle mouth area in the x images is the position of the sound source of the audio, where m≥w≥x, and m, w, and x are all integers greater than or equal to 2.

In an optional solution of a video audio processing device, the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB Count the number of pixels with non-lip RGB values in the value to get the x number, calculate the difference between the maximum and minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have a mouth , If the difference is less than the difference threshold, it is determined that the x images do not have a mouth action.

In an optional solution of a video audio processing device, the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB The number of pixel points of the tooth RGB value is counted in the value to obtain the x number, and the number y of the x number greater than the number threshold is calculated. If y / x is greater than the ratio threshold, it is determined that the x image frames have mouth movements.

Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device includes a control circuit and an input-output circuit, and the input-output circuit is connected to the control circuit.

Among them, the control circuit may include a storage and processing circuit. The storage circuit in the storage and processing circuit may be a memory, such as a hard disk drive memory, a non-volatile memory (such as flash memory or other electronic programmable read-only memory used to form a solid-state drive, etc.), a volatile memory (such as static Or dynamic random access memory, etc.), the embodiments of the present application are not limited. The processing circuit in the storage and processing circuit can be used to control the operation of the electronic device. The processing circuit may be implemented based on one or more microprocessors, microcontrollers, digital signal processors, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.

The storage and processing circuits can be used to run software in electronic devices, such as playing incoming call alert ringing applications, playing short message alert ringing applications, playing alarm alert ringing applications, playing media file applications, voice over internet protocol (voice over Internet protocol (VOIP) phone call application, operating system functions, etc. These softwares can be used to perform some control operations, such as playing incoming call alert ringing, playing short message alert ringing, playing alarm alert ringing, playing media files, making voice phone calls, and other functions in electronic devices, etc., this application The embodiment is not limited.

Among them, the input-output circuit can be used to enable the electronic device to input and output data, that is, to allow the electronic device to receive data from the external device and to allow the electronic device to output data from the electronic device to the external device.

The input-output circuit may further include a sensor. The sensor may include an ambient light sensor, an infrared proximity sensor based on light and capacitance, an ultrasonic sensor, a touch sensor (for example, a light-based touch sensor and / or a capacitive touch sensor, where the touch sensor may be part of a touch display screen, or Can be used independently as a touch sensor structure), acceleration sensor, gravity sensor, and other sensors. The input-output circuit may further include an audio component, and the audio component may be used to provide audio input and output functions for the electronic device. Audio components can also include tone generators and other components for generating and detecting sound.

The input-output circuit may also include one or more display screens. The display screen may include a liquid crystal display screen, an organic light-emitting diode display screen, an electronic ink display screen, a plasma display screen, and one or a combination of several display screens using other display technologies. The display screen may include a touch sensor array (ie, the display screen may be a touch display screen). The touch sensor may be a capacitive touch sensor formed by an array of transparent touch sensor electrodes (such as indium tin oxide (ITO) electrodes), or may be a touch sensor formed using other touch technologies, such as sonic touch, pressure sensitive touch, resistance Touch, optical touch, etc. are not limited in the embodiments of the present application.

The input-output circuit may further include a communication circuit that can be used to provide an electronic device with the ability to communicate with an external device. The communication circuit may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and / or optical signals. The wireless communication circuit in the communication circuit may include a radio frequency transceiver circuit, a power amplifier circuit, a low noise amplifier, a switch, a filter, and an antenna. For example, the wireless communication circuit in the communication circuit may include a circuit for supporting near field communication (NFC) by transmitting and receiving near-field coupled electromagnetic signals. For example, the communication circuit may include a near field communication antenna and a near field communication transceiver. The communication circuit may also include a cellular phone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and so on.

The input-output circuit may further include other input-output units. The input-output unit may include buttons, joysticks, click wheels, scroll wheels, touch pad, keypad, keyboard, camera, light emitting diodes, and other status indicators.

The electronic device may further include a battery (not shown), and the battery is used to provide electrical energy to the electronic device.

Video generally refers to various technologies that capture, record, process, store, transmit and reproduce a series of still images in the form of electrical signals. When continuous image changes exceed 24 frames per second or more, according to the principle of persistence of vision, the human eye cannot distinguish a single static picture; it seems to be a smooth and continuous visual effect. Such a continuous picture is called video. Video technology was first developed for television systems, but it has now evolved into various formats to facilitate consumers to record video. The development of network technology has also caused the recorded video clips to exist on the Internet in the form of streaming media and can be received and played by computers. Video and film belong to different technologies. The latter uses photography to capture dynamic images as a series of still photos.

With the application of cameras on electronic devices, especially after the combination of cameras and smartphones, the frequency of video shooting is getting higher and higher. Especially the recent rapid development of short video applications has made users use video applications more frequently. frequently. As for the video, unless otherwise specified, the video in this application is a video shot by an electronic device, and does not include video shot by professional equipment (such as movies, TV series, etc.). Existing video shooting includes images and audio. For the audio data in the video, the existing electronic devices generally only record the audio data collected during the video shooting, and do not process the audio data, for example, according to the sound source in the shooting video Processing of audio data, etc. This results in a poor scene restoration effect and affects the user experience.

The following describes the embodiments of the present application in detail.

Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a video audio processing method disclosed in an embodiment of the present application. The method is applied to the electronic device described in FIG. 1. The video audio processing method includes the following steps:

Step S201: Acquire the captured first video, and extract image frame data and audio frame data in the first video;

Step S202: Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;

The above audio time interval for acquiring audio frame data may specifically include:

Filtering the audio frame data to obtain the filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as an audio time interval.

Step S203: Analyze the first group of image frame data to determine the sound source position of the audio, and perform 3D sound effect processing on the audio frame data according to the sound source position to obtain the processed audio frame data.

The step S203 of performing 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data may specifically include:

Optionally, if the first video is indoors, indoor 3D sound effect strategy playback may also be performed on the audio frame data. The indoor 3D sound effect strategy playback includes, but is not limited to: reducing the volume, increasing echo, and so on.

The technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.

Optionally, the above method for determining that the first video is indoor may specifically include:

Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first One video is indoor, otherwise, it is determined that the first video is non-indoor.

The above classifiers include but are not limited to: machine learning, neural network models, deep learning models, and other algorithm models with classification functions.

Extracting n frames of image data as described above can reduce the amount of data calculation. Comparing all image frame data of the first video with a classifier operation can greatly reduce the amount of calculation, and its accuracy will not be reduced. Because, according to the statistics of the applicant on the big data of video shooting, it is found that the video shooting time is generally short, most of the video shooting time is less than 5 minutes, or even less than 2 minutes, which is generally referred to as micro video, Unlike the frequent switching of movie scenes, for micro video scenes, due to the short time, in addition, micro video is generally formed by one shot, and will not undergo subsequent editing and stitching processing, so the shooting scene will generally not switch According to the statistics of big data, most video shooting scenes are fixed. For example, indoor scene shooting is indoor scene shooting, and outdoor scene shooting is outdoor scene shooting. Therefore, n images of the first video are directly extracted Frame can be judged to confirm indoor or outdoor.

The analysis of the first group of image frame data in step S203 to determine the location of the sound source of the audio may specifically include:

Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames When recognizing the mouth area of the x image frames and determining that the x image frames have the action of the mouth, determine the position of the mouth area in the x images of the x image frames as the sound source position of the audio.

The continuous time period may have image frames with continuous shooting times, for example, m image frames in the time period of 1s-10s, and of course, may be other time periods. The present application does not limit the specific time of the above time period.

The above-mentioned face recognition processing method can be obtained by using a general face recognition algorithm, for example, through Baidu face recognition algorithm, Google face recognition and so on.

The above-mentioned identification of the mouth area of the x image frames to determine that the x image frames have a mouth may specifically include:

The principle of this method is based on the fact that the person must have a mouth movement. The movement of the mouth is analyzed. When the person speaks, the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the RGB value of the lips), the second part is the non-lip area (may appear, the RGB value of the teeth or the black RGB value of the light), according to the statistics of big data, it is found that for the movement of the mouth, its The area of the second part will change at any time. For example, if you say a paragraph, the difference between the maximum range and the minimum range of the second part is large. Since the video shooting distance is relatively fixed, it is reflected in the image frame, which corresponds to the second part. The change in the number of pixels is relatively large. Based on this principle, the applicant has identified the movement of the mouth.

The principle of this method is based on the fact that the person must have a mouth movement. The movement of the mouth is analyzed. When the person speaks, the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the lip RGB value), the second part is the non-lip area (such as the tooth RGB value), through the statistics of big data, the area of the second part of the mouth movements will change at any time Often, teeth appear from time to time, so counting the number of times the teeth appear can determine whether there is mouth movement. In addition, Asian teeth are generally white to yellow, which is very different from the RGB value of the lips, so Selecting the RGB value of the teeth can also reduce errors and improve the accuracy of recognition of mouth movements.

Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a video sound processing method disclosed in an embodiment of the present application, which is applied to the electronic device described in FIG. 1 above. The movie sound processing method includes the following steps:

Step S301: Acquire the captured first video, and extract image frame data and audio frame data in the first video;

Step S302: Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data.

Step S303: extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract the time continuous x of the w image frames Image frames, when it is determined that the mouth regions of the x image frames determine the movement of the mouth, the position of the mouth region in the x images of the x image frames is the sound source position of the audio.

Step S304: If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data.

Referring to FIG. 4, FIG. 4 provides a video audio processing device. The video audio processing device includes:

The obtaining unit 401 is configured to obtain the first video captured and extract image frame data and audio frame data in the first video;

The processing unit 402 is used to obtain the audio time interval of the audio frame data, extract the first group of image frame data corresponding to the audio time interval from the image frame data; analyze the first group of image frame data to determine the sound of the audio Source position; perform 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data.

Optionally, the processing unit is specifically configured to increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data if the position of the sound source is on the left; if the position of the sound source is on the right , Increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.

Optionally, the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is indoor.

Optionally, the processing unit is specifically used to randomly extract n frames of image data from the image frame data, transmit the n frames of image data to a trained classifier, and perform classification algorithm processing to determine n scenes corresponding to the n frames of image data, If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; n is an integer greater than or equal to 2.

Optionally, the processing unit is specifically used to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition processing on the m image frames to obtain w image frames containing human faces, and extract w Time-continuous x image frames in the image frame, when the mouth area of the x image frames is identified and the movement of the mouth is determined, the position of the mouth area in the x image frames in the x image frames is the audio The sound source position of, m≥w≥x, m, w, x are all integers greater than or equal to 2.

Optionally, the processing unit is specifically used to determine x mouth areas of x image frames, identify RGB values of all pixels in x mouth areas, and obtain the number of pixels that count non-lip RGB values among all RGB values x number, calculate the difference between the maximum value and the minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have mouth movements, if the difference is less than the difference threshold, determine the The x images have no mouth movements.

Optionally, the processing unit is specifically used to determine x mouth regions of x image frames, identify the RGB values of all pixels in the x mouth regions, and obtain the number of pixels that count the RGB values of teeth among all RGB values to obtain x For the number, calculate the number of times y out of the x number that is greater than the number threshold. If y / x is greater than the ratio threshold, determine that x image frames have mouth movements.

Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application. As shown in the figure, the electronic device includes a processor, a memory, a communication interface, and one or more programs. One or more programs are stored in the above-mentioned memory and are configured to be executed by the above-mentioned processor. The above-mentioned program includes instructions for performing the following steps:

In an optional solution, the 3D sound effect processing is performed on the audio frame data according to the sound source position to obtain processed audio frame data specifically including:

In an optional solution, the method further includes:

In an optional solution, the method for determining that the first video is indoor specifically includes:

In an optional solution, the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:

In an optional solution, the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:

The above mainly introduces the solution of the embodiment of the present application from the perspective of the execution process on the method side. It can be understood that, in order to realize the above-mentioned functions, the electronic device includes a hardware structure and / or a software module corresponding to each function. Those skilled in the art should be easily aware that, in conjunction with the example units and algorithm steps described in the embodiments provided herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed by hardware or computer software driven hardware depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

The embodiments of the present application may divide the functional unit of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The above integrated unit can be implemented in the form of hardware or software function unit. It should be noted that the division of units in the embodiments of the present application is schematic, and is only a division of logical functions, and there may be other division manners in actual implementation.

It should be noted that the electronic devices described in the embodiments of the present application are presented in the form of functional units. The term "unit" used herein should be understood as the broadest possible meaning, and the object used to implement the function described by each "unit" may be, for example, an integrated circuit ASIC, a single circuit, used to execute one or more software or firmware A program's processor (shared, dedicated, or chipset) and memory, combined logic circuits, and / or other suitable components that provide the functions described above.

An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute any video sound processing method described in the foregoing method embodiments Part or all of the steps.

An embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute as described in the above method embodiments Some or all steps of any video audio processing method.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence, Because according to this application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by this application.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of software program modules.

If the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. The foregoing memory includes: U disk, read-only memory (ROM), random access memory (RAM), mobile hard disk, magnetic disk, or optical disk and other media that can store program codes.

A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by instructing relevant hardware through a program. The program may be stored in a computer-readable memory, and the memory may include: a flash disk , ROM, RAM, magnetic disk or optical disk, etc.

The embodiments of the present application have been described in detail above, and specific examples have been used in this article to explain the principles and implementation of the present application. The descriptions of the above embodiments are only used to help understand the method and the core idea of the present application; Those of ordinary skill in the art, based on the ideas of the present application, may have changes in specific implementations and application scopes. In summary, the content of this specification should not be construed as limiting the present application.

Claims

A video audio processing method, characterized in that the method includes the following steps:

Obtain the captured first video and extract the image frame data and audio frame data in the first video;

Obtain the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;

The first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
The method according to claim 1, wherein the 3D sound effect processing is performed on the audio frame data according to the sound source position to obtain processed audio frame data specifically including:

If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
The method according to claim 1, wherein the method further comprises:

The first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
The method of claim 3, wherein

The indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
The method according to claim 3, wherein the method for determining that the first video is indoor specifically includes:

Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first A video is indoor, otherwise, it is determined that the first video is non-indoor; the n is an integer greater than or equal to 2.
The method according to claim 1, wherein the analysis of the first group of image frame data to determine the location of the sound source of the audio specifically includes:

Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames , When the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m≥w ≥x, m, w and x are all integers greater than or equal to 2.
The method according to claim 6, wherein the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:

Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.
The method according to claim 6, wherein the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:

Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.
The method according to claim 1, wherein the audio time interval for acquiring audio frame data specifically includes:

Filtering the audio frame data to obtain filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as the audio time interval.
A movie sound effect processing device, characterized in that the movie sound effect processing device includes:

An obtaining unit, configured to obtain the first video captured, and extract image frame data and audio frame data in the first video;

The processing unit is used for acquiring the audio time interval of the audio frame data, extracting the first group of image frame data corresponding to the audio time interval from the image frame data; analyzing the first group of image frame data to determine the audio source Position; perform 3D sound effect processing on the audio frame data according to the position of the sound source to obtain processed audio frame data.
The device according to claim 10, characterized in that

The processing unit is specifically used to increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data if the position of the sound source is on the left; Increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
The device according to claim 10, characterized in that

The processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is located indoors.
The device according to claim 12, characterized in that

The indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
The device according to claim 12, characterized in that

The processing unit is specifically used to randomly extract n frames of image data from the image frame data, transmit the n frames of image data to a trained classifier, and perform classification algorithm processing to determine n scenes corresponding to the n frames of image data, such as n The scenes are all indoors, and the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; the n is an integer greater than or equal to 2.
The device according to claim 10, characterized in that

The processing unit is specifically used to extract m image frames of the first set of image frame data for a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract w image frames The x image frames in the middle time are continuous. When the mouth area of the x image frames is identified and the action of the mouth is determined, the position of the mouth area in the x image frames in the x image frames is the sound of the audio For the source location, the m≥w≥x, and the m, w, and x are all integers greater than or equal to 2.
The device according to claim 15, characterized in that

The processing unit is specifically used to determine x mouth areas of x image frames, identify the RGB values of all pixels in the x mouth areas, and obtain the number of pixels that count non-lip RGB values among all RGB values Quantity, calculate the difference between the maximum value and the minimum value of the x numbers, if the difference is greater than the difference threshold, determine that the x images have mouth movements, if the difference is less than the difference threshold, determine the x The image does not have mouth movements.
The device according to claim 15, characterized in that

The processing unit is specifically used to determine x mouth regions of x image frames, identify the RGB values of all pixels of the x mouth regions, and calculate the number of pixels of the tooth RGB value among all RGB values to obtain x numbers Calculate the number y of x numbers that is greater than the threshold value of the quantity. If y / x is greater than the ratio threshold value, determine that x image frames have mouth movements.
An electronic device, characterized in that it includes a processor, a memory, a communication interface, and one or more programs, the one or more programs are stored in the memory, and are configured to be executed by the processor, The program includes instructions for performing the steps in the method of any one of claims 1-9.
A computer-readable storage medium characterized by storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method according to any one of claims 1-9.
A computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium that stores a computer program, the computer program is operable to cause a computer to execute any one of claims 1-9 Described method.