WO2020082902A1 - Sound effect processing method for video, and related products - Google Patents

Sound effect processing method for video, and related products Download PDF

Info

Publication number
WO2020082902A1
WO2020082902A1 PCT/CN2019/104044 CN2019104044W WO2020082902A1 WO 2020082902 A1 WO2020082902 A1 WO 2020082902A1 CN 2019104044 W CN2019104044 W CN 2019104044W WO 2020082902 A1 WO2020082902 A1 WO 2020082902A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame data
audio
image
video
image frames
Prior art date
Application number
PCT/CN2019/104044
Other languages
French (fr)
Chinese (zh)
Inventor
朱克智
严锋贵
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2020082902A1 publication Critical patent/WO2020082902A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Definitions

  • This application relates to the field of audio technology, and in particular to a method for processing audio effects of video and related products.
  • the embodiments of the present application provide a method for processing audio effects of video and related products, which can process the audio of the video by the position of the sound source, thereby improving the user experience.
  • an embodiment of the present application provides a video audio processing method, the method includes the following steps:
  • the first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
  • a movie sound effect processing device includes:
  • An obtaining unit configured to obtain the first video captured, and extract image frame data and audio frame data in the first video
  • the processing unit is used for acquiring the audio time interval of the audio frame data, extracting the first group of image frame data corresponding to the audio time interval from the image frame data; analyzing the first group of image frame data to determine the audio source Position; 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data
  • an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured by the above
  • the processor executes, and the above program includes instructions for performing the steps in the first aspect of the embodiments of the present application.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the first Part or all of the steps described in one aspect.
  • an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing the computer program, and the computer program is operable to cause the computer to execute as implemented in the present application Examples of some or all of the steps described in the first aspect.
  • the computer program product may be a software installation package.
  • FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a movie sound processing method disclosed in an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another movie sound effect processing method disclosed in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a movie sound processing device disclosed in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application.
  • the electronic devices involved in the embodiments of the present application may include various handheld devices with wireless communication functions (such as smart phones), in-vehicle devices, virtual reality (virtual reality (VR) / augmented reality (AR) devices), and Wearable devices, computing devices, or other processing devices connected to wireless modems, as well as various forms of user equipment (UE), mobile stations (MS), terminal devices (terminal), R & D / test platforms, Servers, etc.
  • UE user equipment
  • MS mobile stations
  • terminal devices terminal
  • R & D / test platforms Servers, etc.
  • the electronic device may filter the audio data (the sound emitted by the sound source) using an HRTF (Head Related Transfer Function) filter to obtain virtual surround sound, which is also called Surround sound, or panoramic sound, achieve a three-dimensional stereo sound effect.
  • HRTF Head Related Transfer Function
  • the corresponding name of HRTF in the time domain is HRIR (Head Related Related Impulse).
  • BRIR binaural room impulse response
  • the binaural room impulse response consists of three parts: direct sound, early reflection sound and reverberation.
  • the 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data specifically including:
  • the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
  • the method further includes:
  • the first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
  • the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
  • the method of determining that the first video is indoor specifically includes:
  • the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:
  • Extract m image frames of the first set of image frame data in a continuous period of time perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames ,
  • the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m ⁇ w ⁇ x, m, w and x are all integers greater than or equal to 2.
  • the action of identifying the mouth area of x image frames to determine that the x image frames have a mouth specifically includes:
  • the action of identifying the mouth area of x image frames to determine that the x image frames have a mouth specifically includes:
  • the audio time interval for acquiring audio frame data specifically includes:
  • Filtering the audio frame data to obtain filtered first audio frame data acquiring a time interval corresponding to the first audio frame data, and determining the time interval as the audio time interval.
  • the processing unit is specifically used to increase the volume of the left channel in the audio frame data or decrease the audio frame data if the sound source position is on the left The volume of the right channel; if the sound source position is on the right, increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
  • the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is located indoors.
  • the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
  • the processing unit is specifically configured to randomly extract n frames of image data from the image frame data, and transmit the n frames of image data to a trained classifier for execution
  • the classification algorithm processes and determines n scenes corresponding to n frames of image data. If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; the n is an integer greater than or equal to 2.
  • the processing unit is specifically configured to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition on the m image frames Processing to obtain w image frames containing a human face, extracting x image frames that are consecutive in time among the w image frames, and determining x image frames with mouth movements when x image frames are identified in the mouth area, determine x image frames
  • the position of the middle mouth area in the x images is the position of the sound source of the audio, where m ⁇ w ⁇ x, and m, w, and x are all integers greater than or equal to 2.
  • the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB Count the number of pixels with non-lip RGB values in the value to get the x number, calculate the difference between the maximum and minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have a mouth , If the difference is less than the difference threshold, it is determined that the x images do not have a mouth action.
  • the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB The number of pixel points of the tooth RGB value is counted in the value to obtain the x number, and the number y of the x number greater than the number threshold is calculated. If y / x is greater than the ratio threshold, it is determined that the x image frames have mouth movements.
  • FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device includes a control circuit and an input-output circuit, and the input-output circuit is connected to the control circuit.
  • the control circuit may include a storage and processing circuit.
  • the storage circuit in the storage and processing circuit may be a memory, such as a hard disk drive memory, a non-volatile memory (such as flash memory or other electronic programmable read-only memory used to form a solid-state drive, etc.), a volatile memory (such as static Or dynamic random access memory, etc.), the embodiments of the present application are not limited.
  • the processing circuit in the storage and processing circuit can be used to control the operation of the electronic device.
  • the processing circuit may be implemented based on one or more microprocessors, microcontrollers, digital signal processors, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.
  • the storage and processing circuits can be used to run software in electronic devices, such as playing incoming call alert ringing applications, playing short message alert ringing applications, playing alarm alert ringing applications, playing media file applications, voice over internet protocol (voice over Internet protocol (VOIP) phone call application, operating system functions, etc.
  • These softwares can be used to perform some control operations, such as playing incoming call alert ringing, playing short message alert ringing, playing alarm alert ringing, playing media files, making voice phone calls, and other functions in electronic devices, etc., this application
  • the embodiment is not limited.
  • the input-output circuit can be used to enable the electronic device to input and output data, that is, to allow the electronic device to receive data from the external device and to allow the electronic device to output data from the electronic device to the external device.
  • the input-output circuit may further include a sensor.
  • the sensor may include an ambient light sensor, an infrared proximity sensor based on light and capacitance, an ultrasonic sensor, a touch sensor (for example, a light-based touch sensor and / or a capacitive touch sensor, where the touch sensor may be part of a touch display screen, or Can be used independently as a touch sensor structure), acceleration sensor, gravity sensor, and other sensors.
  • the input-output circuit may further include an audio component, and the audio component may be used to provide audio input and output functions for the electronic device. Audio components can also include tone generators and other components for generating and detecting sound.
  • the input-output circuit may also include one or more display screens.
  • the display screen may include a liquid crystal display screen, an organic light-emitting diode display screen, an electronic ink display screen, a plasma display screen, and one or a combination of several display screens using other display technologies.
  • the display screen may include a touch sensor array (ie, the display screen may be a touch display screen).
  • the touch sensor may be a capacitive touch sensor formed by an array of transparent touch sensor electrodes (such as indium tin oxide (ITO) electrodes), or may be a touch sensor formed using other touch technologies, such as sonic touch, pressure sensitive touch, resistance Touch, optical touch, etc. are not limited in the embodiments of the present application.
  • the input-output circuit may further include a communication circuit that can be used to provide an electronic device with the ability to communicate with an external device.
  • the communication circuit may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and / or optical signals.
  • the wireless communication circuit in the communication circuit may include a radio frequency transceiver circuit, a power amplifier circuit, a low noise amplifier, a switch, a filter, and an antenna.
  • the wireless communication circuit in the communication circuit may include a circuit for supporting near field communication (NFC) by transmitting and receiving near-field coupled electromagnetic signals.
  • the communication circuit may include a near field communication antenna and a near field communication transceiver.
  • the communication circuit may also include a cellular phone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and so on.
  • the input-output circuit may further include other input-output units.
  • the input-output unit may include buttons, joysticks, click wheels, scroll wheels, touch pad, keypad, keyboard, camera, light emitting diodes, and other status indicators.
  • the electronic device may further include a battery (not shown), and the battery is used to provide electrical energy to the electronic device.
  • Video generally refers to various technologies that capture, record, process, store, transmit and reproduce a series of still images in the form of electrical signals.
  • continuous image changes exceed 24 frames per second or more, according to the principle of persistence of vision, the human eye cannot distinguish a single static picture; it seems to be a smooth and continuous visual effect.
  • Such a continuous picture is called video.
  • Video technology was first developed for television systems, but it has now evolved into various formats to facilitate consumers to record video. The development of network technology has also caused the recorded video clips to exist on the Internet in the form of streaming media and can be received and played by computers.
  • Video and film belong to different technologies. The latter uses photography to capture dynamic images as a series of still photos.
  • the video in this application is a video shot by an electronic device, and does not include video shot by professional equipment (such as movies, TV series, etc.).
  • Existing video shooting includes images and audio.
  • the existing electronic devices generally only record the audio data collected during the video shooting, and do not process the audio data, for example, according to the sound source in the shooting video Processing of audio data, etc. This results in a poor scene restoration effect and affects the user experience.
  • FIG. 2 is a schematic flowchart of a video audio processing method disclosed in an embodiment of the present application. The method is applied to the electronic device described in FIG. 1.
  • the video audio processing method includes the following steps:
  • Step S201 Acquire the captured first video, and extract image frame data and audio frame data in the first video
  • Step S202 Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;
  • the above audio time interval for acquiring audio frame data may specifically include:
  • Filtering the audio frame data to obtain the filtered first audio frame data acquiring a time interval corresponding to the first audio frame data, and determining the time interval as an audio time interval.
  • Step S203 Analyze the first group of image frame data to determine the sound source position of the audio, and perform 3D sound effect processing on the audio frame data according to the sound source position to obtain the processed audio frame data.
  • the step S203 of performing 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data may specifically include:
  • the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
  • indoor 3D sound effect strategy playback may also be performed on the audio frame data.
  • the indoor 3D sound effect strategy playback includes, but is not limited to: reducing the volume, increasing echo, and so on.
  • the technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
  • the above method for determining that the first video is indoor may specifically include:
  • the above classifiers include but are not limited to: machine learning, neural network models, deep learning models, and other algorithm models with classification functions.
  • Extracting n frames of image data as described above can reduce the amount of data calculation. Comparing all image frame data of the first video with a classifier operation can greatly reduce the amount of calculation, and its accuracy will not be reduced. Because, according to the statistics of the applicant on the big data of video shooting, it is found that the video shooting time is generally short, most of the video shooting time is less than 5 minutes, or even less than 2 minutes, which is generally referred to as micro video, Unlike the frequent switching of movie scenes, for micro video scenes, due to the short time, in addition, micro video is generally formed by one shot, and will not undergo subsequent editing and stitching processing, so the shooting scene will generally not switch According to the statistics of big data, most video shooting scenes are fixed. For example, indoor scene shooting is indoor scene shooting, and outdoor scene shooting is outdoor scene shooting. Therefore, n images of the first video are directly extracted Frame can be judged to confirm indoor or outdoor.
  • the analysis of the first group of image frame data in step S203 to determine the location of the sound source of the audio may specifically include:
  • Extract m image frames of the first set of image frame data in a continuous period of time perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames
  • the continuous time period may have image frames with continuous shooting times, for example, m image frames in the time period of 1s-10s, and of course, may be other time periods.
  • the present application does not limit the specific time of the above time period.
  • the above-mentioned face recognition processing method can be obtained by using a general face recognition algorithm, for example, through Baidu face recognition algorithm, Google face recognition and so on.
  • the above-mentioned identification of the mouth area of the x image frames to determine that the x image frames have a mouth may specifically include:
  • the principle of this method is based on the fact that the person must have a mouth movement.
  • the movement of the mouth is analyzed.
  • the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the RGB value of the lips), the second part is the non-lip area (may appear, the RGB value of the teeth or the black RGB value of the light), according to the statistics of big data, it is found that for the movement of the mouth, its The area of the second part will change at any time. For example, if you say a paragraph, the difference between the maximum range and the minimum range of the second part is large. Since the video shooting distance is relatively fixed, it is reflected in the image frame, which corresponds to the second part. The change in the number of pixels is relatively large. Based on this principle, the applicant has identified the movement of the mouth.
  • the above-mentioned identification of the mouth area of the x image frames to determine that the x image frames have a mouth may specifically include:
  • the principle of this method is based on the fact that the person must have a mouth movement.
  • the movement of the mouth is analyzed.
  • the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the lip RGB value), the second part is the non-lip area (such as the tooth RGB value), through the statistics of big data, the area of the second part of the mouth movements will change at any time Often, teeth appear from time to time, so counting the number of times the teeth appear can determine whether there is mouth movement.
  • Asian teeth are generally white to yellow, which is very different from the RGB value of the lips, so Selecting the RGB value of the teeth can also reduce errors and improve the accuracy of recognition of mouth movements.
  • FIG. 3 is a schematic flowchart of a video sound processing method disclosed in an embodiment of the present application, which is applied to the electronic device described in FIG. 1 above.
  • the movie sound processing method includes the following steps:
  • Step S301 Acquire the captured first video, and extract image frame data and audio frame data in the first video
  • Step S302 Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data.
  • Step S303 extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract the time continuous x of the w image frames Image frames, when it is determined that the mouth regions of the x image frames determine the movement of the mouth, the position of the mouth region in the x images of the x image frames is the sound source position of the audio.
  • Step S304 If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data.
  • the technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
  • FIG. 4 provides a video audio processing device.
  • the video audio processing device includes:
  • the obtaining unit 401 is configured to obtain the first video captured and extract image frame data and audio frame data in the first video;
  • the processing unit 402 is used to obtain the audio time interval of the audio frame data, extract the first group of image frame data corresponding to the audio time interval from the image frame data; analyze the first group of image frame data to determine the sound of the audio Source position; perform 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data.
  • the technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
  • the processing unit is specifically configured to increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data if the position of the sound source is on the left; if the position of the sound source is on the right , Increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
  • the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is indoor.
  • the processing unit is specifically used to randomly extract n frames of image data from the image frame data, transmit the n frames of image data to a trained classifier, and perform classification algorithm processing to determine n scenes corresponding to the n frames of image data, If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; n is an integer greater than or equal to 2.
  • the processing unit is specifically used to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition processing on the m image frames to obtain w image frames containing human faces, and extract w Time-continuous x image frames in the image frame, when the mouth area of the x image frames is identified and the movement of the mouth is determined, the position of the mouth area in the x image frames in the x image frames is the audio
  • the sound source position of, m ⁇ w ⁇ x, m, w, x are all integers greater than or equal to 2.
  • the processing unit is specifically used to determine x mouth areas of x image frames, identify RGB values of all pixels in x mouth areas, and obtain the number of pixels that count non-lip RGB values among all RGB values x number, calculate the difference between the maximum value and the minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have mouth movements, if the difference is less than the difference threshold, determine the The x images have no mouth movements.
  • the processing unit is specifically used to determine x mouth regions of x image frames, identify the RGB values of all pixels in the x mouth regions, and obtain the number of pixels that count the RGB values of teeth among all RGB values to obtain x For the number, calculate the number of times y out of the x number that is greater than the number threshold. If y / x is greater than the ratio threshold, determine that x image frames have mouth movements.
  • FIG. 5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application.
  • the electronic device includes a processor, a memory, a communication interface, and one or more programs.
  • One or more programs are stored in the above-mentioned memory and are configured to be executed by the above-mentioned processor.
  • the above-mentioned program includes instructions for performing the following steps:
  • the first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
  • the 3D sound effect processing is performed on the audio frame data according to the sound source position to obtain processed audio frame data specifically including:
  • the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
  • the method further includes:
  • the first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
  • the method for determining that the first video is indoor specifically includes:
  • the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:
  • Extract m image frames of the first set of image frame data in a continuous period of time perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames ,
  • the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m ⁇ w ⁇ x, m, w and x are all integers greater than or equal to 2.
  • the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
  • the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
  • the electronic device includes a hardware structure and / or a software module corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed by hardware or computer software driven hardware depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
  • the embodiments of the present application may divide the functional unit of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above integrated unit can be implemented in the form of hardware or software function unit. It should be noted that the division of units in the embodiments of the present application is schematic, and is only a division of logical functions, and there may be other division manners in actual implementation.
  • each unit may be, for example, an integrated circuit ASIC, a single circuit, used to execute one or more software or firmware A program's processor (shared, dedicated, or chipset) and memory, combined logic circuits, and / or other suitable components that provide the functions described above.
  • An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute any video sound processing method described in the foregoing method embodiments Part or all of the steps.
  • An embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute as described in the above method embodiments Some or all steps of any video audio processing method.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of software program modules.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
  • the technical solution of the present application essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the foregoing memory includes: U disk, read-only memory (ROM), random access memory (RAM), mobile hard disk, magnetic disk, or optical disk and other media that can store program codes.
  • the program may be stored in a computer-readable memory, and the memory may include: a flash disk , ROM, RAM, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Telephone Function (AREA)
  • Studio Devices (AREA)

Abstract

Disclosed are a sound effect processing method for a video, and related products. The method comprises the following steps: obtaining a photographed first video, and extracting image frame data and audio frame data from the first video; obtaining an audio time interval of the audio frame data, and extracting a first group of image frame data corresponding to the audio time interval from the image frame data; and analyzing the first group of image frame data to determine the acoustic source location of an audio, and performing 3D sound effect processing on the audio frame data according to the acoustic source location to obtain the processed audio frame data. According to the technical solution provided by the present application, the user experience is high.

Description

视频的音效处理方法及相关产品Video audio processing method and related products 技术领域Technical field
本申请涉及音频技术领域,具体涉及一种视频的音效处理方法及相关产品。This application relates to the field of audio technology, and in particular to a method for processing audio effects of video and related products.
背景技术Background technique
随着电子设备(如手机、平板电脑等)的大量普及应用,电子设备能够支持的应用越来越多,功能越来越强大,电子设备向着多样化、个性化的方向发展,成为用户生活中不可缺少的电子用品,视频类应用为电子设备的高频应用。With the widespread application of electronic devices (such as mobile phones, tablet computers, etc.), more and more applications can be supported by electronic devices, and their functions are becoming more and more powerful. Indispensable electronic appliances, video applications are high-frequency applications for electronic equipment.
发明内容Summary of the invention
本申请实施例提供了一种视频的音效处理方法及相关产品,能够声源的位置对视频的音频进行处理,提升用户体验。The embodiments of the present application provide a method for processing audio effects of video and related products, which can process the audio of the video by the position of the sound source, thereby improving the user experience.
第一方面,本申请实施例提供一种视频的音效处理方法,所述方法包括如下步骤:In a first aspect, an embodiment of the present application provides a video audio processing method, the method includes the following steps:
获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;Obtain the captured first video and extract the image frame data and audio frame data in the first video;
获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;Obtain the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;
对第一组图像帧数据进行分析确定音频的声源位置,依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。The first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
第二方面,提供一种电影音效处理装置,所述电影音效处理装置包括:In a second aspect, a movie sound effect processing device is provided. The movie sound effect processing device includes:
获取单元,用于获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;An obtaining unit, configured to obtain the first video captured, and extract image frame data and audio frame data in the first video;
处理单元,用于获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;对第一组图像帧数据进行分析确定音频的声源位置;依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据The processing unit is used for acquiring the audio time interval of the audio frame data, extracting the first group of image frame data corresponding to the audio time interval from the image frame data; analyzing the first group of image frame data to determine the audio source Position; 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data
第三方面,本申请实施例提供一种电子设备,包括处理器、存储器、通信接口,以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行本申请实施例第一方面中的步骤的指令。In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured by the above The processor executes, and the above program includes instructions for performing the steps in the first aspect of the embodiments of the present application.
第四方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the first Part or all of the steps described in one aspect.
第五方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。According to a fifth aspect, an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing the computer program, and the computer program is operable to cause the computer to execute as implemented in the present application Examples of some or all of the steps described in the first aspect. The computer program product may be a software installation package.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained based on these drawings.
图1是本申请实施例提供的一种电子设备的结构示意图;1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图2是本申请实施例公开的一种电影音效处理方法的流程示意图;2 is a schematic flowchart of a movie sound processing method disclosed in an embodiment of the present application;
图3是本申请实施例公开的另一种电影音效处理方法的流程示意图;3 is a schematic flowchart of another movie sound effect processing method disclosed in an embodiment of the present application;
图4是本申请实施例公开的一种电影音效处理装置的结构示意图;4 is a schematic structural diagram of a movie sound processing device disclosed in an embodiment of the present application;
图5是本申请实施例公开的另一种电子设备的结构示意图。5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of this application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms “first” and “second” in the description and claims of the present application and the above drawings are used to distinguish different objects, not to describe a specific order. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes steps or units that are not listed, or optionally also includes Other steps or units inherent to these processes, methods, products, or equipment.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to "embodiments" means that specific features, structures, or characteristics described in connection with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive of other embodiments. Those skilled in the art understand explicitly and implicitly that the embodiments described herein can be combined with other embodiments.
本申请实施例所涉及到的电子设备可以包括各种具有无线通信功能的手持设备(如智能手机)、车载设备、虚拟现实(virtual reality,VR)/增强现实(augmented reality,AR)设备,可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(user equipment,UE),移动台(mobile station,MS),终端设备(terminal device)、研发/测试平台、服务器等等。为方便描述,上面提到的设备统称为电子设备。The electronic devices involved in the embodiments of the present application may include various handheld devices with wireless communication functions (such as smart phones), in-vehicle devices, virtual reality (virtual reality (VR) / augmented reality (AR) devices), and Wearable devices, computing devices, or other processing devices connected to wireless modems, as well as various forms of user equipment (UE), mobile stations (MS), terminal devices (terminal), R & D / test platforms, Servers, etc. For convenience of description, the devices mentioned above are collectively referred to as electronic devices.
具体实现中,本申请实施例中,电子设备可对音频数据(声源发出的声音)使用HRTF(Head Related Transfer Function,头相关变换函数)滤波器进行滤波,得到虚拟环绕声,也称之为环绕声,或者全景声,实现一种三维立体音效。HRTF在时间域所对应的名称是HRIR(Head Related Impulse Response)。或者将音频数据与双耳房间脉冲响应(Binaural Room Impulse Response,BRIR)做卷积,双耳房间脉冲响应由三个部分组成:直达声,早期反射声和混响。In a specific implementation, in the embodiment of the present application, the electronic device may filter the audio data (the sound emitted by the sound source) using an HRTF (Head Related Transfer Function) filter to obtain virtual surround sound, which is also called Surround sound, or panoramic sound, achieve a three-dimensional stereo sound effect. The corresponding name of HRTF in the time domain is HRIR (Head Related Related Impulse). Or convolve the audio data with the binaural room impulse response (BRIR). The binaural room impulse response consists of three parts: direct sound, early reflection sound and reverberation.
在视频拍摄对音频的位置并没有返回,即对于声源在左、右等位置并没有对应的处理,这样导致视频的场景还原效果差,影响用户体验度。There is no return to the audio position during video shooting, that is, there is no corresponding processing for the sound source at the left, right, etc., which results in a poor scene restoration effect of the video and affects the user experience.
在一种视频的音效处理方法的一种可选方案中,所述依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据具体包括:In an optional solution of a video sound processing method, the 3D sound effect processing is performed on the audio frame data according to the position of the sound source to obtain processed audio frame data specifically including:
如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
在一种视频的音效处理方法的一种可选方案中,所述方法还包括:In an optional solution of a video audio processing method, the method further includes:
第一视频为室内,对所述音频帧数据执行室内3D音效策略播放。The first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
在一种视频的音效处理方法的一种可选方案中,所述室内3D音效策略播放包括:降低音量或增加回声。In an optional solution of a video sound processing method, the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
在一种视频的音效处理方法的一种可选方案中,所述确定所述第一视频为室内的方法具体包括:In an optional solution of a video audio processing method, the method of determining that the first video is indoor specifically includes:
从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first A video is indoor, otherwise, it is determined that the first video is non-indoor; the n is an integer greater than or equal to 2.
在一种视频的音效处理方法的一种可选方案中,所述对第一组图像帧数据进行分析确定音频的声源位置具体包括:In an optional solution of a video sound processing method, the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:
提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames , When the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m≥w ≥x, m, w and x are all integers greater than or equal to 2.
在一种视频的音效处理方法的一种可选方案中,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:In an optional solution of a video audio processing method, the action of identifying the mouth area of x image frames to determine that the x image frames have a mouth specifically includes:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.
在一种视频的音效处理方法的一种可选方案中,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:In an optional solution of a video audio processing method, the action of identifying the mouth area of x image frames to determine that the x image frames have a mouth specifically includes:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.
在一种视频的音效处理方法的一种可选方案中,所述获取音频帧数据的音频时间区间具体包括:In an optional solution of a video sound processing method, the audio time interval for acquiring audio frame data specifically includes:
将所述音频帧数据进行过滤处理得到过滤后的第一音频帧数据,获取所述第一音频帧数据对应的时间区间,确定所述时间区间为所述音频时间区间。Filtering the audio frame data to obtain filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as the audio time interval.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,具体用于如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高 音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。In an optional solution of a video sound processing device, the processing unit is specifically used to increase the volume of the left channel in the audio frame data or decrease the audio frame data if the sound source position is on the left The volume of the right channel; if the sound source position is on the right, increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,还用于如第一视频位于室内,对所述音频帧数据执行室内3D音效策略播放。In an optional solution of a video sound processing device, the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is located indoors.
在一种视频的音效处理装置的一种可选方案中,所述室内3D音效策略播放包括:降低音量或增加回声。In an optional solution of a video sound processing device, the indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,具体用于从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。In an optional solution of a video audio processing device, the processing unit is specifically configured to randomly extract n frames of image data from the image frame data, and transmit the n frames of image data to a trained classifier for execution The classification algorithm processes and determines n scenes corresponding to n frames of image data. If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; the n is an integer greater than or equal to 2.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,具体用于提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。In an optional solution of a video audio processing device, the processing unit is specifically configured to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition on the m image frames Processing to obtain w image frames containing a human face, extracting x image frames that are consecutive in time among the w image frames, and determining x image frames with mouth movements when x image frames are identified in the mouth area, determine x image frames The position of the middle mouth area in the x images is the position of the sound source of the audio, where m≥w≥x, and m, w, and x are all integers greater than or equal to 2.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。In an optional solution of a video audio processing device, the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB Count the number of pixels with non-lip RGB values in the value to get the x number, calculate the difference between the maximum and minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have a mouth , If the difference is less than the difference threshold, it is determined that the x images do not have a mouth action.
在一种视频的音效处理装置的一种可选方案中,所述处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。In an optional solution of a video audio processing device, the processing unit is specifically configured to determine x mouth regions of x image frames, identify RGB values of all pixels of the x mouth regions, and convert all RGB The number of pixel points of the tooth RGB value is counted in the value to obtain the x number, and the number y of the x number greater than the number threshold is calculated. If y / x is greater than the ratio threshold, it is determined that the x image frames have mouth movements.
请参阅图1,图1是本申请实施例提供的一种电子设备的结构示意图,电子设备包括控制电路和输入-输出电路,输入输出电路与控制电路连接。Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device includes a control circuit and an input-output circuit, and the input-output circuit is connected to the control circuit.
其中,控制电路可以包括存储和处理电路。该存储和处理电路中的存储电路可以是存储器,例如硬盘驱动存储器,非易失性存储器(例如闪存或用于形成固态驱动器的其它电子可编程只读存储器等),易失性存储器(例如静态或动态随机存取存储器等)等,本申请实施例不作限制。存储和处理电路中的处理电路可以用于控制电子设备的运转。该处理电路可以基于一个或多个微处理器,微控制器,数字信号处理器,基带处理器,功率管理单元,音频编解码器芯片,专用集成电路,显示驱动器集成电路等来实现。Among them, the control circuit may include a storage and processing circuit. The storage circuit in the storage and processing circuit may be a memory, such as a hard disk drive memory, a non-volatile memory (such as flash memory or other electronic programmable read-only memory used to form a solid-state drive, etc.), a volatile memory (such as static Or dynamic random access memory, etc.), the embodiments of the present application are not limited. The processing circuit in the storage and processing circuit can be used to control the operation of the electronic device. The processing circuit may be implemented based on one or more microprocessors, microcontrollers, digital signal processors, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.
存储和处理电路可用于运行电子设备中的软件,例如播放来电提示响铃应用程序、播放短消息提示响铃应用程序、播放闹钟提示响铃应用程序、播放媒体文件应用程序、互联网协议语音(voice over internet  protocol,VOIP)电话呼叫应用程序、操作系统功能等。这些软件可以用于执行一些控制操作,例如,播放来电提示响铃、播放短消息提示响铃、播放闹钟提示响铃、播放媒体文件、进行语音电话呼叫以及电子设备中的其它功能等,本申请实施例不作限制。The storage and processing circuits can be used to run software in electronic devices, such as playing incoming call alert ringing applications, playing short message alert ringing applications, playing alarm alert ringing applications, playing media file applications, voice over internet protocol (voice over Internet protocol (VOIP) phone call application, operating system functions, etc. These softwares can be used to perform some control operations, such as playing incoming call alert ringing, playing short message alert ringing, playing alarm alert ringing, playing media files, making voice phone calls, and other functions in electronic devices, etc., this application The embodiment is not limited.
其中,输入-输出电路可用于使电子设备实现数据的输入和输出,即允许电子设备从外部设备接收数据和允许电子设备将数据从电子设备输出至外部设备。Among them, the input-output circuit can be used to enable the electronic device to input and output data, that is, to allow the electronic device to receive data from the external device and to allow the electronic device to output data from the electronic device to the external device.
输入-输出电路可以进一步包括传感器。传感器可以包括环境光传感器,基于光和电容的红外接近传感器,超声波传感器,触摸传感器(例如,基于光触摸传感器和/或电容式触摸传感器,其中,触摸传感器可以是触控显示屏的一部分,也可以作为一个触摸传感器结构独立使用),加速度传感器,重力传感器,和其它传感器等。输入-输出电路还可以进一步包括音频组件,音频组件可以用于为电子设备提供音频输入和输出功能。音频组件还可以包括音调发生器以及其它用于产生和检测声音的组件。The input-output circuit may further include a sensor. The sensor may include an ambient light sensor, an infrared proximity sensor based on light and capacitance, an ultrasonic sensor, a touch sensor (for example, a light-based touch sensor and / or a capacitive touch sensor, where the touch sensor may be part of a touch display screen, or Can be used independently as a touch sensor structure), acceleration sensor, gravity sensor, and other sensors. The input-output circuit may further include an audio component, and the audio component may be used to provide audio input and output functions for the electronic device. Audio components can also include tone generators and other components for generating and detecting sound.
输入-输出电路还可以包括一个或多个显示屏。显示屏可以包括液晶显示屏,有机发光二极管显示屏,电子墨水显示屏,等离子显示屏,使用其它显示技术的显示屏中一种或者几种的组合。显示屏可以包括触摸传感器阵列(即,显示屏可以是触控显示屏)。触摸传感器可以是由透明的触摸传感器电极(例如氧化铟锡(ITO)电极)阵列形成的电容式触摸传感器,或者可以是使用其它触摸技术形成的触摸传感器,例如音波触控,压敏触摸,电阻触摸,光学触摸等,本申请实施例不作限制。The input-output circuit may also include one or more display screens. The display screen may include a liquid crystal display screen, an organic light-emitting diode display screen, an electronic ink display screen, a plasma display screen, and one or a combination of several display screens using other display technologies. The display screen may include a touch sensor array (ie, the display screen may be a touch display screen). The touch sensor may be a capacitive touch sensor formed by an array of transparent touch sensor electrodes (such as indium tin oxide (ITO) electrodes), or may be a touch sensor formed using other touch technologies, such as sonic touch, pressure sensitive touch, resistance Touch, optical touch, etc. are not limited in the embodiments of the present application.
输入-输出电路还可以进一步包括通信电路可以用于为电子设备提供与外部设备通信的能力。通信电路可以包括模拟和数字输入-输出接口电路,和基于射频信号和/或光信号的无线通信电路。通信电路中的无线通信电路可以包括射频收发器电路、功率放大器电路、低噪声放大器、开关、滤波器和天线。举例来说,通信电路中的无线通信电路可以包括用于通过发射和接收近场耦合电磁信号来支持近场通信(near field communication,NFC)的电路。例如,通信电路可以包括近场通信天线和近场通信收发器。通信电路还可以包括蜂窝电话收发器和天线,无线局域网收发器电路和天线等。The input-output circuit may further include a communication circuit that can be used to provide an electronic device with the ability to communicate with an external device. The communication circuit may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and / or optical signals. The wireless communication circuit in the communication circuit may include a radio frequency transceiver circuit, a power amplifier circuit, a low noise amplifier, a switch, a filter, and an antenna. For example, the wireless communication circuit in the communication circuit may include a circuit for supporting near field communication (NFC) by transmitting and receiving near-field coupled electromagnetic signals. For example, the communication circuit may include a near field communication antenna and a near field communication transceiver. The communication circuit may also include a cellular phone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and so on.
输入-输出电路还可以进一步包括其它输入-输出单元。输入-输出单元可以包括按钮,操纵杆,点击轮,滚动轮,触摸板,小键盘,键盘,照相机,发光二极管和其它状态指示器等。The input-output circuit may further include other input-output units. The input-output unit may include buttons, joysticks, click wheels, scroll wheels, touch pad, keypad, keyboard, camera, light emitting diodes, and other status indicators.
其中,电子设备还可以进一步包括电池(未图示),电池用于给电子设备提供电能。The electronic device may further include a battery (not shown), and the battery is used to provide electrical energy to the electronic device.
视频(Video)泛指将一系列静态影像以电信号的方式加以捕捉、纪录、处理、储存、传送与重现的各种技术。连续的图像变化每秒超过24帧(frame)画面以上时,根据视觉暂留原理,人眼无法辨别单幅的静态画面;看上去是平滑连续的视觉效果,这样连续的画面叫做视频。视频技术最早是为了电视系统而发展,但现在已经发展为各种不同的格式以利消费者将视频记录下来。网络技术的发达也促使视频的纪录片段以串流媒体的形式存在于因特网之上并可被电脑接收与播放。视频与电影属于不同的技术,后者是利用照相术将动态的影像捕捉为一系列的静态照片。Video generally refers to various technologies that capture, record, process, store, transmit and reproduce a series of still images in the form of electrical signals. When continuous image changes exceed 24 frames per second or more, according to the principle of persistence of vision, the human eye cannot distinguish a single static picture; it seems to be a smooth and continuous visual effect. Such a continuous picture is called video. Video technology was first developed for television systems, but it has now evolved into various formats to facilitate consumers to record video. The development of network technology has also caused the recorded video clips to exist on the Internet in the form of streaming media and can be received and played by computers. Video and film belong to different technologies. The latter uses photography to capture dynamic images as a series of still photos.
随着摄像头在电子设备上的应用,尤其是摄像头与智能手机结合以后,视频拍摄被用户使用的频率越来越高,尤其是近期短视频应用的飞速发展,让用户使用视频应用的频率更加的频繁。对于视频,本申请中的视频如无特殊说明,该视频为电子设备拍摄的视频,并不包含专业设备拍摄的视频(例如电影、电视剧等等影视作品)。现有的视频拍摄,包含图像和音频,对于视频中的音频数据,现有的电子设备一般只记录在视频拍摄时采集到的音频数据,并不对音频数据进行处理,例如依据拍摄视频中声源的位置对音频数据进行处理等。这样导致场景的还原效果差,影响了用户的体验度。With the application of cameras on electronic devices, especially after the combination of cameras and smartphones, the frequency of video shooting is getting higher and higher. Especially the recent rapid development of short video applications has made users use video applications more frequently. frequently. As for the video, unless otherwise specified, the video in this application is a video shot by an electronic device, and does not include video shot by professional equipment (such as movies, TV series, etc.). Existing video shooting includes images and audio. For the audio data in the video, the existing electronic devices generally only record the audio data collected during the video shooting, and do not process the audio data, for example, according to the sound source in the shooting video Processing of audio data, etc. This results in a poor scene restoration effect and affects the user experience.
下面对本申请实施例进行详细介绍。The following describes the embodiments of the present application in detail.
请参阅图2,图2是本申请实施例公开的一种视频的音效处理方法的流程示意图,应用于上述图1所描述的电子设备,该视频的音效处理方法包括如下步骤:Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a video audio processing method disclosed in an embodiment of the present application. The method is applied to the electronic device described in FIG. 1. The video audio processing method includes the following steps:
步骤S201、获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;Step S201: Acquire the captured first video, and extract image frame data and audio frame data in the first video;
步骤S202、获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;Step S202: Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;
上述获取音频帧数据的音频时间区间具体可以包括:The above audio time interval for acquiring audio frame data may specifically include:
将音频帧数据进行过滤处理得到过滤后的第一音频帧数据,获取第一音频帧数据对应的时间区间,确定该时间区间为音频时间区间。Filtering the audio frame data to obtain the filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as an audio time interval.
步骤S203、对第一组图像帧数据进行分析确定音频的声源位置,依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。Step S203: Analyze the first group of image frame data to determine the sound source position of the audio, and perform 3D sound effect processing on the audio frame data according to the sound source position to obtain the processed audio frame data.
上述步骤S203中的依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据具体可以包括:The step S203 of performing 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data may specifically include:
如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
可选的,如上述第一视频为室内,还可以对该音频帧数据执行室内3D音效策略播放,上述室内3D音效策略播放包括但不限于:降低音量、增加回声等等方式。Optionally, if the first video is indoors, indoor 3D sound effect strategy playback may also be performed on the audio frame data. The indoor 3D sound effect strategy playback includes, but is not limited to: reducing the volume, increasing echo, and so on.
本申请提供的技术方案获取拍摄的第一视频时,提取第一视频的图像帧数据以及音频帧数据,然后获取音频帧数据对应的音频时间区间,依据该音频时间区间对应的图像帧数据来确定声源位置,然后依据声源位置来调整音频数据,从而在音频数据中体现声源,增加了音频数据的场景还原效果,提高了用户体验度。The technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
可选的,上述确定第一视频为室内的方法具体可以包括:Optionally, the above method for determining that the first video is indoor may specifically include:
从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内。Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first One video is indoor, otherwise, it is determined that the first video is non-indoor.
上述分类器包括但不限于:机器学习、神经网络模型、深度学习模型等等具有分类功能的算法模型。The above classifiers include but are not limited to: machine learning, neural network models, deep learning models, and other algorithm models with classification functions.
上述抽取n帧图像数据能够减少数据的运算量,对比第一视频所有图像帧数据均进行分类器的运算,能够极大地减少运算量,并且其准确度还不会降低。因为,通过本申请人对拍摄视频大数据的统计发现,拍摄视频一般时间较短,大部分的拍摄视频的时间均低于5分钟,甚至是低于2分钟,通俗的说即为微视频,与电影的场景频繁的切换不同,对于微视频这种场景,由于时间很短,另外,微视频一般为一次拍摄形成,不会经过后续的剪辑以及拼接处理等,因此其拍摄场景一般不会切换,通过大数据的统计,绝大部分的视频拍摄的场景均是固定,例如,室内场景拍摄均为室内场景拍摄,室外场景拍摄均为室外场景拍摄,因此,直接抽取第一视频的n个图像帧进行判断就可以实现室内还是室外的确认。Extracting n frames of image data as described above can reduce the amount of data calculation. Comparing all image frame data of the first video with a classifier operation can greatly reduce the amount of calculation, and its accuracy will not be reduced. Because, according to the statistics of the applicant on the big data of video shooting, it is found that the video shooting time is generally short, most of the video shooting time is less than 5 minutes, or even less than 2 minutes, which is generally referred to as micro video, Unlike the frequent switching of movie scenes, for micro video scenes, due to the short time, in addition, micro video is generally formed by one shot, and will not undergo subsequent editing and stitching processing, so the shooting scene will generally not switch According to the statistics of big data, most video shooting scenes are fixed. For example, indoor scene shooting is indoor scene shooting, and outdoor scene shooting is outdoor scene shooting. Therefore, n images of the first video are directly extracted Frame can be judged to confirm indoor or outdoor.
上述步骤S203中对第一组图像帧数据进行分析确定音频的声源位置具体可以包括:The analysis of the first group of image frame data in step S203 to determine the location of the sound source of the audio may specifically include:
提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置。Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames When recognizing the mouth area of the x image frames and determining that the x image frames have the action of the mouth, determine the position of the mouth area in the x images of the x image frames as the sound source position of the audio.
上述连续时间段具有可以为,拍摄时间连续的图像帧,例如,1s—10s时间段的m个图像帧,当然也可以为其他的时间段,本申请并不局限上述时间段的具体时间。The continuous time period may have image frames with continuous shooting times, for example, m image frames in the time period of 1s-10s, and of course, may be other time periods. The present application does not limit the specific time of the above time period.
上述人脸识别处理的方法可以采用通用人脸识别算法来得到,例如,通过百度人脸识别算法,谷歌人脸识别等等方式来得到。The above-mentioned face recognition processing method can be obtained by using a general face recognition algorithm, for example, through Baidu face recognition algorithm, Google face recognition and so on.
上述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体可以包括:The above-mentioned identification of the mouth area of the x image frames to determine that the x image frames have a mouth may specifically include:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.
此方法的原理基于人说话一定具有嘴巴的动作,对于嘴巴的动作分析得到,人在说话时,该嘴巴的区域分为2个部分,第一部分为嘴唇区域(以亚洲人为例,嘴唇为粉红色,可以查询RGB值确定该嘴唇RGB值的范围),第二部分为非嘴唇区域(可能出现,牙齿RGB值或无光线的黑色RGB值),通过大数据的统计发现,对于嘴巴的动作,其第二部分的面积是会随时变化的,例如说一段话,那么第二部 分最大范围以及最小范围的差距较大,由于拍摄视频距离相对固定,那么反应到图像帧中,即第二部分对应的像素点数量的变化比较大,基于这个原理,本申请人来识别嘴巴的动作。The principle of this method is based on the fact that the person must have a mouth movement. The movement of the mouth is analyzed. When the person speaks, the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the RGB value of the lips), the second part is the non-lip area (may appear, the RGB value of the teeth or the black RGB value of the light), according to the statistics of big data, it is found that for the movement of the mouth, its The area of the second part will change at any time. For example, if you say a paragraph, the difference between the maximum range and the minimum range of the second part is large. Since the video shooting distance is relatively fixed, it is reflected in the image frame, which corresponds to the second part. The change in the number of pixels is relatively large. Based on this principle, the applicant has identified the movement of the mouth.
上述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体可以包括:The above-mentioned identification of the mouth area of the x image frames to determine that the x image frames have a mouth may specifically include:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.
此方法的原理基于人说话一定具有嘴巴的动作,对于嘴巴的动作分析得到,人在说话时,该嘴巴的区域分为2个部分,第一部分为嘴唇区域(以亚洲人为例,嘴唇为粉红色,可以查询RGB值确定该嘴唇RGB值的范围),第二部分为非嘴唇区域(例如牙齿RGB值),通过大数据的统计发现,对于嘴巴的动作,其第二部分的面积是会随时变化的,在变化时会时时的出现牙齿,那么统计牙齿出现的次数,即能够确定是否具有嘴巴动作,另外,亚洲人的牙齿一般为白色偏黄,其与嘴唇的RGB值的差别很大,因此选择牙齿的RGB值也可以减少误差,提高嘴巴动作的识别准确度。The principle of this method is based on the fact that the person must have a mouth movement. The movement of the mouth is analyzed. When the person speaks, the mouth area is divided into two parts, the first part is the lip area (in the case of Asians, the lips are pink) , You can query the RGB value to determine the range of the lip RGB value), the second part is the non-lip area (such as the tooth RGB value), through the statistics of big data, the area of the second part of the mouth movements will change at any time Often, teeth appear from time to time, so counting the number of times the teeth appear can determine whether there is mouth movement. In addition, Asian teeth are generally white to yellow, which is very different from the RGB value of the lips, so Selecting the RGB value of the teeth can also reduce errors and improve the accuracy of recognition of mouth movements.
请参阅图3,图3是本申请实施例公开的一种视频音效处理方法的流程示意图,应用于上述图1所描述的电子设备,该电影音效处理方法包括如下步骤:Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a video sound processing method disclosed in an embodiment of the present application, which is applied to the electronic device described in FIG. 1 above. The movie sound processing method includes the following steps:
步骤S301、获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;Step S301: Acquire the captured first video, and extract image frame data and audio frame data in the first video;
步骤S302、获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据。Step S302: Acquire the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data.
步骤S303、提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置。Step S303: extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract the time continuous x of the w image frames Image frames, when it is determined that the mouth regions of the x image frames determine the movement of the mouth, the position of the mouth region in the x images of the x image frames is the sound source position of the audio.
步骤S304、如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量。Step S304: If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data.
本申请提供的技术方案获取拍摄的第一视频时,提取第一视频的图像帧数据以及音频帧数据,然后获取音频帧数据对应的音频时间区间,依据该音频时间区间对应的图像帧数据来确定声源位置,然后依据声源位置来调整音频数据,从而在音频数据中体现声源,增加了音频数据的场景还原效果,提高了用户体验度。The technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
参阅图4,图4提供了一种视频的音效处理装置,所述视频的音效处理装置包括:Referring to FIG. 4, FIG. 4 provides a video audio processing device. The video audio processing device includes:
获取单元401,用于获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;The obtaining unit 401 is configured to obtain the first video captured and extract image frame data and audio frame data in the first video;
处理单元402,用于获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;对第一组图像帧数据进行分析确定音频的声源位置;依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。The processing unit 402 is used to obtain the audio time interval of the audio frame data, extract the first group of image frame data corresponding to the audio time interval from the image frame data; analyze the first group of image frame data to determine the sound of the audio Source position; perform 3D sound effect processing on the audio frame data according to the sound source position to obtain processed audio frame data.
本申请提供的技术方案获取拍摄的第一视频时,提取第一视频的图像帧数据以及音频帧数据,然后获取音频帧数据对应的音频时间区间,依据该音频时间区间对应的图像帧数据来确定声源位置,然后依据声源位置来调整音频数据,从而在音频数据中体现声源,增加了音频数据的场景还原效果,提高了用户体验度。The technical solution provided by the present application extracts the image frame data and audio frame data of the first video when acquiring the first video captured, and then acquires the audio time interval corresponding to the audio frame data, and determines it according to the image frame data corresponding to the audio time interval The position of the sound source, and then adjust the audio data according to the position of the sound source, thereby reflecting the sound source in the audio data, increasing the scene restoration effect of the audio data, and improving the user experience.
可选的,处理单元,具体用于如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。Optionally, the processing unit is specifically configured to increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data if the position of the sound source is on the left; if the position of the sound source is on the right , Increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
可选的,处理单元,还用于如第一视频为室内,对所述音频帧数据执行室内3D音效策略播放。Optionally, the processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is indoor.
可选的,处理单元,具体用于从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视 频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。Optionally, the processing unit is specifically used to randomly extract n frames of image data from the image frame data, transmit the n frames of image data to a trained classifier, and perform classification algorithm processing to determine n scenes corresponding to the n frames of image data, If the n scenes are all indoors, the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; n is an integer greater than or equal to 2.
可选的,处理单元,具体用于提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。Optionally, the processing unit is specifically used to extract m image frames of the first set of image frame data for a continuous period of time, and perform face recognition processing on the m image frames to obtain w image frames containing human faces, and extract w Time-continuous x image frames in the image frame, when the mouth area of the x image frames is identified and the movement of the mouth is determined, the position of the mouth area in the x image frames in the x image frames is the audio The sound source position of, m≥w≥x, m, w, x are all integers greater than or equal to 2.
可选的,处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。Optionally, the processing unit is specifically used to determine x mouth areas of x image frames, identify RGB values of all pixels in x mouth areas, and obtain the number of pixels that count non-lip RGB values among all RGB values x number, calculate the difference between the maximum value and the minimum value of the x number, if the difference is greater than the difference threshold, determine that the x images have mouth movements, if the difference is less than the difference threshold, determine the The x images have no mouth movements.
可选的,处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。Optionally, the processing unit is specifically used to determine x mouth regions of x image frames, identify the RGB values of all pixels in the x mouth regions, and obtain the number of pixels that count the RGB values of teeth among all RGB values to obtain x For the number, calculate the number of times y out of the x number that is greater than the number threshold. If y / x is greater than the ratio threshold, determine that x image frames have mouth movements.
请参阅图5,图5是本申请实施例公开的另一种电子设备的结构示意图,如图所示,该电子设备包括处理器、存储器、通信接口,以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行以下步骤的指令:Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application. As shown in the figure, the electronic device includes a processor, a memory, a communication interface, and one or more programs. One or more programs are stored in the above-mentioned memory and are configured to be executed by the above-mentioned processor. The above-mentioned program includes instructions for performing the following steps:
获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;Obtain the captured first video and extract the image frame data and audio frame data in the first video;
获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;Obtain the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;
对第一组图像帧数据进行分析确定音频的声源位置,依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。The first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
在一种可选的方案中,所述依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据具体包括:In an optional solution, the 3D sound effect processing is performed on the audio frame data according to the sound source position to obtain processed audio frame data specifically including:
如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
在一种可选的方案中,所述方法还包括:In an optional solution, the method further includes:
第一视频为室内,对所述音频帧数据执行室内3D音效策略播放。The first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
在一种可选的方案中,所述确定所述第一视频为室内的方法具体包括:In an optional solution, the method for determining that the first video is indoor specifically includes:
从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first A video is indoor, otherwise, it is determined that the first video is non-indoor; the n is an integer greater than or equal to 2.
在一种可选的方案中,所述对第一组图像帧数据进行分析确定音频的声源位置具体包括:In an optional solution, the analysis of the first group of image frame data to determine the location of the audio sound source specifically includes:
提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames , When the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m≥w ≥x, m, w and x are all integers greater than or equal to 2.
在一种可选的方案中,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:In an optional solution, the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.
在一种可选的方案中,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:In an optional solution, the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计 牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所提供的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The above mainly introduces the solution of the embodiment of the present application from the perspective of the execution process on the method side. It can be understood that, in order to realize the above-mentioned functions, the electronic device includes a hardware structure and / or a software module corresponding to each function. Those skilled in the art should be easily aware that, in conjunction with the example units and algorithm steps described in the embodiments provided herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed by hardware or computer software driven hardware depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例对电子设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiments of the present application may divide the functional unit of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The above integrated unit can be implemented in the form of hardware or software function unit. It should be noted that the division of units in the embodiments of the present application is schematic, and is only a division of logical functions, and there may be other division manners in actual implementation.
需要注意的是,本申请实施例所描述的电子设备是以功能单元的形式呈现。这里所使用的术语“单元”应当理解为尽可能最宽的含义,用于实现各个“单元”所描述功能的对象例如可以是集成电路ASIC,单个电路,用于执行一个或多个软件或固件程序的处理器(共享的、专用的或芯片组)和存储器,组合逻辑电路,和/或提供实现上述功能的其他合适的组件。It should be noted that the electronic devices described in the embodiments of the present application are presented in the form of functional units. The term "unit" used herein should be understood as the broadest possible meaning, and the object used to implement the function described by each "unit" may be, for example, an integrated circuit ASIC, a single circuit, used to execute one or more software or firmware A program's processor (shared, dedicated, or chipset) and memory, combined logic circuits, and / or other suitable components that provide the functions described above.
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种视频的音效处理方法的部分或全部步骤。An embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute any video sound processing method described in the foregoing method embodiments Part or all of the steps.
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种视频的音效处理方法的部分或全部步骤。An embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, the computer program is operable to cause the computer to execute as described in the above method embodiments Some or all steps of any video audio processing method.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence, Because according to this application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by this application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各 个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. The foregoing memory includes: U disk, read-only memory (ROM), random access memory (RAM), mobile hard disk, magnetic disk, or optical disk and other media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、ROM、RAM、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by instructing relevant hardware through a program. The program may be stored in a computer-readable memory, and the memory may include: a flash disk , ROM, RAM, magnetic disk or optical disk, etc.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been described in detail above, and specific examples have been used in this article to explain the principles and implementation of the present application. The descriptions of the above embodiments are only used to help understand the method and the core idea of the present application; Those of ordinary skill in the art, based on the ideas of the present application, may have changes in specific implementations and application scopes. In summary, the content of this specification should not be construed as limiting the present application.

Claims (20)

  1. 一种视频的音效处理方法,其特征在于,所述方法包括如下步骤:A video audio processing method, characterized in that the method includes the following steps:
    获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;Obtain the captured first video and extract the image frame data and audio frame data in the first video;
    获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;Obtain the audio time interval of the audio frame data, and extract the first group of image frame data corresponding to the audio time interval from the image frame data;
    对第一组图像帧数据进行分析确定音频的声源位置,依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。The first group of image frame data is analyzed to determine the location of the sound source of the audio, and the audio frame data is subjected to 3D sound effect processing according to the position of the sound source to obtain the processed audio frame data.
  2. 根据权利要求1所述的方法,其特征在于,所述依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据具体包括:The method according to claim 1, wherein the 3D sound effect processing is performed on the audio frame data according to the sound source position to obtain processed audio frame data specifically including:
    如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。If the sound source position is on the left, increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data; if the sound source position is on the right, increase the right channel in the audio frame data Or lower the volume of the left channel in the audio frame data.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    第一视频为室内,对所述音频帧数据执行室内3D音效策略播放。The first video is indoor, and an indoor 3D sound effect strategy is played on the audio frame data.
  4. 根据权利要求3所述的方法,其特征在于,The method of claim 3, wherein
    所述室内3D音效策略播放包括:降低音量或增加回声。The indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
  5. 根据权利要求3所述的方法,其特征在于,所述确定所述第一视频为室内的方法具体包括:The method according to claim 3, wherein the method for determining that the first video is indoor specifically includes:
    从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。Randomly extract n frames of image data from the image frame data, transfer the n frames of image data to a trained classifier, and perform a classification algorithm to determine the n scenes corresponding to the n frames of image data. If n scenes are all indoor, determine the first A video is indoor, otherwise, it is determined that the first video is non-indoor; the n is an integer greater than or equal to 2.
  6. 根据权利要求1所述的方法,其特征在于,所述对第一组图像帧数据进行分析确定音频的声源位置具体包括:The method according to claim 1, wherein the analysis of the first group of image frame data to determine the location of the sound source of the audio specifically includes:
    提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。Extract m image frames of the first set of image frame data in a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract x image frames with continuous time in the w image frames , When the mouth area of x image frames is identified, it is determined that the position of the mouth area in the x image frames in the x images is the sound source position of the audio, and m≥w ≥x, m, w and x are all integers greater than or equal to 2.
  7. 根据权利要求6所述的方法,其特征在于,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:The method according to claim 6, wherein the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
    确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计 非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。Determine x mouth areas of x image frames, identify the RGB values of all pixels in x mouth areas, get the number of pixels that count non-lip RGB values among all RGB values, and calculate the maximum of x numbers The difference between the value and the minimum value. If the difference is greater than the difference threshold, it is determined that the x images have mouth movements. If the difference is less than the difference threshold, the x images are determined to have no mouth movements.
  8. 根据权利要求6所述的方法,其特征在于,所述对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作具体包括:The method according to claim 6, wherein the action of identifying the mouth area of the x image frames to determine that the x image frames have a mouth specifically includes:
    确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。Determine the x mouth areas of the x image frames, identify the RGB values of all pixels in the x mouth areas, get the number of pixels that count the RGB values of the teeth in all RGB values, and calculate the x number. The threshold number of times y, if y / x is greater than the ratio threshold, determine that x image frames have mouth motions.
  9. 根据权利要求1所述的方法,其特征在于,所述获取音频帧数据的音频时间区间具体包括:The method according to claim 1, wherein the audio time interval for acquiring audio frame data specifically includes:
    将所述音频帧数据进行过滤处理得到过滤后的第一音频帧数据,获取所述第一音频帧数据对应的时间区间,确定所述时间区间为所述音频时间区间。Filtering the audio frame data to obtain filtered first audio frame data, acquiring a time interval corresponding to the first audio frame data, and determining the time interval as the audio time interval.
  10. 一种电影音效处理装置,其特征在于,所述电影音效处理装置包括:A movie sound effect processing device, characterized in that the movie sound effect processing device includes:
    获取单元,用于获取拍摄的第一视频,提取第一视频中的图像帧数据以及音频帧数据;An obtaining unit, configured to obtain the first video captured, and extract image frame data and audio frame data in the first video;
    处理单元,用于获取音频帧数据的音频时间区间,从图像帧数据中提取与该音频时间区间对应的第一组图像帧数据;对第一组图像帧数据进行分析确定音频的声源位置;依据该声源位置对该音频帧数据进行3D音效处理得到处理后的音频帧数据。The processing unit is used for acquiring the audio time interval of the audio frame data, extracting the first group of image frame data corresponding to the audio time interval from the image frame data; analyzing the first group of image frame data to determine the audio source Position; perform 3D sound effect processing on the audio frame data according to the position of the sound source to obtain processed audio frame data.
  11. 根据权利要求10所述的装置,其特征在于,The device according to claim 10, characterized in that
    所述处理单元,具体用于如该声源位置位于左侧,则提高音频帧数据中左声道的音量或降低音频帧数据中右声道的音量;如该声源位置位于右侧,则提高音频帧数据中右声道的音量或降低音频帧数据中左声道的音量。The processing unit is specifically used to increase the volume of the left channel in the audio frame data or decrease the volume of the right channel in the audio frame data if the position of the sound source is on the left; Increase the volume of the right channel in the audio frame data or decrease the volume of the left channel in the audio frame data.
  12. 根据权利要求10所述的装置,其特征在于,The device according to claim 10, characterized in that
    所述处理单元,还用于如第一视频位于室内,对所述音频帧数据执行室内3D音效策略播放。The processing unit is further configured to perform indoor 3D sound effect strategy playback on the audio frame data if the first video is located indoors.
  13. 根据权利要求12所述的装置,其特征在于,The device according to claim 12, characterized in that
    所述室内3D音效策略播放包括:降低音量或增加回声。The indoor 3D sound effect strategy playing includes: decreasing the volume or increasing the echo.
  14. 根据权利要求12所述的装置,其特征在于,The device according to claim 12, characterized in that
    所述处理单元,具体用于从图像帧数据中随机抽取n帧图像数据,将n帧图像数据传输至训练好的分类器中执行分类算法处理确定n帧图像数据对应的n个场景,如n个场景均为室内,确定第一视频为室内,否则,确定第一视频为非室内;所述n为大于等于2的整数。The processing unit is specifically used to randomly extract n frames of image data from the image frame data, transmit the n frames of image data to a trained classifier, and perform classification algorithm processing to determine n scenes corresponding to the n frames of image data, such as n The scenes are all indoors, and the first video is determined to be indoors; otherwise, the first video is determined to be non-indoors; the n is an integer greater than or equal to 2.
  15. 根据权利要求10所述的装置,其特征在于,The device according to claim 10, characterized in that
    所述处理单元,具体用于提取第一组图像帧数据连续时间段的m个图像帧,对m个图像帧进行人脸识别处理得到包含人脸的w个图像帧,提取w个图像帧中时间连续的x个图像帧,对x个图像帧的嘴巴区域识别确定x个图像帧具有嘴巴的动作时,确定x个图像帧中嘴巴区域的在x个图像中的位置为该音频的声源位置,所述m≥w≥x,所述m、w、x均为大于等于2的整数。The processing unit is specifically used to extract m image frames of the first set of image frame data for a continuous period of time, perform face recognition processing on the m image frames to obtain w image frames containing a human face, and extract w image frames The x image frames in the middle time are continuous. When the mouth area of the x image frames is identified and the action of the mouth is determined, the position of the mouth area in the x image frames in the x image frames is the sound of the audio For the source location, the m≥w≥x, and the m, w, and x are all integers greater than or equal to 2.
  16. 根据权利要求15所述的装置,其特征在于,The device according to claim 15, characterized in that
    所述处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计非嘴唇RGB值的像素点的个数得到x个数量,计算x个数量中最大值与最小值之间的差值,如该差值大于差值阈值,确定该x个图像具有嘴巴的动作,如该差值小于差值阈值,确定该x个图像不具有嘴巴的动作。The processing unit is specifically used to determine x mouth areas of x image frames, identify the RGB values of all pixels in the x mouth areas, and obtain the number of pixels that count non-lip RGB values among all RGB values Quantity, calculate the difference between the maximum value and the minimum value of the x numbers, if the difference is greater than the difference threshold, determine that the x images have mouth movements, if the difference is less than the difference threshold, determine the x The image does not have mouth movements.
  17. 根据权利要求15所述的装置,其特征在于,The device according to claim 15, characterized in that
    所述处理单元,具体用于确定x个图像帧的x个嘴巴区域,识别x个嘴巴区域所有像素点的RGB值,将所有RGB值中统计牙齿RGB值的像素点的个数得到x个数量,计算x个数量中大于数量阈值的次数y,如y/x大于比值阈值,确定x个图像帧具有嘴巴动作。The processing unit is specifically used to determine x mouth regions of x image frames, identify the RGB values of all pixels of the x mouth regions, and calculate the number of pixels of the tooth RGB value among all RGB values to obtain x numbers Calculate the number y of x numbers that is greater than the threshold value of the quantity. If y / x is greater than the ratio threshold value, determine that x image frames have mouth movements.
  18. 一种电子设备,其特征在于,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-9任一项所述的方法中的步骤的指令。An electronic device, characterized in that it includes a processor, a memory, a communication interface, and one or more programs, the one or more programs are stored in the memory, and are configured to be executed by the processor, The program includes instructions for performing the steps in the method of any one of claims 1-9.
  19. 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-9任一项所述的方法。A computer-readable storage medium characterized by storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method according to any one of claims 1-9.
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-9任一项所述的方法。A computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium that stores a computer program, the computer program is operable to cause a computer to execute any one of claims 1-9 Described method.
PCT/CN2019/104044 2018-10-25 2019-09-02 Sound effect processing method for video, and related products WO2020082902A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811253072.1A CN109413563B (en) 2018-10-25 2018-10-25 Video sound effect processing method and related product
CN201811253072.1 2018-10-25

Publications (1)

Publication Number Publication Date
WO2020082902A1 true WO2020082902A1 (en) 2020-04-30

Family

ID=65469699

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/104044 WO2020082902A1 (en) 2018-10-25 2019-09-02 Sound effect processing method for video, and related products

Country Status (2)

Country Link
CN (1) CN109413563B (en)
WO (1) WO2020082902A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4184927A4 (en) * 2020-11-18 2024-01-17 Tencent Tech Shenzhen Co Ltd Sound effect adjusting method and apparatus, device, storage medium, and computer program product

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413563B (en) * 2018-10-25 2020-07-10 Oppo广东移动通信有限公司 Video sound effect processing method and related product
KR20200107757A (en) * 2019-03-08 2020-09-16 엘지전자 주식회사 Method and apparatus for sound object following
CN110312032B (en) * 2019-06-17 2021-04-02 Oppo广东移动通信有限公司 Audio playing method and device, electronic equipment and computer readable storage medium
CN110753238B (en) * 2019-10-29 2022-05-06 北京字节跳动网络技术有限公司 Video processing method, device, terminal and storage medium
CN113747047B (en) * 2020-05-30 2023-10-13 华为技术有限公司 Video playing method and device
CN116158091A (en) * 2020-06-29 2023-05-23 海信视像科技股份有限公司 Display device and screen sounding method
CN112135226B (en) * 2020-08-11 2022-06-10 广东声音科技有限公司 Y-axis audio reproduction method and Y-axis audio reproduction system
CN113556501A (en) * 2020-08-26 2021-10-26 华为技术有限公司 Audio processing method and electronic equipment
CN112380396B (en) * 2020-11-11 2024-04-26 网易(杭州)网络有限公司 Video processing method and device, computer readable storage medium and electronic equipment
CN113050915B (en) * 2021-03-31 2023-12-26 联想(北京)有限公司 Electronic equipment and processing method
CN115022710B (en) * 2022-05-30 2023-09-19 咪咕文化科技有限公司 Video processing method, device and readable storage medium
CN115174959B (en) * 2022-06-21 2024-01-30 咪咕文化科技有限公司 Video 3D sound effect setting method and device
CN115696172B (en) * 2022-08-15 2023-10-20 荣耀终端有限公司 Sound image calibration method and device
CN117793607A (en) * 2022-09-28 2024-03-29 华为技术有限公司 Playing control method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1984310A (en) * 2005-11-08 2007-06-20 Tcl通讯科技控股有限公司 Method and communication apparatus for reproducing a moving picture, and use in a videoconference system
CN1997161A (en) * 2006-12-30 2007-07-11 华为技术有限公司 A video terminal and audio code stream processing method
US20140314391A1 (en) * 2013-03-18 2014-10-23 Samsung Electronics Co., Ltd. Method for displaying image combined with playing audio in an electronic device
CN106162447A (en) * 2016-06-24 2016-11-23 维沃移动通信有限公司 The method of a kind of audio frequency broadcasting and terminal
CN109413563A (en) * 2018-10-25 2019-03-01 Oppo广东移动通信有限公司 The sound effect treatment method and Related product of video

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829018B2 (en) * 2001-09-17 2004-12-07 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
WO2009078454A1 (en) * 2007-12-18 2009-06-25 Sony Corporation Data processing apparatus, data processing method, and storage medium
CN109040636B (en) * 2010-03-23 2021-07-06 杜比实验室特许公司 Audio reproducing method and sound reproducing system
US20110311144A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Rgb/depth camera for improving speech recognition
WO2014010920A1 (en) * 2012-07-09 2014-01-16 엘지전자 주식회사 Enhanced 3d audio/video processing apparatus and method
US9674453B1 (en) * 2016-10-26 2017-06-06 Cisco Technology, Inc. Using local talker position to pan sound relative to video frames at a remote location

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1984310A (en) * 2005-11-08 2007-06-20 Tcl通讯科技控股有限公司 Method and communication apparatus for reproducing a moving picture, and use in a videoconference system
CN1997161A (en) * 2006-12-30 2007-07-11 华为技术有限公司 A video terminal and audio code stream processing method
US20140314391A1 (en) * 2013-03-18 2014-10-23 Samsung Electronics Co., Ltd. Method for displaying image combined with playing audio in an electronic device
CN106162447A (en) * 2016-06-24 2016-11-23 维沃移动通信有限公司 The method of a kind of audio frequency broadcasting and terminal
CN109413563A (en) * 2018-10-25 2019-03-01 Oppo广东移动通信有限公司 The sound effect treatment method and Related product of video

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4184927A4 (en) * 2020-11-18 2024-01-17 Tencent Tech Shenzhen Co Ltd Sound effect adjusting method and apparatus, device, storage medium, and computer program product

Also Published As

Publication number Publication date
CN109413563B (en) 2020-07-10
CN109413563A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
WO2020082902A1 (en) Sound effect processing method for video, and related products
CN106651955B (en) Method and device for positioning target object in picture
TWI755833B (en) An image processing method, an electronic device and a storage medium
CN110100251B (en) Apparatus, method, and computer-readable storage medium for processing document
JP2016531362A (en) Skin color adjustment method, skin color adjustment device, program, and recording medium
JP6336206B2 (en) Method, apparatus, program and recording medium for processing moving picture file identifier
TW201901527A (en) Video conference and video conference management method
CN105704369B (en) A kind of information processing method and device, electronic equipment
CN110312032B (en) Audio playing method and device, electronic equipment and computer readable storage medium
CN106303156B (en) To the method, device and mobile terminal of video denoising
CN108234879B (en) Method and device for acquiring sliding zoom video
CN109639896A (en) Block object detecting method, device, storage medium and mobile terminal
CN108200421B (en) White balance processing method, terminal and computer readable storage medium
WO2022151686A1 (en) Scene image display method and apparatus, device, storage medium, program and product
US9921796B2 (en) Sharing of input information superimposed on images
CN106982327A (en) Image processing method and device
WO2020088068A1 (en) Display screen mipi working frequency regulation method and related product
CN105608469B (en) The determination method and device of image resolution ratio
CN109218620B (en) Photographing method and device based on ambient brightness, storage medium and mobile terminal
WO2022151687A1 (en) Group photo image generation method and apparatus, device, storage medium, computer program, and product
CN109286841B (en) Movie sound effect processing method and related product
CN106488168A (en) The angle changing method of picture of collection and device in electric terminal
CN105045510B (en) Realize that video checks the method and device of operation
CN108540726B (en) Method and device for processing continuous shooting image, storage medium and terminal
CN106023114B (en) Image processing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19877233

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19877233

Country of ref document: EP

Kind code of ref document: A1