WO2022042168A1 - 音频处理方法及电子设备 - Google Patents

音频处理方法及电子设备 Download PDF

Info

Publication number
WO2022042168A1
WO2022042168A1 PCT/CN2021/108458 CN2021108458W WO2022042168A1 WO 2022042168 A1 WO2022042168 A1 WO 2022042168A1 CN 2021108458 W CN2021108458 W CN 2021108458W WO 2022042168 A1 WO2022042168 A1 WO 2022042168A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sound pickup
pickup range
video
image
Prior art date
Application number
PCT/CN2021/108458
Other languages
English (en)
French (fr)
Inventor
卞超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2023513516A priority Critical patent/JP2023540908A/ja
Priority to EP21860008.8A priority patent/EP4192004A4/en
Priority to US18/042,753 priority patent/US20230328429A1/en
Publication of WO2022042168A1 publication Critical patent/WO2022042168A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/62Control of parameters via user interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/63Control of cameras or camera modules by using electronic viewfinders
    • H04N23/631Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters
    • H04N23/632Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters for displaying or modifying preview images prior to image capturing, e.g. variety of image resolutions or capturing parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/63Control of cameras or camera modules by using electronic viewfinders
    • H04N23/633Control of cameras or camera modules by using electronic viewfinders for displaying additional information relating to control or operation of the camera
    • H04N23/635Region indicators; Field of view indicators
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • H04N23/675Focus control based on electronic image sensor signals comprising setting of focusing regions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/90Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/142Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/02Casings; Cabinets ; Supports therefor; Mountings therein
    • H04R1/028Casings; Cabinets ; Supports therefor; Mountings therein associated with devices performing functions other than acoustics, e.g. electric candles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/40Visual indication of stereophonic sound image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/02Details casings, cabinets or mounting therein for transducers covered by H04R1/02 but not provided for in any of its subgroups
    • H04R2201/025Transducer mountings or cabinet supports enabling variable orientation of transducer of cabinet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/405Non-uniform arrays of transducers or a plurality of uniform arrays with different transducer spacing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/01Aspects of volume control, not necessarily automatic, in sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the present application relates to the field of electronic technology, and in particular, to an audio processing method and an electronic device.
  • a voice enhancement method is also proposed.
  • the audio file collected by the electronic device is processed by an audio algorithm to remove noise.
  • the processing capability of the audio algorithm is more demanding.
  • the complex audio processing process will also increase the requirements for the hardware performance of electronic equipment.
  • the audio processing method and electronic device provided by the present application can achieve directional speech enhancement by determining the position of the face or mouth of the person making the sound in the video picture, and determining the range that needs to be picked up according to the position of the person's face or mouth. It not only simplifies the audio processing algorithm, but also improves the audio quality.
  • the present application provides an audio processing method, the method is applied to an electronic device, and the method may include: detecting a first operation of opening a camera application. In response to the first operation, a shooting preview interface is displayed. A second operation to start recording is detected. In response to the second operation, the video picture and the first audio are collected, and a shooting interface is displayed, and the shooting interface includes a preview interface of the video picture. Identify a target image in the video picture, where the target image is a first face image and/or a first human mouth image. Wherein, the first face image is the face image of the vocal object in the video image, and the first human mouth image is the human mouth image of the vocal object in the video image.
  • the first sound pickup range corresponding to the sounding object is determined.
  • the second audio corresponding to the video picture is obtained.
  • the audio volume within the first sound pickup range in the second audio is greater than the audio volume outside the first sound pickup range.
  • the method in the embodiment of the present application may be applied to a scenario in which a user instruction is received to directly start a camera application. It can also be applied to the scene where the user opens other third-party applications (such as short video applications, live broadcast applications, video calling applications, etc.) and invokes the startup of the camera.
  • the first operation or the second operation includes, for example, a touch operation, a key operation, an air gesture operation, a voice operation, and the like.
  • the method further includes: detecting a sixth operation of starting the voice enhancement mode.
  • the speech enhancement mode is activated.
  • the user is firstly asked whether to enable the voice enhancement mode. After the user confirms that the voice enhanced mode is turned on, the voice enhanced mode is activated. Or, automatically activate the voice enhancement mode after detecting the switch to the video recording function. In still other embodiments, after switching to the video recording function is detected, the video recording preview interface is displayed first, and after detecting the operation instructed by the user to shoot, the voice enhanced mode is activated according to the user's instruction, or the voice enhanced mode is automatically activated.
  • the electronic device After starting the voice enhancement mode, the electronic device needs to process the collected first audio, identify the audio of the sounding object, and enhance this part of the audio to obtain a better recording effect.
  • the first audio is, for example, the collected initial audio signal
  • the second audio is audio obtained after voice enhancement processing.
  • the first face image or the first human mouth image is identified through a face image recognition algorithm. For example, in the process of recording a video picture, it is determined by a face image recognition algorithm whether the collected video picture contains a face image. If a face image is included, the included face image is identified, and whether it is uttering is determined according to changes in the facial feature data of the face image, such as facial feature data, facial contour data, etc. within a preset time period. Wherein, the criterion for judging that the face image is uttering sound includes judging that the face image is currently uttering sound. Or, it is determined that the human face image is vocalizing again within a preset time period after it is determined that the human face image is vocalizing for the first time.
  • the human vocal organ is the human mouth.
  • the vocal human mouth data can be obtained, the data of the first human mouth image can be preferentially determined, and then the first sound pickup range can be determined based on the data of the first human mouth image.
  • the image corresponding to the person who is speaking is not the target image. That is, the target image is the image corresponding to the recognized voice-producing face and/or the voice-producing mouth.
  • the first sound pickup range that needs to be enhanced for sound pickup is determined. Further, based on the collected initial audio signal and the first sound pickup range, a second audio frequency is obtained. In the second audio frequency, the audio volume within the first sound pickup range is greater than the audio volume outside the first sound pickup range. That is, boost the volume of the person speaking, thereby improving the audio recording.
  • determining the first sound pickup range corresponding to the sounding object according to the target image includes: obtaining the first feature value according to the target image.
  • the first feature value includes one or more items of front and rear attribute parameters, area ratio, and position information.
  • the front and rear attribute parameters are used to indicate whether the video image is a video image captured by the front camera or a video image captured by the rear camera.
  • the area ratio is used to represent the ratio of the area of the target image to the area of the video screen.
  • Position information used to indicate the position of the target image in the video picture. Then, according to the first feature value, the first sound pickup range corresponding to the sounding object is determined.
  • the first feature value is used to describe the relative positional relationship between the face of the real person corresponding to the first face image and the electronic device, or the first feature value is used to describe the relationship between the human mouth of the real person corresponding to the first human mouth image and the electronic device.
  • the relative positional relationship of electronic equipment. Therefore, the electronic device can determine the first sound pickup range according to the first feature value. For example, if the real person corresponding to the first face image is located directly in front of the electronic device, that is, the first face image is located in the center of the captured video image, the first sound pickup range is the sound pickup range directly in front of the electronic device. Subsequently, after the electronic device obtains the initial audio signal including the audio signals in various directions, the electronic device may obtain the audio corresponding to the first face image based on the initial audio signal and the first sound pickup range.
  • the first feature value may change during the video recording process. Then, the first pickup range will also change accordingly. Then, for the audio in the recorded video, the audio recorded by the electronic device at least includes the audio of the first duration and the audio of the second duration.
  • the first duration audio frequency is the audio frequency corresponding to the first sound pickup range
  • the second duration audio frequency is the audio frequency corresponding to the changed sound pickup range. That is to say, the electronic device can dynamically determine the sound pickup range based on the change of the voice-emitting face or the voice-emitting mouth in the video picture, and then record the audio according to the sound pickup range.
  • the audio of the formed video picture may include multiple audios of different durations or the same duration recorded based on the changed sound pickup range according to the time sequence.
  • the electronic device can always focus on improving the audio recording quality of the part that needs to be enhanced according to the change of the sound pickup range, thereby ensuring the audio recording effect.
  • the user when the user plays the video file, the user can be presented with a dynamically changing playing experience, such as a sound range that matches the change of the video content.
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: when the video picture is a front video picture, determining that the first sound pickup range is the sound pickup range on the front camera side. pickup range. When the video picture is a rear video picture, it is determined that the first sound pickup range is the sound pickup range on the rear camera side.
  • the sound pickup range of the electronic device includes a sound pickup range of 180 degrees at the front and a sound pickup range of 180 degrees at the rear. Then, when it is determined that the video picture is the front video picture, the sound pickup range of 180 degrees in the front is used as the first sound pickup range. When it is determined that the video picture is the rear video picture, the sound pickup range of the rear 180 degrees is used as the first sound pickup range. Further, during the video recording process, in response to the user's operation of switching the front and rear cameras, the first sound pickup range will also be switched from the front to the rear, so as to ensure that the first sound pickup range is the sound pickup range corresponding to the sounding object in the video picture. .
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the first sound pickup range according to the area ratio and the sound pickup range of the first audio frequency.
  • the sound pickup range of the first audio is, for example, the sound pickup range of panoramic audio.
  • the microphones are used to collect the initial audio signals in all directions, that is, the initial audio signals within the sound pickup range of the panoramic audio are obtained.
  • the person concerned by the user is usually placed at the center of the video image, that is, the first face image or the first human mouth image is located at the center of the viewfinder frame.
  • the first face image or the first mouth image correspond to different sound pickup ranges, and the area ratio can be used to describe the size of the first sound pickup range. Such as radius, diameter, area, etc.
  • X is used to represent the first face image area or the first human mouth image area.
  • Y is used to indicate the area of the video frame displayed by the viewfinder.
  • N represents the sound pickup range corresponding to the viewing range.
  • the area ratio is X/Y
  • the first pickup range is N*X/Y. That is to say, the ratio of the first sound pickup range to the panoramic sound pickup range is proportional to the area ratio.
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the position of the first sound pickup range in the sound pickup range of the first audio according to the position information.
  • the sounding object is not located at the center of the video picture, and the position of the image corresponding to the sounding object (ie, the target image) in the video picture can be obtained according to the position information. It can be understood that there is a corresponding relationship between the position of the target image in the video picture and the position of the first sound pickup range in the panoramic sound pickup range.
  • the position information includes a first offset of the center point of the target image relative to a first reference point, where the first reference point is the center point or the focus of the video image.
  • determining the position of the first sound pickup range in the sound pickup range of the first audio frequency includes: determining, according to the first offset, a center point of the first sound pickup range relative to the sound pickup range of the first audio frequency.
  • the second offset of the center point, the second offset is proportional to the first offset. Then, according to the second offset, the position of the first sound pickup range in the sound pickup range of the first audio frequency is determined.
  • the offset amount includes, for example, the offset direction, and/or the offset angle, and/or the offset distance, and the like.
  • the offset direction means that the center point of the first face image or the first mouth image is shifted leftward, rightwardly, upwardly, downwardly, upwardly left, and upward relative to the first reference point. Offset to the top right, offset to the bottom left, or offset to the bottom right, etc.
  • the offset angle is the angle pointing to the upper left offset, the upper right offset, the lower left offset or the lower right offset.
  • the offset distance is the distance that points to the left offset, the right offset, the upward offset, the downward offset, or the offset distance at a certain offset angle, etc.
  • the first reference point is used as the origin
  • the x-axis is parallel to the bottom edge of the mobile phone (or the bottom edge of the current viewfinder frame)
  • the direction perpendicular to the x-axis is y to construct a coordinate system
  • the current coordinate system is displayed parallel to the mobile phone. Screen.
  • the constructed coordinate system is used to define the offset direction, offset angle and offset distance of the center point of the first face image or the first human mouth image relative to the first reference point. For example, if the position information of the target image is the lower left of the center point of the viewfinder frame, the first sound pickup range is in the panoramic sound pickup range, and the center point of the first sound pickup range is at the lower left of the center point of the panoramic sound pickup range.
  • the center point of the video picture is the center point of the viewfinder frame, or the center point of the video picture is the center point of the display screen.
  • the center point of the viewfinder frame is used as the first reference point, that is, the center point of the viewfinder frame is used to represent the center point of the video picture.
  • the first reference point may also be represented in other forms.
  • the center point of the entire screen of the display screen of the mobile phone is used to represent the center point of the video image, that is, as the first reference point.
  • obtaining the second audio frequency corresponding to the video picture includes: enhancing the audio signal within the first sound pickup range in the first audio frequency, and/or attenuating the audio signal. For the audio signals in the first audio frequency outside the first sound pickup range, the second audio frequency is obtained.
  • the first audio includes audio signals in various directions. After the first sound pickup range corresponding to the sound-emitting object is determined, the audio signals in the first sound pickup range are enhanced to improve the audio quality in the recorded video. Optionally, the audio signal outside the sound pickup range is further weakened to reduce the interference of external noise, and to highlight the sound emitted by the sounding object in the audio.
  • the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
  • Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when part or all of the first sound pickup range is included in the sound pickup range of the first microphone in the one or more microphones, executing The second audio is obtained by at least one of the following operations: enhancing the audio signal within the first sound pickup range in the sound pickup range of the first microphone; attenuating the audio signal outside the first sound pickup range in the sound pickup range of the first microphone; attenuating one or Audio signals of other microphones except the first microphone among the plurality of microphones.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range is within the sound pickup range of the microphone 1, then the mobile phone can enhance the audio signal within the first sound pickup range collected by the microphone 1 in the initial audio signal after using the microphone 1 and the microphone 2 to obtain the initial audio signal. , and at the same time attenuate the audio signal outside the first pickup range collected by the microphone 1 in the initial audio signal, and attenuate the audio signal collected by the microphone 2 to obtain the audio corresponding to the first face image or the first human mouth image.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range includes a sound pickup range 1 within the sound pickup range of the microphone 1 , and a sound pickup range 2 within the sound pickup range of the microphone 2 . That is to say, the first sound pickup range is the union of the sound pickup range 1 and the sound pickup range 2 . Then, after the mobile phone uses the microphone 1 and the microphone 2 to obtain the initial audio signal, it can enhance the audio signals within the sound pickup range 1 of the microphone 1 and the sound pickup range 2 of the microphone 2 in the initial audio signal, and weaken the remaining audio signals in the initial audio signal.
  • the audio signal is used to obtain the audio corresponding to the first face image or the first human mouth image. It can be understood that the pickup range 1 and the pickup range 2 may overlap in whole or in part.
  • the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
  • Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when the sound pickup range of the second microphone in the at least two microphones does not include the first sound pickup range, turning off the second microphone, at least Audio collected by other microphones in the two microphones except the second microphone is the second audio.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range is within the sound pickup range of the microphone 1 and outside the sound pickup range of the microphone 2 .
  • the mobile phone turns off the microphone 2, and processes the audio signal collected by the microphone 1 as the audio corresponding to the video image, that is, the audio corresponding to the first face image or the first mouth image is the audio collected by the microphone 1.
  • the method when the second microphone is turned off, the method further includes: enhancing the audio signal within the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and /or attenuate audio signals outside the first sound pickup range in the sound pickup ranges of other microphones except the second microphone among the at least two microphones.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range is within the sound pickup range of the microphone 1 and outside the sound pickup range of the microphone 2 .
  • the mobile phone turns off the microphone 2, enhances the audio signal within the first pickup range in the audio signal collected by the microphone 1, and attenuates the audio signal outside the first pickup range, and obtains the first face image or the first human mouth image. corresponding audio.
  • the number of the first face images is one or more, and the number of the first human mouths is one or more.
  • the number of first face images is one or more
  • the number of first human mouth images is one or more. It is understandable that, if some characters are speaking in the currently captured video, but the mobile phone fails to recognize that they are speaking, the face image or mouth image of the unrecognized person who speaks is not classified as the above-mentioned No. A face image or a first mouth image.
  • the first feature value needs to be determined based on multiple first face images or multiple first human mouth images. For example, in the process of determining the area ratio, the ratio of the area of the multiple first face images to the area of the video screen is used as the area ratio of the target image. For another example, in the process of determining the position information, the offset of the center point of the placeholder frame where the multiple first face images are located relative to the center point of the video image is used as the position information of the target image. Wherein, the placeholder frame where the multiple first face images are located is used to represent the smallest selection frame containing the multiple face images.
  • the method further includes: detecting a third operation of stopping shooting. In response to the third operation, recording is stopped and a recorded video is generated; the recorded video includes a video picture and a second audio. Detect the fourth operation of playing the recorded video. In response to the fourth operation, a video playing interface is displayed, the video picture and the second audio are played.
  • the electronic device determines the first sound pickup range according to the voice-emitting face image or the voice-emitting human mouth image, and then records audio according to the first voice pickup range. Subsequently, the recorded audio needs to be saved, and the user can play the video image and audio of the saved video.
  • the scene of recording the video screen is a real-time communication scene such as live broadcast, video call, etc.
  • the method of recording audio during the process of recording the video screen can refer to the above method, but when the user instructs to stop the shooting operation is detected. After the operation of stopping the communication, the communication is stopped directly without generating a recorded video. It is understandable that, in some real-time communication scenarios, the user may also choose to save the recorded video.
  • the electronic device determines whether to save the recorded video in the real-time communication scene in response to the user's operation.
  • the recorded video further includes third audio
  • the third audio is audio determined according to the second sound pickup range
  • the second sound pickup range is determined according to the first sound pickup range, and is different from the first sound pickup range.
  • the video playback interface includes a first control and a second control, the first control corresponds to the second audio, and the second control corresponds to the third audio.
  • the electronic device may One or more reference first sound pickup ranges are determined in the vicinity of the pickup range. Wherein, the electronic device obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the electronic device may also use panoramic audio as one channel of audio. Then, the electronic device can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range. Among them, one channel of audio can be understood as an audio file.
  • the recording function may include a single-channel recording function and a multi-channel recording function.
  • the single-channel video recording function refers to displaying a viewfinder frame during the shooting process of the electronic device, which is used for recording a video image of one channel.
  • the multi-channel video recording function means that the electronic device displays at least two viewfinder frames during the shooting process, and each viewfinder frame is used for one video frame.
  • each channel of video images and the corresponding audio collection method can refer to the implementation method of the single-channel recording function.
  • the electronic device can switch and play audios corresponding to different sound pickup ranges, provide the user with a variety of audio playback options, realize the adjustability of the audio, and improve the user's audio playback experience.
  • the method further includes: in response to the fourth operation, playing the video picture and the second audio.
  • the fourth operation includes an operation of operating a playback control or an operation of operating the first control.
  • a fifth operation of operating the second control is detected.
  • the video picture and the third audio are played.
  • the electronic device may display a video playback interface without playing audio first. After detecting the user's instruction operation, the electronic device plays the audio indicated by the user.
  • the method further includes: deleting the second audio or the third audio in response to the operation of deleting the second audio or the third audio.
  • the audio that the user does not want to save can be deleted according to the user's requirements, thereby improving the user experience.
  • the present application provides an electronic device comprising: a processor, a memory, a microphone, a camera and a display screen, the memory, the microphone, the camera, and the display screen are coupled to the processor, and the memory is used for storing computer program codes
  • the computer program code includes computer instructions that, when read by the processor from the memory, cause the electronic device to perform an operation of detecting a first operation of opening the camera application.
  • a shooting preview interface is displayed.
  • a second operation to start recording is detected.
  • the video picture and the first audio are collected, and a shooting interface is displayed, and the shooting interface includes a preview interface of the video picture.
  • the target image is the first face image and/or the first mouth image; wherein, the first face image is the face image of the sounding object in the video image, and the first mouth image is The human mouth image of the vocalized object in the video image.
  • the first sound pickup range corresponding to the sounding object is determined.
  • the second audio frequency corresponding to the video picture is obtained, and the audio volume within the first sound pickup range in the second audio frequency is greater than the audio volume outside the first sound pickup range.
  • determining the first sound pickup range corresponding to the sounding object according to the target image including: obtaining a first feature value according to the target image; wherein the first feature value includes pre- and post-position attribute parameters, and the area occupies ratio, one or more items of position information; among them, the front and rear attribute parameters are used to indicate whether the video picture is a video picture captured by the front camera or a video picture captured by the rear camera; the area ratio is used to indicate the size of the target image. The ratio of the area to the area of the video screen; the position information is used to indicate the position of the target image in the video screen. According to the first feature value, the first sound pickup range corresponding to the sounding object is determined.
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: when the video picture is a front video picture, determining that the first sound pickup range is the sound pickup range on the front camera side. pickup range. When the video picture is a rear video picture, it is determined that the first sound pickup range is the sound pickup range on the rear camera side.
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the first sound pickup range according to the area ratio and the sound pickup range of the first audio frequency.
  • determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the position of the first sound pickup range in the sound pickup range of the first audio according to the position information.
  • the position information includes a first offset of the center point of the target image relative to a first reference point, where the first reference point is the center point or the focus of the video image.
  • determining the position of the first sound pickup range in the sound pickup range of the first audio frequency includes: according to the first offset, determining the difference between the center point of the first sound pickup range relative to the sound pickup range of the first audio frequency
  • the second offset of the center point, the second offset is proportional to the first offset. According to the second offset, the position of the first sound pickup range in the sound pickup range of the first audio frequency is determined.
  • the center point of the video picture is the center point of the viewfinder frame, or the center point of the video picture is the center point of the display screen.
  • obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio frequency including: enhancing the audio signal within the first sound pickup range in the first audio frequency, and/or Attenuate audio signals in the first audio frequency outside the first sound pickup range to obtain the second audio frequency.
  • the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
  • Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when part or all of the first sound pickup range is included in the sound pickup range of the first microphone in the one or more microphones, executing The second audio is obtained by at least one of the following operations: enhancing the audio signal within the first sound pickup range in the sound pickup range of the first microphone; attenuating the audio signal outside the first sound pickup range in the sound pickup range of the first microphone; attenuating one or Audio signals of other microphones except the first microphone among the plurality of microphones.
  • the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
  • Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when the sound pickup range of the second microphone in the at least two microphones does not include the first sound pickup range, turning off the second microphone, at least Audio collected by other microphones in the two microphones except the second microphone is the second audio.
  • the electronic device when the second microphone is turned off, when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: enhance the microphones of the at least two microphones except the second microphone. Audio signals within the first sound pickup range in the sound pickup range, and/or attenuate audio signals outside the first sound pickup range in the sound pickup ranges of the other microphones except the second microphone among the at least two microphones.
  • the number of the first face images is one or more, and the number of the first human mouths is one or more.
  • the electronic device when the processor reads the computer instructions from the memory, the electronic device also causes the electronic device to perform the following operation: detecting the third operation of stopping shooting. In response to the third operation, recording is stopped and a recorded video is generated; the recorded video includes a video picture and a second audio. Detect the fourth operation of playing the recorded video. In response to the fourth operation, a video playing interface is displayed, the video picture and the second audio are played.
  • the recorded video further includes third audio
  • the third audio is audio determined according to the second sound pickup range
  • the second sound pickup range is determined according to the first sound pickup range, and is different from the first sound pickup range.
  • the video playback interface includes a first control and a second control, the first control corresponds to the second audio, and the second control corresponds to the third audio.
  • the processor when the processor reads the computer instructions from the memory, it also causes the electronic device to perform the following operations.
  • the video picture and the second audio are played; the fourth operation includes an operation of operating a playback control or an operation of operating the first control.
  • a fifth operation of operating the second control is detected.
  • the video picture and the third audio are played.
  • the electronic device when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: in response to the operation of deleting the second audio or the third audio, delete the second audio or the third audio.
  • the electronic device when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: detect the sixth operation of starting the speech enhancement mode. In response to the sixth operation, the speech enhancement mode is activated.
  • the present application provides an electronic device having the function of implementing the audio processing method described in the first aspect and any of the possible implementation manners.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides a computer-readable storage medium, including computer instructions, which, when the computer instructions are executed on an electronic device, cause the electronic device to perform any one of the first aspect and any of the possible implementations.
  • the audio processing method includes
  • the present application provides a computer program product that, when the computer program product is run on an electronic device, causes the electronic device to perform the audio processing described in any one of the first aspect and any of the possible implementations. method.
  • a circuit system in a sixth aspect, includes a processing circuit, and the processing circuit is configured to perform the audio processing method as described in the above-mentioned first aspect and any one of the possible implementation manners.
  • an embodiment of the present application provides a chip system, including at least one processor and at least one interface circuit, where the at least one interface circuit is configured to perform a transceiving function and send instructions to the at least one processor.
  • the at least one processor executes the audio processing method described in the first aspect and any one of the possible implementations.
  • FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 2A is a schematic layout diagram of a camera provided by an embodiment of the present application.
  • FIG. 2B is a schematic layout diagram of a microphone according to an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a software structure of an electronic device provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram 1 of a group of interfaces provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram 1 of a pickup range provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram 1 of an interface provided by an embodiment of the present application.
  • FIG. 8 is a second set of interface schematic diagrams provided by the embodiment of the present application.
  • FIG. 9 is a second schematic diagram of a pickup range provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram three of a group of interfaces provided by the embodiment of the present application.
  • FIG. 11 is a fourth set of interface schematic diagrams provided by the embodiment of the present application.
  • Fig. 12 is a set of interface schematic diagram five provided by the embodiment of the present application.
  • FIG. 13 is a schematic diagram of a coordinate system provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of an offset angle provided by an embodiment of the present application.
  • 15 is a schematic diagram of an offset distance provided by an embodiment of the present application.
  • 16A is a schematic diagram 1 of a first sound pickup range provided by an embodiment of the present application.
  • 16B is a second schematic diagram of a first sound pickup range provided by an embodiment of the present application.
  • FIG. 16C is a schematic diagram 3 of the first sound pickup range provided by the embodiment of the present application.
  • FIG. 17 is a second schematic diagram of an interface provided by an embodiment of the present application.
  • FIG. 18 is a schematic diagram six of a group of interfaces provided by the embodiment of the present application.
  • FIG. 19 is a schematic diagram seven of a group of interfaces provided by the embodiment of the present application.
  • FIG. 20 is a schematic diagram of a group of interfaces provided by the embodiment of the present application eight;
  • FIG. 21 is a second schematic flowchart of an audio processing method provided by an embodiment of the present application.
  • the electronic device may specifically be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a notebook computer, an ultra-mobile personal computer (ultra-mobile) personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), artificial intelligence (artificial intelligence) device, or special camera (for example, a single-lens reflex camera, a card camera), etc.
  • AR augmented reality
  • VR virtual reality
  • a notebook computer an ultra-mobile personal computer (ultra-mobile) personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), artificial intelligence (artificial intelligence) device, or special camera (for example, a single-lens reflex camera, a card camera), etc.
  • PDA personal digital assistant
  • artificial intelligence artificial intelligence
  • special camera for example, a single-lens reflex camera, a card camera
  • FIG. 1 shows a schematic structural diagram of an electronic device 100 .
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM Subscriber identification module
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • graphics processor graphics processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the controller may be the nerve center and command center of the electronic device 100 .
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 performs image recognition on multiple frames of images collected in a video picture, and obtains face image and/or human mouth image data contained in each frame of image.
  • the processor 110 performs image recognition on multiple frames of images collected in a video picture, and obtains face image and/or human mouth image data contained in each frame of image.
  • the voiced face and/or the voice in each frame of the image is determined.
  • mouth position, proportion and other information is determined.
  • the sound pickup range to be enhanced is determined according to information such as the position and proportion of the voicer's face and/or mouth in the video picture, that is, the position area of the voicer's voice in the panoramic audio is determined.
  • Improves audio quality in recorded video by enhancing the audio signal within the pickup range.
  • audio signals outside the pickup range are further attenuated to reduce the interference of external noises.
  • the charging management module 140 is used to receive charging input from the charger.
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the display screen 194, the camera 193, and the like.
  • the wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 .
  • the wireless communication module 160 can provide wireless communication including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (bluetooth, BT), etc. applied on the electronic device 100 . solution.
  • WLAN wireless local area networks
  • Wi-Fi wireless fidelity
  • Bluetooth bluetooth, BT
  • the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • the display screen 194 can display a shooting preview interface, a video preview interface and a shooting interface in the video recording mode, and can also display a video playing interface and the like during video playback.
  • the electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, converting it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • the ISP may control the photosensitive element to perform exposure and photographing according to the photographing parameters.
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • the camera 193 may be located in the edge area of the electronic device, may be an under-screen camera, or may be a camera that can be raised and lowered.
  • the camera 193 may include a rear camera, and may also include a front camera. The embodiment of the present application does not limit the specific position and shape of the camera 193 .
  • the layout of the camera on the electronic device 100 can be referred to FIG. 2A , where the front surface of the electronic device 100 is the plane where the display screen 194 is located.
  • the camera 1931 is located on the front of the electronic device 100 , and the camera is a front-facing camera.
  • the camera 1932 is located on the back of the electronic device 100 , and the camera is a rear camera.
  • the solutions of the embodiments of the present application may be applied to the electronic device 100 having a folding screen with multiple display screens (that is, the display screen 194 can be folded).
  • the folding screen electronic device 100 as shown in (c) of FIG. 2A .
  • the display screen is folded inward (or outwardly) along the folded edge, so that the display screen forms at least two screens (eg, A screen and B screen).
  • the camera on the C-screen is on the back of the electronic device 100, which can be regarded as a rear camera.
  • the camera on the C-screen becomes on the front of the electronic device 100, which can be regarded as a front-facing camera. That is to say, the front camera and the rear camera in this application do not limit the nature of the cameras themselves, but are only an illustration of a positional relationship.
  • the electronic device 100 can determine whether the camera is a front-facing camera or a rear-facing camera according to the position of the used camera on the electronic device 100, and then determine the direction of sound collection. For example, if the electronic device 100 currently collects images through a rear camera located on the back of the electronic device 100 , the electronic device 100 needs to focus on capturing the sound on the back of the electronic device 100 . For another example, the current electronic device 100 collects images through a front-facing camera located on the front of the electronic device 100 , and the electronic device 100 needs to focus on collecting sounds from the front of the electronic device 100 . This ensures that the captured sound matches the captured image.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs.
  • the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG Moving Picture Experts Group
  • MPEG2 moving picture experts group
  • MPEG3 MPEG4
  • MPEG4 Moving Picture Experts Group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the NPU uses an image recognition technology to recognize whether the image captured by the camera 193 includes a face image and/or a human mouth image. Further, the NPU can also confirm the voice-producing face or the voice-producing mouth according to the data of the face image and/or the mouth image, so as to confirm the sound pickup range that needs to perform directional recording.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio data into analog audio electrical signal output, and also for converting analog audio electrical signal input into digital audio data.
  • the audio module 170 may include an analog/digital converter and a digital/analog converter.
  • the audio module 170 is used to convert the analog audio electrical signal output by the microphone 170C into digital audio data.
  • Audio module 170 may also be used to encode and decode audio data.
  • the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as “speaker” is used to convert analog audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as the "earpiece" is used to convert the analog audio electrical signal into a sound signal.
  • the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into analog audio electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the microphone 170C may be a built-in component of the electronic device 100 or an external accessory of the electronic device 100 .
  • the electronic device 100 may include one or more microphones 170C, wherein each microphone or multiple microphones cooperate to collect sound signals in various directions, and convert the collected sound signals into analog audio electrical signals. It can also achieve noise reduction, identify sound sources, or directional recording functions.
  • FIG. 2B a schematic diagram of the layout of multiple microphones on two types of electronic devices 100 and the sound pickup range corresponding to each microphone are exemplarily given.
  • the front of the electronic device 100 is the plane where the display screen 194 is located, and the microphone 21 is located on the top of the electronic device 100 (usually the earpiece, the camera the side), the microphone 22 is located on the right side of the electronic device 100, and the microphone 23 is located at the bottom of the electronic device 100 (the bottom part of the current angle of the electronic device 100 shown in FIG. 2B (a) is not visible, and the position of the microphone 23 is schematically represented by a dotted line) .
  • the sound pickup range corresponding to the microphone 21 includes the front upper sound pickup range and the rear upper sound pickup range
  • the sound pickup range corresponding to the microphone 22 includes the front middle pickup range
  • the pickup range corresponding to the microphone 23 includes the front lower pickup range and the rear lower pickup range.
  • the combination of the microphones 21 - 23 can collect sound signals from all directions around the electronic device 100 .
  • the front camera may correspond to the front sound pickup range
  • the rear camera may correspond to the rear sound pickup range. Then, when the electronic device 100 uses the front camera to record a video, the sound pickup range is determined to be the front sound pickup range.
  • the sound pickup range is more accurately determined to be a certain range included in the front sound pickup range. The specific method is described in detail below.
  • the electronic device 100 may further include a larger number of microphones, as shown in (c) of FIG. 2B , the electronic device 100 includes 6 microphones.
  • the microphone 24 is located on the top of the electronic device 100
  • the microphone 25 is located on the left side of the electronic device 100
  • the microphone 26 is located at the bottom of the electronic device 100
  • the microphones 27 - 29 are located on the right side of the electronic device 100 .
  • the left part of the electronic device 100 shown in (c) of FIG. 2B is not visible at the current angle, and the positions of the microphone 25 and the microphone 26 are schematically indicated by dotted lines. As shown in FIG.
  • the sound pickup range corresponding to the microphone 24 includes the sound pickup range above the front
  • the sound pickup range corresponding to the microphone 25 includes the front middle pickup range
  • the pickup corresponding to the microphone 26 includes the front lower sound pickup range
  • the sound pickup range corresponding to the microphone 27 includes the rear upper sound pickup range
  • the sound pickup range corresponding to the microphone 28 includes the rear middle pickup range
  • the sound pickup range corresponding to the microphone 29 includes the rear sound pickup range.
  • the combination of microphones 24 - 29 can collect sound signals from all directions around the electronic device 100 .
  • the pickup ranges of the audio signals collected by the microphones of the electronic device 100 partially overlap, that is, the shaded parts in (b) and (d) of FIG. 2B .
  • the sound quality of the sound signal collected by a certain microphone may be better (for example, the signal-to-noise ratio is high, and the spike noise and spur noise are relatively high. less, etc.), and the sound quality of the sound signal collected by the other microphone may be poor.
  • the audio data with better sound quality in the corresponding direction is selected for fusion processing, and the audio with better effect is generated according to the processed audio data recording.
  • the audio data collected by the multiple microphones can be fused to obtain the audio corresponding to the uttering face or the uttering mouth.
  • the microphone 170C can be a directional microphone, which can collect sound signals in a specific direction.
  • the microphone 170C can also be an anisotropic microphone, which can collect sound signals in various directions, or can collect sound signals within a certain range according to its position on the electronic device 100 .
  • the microphone 170C is rotatable, and the electronic device 100 can adjust the sound pickup direction by rotating the microphone.
  • the electronic device 100 can configure a microphone 170C, which can be rotated by rotating the microphone 170C.
  • the microphone can pick up sound in all directions.
  • the audio signals within the corresponding sound pickup range can be picked up by the combination of different microphones 170C.
  • some of the microphones 170C can be used to pick up sound without using all the microphones 170C of the electronic device 100 .
  • the audio signals collected by some microphones 170C are enhanced, and the audio signals collected by some microphones 170C are attenuated.
  • This embodiment of the present application does not specifically limit the number of microphones 170C.
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the distance sensor 180F is used to measure the distance.
  • the electronic device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the electronic device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the electronic device 100 may detect the operation of the user instructing to start and/or stop recording through the touch sensor 180K.
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software or a combination of software and hardware.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 as an example.
  • FIG. 3 is a block diagram of a software structure of an electronic device 100 according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the operating system such as the Android system
  • the operating system of the electronic device is divided into four layers, which are a kernel layer, a hardware abstract layer (HAL), an application framework layer, and an application layer from bottom to top. .
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least camera drivers, audio drivers, display drivers, and sensor drivers.
  • the touch sensor 180K transmits the received touch operation to the upper-layer camera application through the sensor driver of the kernel layer.
  • the camera application recognizes that the touch operation is an operation to start recording video
  • the camera application invokes the camera 193 through the camera driver to record video images, and invokes the microphone 170C through the audio driver to record audio.
  • the corresponding hardware interrupt is sent to the kernel layer, and the kernel layer can process the corresponding operation into an original input event (for example, the touch operation includes touch coordinates, time stamp of the touch operation and other information).
  • Raw input events are stored at the kernel layer.
  • the hardware abstraction layer is located between the kernel layer and the application framework layer, and is used to define the interface that drives the hardware implementation of the application program, and converts the value of the driver hardware implementation into the software implementation programming language. For example, identify the value of the camera driver, convert it into a software programming language and upload it to the application framework layer, and then call the camera service system.
  • the HAL can upload the video images collected by the camera 193 and the raw data after face image recognition to the application framework layer for further processing.
  • the original data after face image recognition may include, for example, face image data and/or human mouth image data, and the like.
  • the face image data may include the number of voiced face images, the position information of the voiced face images in the video screen, etc.
  • the human mouth image data may include the number of voiced face images, the number of voiced face images in the video screen, etc. location information, etc.
  • the priority order of the face image data and the human mouth image data is preset.
  • the human vocal organ is the human mouth
  • the sound pickup range can be more accurately determined by the vocal human mouth data. Therefore, the priority order of setting the human mouth image data is higher than that of the face image data.
  • HAL can determine the voiced face image data and voiced human mouth image data according to the collected video images, and upload the voiced human mouth data as raw data according to the priority order.
  • the subsequent audio processing system determines the sound pickup range corresponding to the uttering mouth image according to the corresponding relationship between the video picture and the panoramic audio based on the uttering mouth image data.
  • the HAL only determines the sounding face image data in the collected video images, and uploads the sounding face image data as raw data to determine the sound pickup range corresponding to the sounding face image.
  • the HAL only determines the uttering mouth image data according to the video picture, and uploads the uttering mouth image data as raw data to determine the sound pickup range corresponding to the uttering mouth image.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer obtains the original input event from the kernel layer via the HAL, and identifies the control corresponding to the input event.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a camera service system, an audio processing system, a view system, a telephony manager, a resource manager, a notification manager, a window manager, and the like.
  • the camera service system serves the camera application and is used to call the camera application to collect images based on the raw events input from the kernel layer.
  • the audio processing system is used to manage the audio data and process the audio data with different audio algorithms. For example, in cooperation with the camera service system, the collected audio signals are processed during the recording process. For example, based on the face image data, the sound pickup range is determined, the audio signals within the sound pickup range are enhanced, and the audio signals outside the sound pickup range are weakened.
  • the camera application invokes the camera service system of the application framework layer to start the camera application. Then, start the camera driver by calling the kernel layer, and capture the video through the camera 193 . And call the audio processing system, use the kernel layer to start the audio driver, collect sound signals through the microphone 170C, and generate analog audio electrical signals, and generate digital audio data from the analog audio electrical signals through the audio module 170, and generate audio according to the digital audio data. .
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the electronic device 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, video, call, WLAN, music, short message, Bluetooth, map, calendar, gallery, navigation, etc.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • the audio processing method provided by the embodiment of the present application will be described below by taking the electronic device as a mobile phone having the structure shown in FIG. 1 and FIG. 3 as an example.
  • the methods of the embodiments of the present application may be applied to a scenario in which a user instruction is received to directly start a camera application (hereinafter may also be referred to as a camera for short). It can also be applied to the scene where the user opens other third-party applications (such as short video applications, live broadcast applications, video calling applications, etc.) and invokes the startup of the camera.
  • a camera application hereinafter may also be referred to as a camera for short.
  • third-party applications such as short video applications, live broadcast applications, video calling applications, etc.
  • the user may instruct the mobile phone to start the camera and display the shooting preview interface through a touch operation, a key operation, an air gesture operation, or a voice operation.
  • the mobile phone in response to the user clicking the camera icon 41 , the mobile phone starts the camera and displays the shooting preview interface 402 shown in FIG. 4( b ).
  • the mobile phone starts the camera in response to the user's voice instruction operation to turn on the camera, and displays the shooting preview interface 402 shown in (b) of FIG. 4 .
  • the control 421 is used to set the shooting function of the mobile phone, such as time-lapse shooting.
  • Control 422 is used to turn the filter function on or off.
  • Control 423 is used to turn the flash function on or off.
  • the camera can switch between different functions in response to the user's operation of clicking different function controls.
  • the controls 431-434 are used to switch the functions that can be realized by the camera. If the control 432 is currently selected, the photographing function is activated. For another example, in response to the user clicking the control 431, switching to the portrait shooting function. Alternatively, in response to the user's operation of clicking on the control 433, the recording function is switched. Alternatively, in response to the user's operation of clicking the control 434, more switchable functions of the camera, such as panorama shooting, are displayed.
  • the camera function is turned on by default.
  • the video recording function is activated, and the video preview interface is displayed.
  • a shooting preview interface 402 as shown in (b) in FIG. 4 is displayed by default.
  • the mobile phone detects the operation of the user clicking on the control 433, the video recording function is activated, and the screen shown in (c) in FIG. 4 is displayed. 403 of the video preview interface.
  • the mobile phone can also turn on the video recording function by default after starting the camera. For example, after the mobile phone starts the camera, the video recording preview interface 403 shown in (c) in FIG.
  • the mobile phone detects the user's operation of opening the camera application, the recording function can be started.
  • the mobile phone activates the video recording function by detecting an air gesture, or detecting a voice instruction operation. For example, when the mobile phone receives the user's voice command "open camera recording", it directly starts the recording function of the camera, and displays the recording preview interface.
  • the mobile phone starts the camera, it defaults to a function that was last applied before the camera was turned off last time, such as a portrait shooting function. After that, by detecting the operation of enabling the video recording function, the video recording function of the camera is activated, and the video recording preview interface is displayed.
  • the mobile phone after detecting switching to the video recording function, the mobile phone first asks the user whether to enable the voice enhancement mode. After the user confirms that the voice enhanced mode is turned on, the voice enhanced mode is activated. Or, the phone automatically activates the voice enhancement mode after detecting that it has switched to the video recording function. In still other embodiments, after detecting switching to the video recording function, the mobile phone first displays the video recording preview interface, and then after detecting the operation instructed by the user to shoot, activates the voice enhanced mode according to the user's instruction, or automatically activates the voice enhanced mode.
  • the mobile phone in response to the operation of the user clicking the recording control 433, the mobile phone displays the recording preview interface 403 as shown in FIG. 4(c), and displays a prompt box in the recording preview interface 403. 44, used to prompt the user whether to activate the voice enhancement mode. If it is detected that the user clicks Yes, the voice enhancement mode is activated and a photographing interface 404 as shown in (d) in FIG. 4 is displayed. Alternatively, after the mobile phone is switched from the shooting preview interface 402 to the video recording function, the voice enhancement mode is directly activated and the shooting interface 404 shown in FIG. 4(d) is displayed.
  • the mobile phone enables or disables the voice enhancement mode after detecting the user's operation of enabling or disabling the voice enhancement mode in the video recording preview interface or during the recording of the video screen.
  • the operation of starting the voice enhancement mode may include, for example, an operation of clicking a preset control, a voice operation, and the like.
  • the mobile phone can enable or disable the voice enhancement mode by detecting the user's operation on the control 46 .
  • the current display state of the control 46 indicates that the voice enhancement mode is not activated on the current mobile phone.
  • the voice enhancement mode is activated.
  • the mobile phone can enable or disable the voice enhancement mode by detecting the user's operation on the control 46 before or during the shooting.
  • the mobile phone After the voice enhancement mode is turned on, the mobile phone starts recording video images after detecting the operation instructed by the user to shoot, and can perform video encoding and other processing on the captured video images to generate and save video files.
  • the mobile phone in response to the operation of the user clicking the shooting control 45, displays the shooting interface 404 shown in FIG. 4(d), and starts to perform the video screen. record.
  • the voice enhancement mode is used to enhance the collection of the audio of some specific objects in the video shot of the video, thereby improving the audio recording effect. For example, if a user uses a camera to record video during an interview, it is necessary to focus on collecting the voice of the person being interviewed.
  • the operation of the user instructing to shoot may include, for example, an operation of clicking a shooting control, an operation of voice instruction, and other operation methods.
  • the large circle 501 is used to represent the maximum range (which can also be described as the panoramic sound pickup range) that can be picked up by all the microphones of the mobile phone, and the small circle 502 is used to represent the person concerned by the user. (usually the character who is vocalizing) the corresponding pickup range.
  • the sound pickup range of the person concerned by the user ie, the sound pickup range 1
  • the sound pickup range that needs to be enhanced for recording can be determined according to the position information of the image of the person concerned by the user in the recorded video picture. That is, the audio recording effect in the sound pickup range 1 shown in (b) in FIG. 5 is enhanced. In this way, the impact of other noises in the panoramic audio on the voice of the person concerned by the user in the recorded audio is reduced.
  • the voice-producing face image identified by the mobile phone may be described as the first face image, and the voice-voicing face image may be described as the first human mouth image. Or it can also be described as a voiced face image or voiced mouth image.
  • the number of first face images is one or more
  • the number of first human mouth images is one or more. It is understandable that, if some characters are speaking in the currently captured video, but the mobile phone fails to recognize that they are speaking, the face image or mouth image of the unrecognized person who speaks is not classified as the above-mentioned No. A face image or a first mouth image.
  • the mobile phone starts the voice enhancement mode and starts recording video images, it needs to recognize the first face image or the first mouth image, and according to the first face image or the first mouth image, determine the first image that needs to enhance the recording effect. sound range for better recording.
  • the mobile phone calls the microphone corresponding to the first sound pickup range to enhance the audio signal within the first sound pickup range.
  • the cell phone includes one or more microphones for capturing the first audio (ie, the initial audio signal).
  • the first sound pickup range is included in the sound pickup range of the first microphone of the one or more microphones, enhancing the audio signal within the first sound pickup range of the first microphone pickup range; and/or attenuating the audio signal
  • the audio signal outside the first sound pickup range in the sound pickup range of the first microphone and/or attenuate the audio signals of other microphones in one or more microphones except the first microphone to obtain the second audio frequency (that is, the first face image or audio corresponding to the first human mouth image).
  • the mobile phone includes at least two microphones, and the at least two microphones are used to collect the first audio.
  • the second microphone is turned off, and the audio collected by the other microphones of the at least two microphones except the second microphone is the second audio.
  • the second microphone is turned off, enhance the audio signal in the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and/or attenuate the audio signals in the at least two microphones except the second microphone. Audio signals outside the first sound pickup range in the sound pickup ranges of other microphones.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range is within the sound pickup range of the microphone 1, then the mobile phone can enhance the audio signal within the first sound pickup range collected by the microphone 1 in the initial audio signal after using the microphone 1 and the microphone 2 to obtain the initial audio signal. , and at the same time attenuate the audio signal outside the first pickup range collected by the microphone 1 in the initial audio signal, and attenuate the audio signal collected by the microphone 2 to obtain the audio corresponding to the first face image or the first human mouth image.
  • the mobile phone turns off the microphone 2, enhances the audio signal within the first pickup range in the audio signal collected by the microphone 1, and attenuates the audio signal outside the first pickup range, and then obtains the first face image or the first human mouth image. corresponding audio.
  • the mobile phone is configured with a microphone 1 and a microphone 2 .
  • the first sound pickup range includes a sound pickup range 1 within the sound pickup range of the microphone 1 , and a sound pickup range 2 within the sound pickup range of the microphone 2 . That is to say, the first sound pickup range is the union of the sound pickup range 1 and the sound pickup range 2 .
  • the mobile phone uses the microphone 1 and the microphone 2 to obtain the initial audio signal, it can enhance the audio signals within the sound pickup range 1 of the microphone 1 and the sound pickup range 2 of the microphone 2 in the initial audio signal, and weaken the remaining audio signals in the initial audio signal.
  • the audio signal is used to obtain the audio corresponding to the first face image or the first human mouth image. It can be understood that the pickup range 1 and the pickup range 2 may overlap in whole or in part.
  • the shooting interface 404 includes a viewfinder frame 48 for displaying a video image.
  • the sound pickup range corresponding to the viewfinder frame 48 is the maximum sound pickup range of the currently recorded video picture.
  • the mobile phone recognizes the first face image 47 , and assuming that the first face image is located in the center of the viewfinder frame 48 , the mobile phone determines that the first sound pickup range is the center of the maximum sound pickup range.
  • the phone boosts the audio signal in the first pickup range.
  • a prompt box 49 is displayed on the shooting interface 404 for prompting the user that the recording effect of the middle position has been enhanced.
  • the prompt box 49 can be continuously displayed during the shooting process, the displayed content changes with the change of the first sound pickup range, and is automatically hidden after the shooting is stopped. Alternatively, it is only displayed within a preset time period, and automatically disappears after the preset time period, so as to avoid blocking the displayed video picture of the viewfinder frame 48 .
  • the mobile phone can obtain the audio corresponding to the uttering face or the uttering mouth by enhancing the audio signal within the first pickup range, so as to enhance the sound recording effect of the uttering face or the uttering mouth, so that Reduce the interference of external noise.
  • the audio signal outside the first sound pickup range can also be weakened to obtain a better recording effect.
  • only the audio signals outside the first pickup range are attenuated to reduce the interference of external noises.
  • FIG. 6 is a schematic flowchart of an audio processing method provided by an embodiment of the present application.
  • the mobile phone described above through (a)-(d) in FIG. 4 is used to identify the first face image or the first human mouth image, and determine the first object that needs voice enhancement.
  • the sound range and the process of obtaining the audio corresponding to the first pickup range are described in detail.
  • the mobile phone recognizes the first face image or the first human mouth image.
  • the mobile phone may recognize the first face image or the first human mouth image through a face image recognition algorithm. For example, in the process of recording a video image, the mobile phone determines whether the captured video image contains a face image through a face image recognition algorithm. If a face image is included, the included face image is identified, and whether it is uttering is determined according to changes in the facial feature data of the face image, such as facial feature data, facial contour data, etc. within a preset time period. Wherein, the criterion for judging that the face image is uttering includes the mobile phone judging that the face image is currently uttering.
  • the mobile phone determines that the face image is uttering again within a preset time period after judging that the face image utters the sound for the first time, and then it is determined that the face image is uttering sound.
  • the human vocal organ is the human mouth.
  • the vocal human mouth data can be obtained, the data of the first human mouth image can be preferentially determined, and then the first sound pickup range can be determined based on the data of the first human mouth image.
  • the mobile phone collects the face image 71, and recognizes the facial feature key points corresponding to the face image 71 through the face image recognition algorithm (such as the circle displayed on the face image 71). feature points to determine if it is vocalizing). And face data and/or mouth data can be obtained.
  • the facial feature points include upper lip feature points and lower lip feature points, and the distance between the upper and lower lips can be obtained in real time according to the upper lip feature points and the lower lip feature points. Then preset the distance threshold between the upper lip and the lower lip of the face image.
  • the mobile phone detects that the distance between the upper lip and lower lip of the face image exceeds the distance If the number of times of the threshold exceeds the preset number of times, it is determined that the current face image is uttering sound.
  • the facial feature points may also include facial contour feature points
  • the mobile phone can obtain data such as jaw changes and facial muscle changes according to the facial contour feature points, and then determine whether the facial image is vocalizing. For example, within a preset time period, the number of times that the change data generated by the up and down movement of the chin exceeds the preset threshold exceeds the preset number of times, it is determined that the current face image is uttering sound.
  • the mobile phone can also determine the voice-producing face or the voice-producing mouth according to changes in other data corresponding to the human mouth, such as the Adam's apple change data.
  • the mobile phone can also combine the above-mentioned face data and mouth data to realize more accurate recognition of the first face image or the first face image.
  • the number of the first face images is one or more. In a scene in which the number of the first face images is multiple, that is, in a scene in which multiple face images utter simultaneously or multiple face images utter successively within the first preset time period, the mobile phone can exclude the face images among them.
  • a face image with a small area or at the edge of the video screen is not considered to be the first face image.
  • the face image that the user is concerned about should be a face image with a large area, or a face image displayed in the middle or near the middle of the video screen. face image.
  • the first preset time period may be a pre-configured short time range.
  • the mobile phone determines that user A is speaking, and starts timing at the time when user A stops speaking, and detects that user B starts speaking within the first preset time period. . Further, within the first preset time period after user B stops uttering, it is detected that user A starts uttering again. That is to say, during the recording process, user B speaks immediately after user A speaks, or user A and user B speak alternately, the face images corresponding to user A and user B can be confirmed as the first face image. Then, frequent confirmation of the sound pickup range corresponding to the first face image in a short time range can be avoided, the data processing amount can be reduced, and the efficiency can be improved at the same time.
  • the mobile phone confirms the face image with the largest area or the face image closest to the center of the video screen, and the area difference between the face image and the face image is less than the preset value.
  • the thresholded voiced face image is identified as the first face image.
  • the face image and the voice-producing face image within a preset range near the face image are confirmed as the first face image, so as to determine the first sound pickup range according to the first face image.
  • the scenario in which the mobile phone determines the multiple first human mouth images is the same as the scenario in which the multiple first human face images are determined, and will not be described again.
  • the center point of the video image includes, for example, the center point of the viewfinder frame, the center point of the display screen of the mobile phone, and the like.
  • the mobile phone acquires the first feature value corresponding to the first face image or the first human mouth image.
  • the mobile phone determines the first sound pickup range according to the first feature value.
  • the first feature value is used to describe the relative positional relationship between the face of the real person corresponding to the first face image and the mobile phone, or the first feature value is used to describe the mouth of the real person corresponding to the first face image and the mobile phone relative positional relationship. Therefore, the mobile phone can determine the first sound pickup range according to the first feature value. For example, if the real person corresponding to the first face image is located directly in front of the mobile phone, that is, the first face image is located in the center of the captured video image, the first sound pickup range is the sound pickup range directly in front of the mobile phone.
  • the first feature value includes one or more items of pre- and post-position attribute parameters, area ratio, and location information.
  • the front and rear attribute parameters, the area ratio and the position information are parameters determined by the mobile phone according to the first face image or the first human mouth image, and their meanings are described in the following description.
  • the following describes a specific method for the mobile phone to determine the first sound pickup range when the first feature value includes different parameters.
  • the first feature value includes a front-to-back attribute parameter of the first face image, or the first feature value includes a front-to-back attribute parameter corresponding to the first human mouth image.
  • the "front and rear attribute parameter" is used to indicate that the video picture containing the first face image or the first human mouth image is the video picture captured by the front camera (for the convenience of description, this paper is also referred to as the front video picture), It is also a video image captured by the rear camera (for convenience of description, it is also referred to as a rear video image in this document).
  • the front and rear attribute parameters can be used to determine whether the first sound pickup range is within a range of 180 degrees in front of the mobile phone or within a range of 180 degrees behind. Exemplarily, as shown in (b) of FIG.
  • the sound pickup range corresponding to the front video picture includes the ranges represented by ellipse 204 , ellipse 205 and ellipse 206
  • the sound pickup range corresponding to the rear video picture may include ellipse 201 .
  • the range represented by ellipse 202 and ellipse 203 may include ellipse 201 .
  • the video images displayed in the viewfinder of the mobile phone can be switched between the images captured by the front and rear cameras.
  • the mobile phone As shown in FIG. 8( a ), as shown in the photographing interface 801 , the mobile phone is in the voice enhancement mode, and it is confirmed that there is a voice-producing face image 81 .
  • the mobile phone confirms that the video picture where the voice-emitting face image 81 is located is the video picture captured by the front camera, that is, confirms that the first feature value is the front attribute parameter, then confirms that the first sound pickup range is within the front 180-degree range, and displays a prompt box 82, prompting the user that the pre-recording effect has been enhanced.
  • the shooting interface 801 further includes a front-rear switching control 83 for switching between the front and rear cameras.
  • the mobile phone can switch the front camera to the rear camera in response to the user's operation of clicking the front and rear switch control 83 .
  • the video screen displayed by the mobile phone the video screen captured by the front camera displayed on the shooting interface 801 shown in (a) in FIG. video images captured by the camera.
  • the mobile phone recognizes the voice-producing face image 84 in the current video screen, it determines that the first feature value is the post attribute acceptance number information, and the first sound pickup range is within the range of 180 degrees behind the mobile phone.
  • the mobile phone displays a prompt box 85, prompting the user that the post-recording effect has been enhanced.
  • the sound pickup range corresponding to the rear video picture is the ellipse 201, the range represented by the ellipse 202 and the ellipse 203, and the sound pickup range corresponding to the front video picture is the ellipse 204, the ellipse 205 and the ellipse 203.
  • the range represented by ellipse 206 the mobile phone confirms that the first face image corresponds to the rear video image according to the first feature value, and then confirms that the first sound pickup range is the range represented by ellipse 201 , ellipse 202 and ellipse 203 .
  • the mobile phone confirms that the first face image corresponds to the rear video screen according to the first feature value, then confirms that the first sound pickup range is the microphone 27, and the sound pickup corresponding to the microphone 28 and the microphone 29 Scope.
  • the first feature value includes the area ratio corresponding to the first face image, or the first feature value includes the area ratio corresponding to the first human mouth image.
  • the "area ratio” is used to represent the ratio of the area of the first face image or the area of the first mouth image to the area of the video screen. This area ratio is used to measure the radius (or diameter) of the audio collected by the microphone.
  • the person concerned by the user is usually placed at the center of the video image, that is, the first face image or the first human mouth image is located at the center of the viewfinder frame.
  • the sound pickup ranges corresponding to different areas of the first face image or the first mouth image are different.
  • the mobile phone determines two first face images in different time periods, which are the first face image 1 and the first face image 2 respectively. The areas of the two face images are different, and the area of the first face image 1 is larger than the area of the first face image 2 .
  • the determined sound pickup range is the sound pickup range 1 .
  • the determined sound pickup range is the sound pickup range 2 .
  • Pickup range 1 is greater than pickup range 2.
  • X is used to represent the first face image area or the first human mouth image area.
  • Y is used to indicate the area of the video frame displayed by the viewfinder.
  • N represents the sound pickup range corresponding to the viewing range.
  • the area ratio is used to represent the ratio of the area of the first face image to the area of the video picture displayed by the viewfinder frame.
  • the number of the first face images may be one or more, and then the area of the first face image is the area of one face image or the sum of the areas of multiple face images.
  • the sum of the areas of the multiple face images can be represented by the area of the occupancy frame where the multiple face images are located, that is, the area of the smallest selection frame containing the multiple face images.
  • the number of the first face image is 1, and during the face image recognition process of the mobile phone, according to the feature of the top of the forehead in the facial feature points of the face image 11 Point position, the position of the feature point at the bottom of the chin, and the position of the feature point at the most edge of the left and right faces excluding the ears, determine the dotted frame 101 for the face area of the first face image 11, and the image area within the frame selection range is The first face image area. That is, in the process of confirming the first face area, only the face area is calculated, excluding the influence of ears, hats, accessories, necks, etc.
  • the area of the video image displayed in the viewfinder frame is the image area within the frame selection range of the dotted line frame 102 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 101 and the dotted frame 102 . Subsequently, for the method for determining the area of the first face image, reference may be made to the current method for determining the area of the first face image, which will not be repeated hereafter.
  • the interface 1002 in (b) of FIG. 10 there are two face images displayed in the interface 1002 , both of which are recognized by the mobile phone as the first face image uttering sound.
  • the area of the face image 12 on the right side is the image area within the frame selection range of the dotted frame 103
  • the area of the face image 13 on the left side is the image area within the frame selection range of the dotted frame 104
  • the area of the first face image is:
  • the image area within the frame selection range of the dotted line frame 105 is the area of the smallest frame including all face images (for example, the total frame selection area is determined according to the edge limit value of all face image area selection frames).
  • the dotted frame 105 is used to represent the placeholder frame where the face image 12 and the face image 13 are located.
  • the finally determined first face image area simultaneously includes the image areas corresponding to the two face images.
  • the area of the video image displayed in the viewfinder frame is the image area within the frame selection range of the dotted line frame 106 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 105 and the dotted frame 106 .
  • the mobile phone determines that the area of the right face image 14 is the largest.
  • the mobile phone can exclude some voice-producing face images that users do not pay attention to through a preset threshold.
  • the preset threshold is less than 20% of the maximum face image area.
  • the mobile phone can exclude the left face image 15 that is smaller than 20% of the area of the right face image 14 .
  • the first face image includes the face image 14 on the right.
  • the preset threshold is that the distance from the face image with the largest area exceeds 35% of the length or width of the video picture displayed by the viewfinder.
  • the mobile phone can exclude the left face image 15 whose distance from the right face image 14 exceeds 35% of the length of the video frame displayed in the viewfinder. Then, the first face image includes the right face image 14 .
  • the area ratio is used to represent the ratio of the area of the first human mouth image to the area of the video picture displayed by the viewfinder frame.
  • the number of the first human mouth images may be one or more, then the area of the first human mouth image is the area of one human mouth image or the sum of the areas corresponding to the multiple human mouth images.
  • the area sum of the multiple human mouth images can be represented by the area of the occupancy frame where the multiple human mouth images are located, that is, by the area of the smallest box containing the multiple human mouth images.
  • the number of the first mouth image is 1, and during the face image recognition process of the mobile phone, according to the feature points of the facial feature points in the top of the mouth image, In the lower left, the leftmost and rightmost feature point positions, determine the dotted frame 111 that frames the area of the first human mouth image 16, and the image area within the frame selection range is the area of the first human mouth image.
  • the area of the video image displayed in the viewfinder frame is the image area within the frame selected by the dotted frame 112 .
  • the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 111 and the dotted frame 112 .
  • the interface 1102 displays two human mouth images, both of which are recognized by the mobile phone as vocalized human mouth images.
  • the area of the first mouth image 17 on the right is the image area within the frame selection range of the dotted frame 113
  • the area of the first mouth image 18 on the left is the image area within the frame selection range of the dotted frame 114
  • the mouth image area is the image area within the frame selection range of the dotted frame 115, that is, the area of the smallest frame including all mouth images (for example, the total frame selection area is determined according to the edge limit value of all the mouth image area selection frames) .
  • the dotted frame 115 is used to represent the placeholder frame where the first human mouth image 17 and the first human mouth image 18 are located.
  • the finally determined first mouth image area simultaneously includes the image areas corresponding to the two human mouth images.
  • the area of the video image displayed by the viewfinder frame is the image area within the frame selection range of the dotted frame 116 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 115 and the dotted frame 116 .
  • the mobile phone determines that the area of the right mouth image is the largest.
  • the mobile phone can exclude some voice-producing mouth images that users do not pay attention to through a preset threshold.
  • the preset threshold is less than 20% of the maximum mouth image area.
  • the preset threshold is that the distance from the mouth image with the largest area exceeds 35% of the length or width of the video picture displayed by the viewfinder.
  • the mouth image on the left side is excluded, and the first mouth image only includes the first mouth image on the right side. According to the first mouth image on the right side The area determines the radius of the first pickup range.
  • the sound pickup range determined by the mobile phone according to the first feature value of the first face image as shown in (a) in FIG. 10 may be the sound pickup range 2 shown in FIG. 9 .
  • the sound pickup range determined by the mobile phone according to the first feature value of the first face image as shown in (b) in FIG. 10 may be the sound pickup range 1 shown in FIG. 9 .
  • the rectangular area is used as the corresponding first face image.
  • the area of a face image or the area of the first mouth image can be understood that an irregular geometric figure can also be used to correspond to the first face image and the first human mouth image, so as to determine the corresponding area more accurately.
  • the rectangle in the embodiment of the present application is only an exemplary illustration. There is no specific limitation on this embodiment of the present application.
  • the area of the viewfinder frame is used as the area of the video screen. It can be understood that, in the case that the mobile phone is a full-screen mobile phone, the display area of the mobile phone can be used as the video screen area. Alternatively, other areas and areas of other shapes may also be used as the video screen area.
  • the viewfinder frame area in this embodiment of the present application is only an exemplary description, which is not specifically limited in this embodiment of the present application.
  • the first feature value includes position information corresponding to the first face image, or the first feature value includes position information corresponding to the first human mouth image.
  • the "position information" is used to indicate the position of the first face image or the first face image in the video picture.
  • the position information includes the offset of the center point of the first face image relative to the first reference point, such as the offset direction, and/or the offset angle, and/or the offset distance, and the like.
  • the position information includes the offset of the center point of the first human mouth image relative to the first reference point.
  • the first reference point is the center point of the video image or the focal point of focus.
  • the offset direction means that the center point of the first face image or the first mouth image is shifted leftward, rightwardly, upwardly, downwardly, upwardly left, and upward relative to the first reference point. Offset to the top right, offset to the bottom left, or offset to the bottom right, etc.
  • the offset angle is the angle pointing to the upper left offset, the upper right offset, the lower left offset or the lower right offset.
  • the offset distance is the distance that points to the left offset, the right offset, the upward offset, the downward offset, or the offset distance at a certain offset angle, etc.
  • the coordinates of the center point of the first face image may be determined according to the extreme positions of the feature points in various directions of the first face image. As described above, in the process of determining the area of the first face image, according to the facial feature points of the first face image, the position of the feature point at the top of the forehead, the position of the feature point at the bottom of the chin, and the feature points of the left and right faces that do not include ears position, to determine the coordinates of the center point of the first face image. Similarly, the center point coordinates of the first human mouth image are determined according to the positions of the uppermost, lower left, leftmost and rightmost feature points among the facial feature points of the human face image.
  • the preset first reference point may include, for example, the center point of the video image displayed by the viewfinder frame (which may also be described as the center point of the viewfinder), the focus within the viewfinder range, and the like.
  • the x-axis parallel to the bottom edge of the mobile phone (or the bottom edge of the current viewfinder frame) is the x-axis
  • the direction perpendicular to the x-axis is y to construct a coordinate system
  • the current coordinate system is parallel to the mobile phone display.
  • the constructed coordinate system is used to define the offset direction, offset angle and offset distance of the center point of the first face image or the first human mouth image relative to the first reference point. Exemplarily, as shown in (a) of FIG.
  • the coordinate system in the case of the vertical screen display of the mobile phone, the coordinate system, where the x-axis is parallel to the bottom edge (ie, the short side) of the mobile phone.
  • the x-axis is parallel to the side (ie, the long side) of the mobile phone.
  • the intersection of the x-axis and the y-axis, that is, the origin coordinate is (0, 0)
  • the positive direction of the x-axis is right
  • the positive direction of the y-axis is up.
  • the number of the first face image is 1, the center point of the first face image is the position corresponding to the mark 121, and the center of the video image displayed in the viewfinder frame is The point is the position corresponding to the marker 122 .
  • the position of the center point of the viewfinder frame is determined according to the edge limit coordinates of the top, bottom, left, and right of the viewfinder frame.
  • the mobile phone determines the position information of the first face image according to the positional relationship between the identification 121 and the identification 122 . For example, in the scene displayed on the interface 1201, the position information of the first face image is the lower left of the center point of the viewfinder frame. Or, as shown in the interface 1202 in (b) of FIG.
  • the number of the first face image is 1, the center point of the first face image is the position corresponding to the mark 123, and the center point of the video image displayed in the viewfinder frame is Identifies the location corresponding to 124 .
  • the mobile phone determines the position information of the first face image according to the positional relationship between the identification 123 and the identification 124 . For example, in the scene displayed on the interface 1202, the position information of the first human mouth image is the lower left of the center point of the viewfinder frame.
  • the center point of the first human face image is the center point within the image range composed of the multiple human face images.
  • the center point of the first face image is the geometric center point of the frame selection range of the dotted frame 105 .
  • the center point of the first human mouth image is the geometric center point of the range selected by the dotted frame 115 .
  • the center point of the video picture displayed by the viewfinder frame is also the geometric center point of the viewfinder frame.
  • the center point of the rectangle is used as the The corresponding center point of the first face image or the center point of the first face image.
  • irregular geometric figures can also be used to correspond to the first face image and the first human mouth image, so as to more accurately determine the corresponding center point.
  • the rectangle in the embodiment of the present application is only an exemplary illustration. , which is not specifically limited in this embodiment of the present application.
  • the center point of the viewfinder frame is used as the first reference point, that is, the center point of the viewfinder frame is used to represent the position of the video image. center point.
  • the first reference point may also be represented in other forms.
  • the center point of the entire screen of the display screen of the mobile phone is used to represent the center point of the video image, that is, as the first reference point.
  • the user may not place the object of interest in the center of the viewing range during video recording, but select the object of interest by focusing.
  • the mobile phone can obtain the user's intention and determine the object that the user pays attention to.
  • the focus position for focusing may also be the focus position obtained by automatic focusing of the mobile phone. For example, the mobile phone automatically recognizes the portrait, and determines the corresponding focus position after auto-focusing.
  • the number of first face images is 2, and the center point of the first face image is the position corresponding to the mark 125 .
  • the mobile phone detects the user's operation of clicking on the screen, obtains the focused focal position, and displays a dotted frame 126 .
  • the range framed by the dotted frame 126 is the focus range determined by the mobile phone according to the user's intention.
  • the central focus within the focus range is the position corresponding to the marker 127 .
  • the mobile phone determines the position information of the first face image according to the positional relationship between the identification 125 and the identification 127 . For example, the position information of the first face image is the upper left of the focus center.
  • the mobile phone may determine the first face image or the first mouth image according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point The relative positional relationship with the first reference point is then used to determine the offset direction of the first face image or the first mouth image in the video picture displayed in the viewfinder frame.
  • the coordinate system shown in (a) or (b) of FIG. 13 refers to the coordinate system shown in (a) or (b) of FIG. 13 .
  • the coordinates of the center point of the first face image or the center point of the first face image are (X1, Y1)
  • the coordinates of the first reference point are (X2, Y2)
  • the first reference point is set as the origin of the coordinate system ( 0, 0).
  • the relative positional relationship between the first face image or the first face image and the first reference point can be referred to as shown in Table 2 below.
  • X1 ⁇ X1 it means that the first face image or the first mouth image is located on the left side of the first reference point, that is, the offset direction is to the left.
  • the mobile phone may determine, according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point, the position of the first face image displayed in the viewfinder frame.
  • the offset angle in the video picture (as shown in Figure 14, the center point coordinates of the first face image or the center point coordinates (X1, Y1) of the first human mouth image and the first reference point (X2, Y2) Connecting line, the angle ⁇ with the X axis).
  • the large circle 141 is used to indicate the maximum sound pickup range corresponding to the viewfinder frame of the mobile phone, and the coordinates of the center point of the viewfinder frame are set to (0, 0), that is, the center point of the viewfinder frame is set to the first reference point. .
  • the maximum pickup range is divided into four quadrants, such as the first quadrant 142 , the second quadrant 143 , the third quadrant 144 and the fourth quadrant 145 .
  • the mobile phone can determine the offset angle ⁇ based on the angle between the line connecting (X1, Y1) and (X2, Y2) in each quadrant and the x-axis, so 0 ⁇ 90°.
  • the mobile phone may determine the position of the first face image displayed in the viewfinder frame according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point.
  • the offset distance in the video frame According to the offset distance and the radius of the sound pickup range corresponding to the first face, the mobile phone can determine whether the sound pickup range corresponding to the first face image exceeds the sound pickup range corresponding to the viewing range, and then determine the first sound pickup range.
  • the large circle 151 is the maximum sound pickup range corresponding to the viewfinder frame, and the radius is R.
  • the first reference point is the center point of the video screen displayed in the viewfinder, that is, the center point of the maximum sound pickup range, with coordinates (X2, Y2), and the coordinates of the center point of the first face image are (X1, Y1).
  • P is the ratio of the first face image to the area of the video image displayed by the viewfinder frame, that is, the area ratio parameter.
  • the radius of the first sound pickup range is equal to the distance between the center point of the first face image and the edge of the maximum sound pickup range. If r ⁇ 1.5S, the radius of the first pickup range is equal to the product of the radius of the panorama pickup range and the area ratio parameter. In this case, the phone will not pick up sounds beyond the maximum pickup range. It can be understood that in the case of r>S, the method of determining the radius of the first sound pickup range by comparing the magnitudes of r and 1.5S is only an exemplary illustration, and other methods can also be used to determine the first sound pickup range. , to ensure that the mobile phone can pick up the audio data corresponding to the first face image. For example, the radius of the first sound pickup range is determined by comparing the magnitude of r and 2S.
  • the geometric center point of the rectangle is converted into a rectangle.
  • irregular geometric figures can also be used to correspond to the first face image and the first mouth image, so as to more accurately determine the corresponding center point position, and the rectangle in the embodiment of the present application is only an example. It is noted that this embodiment of the present application does not specifically limit it.
  • the mobile phone may determine the first sound pickup range by using any one of the above-mentioned solutions 1 to 3. Alternatively, the mobile phone may determine the first sound pickup range after combining multiple solutions in the above-mentioned solutions 1 to 3. Alternatively, the mobile phone may determine the first sound pickup range by combining one or more parameters in the above solutions 1 to 3 with other parameters. Alternatively, the mobile phone may use other parameters to determine the first sound pickup range.
  • the following introduces a method for confirming the first sound pickup range after the mobile phone combines the above-mentioned scheme 1 to the three-phase scheme.
  • the mobile phone determines the first face image according to the front and rear attribute parameters of the video image corresponding to the first face image.
  • the corresponding video picture is the rear video picture.
  • the first sound pickup range is within a range of 180 degrees behind the mobile phone. That is, the range represented by the ellipse 161, the ellipse 162, and the ellipse 163.
  • the mobile phone can further determine the first sound pickup range according to the position information corresponding to the first face image.
  • the first face image is the face image on the left, and the center point 164 of the first face image is located at the upper left of the center point 165 of the viewfinder frame.
  • the mobile phone determines that the offset direction is the upper left, and the center point of the first pickup range is located at the upper left of the center point of the rear pickup range.
  • the first pickup range can be seen in (b) in Figure 16B.
  • Ellipse 161 and ellipse 162 represent the left side of the range.
  • the large circle 166 is the maximum sound pickup range corresponding to the rear video screen, and the corresponding left and right pickup range can be confirmed by dividing the sound pickup range left and right along the center dotted line.
  • the first sound pickup range at the upper left of the rear can refer to the range represented by the left half ellipse 1611 and the left half ellipse 1621 shown in (c) of FIG. 16B .
  • the position information also includes an offset angle and an offset distance. If the offset angle is greater than 45 degrees, the offset distance is greater than 1/2 of the radius of the video image displayed by the viewfinder. That is, the first face image is located above the center position of the video image displayed in the viewfinder frame, and is far away from the center position. As shown in (a) of FIG. 16C , the first face image is the left face image, and the offset distance between the center point 166 of the first face image and the center point 167 of the viewing frame is relatively large. Then, the middle sound pickup range has little auxiliary effect on the audio corresponding to the first face image, and the first sound pickup range can refer to the range represented by the ellipse 161 shown in (b) in FIG. 16C . Further, the first face image may be the range represented by the left half ellipse 1611 shown in (c) of FIG. 16B .
  • the mobile phone is based on the front and rear attribute parameters of the video picture corresponding to the first face image, and the first face image.
  • the mobile phone determines the sound pickup range according to the front and rear attribute parameters of the video picture corresponding to the first mouth image and the position information corresponding to the first mouth image.
  • the mobile phone can determine the final first sound pickup range according to the area ratio corresponding to the first face image.
  • the mobile phone can determine the radius of the first sound pickup range corresponding to the first face image through the area ratio and the sound pickup range corresponding to the viewing range.
  • the circle 152 as shown in (a) of FIG. 15 delineates the first sound pickup range.
  • the radius of the circle 152 may be used to correspond to the radius range representing the first sound pickup range.
  • the first sound pickup range can be represented by the range represented by the left half ellipse 1611 shown in (c) of FIG. 16B .
  • the radius of the first sound pickup range is finally determined as the distance between the center point of the first face image and the edge of the maximum sound pickup range.
  • the first sound pickup range can be represented by the range represented by the left half ellipse 1611 and the left half ellipse 1612 shown in (c) of FIG. 16B .
  • the mobile phone in the process of determining the first sound pickup range in the mobile phone in combination with the solutions in the above-mentioned solutions 1 to 3, there is no restriction on the order of determining the parameters, and the mobile phone can adopt other methods different from those in the above examples.
  • the parameters are determined sequentially. For example, each parameter is determined at the same time.
  • the first sound pickup range corresponding to the first face image or the first mouth image can be determined, and then the first sound pickup range can be used to acquire audio subsequently, thereby improving the audio quality.
  • the mobile phone acquires audio according to the first sound pickup range.
  • the mobile phone may use a single microphone or multiple microphones to collect sound signals from various directions around, that is, to collect panoramic sound signals. After the mobile phone preprocesses the panoramic sound signals collected by the multiple microphones, initial audio data can be obtained, where the initial audio data includes sound information in various directions. Then, the mobile phone can record the audio corresponding to the first face image according to the initial audio data and the first sound pickup range.
  • the mobile phone can enhance the sound within the first sound pickup range in the initial audio data, and the sounds outside the first sound pickup range can be enhanced.
  • the sound is suppressed (or attenuated), and then the processed audio data is recorded to obtain the audio corresponding to the first face image or the first human mouth image.
  • the audio corresponding to the first face image or the first mouth image records the sound within the first sound pickup range
  • the first sound pickup range is based on the first sound pickup range corresponding to the first face image or the first mouth image.
  • a sound pickup range determined by a feature value so the sound in the first sound pickup range is the corresponding sound of the uttering face or the uttering mouth that the user pays attention to. That is to say, the interference of the noise in the recorded video picture to the voice made by the voice-emitting person's face or the voice-emitting person's mouth is reduced.
  • directional voice enhancement can be performed in a complex shooting environment, and only audio algorithms can be used to enhance processing of some audio signals, which can simplify audio processing algorithms, improve processing efficiency, and reduce the computing performance of mobile phone hardware. requirements.
  • the mobile phone can pick up sound in the first One or more reference first pickup ranges are determined in the vicinity of the range.
  • the mobile phone obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the mobile phone may also use panoramic audio as one channel of audio.
  • the mobile phone can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range.
  • one channel of audio can be understood as an audio file.
  • the mobile phone may determine the corresponding one or more reference first sound pickup ranges according to the area ratio corresponding to the first face image or the first mouth image. It is assumed that the first sound pickup range is determined as the first sound pickup range and the reference first sound pickup range according to the area parameter ratio information. For example, based on Table 1, as shown in Table 4 below, the mobile phone can determine the first sound pickup range and the reference first sound pickup range according to the rules in Table 4 below. In Table 4 below, the first sound pickup range is a recommended value, and the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
  • the mobile phone may determine the audio corresponding to the first sound pickup range and the reference first sound pickup range according to different audio processing methods. For example, based on the above process of determining the first sound pickup range, the audio corresponding to the first sound pickup range is the audio determined by the Dolby sound effect algorithm, and the audio corresponding to the reference first sound pickup range is the audio determined according to the Histen sound effect algorithm. As shown in Table 5 below, Algorithm 1-Algorithm 4 are different audio algorithms, and the audio corresponding to the first sound pickup range and the reference first sound pickup range is determined according to different audio algorithms.
  • the first sound pickup range is a recommended value
  • the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
  • the mobile phone can obtain the first sound pickup range and the reference first sound pickup range corresponding to the area parameter ratio information corresponding to the first face image or the first mouth image and the audio algorithm. audio.
  • the first sound pickup range is a recommended value
  • the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
  • the mobile phone may also use other methods to determine the reference first sound pickup range, which is not specifically limited in this embodiment of the present application.
  • the mobile phone can process the initial audio data to enhance the sound within the reference first sound pickup range, suppress the sound outside the reference first sound pickup range, and then record the processed audio data to obtain the first face image or One or more channels of audio corresponding to the first human mouth image.
  • the mobile phone can record and obtain the first feature value corresponding to the first face image or the first mouth image and the first face image or the first mouth image according to the first sound pickup range and the reference first sound pickup range
  • the picture matches the multi-channel audio for the user to choose to play later.
  • each channel of audio data corresponding to the first face image or the first mouth image may be saved as one audio file, and the first face image may correspond to multiple audio files.
  • the multi-channel audio is within different sound pickup ranges provided by the user.
  • the number of audios is more, the possibility of matching the sound corresponding to the first face image or the first mouth image that the user is concerned about is greater, and the selectivity of the user's audio playback is also greater.
  • the mobile phone may also record audio corresponding to the first face image or the first human mouth image according to the first sound pickup range selected by the user or with reference to the first sound pickup range.
  • the mobile phone detects that the user clicks the operation of the recommended value selection control 171, then in the process of recording the video screen, the first face is recorded according to the first pickup range and the initial audio data.
  • the audio corresponding to the image or the first human mouth image is recorded according to the first pickup range and the initial audio data.
  • the mobile phone detects that the user clicks the enhancement value 1 to select the control, in the process of recording the video screen, according to the reference first sound pickup range and the initial audio data corresponding to the enhancement value 1, record the first face image or The audio corresponding to the first human mouth image.
  • the mobile phone detects that the user clicks the no-processing selection control 172, in the process of recording the video image, the audio signals in various directions are fused according to the initial audio data to obtain panoramic audio. That is, the audio corresponding to the non-processing selection control 172 is panoramic audio, which can also be understood as the audio obtained when the mobile phone is in the non-voice enhancement mode.
  • the methods for determining the recommended value, the enhancement value 1, the enhancement value 2 and the enhancement value 3 in the interface 1701 can be referred to as shown in Tables 4 to 6 above, and will not be repeated here.
  • the user may experience the recording effects corresponding to different sound pickup ranges before formally recording the video picture, and then determine the sound pickup range to be selected in the final video picture recording process.
  • the mobile phone can save only the corresponding audio files according to the user's choice. While ensuring that the needs of users are met, the storage space of the mobile phone can be saved.
  • the first sound pickup range may be changed to the second sound pickup range during the process of recording video images by the mobile phone.
  • the mobile phone when it is recording a video screen, it detects an operation instructing the user to switch the front and rear cameras.
  • the sound pickup range before switching is the first sound pickup range
  • the sound pickup range after the switch is the second sound pickup range.
  • the audio recorded by the mobile phone at least includes the audio of the first duration and the audio of the second duration.
  • the first duration audio frequency is the audio corresponding to the first sound pickup range
  • the second duration audio frequency is the audio frequency corresponding to the second sound pickup range.
  • the mobile phone can dynamically determine the sound pickup range based on the change of the voice-emitting face or the voice-emitting mouth in the video screen, and then record the audio according to the pickup range.
  • the audio of the formed video picture may include multiple audios of different durations or the same duration recorded based on the changed sound pickup range according to the time sequence.
  • the mobile phone can always focus on improving the audio recording quality of the part that needs to be enhanced according to the change of the pickup range, thereby ensuring the audio recording effect.
  • the user when the user plays the video file, the user can be presented with a dynamically changing playing experience, such as a sound range that matches the change of the video content.
  • the first feature value corresponding to the first face image or the first human mouth image changes, resulting in a change in the sound pickup range.
  • the front and rear attribute parameters of the video picture change, resulting in the change of the first sound pickup range.
  • the interface 1801 displays the front video image.
  • the mobile phone detects that the user clicks the front and rear switch control 181, switches to the rear camera to shoot, and displays the interface 1802 shown in (b) of FIG. 18 .
  • the first feature value corresponding to the first face image or the first human mouth image changes, and the audio within the duration of 00:00-00:15 in the recorded audio is the first sound pickup range
  • the corresponding audio, the audio after 00:15 is the audio corresponding to the second pickup range.
  • the position information corresponding to the first face image or the first human mouth image changes, resulting in a change in the first sound pickup range.
  • the picture range and picture size of the video picture in the viewfinder frame will change with the change of the zoom factor (ie, the Zoom value).
  • the zoom factor may be a preset zoom factor, the last zoom factor used before the camera was turned off, or a zoom factor pre-indicated by the user, and the like.
  • the zoom factor corresponding to the viewfinder frame can also be changed according to the user's instruction. Then, in a scene, as the zoom factor changes, the viewing range changes. Correspondingly, the area of the first face image or the area of the first mouth image, and then the area of the first face image or the proportion of the area corresponding to the first face image changes. That is to say, the change of the zoom factor will lead to the change of the pickup range. In this way, in the subsequent video playback process, the recorded audio can be dynamically changed with the change of the display area of the video content, etc., so as to improve the user's playback experience.
  • the mobile phone can determine the sound pickup range corresponding to the viewing range and the sound pickup range corresponding to the area ratio of the first face image or the area ratio of the first mouth image according to the zoom factor.
  • Table 7 where X is used to represent the area of the first face image or the area of the first mouth image.
  • Y is used to indicate the area of the video frame displayed by the viewfinder.
  • the change of the zoom factor does not need to change the sound pickup range.
  • the first face image does not change, indicating that the content of the user's attention has not changed.
  • user A interviews user B, and uses a mobile phone to photograph the interview process of user B.
  • the mobile phone determines that the first face image in the video picture is the face image of user B.
  • the mobile phone detects that the zoom factor has increased, but at this time, the first face image in the video screen is still the face image of user B. Then, the mobile phone does not need to acquire the first sound pickup range again, so as to reduce the amount of computation and save power consumption.
  • the mobile phone detects the operation of changing the zoom factor multiple times, so it is not necessary to change the sound pickup range.
  • the preset time period is 2s. After the mobile phone detects the operation of changing the zoom factor for the first time, it is not necessary to recalculate the pickup range. If the phone does not detect the operation of changing the zoom factor within 2s, the pickup range will be recalculated. If within 2s, the mobile phone detects the operation of changing the zoom factor again, it is not necessary to recalculate the pickup range. And take the time node at which the operation of changing the zoom factor is detected as the starting point, and monitor whether the operation of changing the zoom factor will be detected in the next 2s time period.
  • the first sound pickup range changes.
  • the above-mentioned switching scene between the front and rear cameras can also be understood as a change in the first face image and the first mouth image.
  • the change of the voiced face image or the human mouth image causes the first human face image or the first human mouth image to change.
  • the mobile phone confirms that the first face image is the two face images included in the video screen.
  • the mobile phone identifies the first face image as the face image 182 on the right side of the video screen. Or, if the shooting picture moves, and the currently recorded video picture does not contain the previously recognized first face image or the first mouth image, the above method needs to be used to re-identify the first sound pickup range.
  • the second sound pickup range is determined.
  • the mobile phone uses the first sound pickup range corresponding to the recommended value to record a video before the duration of 00:30, and at 00:30 it is detected that the user clicks the enhancement value 2 to select Operation of Control 183.
  • the mobile phone determines the second sound pickup range as the sound pickup range corresponding to the enhancement value 2, and displays the interface 1804 as shown in (d) in FIG. Pickup range to get audio.
  • the mobile phone before generating an audio file of each channel of audio, the mobile phone can perform multiple sound effects processing on each channel of audio, so that the recorded audio can obtain higher audio quality and better audio processing effect.
  • the sound effect processing may include: Dolby sound, Histen sound, sound retrieval system (SRS) sound, bass enhanced engine (BBE) sound, or dynamic bass enhanced engine (dynamic bass enhanced engine, DBEE) sound effects, etc.
  • the mobile phone in order to prevent the frequent changes of the first characteristic value caused by the shaking of the mobile phone, resulting in frequent changes of the first sound pickup range, the mobile phone can set a preset time threshold, and the mobile phone will not change within the preset time threshold.
  • the first pickup range For example, if it is set within 1s, and the first eigenvalue changes twice in a row, the mobile phone considers that the current change in the first eigenvalue is caused by the shaking of the mobile phone, and the corresponding first pickup range will not be changed.
  • the mobile phone may process the audio signal based on the first sound pickup range while collecting the audio signal, so as to obtain the audio corresponding to the first face image or the first human mouth image. .
  • the mobile phone may first collect the audio signal, and after the video recording is completed, process the audio signal according to the first sound pickup range to obtain the audio corresponding to the first face image or the first human mouth image.
  • the mobile phone calls the corresponding microphone to collect the audio signal within the first sound pickup range, and obtains the audio corresponding to the first face image or the first mouth image after processing.
  • the recording function may include a single-channel recording function and a multi-channel recording function.
  • the single-channel recording function refers to displaying a viewfinder frame during the shooting process of the mobile phone, which is used for recording a video image of one channel.
  • the multi-channel recording function means that the mobile phone displays at least two viewfinder frames during the shooting process, and each viewfinder frame is used for one video frame.
  • each channel of video images and the corresponding audio collection method can refer to the implementation method of the single-channel recording function.
  • the shooting interface includes a viewfinder as an example for description.
  • the process corresponding to the multi-channel video recording function including two or more viewfinder frames is similar to this, and will not be described repeatedly.
  • the mobile phone determines the first sound pickup range according to the voice-emitting face image or the voice-emitting mouth image, and then records audio according to the first voice pickup range. Subsequently, the recorded audio needs to be saved, and the user can play the video image and audio of the saved video.
  • the scene of recording the video screen is a real-time communication scene such as live broadcast, video call, etc.
  • the method of recording audio during the process of recording the video screen can refer to the above method, but when the user instructs to stop the shooting operation is detected. After the operation of stopping the communication, the communication is stopped directly without generating a recorded video. It is understandable that, in some real-time communication scenarios, the user may also choose to save the recorded video.
  • the mobile phone determines whether to save the recorded video in the real-time communication scene.
  • the mobile phone stops recording video images and audio, and generates a video recording.
  • the operation of the user instructing to stop shooting may be the operation of the user clicking on the displayed control 45 in the video preview interface 403 shown in (c) in FIG.
  • the embodiments of the present application do not make specific limitations.
  • the mobile phone after detecting an operation instructed by the user to stop shooting, the mobile phone generates a video recording and returns to the video recording preview interface or the shooting preview interface.
  • the recorded video may include video images and audio.
  • the thumbnail image of the recorded video generated by the mobile phone refer to the thumbnail image 191 displayed in the interface 1901 shown in (a) of FIG. 19 , or the thumbnail image 192 displayed in the interface 1902 shown in (b) of FIG. 19 . .
  • the mobile phone may prompt the user that the recorded video has multiple audio channels.
  • the video thumbnail or the detailed information of the recorded video may include prompt information for representing multiple audio channels, for example, the prompt information may be multiple speakers displayed on the interface 1902 shown in (b) of FIG. 19 . mark 193, other forms of mark, or text information, etc.
  • each channel of audio may respectively correspond to the audio correspondingly collected in the first sound pickup range and the reference first sound pickup range.
  • the mobile phone in response to the user's instruction to stop the operation of shooting, displays an interface 1903 as shown in (c) in FIG. 19 , which is used to prompt the user to save the audio of the desired video file.
  • the video file currently contains audios 194-197, which respectively correspond to audio files recorded in different pickup ranges, or correspond to audio files processed by different audio algorithms in the same pickup range.
  • audios 194-197 correspond to audios with recommended value, enhancement value 1, enhancement value 2, and enhancement value 3, respectively.
  • the mobile phone can play the video file and the corresponding audio.
  • the mobile phone detects that the user has instructed to play the audio 194 , it will play the video file and the audio 194 . After watching the video file, the user can select the audio with better audio effect and save it. In response to the user's selection, the audio that the user needs to save is determined, so as to improve the user's use experience, and avoid the problem of excessive storage space occupation caused by saving too much audio.
  • the user of the current video file selects to save the audio 194 and the audio 197 .
  • the mobile phone completes the saving of the video file, and displays the interface 1902 as shown in (b) of FIG. 19 .
  • the number of speakers in the speaker mark 193 may correspond to the number of audios contained in the current video file.
  • the mobile phone plays the video image and audio of the recorded video.
  • the operation of the user instructing to play the recorded video may be an operation of the user clicking the thumbnail 191 in the recording preview interface shown in (a) of FIG. 19 .
  • the operation of the user instructing to play the recorded video may be the operation of the user clicking the thumbnail 192 in the gallery shown in (b) of FIG. 19 .
  • the mobile phone plays the recorded video according to the video picture and audio recorded in the above-mentioned recording process.
  • the mobile phone can display a video playback interface, and the video playback interface can include recorded video images.
  • the mobile phone can play the audio corresponding to the first sound pickup range by default, and then switch to play other audio according to the user's instruction.
  • the user has selected a specific sound pickup range, and the mobile phone automatically plays the audio corresponding to the sound pickup range selected by the user.
  • the video playback interface may include multiple audio switching controls, and each audio switching control corresponds to a channel of audio. After the mobile phone detects that the user clicks an operation of an audio switching control, the audio of the channel corresponding to the audio switching control is played.
  • the mobile phone may display a video playback interface 2001 as shown in (a) of FIG. 20 , and the video playback interface 2001 displays a video image. Audio switching controls 201-205 are also displayed on the video playback interface 2001 . As shown in (a) of FIG. 20 , the audio switching control 201 currently selected by the mobile phone, or the recommended value is selected by default, the audio corresponding to the first sound pickup range is played. If the mobile phone detects that the user clicks on the audio switching control 203, the audio corresponding to the reference first sound pickup range corresponding to the audio switching control 203 can be played.
  • the mobile phone may delete part of the audio corresponding to the video file in response to the user's operation.
  • the mobile phone detects that the user has long pressed the audio switching control 205, and displays a deletion prompt box. If the user confirms the deletion, the audio corresponding to the audio switching control 205 is deleted, and the interface 2003 shown in (c) of FIG. 20 is displayed. In the interface 2003, the audio control 205 corresponding to the audio whose deletion has been confirmed by the user is no longer displayed. In this way, during the video playback process, the audio that the user does not want to save can be deleted according to the user's requirements, thereby improving the user experience.
  • the mobile phone may display a video playback interface without playing audio first. After detecting the user's instruction operation, the mobile phone plays the audio indicated by the user.
  • the mobile phone can play the audio corresponding to the first face image or the first human mouth image, so that the played audio can reduce the effect of noise on the sound of the uttering face or the uttering mouth. interference, and the played audio matches the face image that the user is concerned about in real time, improving the user's audio experience.
  • the mobile phone can switch and play the audio corresponding to different sound pickup ranges, providing the user with a variety of audio playback options, realizing the adjustability of the audio, and improving the user's audio playback experience.
  • the mobile phone can play the real-time changing first face image or the first mouth image and the audio corresponding to the first feature value, so that the audio is matched with the changed video image in real time, and the user's audio experience is improved.
  • FIG. 21 is a schematic flowchart of another audio processing method provided by an embodiment of the present application.
  • the audio processing method can be applied to the electronic device 100 shown in FIG. 1 .
  • the electronic device after the electronic device detects an operation instructing the user to turn on the camera, the electronic device starts the camera and displays a shooting preview interface. After that, after detecting an operation instructed by the user to shoot, the video image and the first audio (ie, the initial audio signal) are started to be collected.
  • the image captured by the camera of the electronic device is the initial video image, and after the initial video image is processed, a video image that can be displayed on the display screen is obtained.
  • the step of processing the initial video image is performed by the processor.
  • the video frame captured by the camera is only an exemplary illustration.
  • the electronic device starts the voice enhancement mode in response to the user's operation before or after detecting the operation instructing the user to shoot. Or, after detecting the operation instructing the user to shoot, the electronic device starts the voice enhancement mode.
  • the first audio is audio signals in various directions collected by one or more microphones of the electronic device. Subsequently, the voice-enhanced audio may be obtained based on the first audio.
  • the processor includes a GPU, an NPU, and an AP as an example for illustration. It can be understood that, the steps performed by the GPU, NPU, and AP here may also be performed by other processing units in the processor, which are not limited in this embodiment of the present application.
  • the NPU in the processor uses image recognition technology to recognize whether the video picture contains a face image and/or a human mouth image. Further, the NPU can also confirm the voice-producing face or the voice-producing mouth according to the data of the face image and/or the mouth image, so as to confirm the sound pickup range that needs to perform directional recording.
  • the target image can be used to determine the first feature value of the target image, and then the first sound pickup range can be determined according to the first feature value.
  • the first feature value includes one or more items of pre- and post-position attribute parameters, area ratio, and location information.
  • the front and rear attribute parameters are used to indicate whether the video screen is a video screen shot by the front camera or a video screen shot by the rear camera;
  • the area ratio is used to indicate the ratio of the area of the target image to the area of the video screen; location information , which is used to indicate the position of the target image in the video frame.
  • the first feature value includes pre- and post-position attribute parameters corresponding to the target image. That is to say, the AP in the processor determines whether the video picture where the current target image is located is the front video picture or the rear video picture. If it is a front video image, the first sound pickup range is the sound pickup range on the front camera side. If it is a rear video image, the first sound pickup range is the sound pickup range on the rear camera side.
  • the first feature value includes the area ratio corresponding to the target image.
  • the "area ratio” is used to indicate the ratio of the area of the first face image or the area of the first mouth image to the area of the video screen (for example, expressed by X/Y).
  • the electronic device determines the first feature value according to the ratio of the area of the first face image to the area of the viewfinder.
  • the first feature value includes position information corresponding to the target image.
  • the AP determines the position of the first sound pickup range corresponding to the target image within the sound pickup range of the first audio according to the position information of the target image in the video picture. Specifically, the AP determines the first offset of the center point of the target image relative to the first reference point, where the first reference point is the center point or the focus of the video image. After that, the AP determines a second offset of the center point of the first sound pickup range relative to the center point of the first audio pickup range, and the second offset is proportional to the first offset to obtain the first sound pickup Scope.
  • the first offset amount or the second offset amount includes an offset angle and/or an offset distance.
  • the offset of the center of the target image relative to the reference point includes an offset angle ⁇ 1 and an offset distance L1.
  • the AP can determine the first sound pickup range by using one or any combination of front and rear attribute parameters, area ratio, and location information.
  • the AP in the processor uses the first audio collected by one or more microphones to enhance the audio signal within the first sound pickup range, and/or attenuate the first sound pickup range. For audio signals outside the sound pickup range, the audio corresponding to the first face image or the first human mouth image is obtained, that is, the second audio is obtained.
  • the AP may call the microphone corresponding to the first sound pickup range to enhance the audio signal within the first sound pickup range, so that the volume within the first sound pickup range is greater than the volume outside the first sound pickup range.
  • the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
  • the sound pickup range of the first microphone in the one or more microphones includes part or all of the first sound pickup range
  • the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
  • the second microphone is turned off, and the audio collected by the other microphones of the at least two microphones except the second microphone is the first face The audio corresponding to the image or the first human mouth image.
  • the second microphone is turned off, enhance the audio signal in the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and/or attenuate the audio signals in the at least two microphones except the second microphone. Audio signals outside the first sound pickup range in the sound pickup ranges of other microphones.
  • the AP in the processor obtains the recorded video by using the obtained video picture. After detecting an operation instructing to stop shooting, a recorded video including the second audio and video images is obtained.
  • the recorded video may contain multiple audio files, wherein each audio file contains a channel of audio.
  • each audio file contains a channel of audio.
  • the electronic device may be in the vicinity of the first sound pickup range
  • One or more reference first pickup ranges are determined.
  • the electronic device obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the electronic device may also use panoramic audio as one channel of audio.
  • the electronic device can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range.
  • one channel of audio can be understood as an audio file.
  • the user can choose to delete part of the audio, and save the audio that he considers the best, so as to improve the user experience and reduce the storage pressure of the memory.
  • Embodiments of the present application further provide an electronic device, including one or more processors and one or more memories.
  • the one or more memories are coupled to the one or more processors for storing computer program code, the computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform
  • the above related method steps implement the audio processing method in the above embodiment.
  • An embodiment of the present application further provides a chip system, including: a processor, where the processor is coupled with a memory, the memory is used to store a program or an instruction, and when the program or instruction is executed by the processor, the The chip system implements the method in any of the foregoing method embodiments.
  • the number of processors in the chip system may be one or more.
  • the processor can be implemented by hardware or by software.
  • the processor may be a logic circuit, an integrated circuit, or the like.
  • the processor may be a general-purpose processor implemented by reading software codes stored in memory.
  • the memory may be integrated with the processor, or may be provided separately from the processor, which is not limited in this application.
  • the memory can be a non-transitory processor, such as a read-only memory ROM, which can be integrated with the processor on the same chip, or can be provided on different chips.
  • the setting method of the processor is not particularly limited.
  • the system-on-chip may be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or a system on chip (SoC), It can also be a central processing unit (CPU), a network processor (NP), a digital signal processing circuit (DSP), or a microcontroller (microcontroller).
  • controller unit, MCU it can also be a programmable logic device (PLD) or other integrated chips.
  • each step in the above method embodiments may be implemented by a hardware integrated logic circuit in a processor or an instruction in the form of software.
  • the method steps disclosed in conjunction with the embodiments of the present application may be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • Embodiments of the present application further provide a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium.
  • the terminal device executes the above-mentioned related method steps to achieve the above-mentioned embodiments. audio processing method.
  • Embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the above-mentioned relevant steps, so as to realize the audio processing method in the above-mentioned embodiment.
  • embodiments of the present application further provide an apparatus, which may specifically be a component or a module, and the apparatus may include a connected processor and a memory; wherein, the memory is used to store instructions for execution by a computer, and when the apparatus is running, the processor The computer-executed instructions stored in the executable memory can be executed, so that the apparatus executes the audio processing methods in the above method embodiments.
  • the terminal device, computer-readable storage medium, computer program product, or chip provided in the embodiments of the present application are all used to execute the corresponding methods provided above. Therefore, for the beneficial effects that can be achieved, reference may be made to the above-mentioned methods. The beneficial effects in the corresponding method are not repeated here.
  • the electronic device includes corresponding hardware and/or software modules for executing each function.
  • the present application can be implemented in hardware or in the form of a combination of hardware and computer software in conjunction with the algorithm steps of each example described in conjunction with the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functionality for each particular application in conjunction with the embodiments, but such implementations should not be considered beyond the scope of this application.
  • the electronic device can be divided into functional modules according to the above method examples.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware. It should be noted that, the division of modules in this embodiment is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • the disclosed method may be implemented in other manners.
  • the terminal device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of modules or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program instructions.

Abstract

本申请提供音频处理方法及电子设备;涉及电子技术领域,通过确定视频画面中发声的人的脸或嘴的位置,并根据发声人的脸或嘴的位置确定需要加强拾音的范围,从而实现定向语音增强,既简化音频处理算法,又提高音频质量。该方法包括:在采集视频画面和第一音频的过程中,识别视频画面中发声对象的目标图像。根据目标图像,确定发声对象对应的第一拾音范围。基于第一音频和第一拾音范围,确定第二音频。第二音频中第一拾音范围内的音频音量大于第一拾音范围之外的音频音量。

Description

音频处理方法及电子设备
本申请要求于2020年08月26日提交国家知识产权局、申请号为202010868463.5、发明名称为“音频处理方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子技术领域,尤其涉及一种音频处理方法及电子设备。
背景技术
手机或平板电脑等电子设备广泛应用于视频拍摄领域,如短视频拍摄,网络直播等。在视频拍摄过程中,常常因为拍摄人物的走动或者外界杂音等原因导致收音效果不理想,造成语音质量下降。
为了提高收音效果,在利用电子设备收音的基础上,常常还需要增加外在收音设备,导致用户拍摄难度与成本的提升。此外,还提出一种语音增强方法,在视频拍摄过程中,利用音频算法对电子设备采集到的音频文件进行处理,以去除杂音。但是,由于拍摄环境较为复杂,对音频算法的处理能力要求较为苛刻。并且,复杂的音频处理过程,对电子设备硬件性能的要求也会提升。
发明内容
本申请提供的音频处理方法及电子设备,通过确定视频画面中发声的人的脸或嘴的位置,并根据发声人的脸或嘴的位置确定需要加强拾音的范围,从而实现定向语音增强,既简化音频处理算法,又提高音频质量。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种音频处理方法,该方法应用与电子设备,该方法可以包括:检测打开相机应用的第一操作。响应于第一操作,显示拍摄预览界面。检测开始录像的第二操作。响应于第二操作,采集视频画面和第一音频,并显示拍摄界面,拍摄界面包括视频画面的预览界面。识别视频画面中的目标图像,该目标图像为第一人脸图像和/或第一人嘴图像。其中,第一人脸图像为视频图像中的发声对象的人脸图像,第一人嘴图像为视频图像中的发声对象的人嘴图像。之后,根据目标图像,确定发声对象对应的第一拾音范围。根据第一拾音范围和第一音频,获得视频画面对应的第二音频。其中,第二音频中第一拾音范围内的音频音量大于第一拾音范围之外的音频音量。
其中,本申请实施例的方法可以应用于接收用户指示直接启动相机应用的场景。也可以应用于用户开启其他第三方应用(例如短视频应用、直播应用、视频通话应用等),调用启动相机的场景。第一操作或第二操作例如包括触摸操作、按键操作、隔空手势操作或语音操作等。
可选的,在响应于第一操作,显示拍摄预览界面之后,方法还包括:检测启动语音增强模式的第六操作。响应于第六操作,启动语音增强模式。
在一些实施例中,检测到切换至录像功能后,首先询问用户是否开启语音增 强模式。在用户确认开启语音增强模式后,启动语音增强模式。或者,检测到切换至录像功能后,自动启动语音增强模式。在又一些实施例中,检测到切换至录像功能后,先显示录像预览界面,之后检测到用户指示拍摄的操作后,再根据用户指示启动语音增强模式,或者自动启动语音增强模式。
在启动语音增强模式后,电子设备需要对采集到的第一音频进行处理,识别其中发声对象的音频,加强这部分音频,以获得更好的录音效果。其中,第一音频例如为采集到的初始音频信号,第二音频为经过语音增强处理后得到的音频。
可选的,通过人脸图像识别算法识别第一人脸图像或第一人嘴图像。比如,在录制视频画面的过程中,通过人脸图像识别算法确定采集到视频画面中是否包含人脸图像。若包含人脸图像,则识别出其中包含的人脸图像,并根据人脸图像的面部特征数据,如五官数据,面部轮廓数据等在预设时间段内的变化情况确定其是否正在发声。其中,人脸图像正在发声的判断标准包括判断人脸图像当前正在发声。或者,在判断人脸图像第一次发声之后的预设时间段内再次判断人脸图像发声,则确定人脸图像正在发声。可以理解的是,人的发声器官为人嘴,当可以获得发声的人嘴数据时,可以优先确定第一人嘴图像的数据,后续基于第一人嘴图像的数据确定第一拾音范围。需要说明的是,若视频画面中的人正在发声,但未能被识别,则该正在发声的人对应的图像不是目标图像。即目标图像为识别出的发声人脸和/或发声人嘴对应的图像。
如此,通过识别视频画面中发声的目标图像,确定需要增强拾音的第一拾音范围。进而基于采集到的初始音频信号以及第一拾音范围,获得第二音频。使得第二音频中,第一拾音范围内的音频音量大于第一拾音范围以外的音频音量。即增强发声的人的音量,从而提高音频录制效果。
在一种可能的实现方式中,根据目标图像,确定发声对象对应的第一拾音范围包括:根据目标图像,获得第一特征值。其中,第一特征值包括前后置属性参数,面积占比,位置信息中的一项或几项。其中,前后置属性参数,用于表示视频画面为前置摄像头拍摄的视频画面还是后置摄像头拍摄的视频画面。面积占比用于表示目标图像的面积与视频画面的面积的比值。位置信息,用于表示目标图像在视频画面中的位置。之后,根据第一特征值,确定发声对象对应的第一拾音范围。
其中,第一特征值用于描述第一人脸图像对应的真实人物的人脸与电子设备的相对位置关系,或者第一特征值用于描述第一人嘴图像对应的真实人物的人嘴与电子设备的相对位置关系。从而电子设备可以根据第一特征值,确定第一拾音范围。比如,第一人脸图像对应的真实人物位于电子设备的正前方,即第一人脸图像位于拍摄的视频画面的中心位置,则第一拾音范围为电子设备正前方的拾音范围。后续,电子设备获取包含各个方向音频信号的初始音频信号后,可以基于初始音频信号和第一拾音范围获得第一人脸图像对应的音频。
在一些实施例中,在视频画面录制过程中,第一特征值可能会发生变化。那么,第一拾音范围也会随之变化。那么,对于录制的视频中的音频来说,电子设备录制的音频至少包括第一时长音频和第二时长音频。其中,第一时长音频为第 一拾音范围对应的音频,第二时长音频为变化后的拾音范围对应的音频。也就是说,电子设备可以视频画面中发声人脸或发声人嘴的变化,动态确定拾音范围,进而根据拾音范围录制音频。最终检测到用户指示停止录制的操作后,形成的视频画面的音频中可以包含按照根据时间顺序,基于变化的拾音范围录制的不同时长或相同时长的多个音频。
如此,电子设备可以根据拾音范围的变化,始终对焦于提高需要进行语音增强的部分的音频录制质量,从而保证音频录制效果。并且,在用户播放视频文件时,可以向用户展示匹配视频内容变化的声音范围等动态变化的播放体验。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:当视频画面为前置视频画面时,确定第一拾音范围为前置摄像头侧的拾音范围。当视频画面为后置视频画面时,确定第一拾音范围为后置摄像头侧的拾音范围。
示例性的,假设电子设备的拾音范围包括前置180度的拾音范围和后置180度的拾音范围。那么,在确定视频画面为前置视频画面时,则将前置180度的拾音范围作为第一拾音范围。在确定视频画面为后置视频画面时,则将后置180度的拾音范围作为第一拾音范围。进一步的,在视频画面录制过程中,响应于用户切换前后置摄像头的操作,第一拾音范围也会进行前后置切换,从而确保第一拾音范围为视频画面中发声对象对应的拾音范围。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:根据面积占比以及第一音频的拾音范围,确定第一拾音范围。
其中,第一音频的拾音范围例如为全景音频的拾音范围。电子设备在录像过程中,利用麦克风采集各个方向的初始音频信号,即获得全景音频的拾音范围内的初始音频信号。
具体的,用户使用手机拍摄视频画面的过程中,通常会将用户关注的人物置于视频画面中心位置,也就是说,第一人脸图像或第一人嘴图像位于取景框中心位置。不同的第一人脸图像或第一人嘴图像的面积对应的拾音范围不同,可以利用面积占比描述第一拾音范围的大小。如半径,直径,面积等。
示例性的,假设X用于表示第一人脸图像面积或者第一人嘴图像面积。Y用于表示取景框显示的视频画面的面积。N表示取景范围对应的拾音范围。那么,面积占比为X/Y,第一拾音范围为N*X/Y。也就是说,第一拾音范围与全景拾音范围的比值与面积占比成正比。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:根据位置信息,确定第一拾音范围在第一音频的拾音范围中的位置。
在一些场景中,发声对象并不位于视频画面中心位置,则可以根据位置信息,获得发声对象对应的图像(即目标图像)在视频画面中的位置。可以理解的是,目标图像在视频画面中的位置与第一拾音范围在全景拾音范围中的位置,两者具有对应关系。
在一种可能的实现方式中,位置信息包括目标图像的中心点相对于第一参考点的第一偏移量,第一参考点为视频画面的中心点或对焦的焦点。根据位置信息, 确定第一拾音范围在第一音频的拾音范围中的位置,包括:根据第一偏移量,确定第一拾音范围的中心点相对于第一音频的拾音范围的中心点的第二偏移量,第二偏移量与第一偏移量成正比。之后,根据第二偏移量,确定第一拾音范围在第一音频的拾音范围中的位置。
其中,偏移量例如包括偏移方向,和/或偏移角度,和/或偏移距离等。偏移方向是指第一人脸图像或第一人嘴图像的中心点相对于第一参考点向左偏移,向右偏移,向上偏移,向下偏移,向左上偏移,向右上偏移,向左下偏移或者向右下偏移等。偏移角度是指向左上偏移,向右上偏移,向左下偏移或者向右下偏移的角度。偏移距离是指向左偏移,向右偏移,向上偏移,向下偏移的距离,或者某个偏移角度上偏移的距离等。
示例性的,以第一参考点为原点,平行于手机底边(或当前取景框的底边)为x轴,垂直于x轴的方向为y构建坐标系,并且当前坐标系平行于手机显示屏。利用构建的坐标系定义第一人脸图像或第一人嘴图像的中心点相对于第一参考点的偏移方向,偏移角度和偏移距离。比如,目标图像的位置信息为取景框中心点左下方,则第一拾音范围在全景拾音范围中,且第一拾音范围的中心点在全景拾音范围中心点左下方。
在一种可能的实现方式中,视频画面的中心点为的取景框的中心点,或者视频画面的中心点为的显示屏的中心点。
其中,在有些场景中,将取景框的中心点作为第一参考点,即利用取景框中心点表示视频画面的中心点。可以理解的是,基于视频画面的显示形式,第一参考点也可以用其他形式表示。比如,将手机显示屏的全部屏幕的中心点用于表示视频画面的中心点,即作为第一参考点。
在一种可能的实现方式中,根据第一拾音范围和第一音频,获得视频画面对应的第二音频包括:增强第一音频中在第一拾音范围以内的音频信号,和/或削弱第一音频中在第一拾音范围以外的音频信号,获得第二音频。
示例性的,第一音频包括各个方向的音频信号,在确定发声对象对应的第一拾音范围之后,通过增强第一拾音范围内的音频信号,以提高录制的视频中音频质量。可选的,进一步削弱拾音范围外的音频信号,以减小外界杂音的干扰,并在音频中更加突出发声对象发出的声音。
在一种可能的实现方式中,电子设备包含一个或多个麦克风,一个或多个麦克风用于采集第一音频。根据第一拾音范围和第一音频,获得视频画面对应的第二音频,包括:当一个或多个麦克风中第一麦克风的拾音范围内包含第一拾音范围的部分或全部时,执行以下至少一个操作得到第二音频:增强第一麦克风的拾音范围中第一拾音范围内的音频信号;削弱第一麦克风的拾音范围中第一拾音范围外的音频信号;削弱一个或多个麦克风中除第一麦克风外的其他麦克风的音频信号。
示例性的,手机配置有麦克风1和麦克风2。第一拾音范围在麦克风1的拾音范围以内,则手机在利用麦克风1和麦克风2获取到初始音频信号后,可以增强该初始音频信号中麦克风1采集的第一拾音范围内的音频信号,同时削弱该初始 音频信号中麦克风1采集的第一拾音范围以外的音频信号,以及削弱麦克风2采集的音频信号,获取第一人脸图像或第一人嘴图像对应的音频。又比如,手机配置有麦克风1和麦克风2。第一拾音范围包括麦克风1的拾音范围以内的拾音范围1,以及麦克风2的拾音范围以内的拾音范围2。也就是说,第一拾音范围为拾音范围1和拾音范围2的并集。那么,手机在利用麦克风1和麦克风2获取到初始音频信号后,可以增强初始音频信号中麦克风1的拾音范围1以及麦克风2的拾音范围2以内的音频信号,削弱初始音频信号中剩余的音频信号,获取第一人脸图像或第一人嘴图像对应的音频。可以理解的是,拾音范围1和拾音范围2可以全部或部分重叠。
在一种可能的实现方式中,电子设备包含至少两个麦克风,至少两个麦克风用于采集第一音频。根据第一拾音范围和第一音频,获得视频画面对应的第二音频,包括:当至少两个麦克风中第二麦克风的拾音范围不包含第一拾音范围时,关闭第二麦克风,至少两个麦克风中除第二麦克风外的其他麦克风采集的音频为第二音频。
示例性的,手机配置有麦克风1和麦克风2。第一拾音范围在麦克风1的拾音范围以内,在麦克风2的拾音范围以外。那么,手机关闭麦克风2,将麦克风1采集的音频信号处理后作为视频画面对应的音频,即第一人脸图像或第一人嘴图像对应的音频为麦克风1采集的音频。
在一种可能的实现方式中,在关闭第二麦克风时,方法还包括:增强至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围外的音频信号。
示例性的,手机配置有麦克风1和麦克风2。第一拾音范围在麦克风1的拾音范围以内,在麦克风2的拾音范围以外。那么,手机关闭麦克风2,将麦克风1采集的音频信号中第一拾音范围内的音频信号增强,第一拾音范围以外的音频信号削弱后,获取第一人脸图像或第一人嘴图像对应的音频。
在一种可能的实现方式中,第一人脸图像的数量为一个或多个,第一人嘴的数量为一个或多个。
其中,视频画面中正在发声的人物可以为一个或多个,那么第一人脸图像的数量为一个或多个,第一人嘴图像的数量为一个或多个。可以理解的是,若当前拍摄的视频画面中,某些人物正在发声,但手机未能识别其正在发声,则未能识别的发声的人物的人脸图像或人嘴图像不划分为上述的第一人脸图像或第一人嘴图像。
在一些实施例中,若第一人脸图像或第一人嘴图像的数量为多个。那么,在确定第一特征值的过程中,需要基于多张第一人脸图像或多张第一人嘴图像确定第一特征值。比如,在确定面积占比的过程中,将多张第一人脸图像的面积和与视频画面的面积的比值,作为目标图像的面积占比。又比如,在确定位置信息的过程中,将多张第一人脸图像所在的占位框的中心点相对于视频画面的中心点的偏移量,作为目标图像的位置信息。其中,多张第一人脸图像所在的占位框用于 表示包含该多张人脸图像的最小选框。
在一种可能的实现方式中,在响应于第二操作,采集视频画面和第一音频,并显示拍摄界面之后,方法还包括:检测停止拍摄的第三操作。响应于第三操作,停止录制并生成录像视频;录像视频包括视频画面,以及第二音频。检测播放录像视频的第四操作。响应于第四操作,显示视频播放界面,播放视频画面,以及第二音频。
在一些实施例中,电子设备在录制视频画面的过程中,根据发声人脸图像或发声人嘴图像,确定第一拾音范围,进而根据第一拾音范围录制音频。后续,需要对录制的音频进行保存,用户可以播放已保存的录像的视频画面和音频。
需要说明的是,若录制视频画面的场景为直播,视频通话等实时通信场景,则其录制视频画面过程中,录制音频的方法可以参考上述方法,但是在检测到用户指示停止拍摄的操作即为停止通信的操作后,直接停止通信,不必生成录像视频。可以理解的是,某些实时通信场景中,用户也可以选择保存录像视频。电子设备响应于用户的操作,确定是否保存实时通信场景中的录像视频。
在一种可能的实现方式中,录像视频还包括第三音频,第三音频为根据第二拾音范围确定的音频,第二拾音范围为根据第一拾音范围确定,且与第一拾音范围不同的拾音范围;视频播放界面包括第一控件和第二控件,第一控件对应第二音频,第二控件对应第三音频。
在一些实施例中,由于电子设备根据第一特征值确定的第一拾音范围,与第一人脸图像或第一人嘴图像的显示范围可能存在一定的误差,因而电子设备可以在第一拾音范围附近确定一个或多个参考第一拾音范围。其中,电子设备根据第一拾音范围获得一路音频,根据参考第一拾音范围获得至少一路音频,电子设备还可以将全景音频作为一路音频。那么,电子设备基于第一拾音范围可以获得第一人脸图像或第一人嘴图像对应的多路音频。其中,一路音频可以理解为一个音频文件。
可选的,录像功能可以包括单路录像功能和多路录像功能。其中,单路录像功能是指在电子设备拍摄过程中显示一个取景框,用于录制的一路视频画面。多路录像功能是指电子设备在拍摄过程中显示至少两个取景框,每一取景框用于一路视频画面。其中,使用多路录像功能的过程中,每一路视频画面及对应的音频采集方式均可以参照单路录像功能的实现方式。
如此,电子设备可以切换播放不同拾音范围对应的音频,给用户以多种音频播放选择,实现了音频的可调节性,可以提高用户音频播放体验。
在一种可能的实现方式中,该方法还包括:响应于第四操作,播放视频画面和第二音频。第四操作包括操作播放控件的操作或操作第一控件的操作。检测操作第二控件的第五操作。响应于第五操作,播放视频画面和第三音频。
在另一种可能的实现方式中,在视频回放时,电子设备可以显示视频播放界面,且先不播放音频。电子设备在检测到用户的指示操作后,播放用户指示的音频。
在一种可能的实现方式中,该方法还包括:响应于删除第二音频或第三音频 的操作,删除第二音频或第三音频。
如此,能够实现在视频回放过程中,根据用户需求删除用户不想保存的音频,提高用户使用体验。
第二方面,本申请提供一种电子设备,该电子设备包括:处理器,存储器,麦克风,摄像头和显示屏,存储器、麦克风、摄像头、显示屏与处理器耦合,存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当处理器从存储器中读取计算机指令,使得电子设备执行如下操作:检测打开相机应用的第一操作。响应于第一操作,显示拍摄预览界面。检测开始录像的第二操作。响应于第二操作,采集视频画面和第一音频,并显示拍摄界面,拍摄界面包括视频画面的预览界面。识别视频画面中的目标图像,目标图像为第一人脸图像和/或第一人嘴图像;其中,第一人脸图像为视频图像中的发声对象的人脸图像,第一人嘴图像为视频图像中的发声对象的人嘴图像。根据目标图像,确定发声对象对应的第一拾音范围。根据第一拾音范围和第一音频,获得视频画面对应的第二音频,第二音频中第一拾音范围内的音频音量大于第一拾音范围之外的音频音量。
在一种可能的实现方式中,根据目标图像,确定发声对象对应的第一拾音范围;包括:根据目标图像,获得第一特征值;其中,第一特征值包括前后置属性参数,面积占比,位置信息中的一项或几项;其中,前后置属性参数,用于表示视频画面为前置摄像头拍摄的视频画面还是后置摄像头拍摄的视频画面;面积占比用于表示目标图像的面积与视频画面的面积的比值;位置信息,用于表示目标图像在视频画面中的位置。根据第一特征值,确定发声对象对应的第一拾音范围。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:当视频画面为前置视频画面时,确定第一拾音范围为前置摄像头侧的拾音范围。当视频画面为后置视频画面时,确定第一拾音范围为后置摄像头侧的拾音范围。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:根据面积占比以及第一音频的拾音范围,确定第一拾音范围。
在一种可能的实现方式中,根据第一特征值,确定发声对象对应的第一拾音范围,包括:根据位置信息,确定第一拾音范围在第一音频的拾音范围中的位置。
在一种可能的实现方式中,位置信息包括目标图像的中心点相对于第一参考点的第一偏移量,第一参考点为视频画面的中心点或对焦的焦点。根据位置信息,确定第一拾音范围在第一音频的拾音范围中的位置,包括:根据第一偏移量,确定第一拾音范围的中心点相对于第一音频的拾音范围的中心点的第二偏移量,第二偏移量与第一偏移量成正比。根据第二偏移量,确定第一拾音范围在第一音频的拾音范围中的位置。
在一种可能的实现方式中,视频画面的中心点为的取景框的中心点,或者视频画面的中心点为的显示屏的中心点。
在一种可能的实现方式中,根据第一拾音范围和第一音频,获得视频画面对应的第二音频;包括:增强第一音频中在第一拾音范围以内的音频信号,和/或削弱第一音频中在第一拾音范围以外的音频信号,获得第二音频。
在一种可能的实现方式中,电子设备包含一个或多个麦克风,一个或多个麦克 风用于采集第一音频。根据第一拾音范围和第一音频,获得视频画面对应的第二音频,包括:当一个或多个麦克风中第一麦克风的拾音范围内包含第一拾音范围的部分或全部时,执行以下至少一个操作得到第二音频:增强第一麦克风的拾音范围中第一拾音范围内的音频信号;削弱第一麦克风的拾音范围中第一拾音范围外的音频信号;削弱一个或多个麦克风中除第一麦克风外的其他麦克风的音频信号。
在一种可能的实现方式中,电子设备包含至少两个麦克风,至少两个麦克风用于采集第一音频。根据第一拾音范围和第一音频,获得视频画面对应的第二音频,包括:当至少两个麦克风中第二麦克风的拾音范围不包含第一拾音范围时,关闭第二麦克风,至少两个麦克风中除第二麦克风外的其他麦克风采集的音频为第二音频。
在一种可能的实现方式中,在关闭第二麦克风时,当处理器从存储器中读取计算机指令,还使得电子设备执行如下操作:增强至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围外的音频信号。
在一种可能的实现方式中,第一人脸图像的数量为一个或多个,第一人嘴的数量为一个或多个。
在一种可能的实现方式中,当处理器从存储器中读取计算机指令,还使得电子设备执行如下操作:检测停止拍摄的第三操作。响应于第三操作,停止录制并生成录像视频;录像视频包括视频画面,以及第二音频。检测播放录像视频的第四操作。响应于第四操作,显示视频播放界面,播放视频画面,以及第二音频。
在一种可能的实现方式中,录像视频还包括第三音频,第三音频为根据第二拾音范围确定的音频,第二拾音范围为根据第一拾音范围确定,且与第一拾音范围不同的拾音范围;视频播放界面包括第一控件和第二控件,第一控件对应第二音频,第二控件对应第三音频。
在一种可能的实现方式中,当处理器从存储器中读取计算机指令,还使得电子设备执行如下操作。响应于第四操作,播放视频画面和第二音频;第四操作包括操作播放控件的操作或操作第一控件的操作。检测操作第二控件的第五操作。响应于第五操作,播放视频画面和第三音频。
在一种可能的实现方式中,当处理器从存储器中读取计算机指令,还使得电子设备执行如下操作:响应于删除第二音频或第三音频的操作,删除第二音频或第三音频。
在一种可能的实现方式中,当处理器从存储器中读取计算机指令,还使得电子设备执行如下操作:检测启动语音增强模式的第六操作。响应于第六操作,启动语音增强模式。
此外,第二方面所述的电子设备的技术效果可以参考第一方面所述的音频处理方法的技术效果,此处不再赘述。
第三方面,本申请提供一种电子设备,该电子设备具有实现如上述第一方面及其中任一种可能的实现方式中所述的音频处理方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第四方面,本申请提供一种计算机可读存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行如第一方面及其中任一种可能的实现方式中任一项所述的音频处理方法。
第五方面,本申请提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行如第一方面及其中任一种可能的实现方式中任一项所述的音频处理方法。
第六方面,提供一种电路系统,电路系统包括处理电路,处理电路被配置为执行如上述第一方面及其中任一种可能的实现方式中所述的音频处理方法。
第七方面,本申请实施例提供一种芯片系统,包括至少一个处理器和至少一个接口电路,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,当至少一个处理器执行指令时,至少一个处理器执行如上述第一方面及其中任一种可能的实现方式中所述的音频处理方法。
附图说明
图1为本申请实施例提供的电子设备的结构示意图;
图2A为本申请实施例提供的摄像头的布局示意图;
图2B为本申请实施例提供的麦克风的布局示意图;
图3为本申请实施例提供的电子设备的软件结构框图示意图;
图4为本申请实施例提供的一组界面示意图一;
图5为本申请实施例提供的拾音范围示意图一;
图6为本申请实施例提供的音频处理方法流程示意图一;
图7为本申请实施例提供的界面示意图一;
图8为本申请实施例提供的一组界面示意图二;
图9为本申请实施例提供的拾音范围示意图二;
图10为本申请实施例提供的一组界面示意图三;
图11为本申请实施例提供的一组界面示意图四;
图12为本申请实施例提供的一组界面示意图五;
图13为本申请实施例提供的坐标系示意图;
图14为本申请实施例提供的偏移角度示意图;
图15为本申请实施例提供的偏移距离示意图;
图16A为本申请实施例提供的第一拾音范围示意图一;
图16B为本申请实施例提供的第一拾音范围示意图二;
图16C为本申请实施例提供的第一拾音范围示意图三;
图17为本申请实施例提供的界面示意图二;
图18为本申请实施例提供的一组界面示意图六;
图19为本申请实施例提供的一组界面示意图七;
图20为本申请实施例提供的一组界面示意图八;
图21为本申请实施例提供的音频处理方法流程示意图二。
具体实施方式
下面结合附图对本申请实施例提供的音频处理方法及电子设备进行详细地描 述。
本申请实施例提供的音频处理方法,可以应用于电子设备。例如,该电子设备具体可以是手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、人工智能(artificial intelligence)设备或专门的照相机(例如单反相机、卡片式相机)等,本申请实施例对电子设备的具体类型不作任何限制。
示例性的,图1示出了电子设备100的一种结构示意图。电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在本申请的一些实施例中,处理器110对采集到视频画面中多帧图像进行图像识别,获得各帧图像中包含的人脸图像和/或人嘴图像数据。通过对比各帧图像中人脸图像数据和/或人嘴图像数据的变化,如上下嘴唇间距的变化,面部轮廓的变化等,确定出各帧图像中(即视频画面中)发声人脸和/或嘴的位置、占比等信息。进一步的,根据视频画面中发声人脸和/或嘴的位置、占比等信息确定待加强的拾音范围,即确定发声人的声音在全景音频中的位置区域。通过增强拾音范围内的音频信号,以提高录制的视频中音频质量。可选的,进一步削弱拾音范围外的音频信号,以减小外界杂音的干扰。
充电管理模块140用于从充电器接收充电输入。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,显示屏194, 摄像头193等供电。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT)等无线通信的解决方案。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
在一些实施例中,显示屏194可以显示录像模式下的拍摄预览界面、录像预览界面和拍摄界面,还可以在视频回放时显示视频播放界面等。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。例如,在本申请的实施例中,ISP可以根据拍摄参数控制感光元件进行曝光和拍照。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。
在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。其中,摄像头193可以位于电子设备的边缘区域,可以为屏下摄像头,也可以是可升降的摄像头。摄像头193可以包括后置摄像头,还可以包括前置摄像头。本申请实施例对摄像头193的具体位置和形态不予限定。
示例性的,电子设备100上摄像头的布局可以参见图2A,其中,电子设备100正面为显示屏194所在的平面。如图2A中(a)所示,摄像头1931位于电子设备100正面,则摄像头为前置摄像头。如图2A中(b)所示,摄像头1932位于电子设备100背面,则摄像头为后置摄像头。
可选的,本申请实施例的方案可以应用于具有多个显示屏的折叠屏(即显示屏194能够折叠)的电子设备100上。如图2A中(c)所示的折叠屏电子设备100。 响应于用户的操作,如图2A中(d)所示,沿折叠边向内折叠(或向外折叠)显示屏,使得显示屏形成至少两个屏(例如A屏和B屏)。如图2A中(e)所示,在折叠的外侧有显示屏(例如C屏)。若电子设备100在C屏所在表面设置有摄像头。那么,在如图2A中(c)所示的电子设备100未折叠场景中,C屏上的摄像头在电子设备100的背面,可以视为后置摄像头。在如图2A中(e)所示的电子设备100已折叠场景中,C屏上的摄像头变为在电子设备100的正面,可以视为前置摄像头。也就是说,本申请中前置摄像头和后置摄像头并不对摄像头本身的性质进行限制,仅为一种位置关系的说明。
由此,电子设备100可以根据使用的摄像头在电子设备100上的位置,确定摄像头为前置摄像头或后置摄像头,进而确定采集声音的方向。比如,当前电子设备100通过位于电子设备100背面的后置摄像头采集图像,则电子设备100需要重点采集电子设备100背面的声音。又比如,当前电子设备100通过位于电子设备100正面的前置摄像头采集图像,则电子设备100需要重点采集电子设备100正面的声音。如此,确保采集到的声音能够与采集到的图像相匹配。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
在一些实施例中,NPU利用图像识别技术,识别摄像头193采集到的图像中是否包含人脸图像和/或人嘴图像。进一步的,NPU还可以根据人脸图像和/或人嘴图像的数据,确认其中的发声人脸或发声人嘴,从而确认需要进行定向录音的拾音范围。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频数据转换成模拟音频电信号输出,也用于将模拟音频电信号输入转换为数字音频数据,音频模块170可以包括模/数转换器和数 /模转换器。例如,音频模块170用于将麦克风170C输出的模拟音频电信号转换为数字音频数据。音频模块170还可以用于对音频数据进行编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将模拟音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将模拟音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为模拟音频电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。其中,该麦克风170C可以是电子设备100的内置部件,也可以是电子设备100的外接配件。
在一些实施例中,电子设备100可以包括一个或多个麦克风170C,其中每一麦克风或多个麦克风合作可以实现采集各个方向的声音信号,并将采集到的声音信号转换为模拟音频电信号的功能,还可以实现降噪,识别声音来源,或定向录音功能等。
示例性的,如图2B所示,示例性给出了两种电子设备100上多个麦克风的布局的示意以及各个麦克风对应的拾音范围。如图2B中(a)所示,当电子设备100如图中所示的位置放置时,电子设备100的正面为显示屏194所在的平面,麦克风21位于电子设备100顶部(通常为听筒、摄像头所在一侧),麦克风22位于电子设备100右侧,麦克风23位于电子设备100的底部(图2B中(a)所示电子设备100当前角度底部部分不可见,用虚线示意性表示麦克风23位置)。
需要说明的是,后续实施例中所描述的“上”,“下”,“左”和“右”均参考图2B所示的方位,后续不再赘述。
如图2B中(b)所示的拾音范围示意图,麦克风21对应的拾音范围包括前置上方拾音范围和后置上方拾音范围,麦克风22对应的拾音范围包括前置中间拾音范围和后置中间拾音范围,麦克风23对应的拾音范围包括前置下方拾音范围和后置下方拾音范围。麦克风21-23的组合可以采集电子设备100周围各个方向的声音信号。其中,可以根据前置摄像头对应前置拾音范围,后置摄像头对应后置拾音范围。那么,当电子设备100利用前置摄像头录制视频时,则确定拾音范围为前置拾音范围。进一步的,再根据发声人脸或发声人嘴在视频画面中的位置,更加精准的确定拾音范围为前置拾音范围中包含的某个范围。具体方法见下文详细描述。
可以理解的是,电子设备100还可以包括更多数量的麦克风,如图2B中(c)所示,电子设备100包括6个麦克风。其中,麦克风24位于电子设备100顶部,麦克风25位于电子设备100的左侧,麦克风26位于电子设备100的底部,麦克风27-29位于电子设备100右侧。图2B中(c)所示电子设备100当前角度左侧部分不可见,用虚线示意性表示麦克风25和麦克风26的位置。如图2B中(d)所示的拾音范围示意图,麦克风24对应的拾音范围包括前置上方拾音范围,麦克 风25对应的拾音范围包括前置中间拾音范围,麦克风26对应的拾音范围包括前置下方拾音范围,麦克风27对应的拾音范围包括后置上方拾音范围,麦克风28对应的拾音范围包括后置中间拾音范围,麦克风29对应的拾音范围包括后置下方拾音范围。麦克风24-29的组合可以采集电子设备100周围各个方向的声音信号。
其中,如图2B中(b)和(d)所示,电子设备100各个麦克风采集音频信号的拾音范围存在部分重叠,即图2B中(b)和(d)中的阴影部分。在音频录制过程中,需要对重叠部分的音频信号进行融合处理,对于同一方向来说,某个麦克风采集到的声音信号的音质可能较好(例如信噪比较高,尖峰噪声和毛刺噪声较少等),而另一个麦克风采集到的声音信号的音质可能较差。则选取对应的方向上音质较好的音频数据进行融合处理,根据处理后的音频数据录制生成效果较好的音频。进一步的,若发声人脸或发声人嘴对应的拾音范围位于多个麦克风的拾音范围以内,则可以融合多个麦克风采集的音频数据,获得发声人脸或发声人嘴对应的音频。
在一些实施例中,该麦克风170C可以是指向性麦克风,可以针对特定方向采集声音信号。该麦克风170C还可以是非向性麦克风,实现采集各个方向上的声音信号,或者可以根据其在电子设备100上的位置,采集一定范围内的声音信号。
在另一些实施例中,麦克风170C可旋转,电子设备100可以通过旋转麦克风来调整拾音方向,针对发声的人脸或人嘴对应的拾音范围,电子设备100可以配置一个麦克风170C,通过旋转该麦克风实现对各个方向进行拾音。在电子设备100配置多个麦克风170C的情况下,可以通过不同麦克风170C的组合来拾取相应拾音范围内的音频信号。比如,可以使用其中的部分麦克风170C进行拾音,而不需要使用电子设备100全部的麦克风170C。又比如,增强部分麦克风170C采集的音频信号,削弱部分麦克风170C采集的音频信号。
本申请实施例对麦克风170C的数量不做具体限制。
其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
其中,离传感器180F,用于测量距离。电子设备100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,电子设备100可以利用距离传感器180F测距以实现快速对焦。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。
例如,在本申请的实施例中,电子设备100可以通过触摸传感器180K检测用户指示开始和/或停止录像的操作。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以 以硬件,软件或软件和硬件的组合实现。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。
图3是本发明实施例的电子设备100的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将电子设备的操作系统(例如Android系统)分为四层,从下至上分别为内核层,硬件抽象层(hardware abstract layer,HAL),应用程序框架层,以及应用程序层。
内核层是硬件和软件之间的层。内核层至少包含摄像头驱动,音频驱动,显示驱动,传感器驱动。
在一些实施例中,如在录像应用场景中,触摸传感器180K将接收的触摸操作,通过内核层的传感器驱动传至上层的相机应用。由相机应用识别出该触摸操作为开始录制视频的操作后,相机应用通过摄像头驱动调用摄像头193录制视频画面,并通过音频驱动调用麦克风170C录制音频。在上述过程中,相应的硬件中断被发给内核层,并且内核层可以将对应的操作加工成原始输入事件(例如触摸操作包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。
硬件抽象层(hardware abstract layer,HAL)位于内核层和应用程序框架层之间,用于定义驱动应用程序硬件实现的接口,将驱动硬件实现的值转化为软件实现程序语言。例如识别摄像头驱动的值,将其转化为软件程序语言上传至应用程序框架层,进而实现调用相机服务系统。
在一些实施例中,HAL可以将摄像头193采集到的视频画面,进行人脸图像识别后的原始数据上传至应用程序框架层进行进一步的处理。其中,人脸图像识别后的原始数据例如可以包括人脸图像数据和/或人嘴图像数据等。其中,人脸图像数据可以包括发声人脸图像的数量,发声人脸图像在视频画面中的位置信息等;人嘴图像数据可以包括发声人嘴图像的数量,发声人嘴图像在视频画面中的位置信息等。
示例性的,预设人脸图像数据和人嘴图像数据的优先级顺序。其中人的发声器官为人嘴,通过发声人嘴数据可以更加精准的确定拾音范围,因此设置人嘴图像数据的优先级顺序高于人脸图像数据的优先级顺序。比如,HAL根据采集到的视频画面,可以确定其中的发声人脸图像数据和发声人嘴图像数据,则根据优先级顺序,将发声人嘴数据作为原始数据上传。后续音频处理系统基于发声人嘴图像数据,根据视频画面与全景音频的对应关系,确定发声人嘴图像对应的拾音范围。又比如,HAL根据采集到的视频画面,只确定其中的发声人脸图像数据,则将发声人脸图像数据作为原始数据上传,用于确定发声人脸图像对应的拾音范围。再比如,HAL根据视频画面,只确定其中的发声人嘴图像数据,则将发声人嘴图像数据作为原始数据上传,用于确定发声人嘴图像对应的拾音范围。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层从内核层经由HAL获 取原始输入事件,识别该输入事件所对应的控件。应用程序框架层包括一些预先定义的函数。
如图3所示,应用程序框架层可以包括相机服务系统,音频处理系统,视图系统,电话管理器,资源管理器,通知管理器,窗口管理器等。
相机服务系统服务于相机应用,用于基于内核层输入的原始事件调用相机应用采集图像。
音频处理系统,用于管理音频数据,利用不同的音频算法处理音频数据。例如,配合相机服务系统,在录像过程中,对采集到的音频信号进行处理。例如,基于人脸图像数据,确定拾音范围,加强拾音范围以内的音频信号,削弱拾音范围以外的音频信号。
在一些实施例中,相机应用调用应用框架层的相机服务系统,启动相机应用。进而通过调用内核层启动摄像头驱动,通过摄像头193捕获视频。并调用音频处理系统,用过内核层启动音频驱动,通过麦克风170C采集声音信号,并生成模拟音频电信号,以及通过音频模块170将模拟音频电信号生成数字音频数据,并根据数字音频数据生成音频。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
应用程序层可以包括一系列应用程序包。
如图3所示,应用程序包可以包括相机,视频,通话,WLAN,音乐,短信息,蓝牙,地图,日历,图库,导航等应用程序。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
以下将以电子设备为具有图1和图3所示结构的手机为例,对本申请实施例提供的音频处理方法进行阐述。
在一些实施例中,本申请实施例的方法可以应用于接收用户指示直接启动相 机应用(以下也可简称为相机)的场景。也可以应用于用户开启其他第三方应用(例如短视频应用、直播应用、视频通话应用等),调用启动相机的场景。
以下以直接启动相机的场景为例进行示例性说明。
可选的,用户可以通过触摸操作、按键操作、隔空手势操作或语音操作等方式,指示手机启动相机,并显示拍摄预览界面。示例性的,如图4中(a)所示的主界面401,手机响应于用户点击相机图标41的操作,启动相机,并显示图4中(b)所示的拍摄预览界面402。或者,手机响应于到用户打开相机的语音指示操作,启动相机,并显示图4中(b)所示的拍摄预览界面402。其中,控件421用于对手机拍摄功能进行设置,如延时拍摄等。控件422用于开启或关闭滤镜功能。控件423用于开启或关闭闪光灯功能。
其中,在拍摄预览界面,相机能够响应于用户点击不同功能控件的操作,切换不同的功能。比如,图4中(b)所示,控件431-434用于切换相机可实现的功能。如当前已选中控件432,启动拍照功能。又如,响应于用户点击控件431,切换至人像拍摄功能。或者,响应于用户点击控件433的操作,切换至录像功能。又或者,响应于用户点击控件434的操作,显示相机可切换的更多功能,如全景拍摄等。
以下以手机启动录像功能,录制视频画面以及音频为例进行说明。
一般的,手机启动相机后默认打开拍照功能,在检测到切换功能的操作后,如检测到点击录像控件的操作,启动录像功能,并显示录像预览界面。示例性的,手机启动相机后默认显示如图4中(b)所示的拍摄预览界面402,手机检测到用户点击控件433的操作后,启动录像功能,并显示图4中(c)所示的录像预览界面403。或者,在另一些示例中,手机也可以启动相机后默认打开录像功能。比如,手机启动相机后直接显示图4中(c)所示的录像预览界面403。也即手机检测到用户打开相机应用的操作后,即可启动录像功能。又一些示例中,手机通过检测隔空手势,或检测语音指示操作等方式,启动录像功能。例如,手机接收到用户语音命令“打开相机录像”,则直接启动相机的录像功能,并显示录像预览界面。又一些示例中,在另一种可能的实现方式中,手机启动相机后,默认进入上次相机关闭之前最后应用的功能,如人像拍摄功能。之后,再通过检测启动录像功能的操作,启动相机的录像功能,并显示录像预览界面。
在一些实施例中,手机检测到切换至录像功能后,首先询问用户是否开启语音增强模式。在用户确认开启语音增强模式后,启动语音增强模式。或者,手机检测到切换至录像功能后,自动启动语音增强模式。在又一些实施例中,手机检测到切换至录像功能后,先显示录像预览界面,之后检测到用户指示拍摄的操作后,再根据用户指示启动语音增强模式,或者自动启动语音增强模式。
示例性的,如图4中(b)所示,响应于用户点击录像控件433的操作,手机显示如图4中(c)所示录像预览界面403,并在录像预览界面403中显示提示框44,用于提示用户是否启动语音增强模式。若检测到用户点击是的操作,则启动语音增强模式并显示如图4中(d)所示的拍摄界面404。或者,手机由拍摄预览界面402切换至录像功能后,直接启动语音增强模式并显示如图4中(d)所示的 拍摄界面404。
又示例性的,手机切换至录像功能后,只显示如图4中(c)所示的录像预览界面403。之后,响应于用户点击拍摄控件45的操作,再显示提示框44,根据用户选择确认是否启动语音增强模式。或者,手机在录像预览界面403检测到用户点击拍摄控件45的操作后,直接启动语音增强模式并显示如图4中(d)所示的拍摄界面404。
在又一些实施例中,手机在录像预览界面或者在录制视频画面的过程中,检测到用户启动或关闭语音增强模式的操作后,启动或关闭语音增强模式。其中,启动语音增强模式的操作例如可以包括点击预设控件的操作,语音操作等。
示例性的,如图4中(c)所示录像预览界面403,手机可以通过检测用户对控件46的操作,实现启动或者关闭语音增强模式。例如,当前控件46的显示状态,表示当前手机未启动语音增强模式,检测到用户点击控件46的操作后,启动语音增强模式。手机在拍摄开始之前或者拍摄过程中,通过检测用户对控件46的操作,可以实现启动或关闭语音增强模式。
在开启语音增强模式后,手机在检测到用户指示拍摄的操作后,开始录制视频画面,并可以对采集到的视频画面进行视频编码等处理,从而生成视频文件并保存。
示例性的,如图4中(c)所示的录像预览界面403,响应于用户点击拍摄控件45的操作,手机显示如图4中(d)所示的拍摄界面404,并开始进行视频画面录制。
其中,语音增强模式用于增强对视频拍摄视频画面中某些特定目标的音频的采集,从而提高音频录制效果。比如,用户在采访过程中利用相机进行录像,那么需要重点采集被采访的人物的语音。用户指示拍摄的操作例如可以包括点击拍摄控件的操作,语音指示操作等多种操作方式。
示例性的,如图5中(a)所示,大圆501用于表示手机当前所有麦克风能够拾音的最大范围(也可以描述为全景拾音范围),小圆502用于表示用户关注的人物(通常为正在发声的人物)对应的拾音范围。再如图5中(b)所示,用户关注的人物的拾音范围(即拾音范围1)在全景拾音范围以内。本申请实施例中可以根据用户关注的人物的图像在录制的视频画面中的位置信息,确定需要加强录音的拾音范围。也即增强图5中(b)所示拾音范围1内的音频录制效果。从而减小录制的音频中,全景音频中其他杂音对用户关注的人物发声的影响。
在一些实施例中,将手机识别出的正在发声的人脸图像可以描述为第一人脸图像,正在发声的人嘴图像可以描述为第一人嘴图像。或者也可以描述为发声人脸图像或发声人嘴图像。其中,视频画面中正在发声的人物可以为一个或多个,那么第一人脸图像的数量为一个或多个,第一人嘴图像的数量为一个或多个。可以理解的是,若当前拍摄的视频画面中,某些人物正在发声,但手机未能识别其正在发声,则未能识别的发声的人物的人脸图像或人嘴图像不划分为上述的第一人脸图像或第一人嘴图像。
那么,手机在启动语音增强模式开始录制视频画面后,需要识别第一人脸图 像或第一人嘴图像,根据第一人脸图像或第一人嘴图像,确定需要加强录音效果的第一拾音范围,从而获得更好的录音效果。
例如,手机在确认第一拾音范围后,调用第一拾音范围对应的麦克风,实现增强第一拾音范围内的音频信号。在一些场景中,手机包含一个或多个麦克风,一个或多个麦克风用于采集第一音频(即初始音频信号)。当一个或多个麦克风中第一麦克风的拾音范围内包含第一拾音范围的部分或全部时,增强第一麦克风的拾音范围中第一拾音范围内的音频信号;和/或削弱第一麦克风的拾音范围中第一拾音范围外的音频信号;和/或削弱一个或多个麦克风中除第一麦克风外的其他麦克风的音频信号,得到第二音频(即第一人脸图像或第一人嘴图像对应的音频)。在另一些场景中,手机包含至少两个麦克风,至少两个麦克风用于采集第一音频。当至少两个麦克风中第二麦克风的拾音范围不包含第一拾音范围时,关闭第二麦克风,至少两个麦克风中除第二麦克风外的其他麦克风采集的音频为第二音频。或者,在关闭第二麦克风时,增强至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围外的音频信号。
示例性的,手机配置有麦克风1和麦克风2。第一拾音范围在麦克风1的拾音范围以内,则手机在利用麦克风1和麦克风2获取到初始音频信号后,可以增强该初始音频信号中麦克风1采集的第一拾音范围内的音频信号,同时削弱该初始音频信号中麦克风1采集的第一拾音范围以外的音频信号,以及削弱麦克风2采集的音频信号,获取第一人脸图像或第一人嘴图像对应的音频。或者,手机关闭麦克风2,将麦克风1采集的音频信号中第一拾音范围内的音频信号增强,第一拾音范围以外的音频信号削弱后,获取第一人脸图像或第一人嘴图像对应的音频。又比如,手机配置有麦克风1和麦克风2。第一拾音范围包括麦克风1的拾音范围以内的拾音范围1,以及麦克风2的拾音范围以内的拾音范围2。也就是说,第一拾音范围为拾音范围1和拾音范围2的并集。那么,手机在利用麦克风1和麦克风2获取到初始音频信号后,可以增强初始音频信号中麦克风1的拾音范围1以及麦克风2的拾音范围2以内的音频信号,削弱初始音频信号中剩余的音频信号,获取第一人脸图像或第一人嘴图像对应的音频。可以理解的是,拾音范围1和拾音范围2可以全部或部分重叠。
示例性的,如图4中(d)所示拍摄界面404,拍摄界面404包含用于显示视频画面的取景框48。其中,取景框48对应的拾音范围为当前录制的视频画面的最大拾音范围。当前正在录制的视频画面中,手机识别出第一人脸图像47,假设第一人脸图像位于取景框48的中心位置,则手机确定第一拾音范围为最大拾音范围的中心位置。手机增强第一拾音范围内的音频信号。可选的,在拍摄界面404显示提示框49,用于提示用户当前已增强中间位置的录音效果。该提示框49可以在拍摄过程中持续显示,显示内容随着第一拾音范围的变化而变化,在停止拍摄后自动隐藏。或者,仅在预设时间段内显示,预设时间段后自动消失,避免遮挡取景框48的显示的视频画面。
可见,在录制音频的过程中,手机可以通过增强第一拾音范围内的音频信号, 获取发声人脸或发声人嘴对应的音频,实现增强对发声人脸或发声人嘴的收音效果,以减小外界杂音的干扰。进一步的,在增强第一拾音范围内的音频信号的基础上,还可以削弱第一拾音范围以外的音频信号,获得更好的录音效果。或者,仅削弱第一拾音范围以外的音频信号,以减小外界杂音的干扰。
图6为本申请实施例提供的一种音频处理方法流程示意图。以下通过如图6所示的步骤S601-步骤S604对上述通过图4中(a)-(d)介绍的手机识别第一人脸图像或第一人嘴图像,确定需要语音增强的第一拾音范围,以及获取第一拾音范围对应音频的过程进行详细介绍。
S601、手机识别第一人脸图像或第一人嘴图像。
可选的,手机可以通过人脸图像识别算法识别第一人脸图像或第一人嘴图像。比如,手机在录制视频画面的过程中,通过人脸图像识别算法确定采集到视频画面中是否包含人脸图像。若包含人脸图像,则识别出其中包含的人脸图像,并根据人脸图像的面部特征数据,如五官数据,面部轮廓数据等在预设时间段内的变化情况确定其是否正在发声。其中,人脸图像正在发声的判断标准包括手机判断人脸图像当前正在发声。或者,手机在判断人脸图像第一次发声之后的预设时间段内再次判断人脸图像发声,则确定人脸图像正在发声。可以理解的是,人的发声器官为人嘴,当可以获得发声的人嘴数据时,可以优先确定第一人嘴图像的数据,后续基于第一人嘴图像的数据确定第一拾音范围。
示例性的,如图7所示界面701,手机采集到人脸图像71,并通过人脸图像识别算法识别出人脸图像71对应的面部特征关键点(如人脸图像71上显示的圆形特征点,从而确定其是否正在发声)。并可以获得人脸数据和/或人嘴数据。比如,面部特征点包括上嘴唇特征点和下嘴唇特征点,根据上嘴唇特征点和下嘴唇特征点可以实时获得上下嘴唇之间的距离。那么预设人脸图像上嘴唇和下嘴唇之间的距离阈值。若在第一次检测到人脸图像的上嘴唇和下嘴唇之间的距离超过距离阈值之后的预设时间段之内,手机检测到人脸图像的上嘴唇和下嘴唇之间的距离超过距离阈值的次数超过预设次数,则确定当前人脸图像正在发声。
进一步的,面部特征点还可以包括人脸轮廓特征点,那么手机可以根据人脸轮廓特征点获得如下巴变化的数据,人脸肌肉变化的数据等,进而确定人脸图像是否正在发声。比如,预设时间段之内,下巴上下移动产生的变化数据超过预设阈值的次数超过预设次数,则确定当前人脸图像正在发声。当然,手机还可以根据人嘴对应的其他数据如喉结变化数据等的变化,确定发声人脸或发声人嘴。并且手机还可以结合上述各个人脸数据和人嘴数据,实现更加准确的识别第一人脸图像或第一人嘴图像。
需要说明的是,上述人脸图像识别算法可以参见现有技术中包含的人脸图像识别算法,本申请实施例不再对人脸识别算法及其计算过程进行详细阐述。
其中,第一人脸图像的数量为一个或多个。在第一人脸图像的数量为多个的场景中,即多张人脸图像同时发声或者多张人脸图像在第一预设时间段内先后发声的场景中,手机可以排除其中人脸图像面积较小或者位于视频画面边缘的人脸图像,不认为其为第一人脸图像。一般的,用户在录制视频画面的过程中,会将 摄像头对准其关注的人物,那么用户关注的人脸图像应该为面积较大的人脸图像,或者为显示在视频画面中间或中间附近的人脸图像。也就是说,用户关注的拾音范围通常是用户关注的画面范围内的声音,这一部分画面范围需要进行语音增强。其中,第一预设时间段可以为预配置的较短时间范围,如手机判断用户A发声,以用户A停止发声的时间点开始计时,在第一预设时间段内检测到用户B开始发声。进一步的,用户B停止发声后的第一预设时间段内检测到用户A又开始发声。也就是说,在录像过程中,用户A发声后用户B马上发声,或者,用户A和用户B交替发声,则可以将用户A和用户B对应的人脸图像确认为第一人脸图像。那么,可以避免在较短的时间范围内频繁确认第一人脸图像对应的拾音范围,减少数据处理量,同时提高效率。
那么,手机在识别出多张发声的人脸图像后,确认其中面积最大的人脸图像或者距离视频画面中心最近的人脸图像,将该人脸图像以及与该人脸图像面积差小于预设阈值的发声的人脸图像确认为第一人脸图像。或者,将该人脸图像以及该人脸图像附近预设范围内的发声的人脸图像确认为第一人脸图像,从而实现根据第一人脸图像确定第一拾音范围。类似的,手机确定多张第一人嘴图像的场景与确定多张第一人脸图像的场景相同,不再赘述。其中,视频画面的中心点例如包括取景框中心点,手机显示屏幕的中心点等。
S602、手机获取第一人脸图像或第一人嘴图像对应的第一特征值。
S603、手机根据第一特征值确定第一拾音范围。
其中,第一特征值用于描述第一人脸图像对应的真实人物的人脸与手机的相对位置关系,或者第一特征值用于描述第一人嘴图像对应的真实人物的人嘴与手机的相对位置关系。从而手机可以根据第一特征值,确定第一拾音范围。比如,第一人脸图像对应的真实人物位于手机的正前方,即第一人脸图像位于拍摄的视频画面的中心位置,则第一拾音范围为手机正前方的拾音范围。后续,手机获取包含各个方向音频信号的初始音频信号后,可以基于初始音频信号和第一拾音范围获得第一人脸图像对应的音频。第一特征值包括前后置属性参数,面积占比,位置信息中的一项或多项。其中,前后置属性参数,面积占比和位置信息为手机根据第一人脸图像或第一人嘴图像确定的参数,其含义详见下文描述。
以下针对第一特征值包含不同参数时手机确定第一拾音范围的具体方法进行说明。
方案一,第一特征值包括第一人脸图像的前后置属性参数,或者第一特征值包括第一人嘴图像对应的前后置属性参数。
其中,“前后置属性参数”用于表示包含第一人脸图像或第一人嘴图像的视频画面为前置摄像头拍摄的视频画面(为便于描述,本文中也称为前置视频画面),还是后置摄像头拍摄的视频画面(为便于描述,本文中也称为后置视频画面)。该前后置属性参数可以用于确定第一拾音范围在手机前置180度的范围内还是在后置180度的范围内。示例性的,如图2B中的(b)所示,前置视频画面对应的拾音范围包括椭圆204,椭圆205以及椭圆206表示的范围,后置视频画面对应的拾音范围可以包括椭圆201,椭圆202和椭圆203表示的范围。
示例性的,手机取景框内显示的视频画面可以进行前后置摄像头采集画面的切换。如图8中(a)所示的拍摄界面801,手机处于语音增强模式,确认存在发声人脸图像81。手机确认发声人脸图像81所在的视频画面为前置摄像头采集的视频画面,即确认第一特征值为前置属性参数,则确认第一拾音范围为前置180度范围内,显示提示框82,提示用户当前已增强前置录音效果。
进一步的,拍摄界面801还包括前后置切换控件83,用于进行前后置摄像头的切换。比如,手机响应于用户点击前后置切换控件83的操作,可以将前置摄像头切换为后置摄像头。相应的,手机显示的视频画面,由图8中(a)所示的拍摄界面801显示的前置摄像头采集的视频画面,切换为如图8中(b)所示的拍摄界面802显示的后置摄像头采集的视频画面。手机识别出当前视频画面中的发声人脸图像84,则确定第一特征值为后置属性采纳数信息,第一拾音范围为手机后置180度的范围内。手机显示提示框85,提示用户当前已增强后置录音效果。
其中,如图2B中(b)所示,后置视频画面对应的拾音范围为椭圆201,椭圆202和椭圆203表示的范围,前置视频画面对应的拾音范围为椭圆204,椭圆205和椭圆206表示的范围。比如,手机根据第一特征值确认第一人脸图像对应后置视频画面,则确认第一拾音范围为椭圆201,椭圆202和椭圆203表示的范围。或者,参见图2B中(d)所示,手机根据第一特征值确认第一人脸图像对应后置视频画面,则确认第一拾音范围为麦克风27,麦克风28和麦克风29对应的拾音范围。
方案二,第一特征值包括第一人脸图像对应的面积占比,或者,第一特征值包括第一人嘴图像对应的面积占比。
其中,“面积占比”用于表示第一人脸图像面积或第一人嘴图像面积与视频画面的面积的比值。该面积占比用于衡量麦克风采集音频的半径范围(或直径范围)。
具体的,用户使用手机拍摄视频画面的过程中,通常会将用户关注的人物置于视频画面中心位置,也就是说,第一人脸图像或第一人嘴图像位于取景框中心位置。不同的第一人脸图像或第一人嘴图像的面积对应的拾音范围不同。示例性的,如图9所示,假设手机在不同时间段确定两张第一人脸图像,分别为第一人脸图像1和第一人脸图像2。两张人脸图像的面积不同,第一人脸图像1的面积大于第一人脸图像2的面积。那么,如图9所示,根据第一人脸图像1,确定的拾音范围为拾音范围1。根据第一人脸图像2,确定的拾音范围为拾音范围2。拾音范围1大于拾音范围2。
在一些实施例中,如下表1所示,其中,X用于表示第一人脸图像面积或者第一人嘴图像面积。Y用于表示取景框显示的视频画面的面积。N表示取景范围对应的拾音范围。
表1
Figure PCTCN2021108458-appb-000001
一些实施例中,面积占比用于表示第一人脸图像面积与取景框显示的视频画面的面积的比值。其中,第一人脸图像的数量可以为一个或多个,那么第一人脸图像面积为一张人脸图像的面积或者多张人脸图像的面积和。其中,多张人脸图像的面积和可以用多张人脸图像所在的占位框的面积,即包含该多张人脸图像的最小选框的面积表示。
示例性的,如图10中(a)所示界面1001,第一人脸图像数量为1,手机在进行人脸图像识别过程中,根据人脸图像11的面部特征点中额头最上方的特征点位置,下巴最下方的特征点位置,以及左右脸最边沿不包含耳朵的特征点位置,确定框选第一人脸图像11的人脸面积的虚线框101,框选范围内的图像面积为第一人脸图像面积。即确认第一人脸面积的过程中,仅计算其中的人脸面积,排除耳朵,帽子,饰品,脖子等的影响。取景框显示的视频画面的面积为虚线框102框选范围内的图像面积。那么手机可以根据识别出的虚线框101和虚线框102对应的面积比,确定面积占比。后续,第一人脸图像面积的确定方法均可以参见当前第一人脸图像面积的确定方法,之后不再赘述。
又示例性的,如图10中的(b)所示界面1002,界面1002中显示有两张人脸图像,这两张人脸图像均被手机识别为发声的第一人脸图像。右侧的人脸图像12的面积为虚线框103框选范围内的图像面积,左侧的人脸图像13的面积为虚线框104框选范围内的图像面积,那么第一人脸图像面积为虚线框105框选范围内的图像面积,即包括所有人脸图像的最小的选框的面积(例如是根据所有人脸图像面积选框的边沿极限值确定总的框选面积)。其中,虚线框105即用于表示人脸图像12和人脸图像13所在的占位框。最终确定的第一人脸图像面积同时包含两张人脸图像对应的图像面积。取景框显示的视频画面的面积为虚线框106框选范围内的图像面积。那么手机可以根据识别出的虚线框105和虚线框106对应的面积比,确定面积占比。
其中,多人脸发声场景中,如图10中的(c)所示界面1003,若视频画面中的两个人均在发声,手机确定右侧人脸图像14的面积最大。手机可以通过预设阈值排除部分用户不关注的发声人脸图像。比如,预设阈值为小于最大人脸图像面积的20%。示例性的,在界面1003中,手机可以排除小于右侧人脸图像14的面积的20%的左侧人脸图像15。那么,第一人脸图像包括右侧的人脸图像14。又比如,预设阈值为距离最大面积的人脸图像的距离超过取景框显示的视频画面的长度或宽度的35%。示例性的,在界面1003中,手机可以排除距离右侧人脸图像14的距离超过取景框显示的视频画面的长度的35%的左侧人脸图像15。那么,第一人脸图像包括右侧人脸图像14。
又一些实施例中,面积占比用于表示第一人嘴图像的面积与取景框显示的视频画面的面积的比值。其中,第一人嘴图像的数量可以为一个或多个,那么第一人嘴图像的面积为一张人嘴图像的面积或者多张人嘴图像对应的面积和。其中,多张人嘴图像的面积和可以用多张人嘴图像所在的占位框的面积表示,即用包含该多张人嘴图像的最小选框的面积表示。
示例性的,如图11中(a)所示界面1101,第一人嘴图像数量为1,手机在进行人脸图像识别过程中,根据面部特征点中人嘴图像的特征点中最上方,左下方,最左侧和最右侧的特征点位置,确定框选第一人嘴图像16的面积的虚线框111,框选范围内的图像面积为第一人嘴图像面积。取景框显示的视频画面的面积为虚线框112框选范围内的图像面积。那么手机可以根据识别出的虚线框111和虚线框112对应的面积比,确定面积占比。后续,第一人嘴图像面积的确定方法均可以参见当前第一人脸图像面积的确定方法,之后不再赘述。
又示例性的,如图11中的(b)所示界面1102,界面1102中显示有两张人嘴图像,这两张人嘴图像均被手机识别为发声的发声人嘴图像。右侧的第一人嘴图像17的面积为虚线框113框选范围内的图像面积,左侧的第一人嘴图像18的面积为虚线框114框选范围内的图像面积,那么第一人嘴图像面积为虚线框115框选范围内的图像面积,即包括所有人嘴图像的最小的选框的面积(例如是根据所有人嘴图像面积选框的边沿极限值确定总的框选面积)。其中,虚线框115即用于表示第一人嘴图像17和第一人嘴图像18所在的占位框。最终确定的第一人嘴图像面积同时包含两张人嘴图像对应的图像面积。取景框显示的视频画面的面积为虚线框116框选范围内的图像面积。那么手机可以根据识别出的虚线框115和虚线框116对应的面积比,确定面积占比。
同样的,多人脸发声场景中,如图11中的(c)所示界面1103所示,若当前视频画面中的两个人均在发声,手机确定右侧人嘴图像面积最大。手机可以通过预设阈值排除部分用户不关注的发声人嘴图像。比如,预设阈值为小于最大人嘴图像面积的20%。又比如,预设阈值为距离最大面积的人嘴图像的距离超过取景框显示的视频画面的长度或宽度的35%。如图11中的(c)所示界面1103所示,排除左侧正在发声的人嘴图像,第一人嘴图像仅包括右侧正在发声的第一人嘴图像,根据右侧第一人嘴的面积确定第一拾音范围的半径。
示例性的,假设上述确定第一人脸图像面积的场景中,手机均采用后置摄像头采集视频画面。手机根据如图10中(a)所示的第一人脸图像的第一特征值确定的拾音范围,可以为如图9所示的拾音范围2。手机根据如图10中(b)所示的第一人脸图像的第一特征值确定的拾音范围,可以为如图9所示的拾音范围1。
需要说明的是,上述确定第一人脸图像面积和第一人嘴图像的面积的过程中,均将第一人脸图像和第一人嘴图像转化为矩形后,将矩形面积作为对应的第一人脸图像面积或第一人嘴图像的面积。可以理解的是,也可以利用不规则的几何图形对应第一人脸图像和第一人嘴图像,从而更加精确的确定对应的面积,本申请实施例中的矩形仅为一种示例性说明,对此本申请实施例不做具体限定。
需要说明的是,上述确定第一人脸图像面积占比和第一人嘴图像的面积占比 的过程中,均将取景框面积作为视频画面的面积。可以理解的是,在手机为全屏手机的情况下,可以将手机显示屏面积作为视频画面面积。或者,也可以用其他面积,以及其他形状的面积作为视频画面面积,本申请实施例中的取景框面积仅为一种示例性说明,对此本申请实施例不做具体限定。
方案三,第一特征值包括第一人脸图像对应的位置信息,或者第一特征值包括第一人嘴图像对应的位置信息。
其中,“位置信息”用于表示第一人脸图像或第一人嘴图像在视频画面中的位置。位置信息包含第一人脸图像的中心点相对于第一参考点的偏移量,如偏移方向,和/或偏移角度,和/或偏移距离等。或位置信息包含第一人嘴图像的中心点相对于第一参考点的偏移量。其中,第一参考点为视频画面的中心点或对焦的焦点。偏移方向是指第一人脸图像或第一人嘴图像的中心点相对于第一参考点向左偏移,向右偏移,向上偏移,向下偏移,向左上偏移,向右上偏移,向左下偏移或者向右下偏移等。偏移角度是指向左上偏移,向右上偏移,向左下偏移或者向右下偏移的角度。偏移距离是指向左偏移,向右偏移,向上偏移,向下偏移的距离,或者某个偏移角度上偏移的距离等。
在一些实施例中,可以根据第一人脸图像各个方向上的特征点的极限位置,确定第一人脸图像的中心点坐标。如上述第一人脸图像面积的确定过程,根据第一人脸图像的面部特征点中额头最上方的特征点位置,下巴最下方的特征点位置,以及左右脸最边沿不包含耳朵的特征点位置,确定第一人脸图像中心点坐标。同样的,根据人脸图像的面部特征点中人嘴图像的特征点中最上方,左下方,最左侧和最右侧的特征点位置,确定第一人嘴图像的中心点坐标。
之后,预设第一参考点例如可以包括取景框显示的视频画面的中心点(也可以描述为取景的中心点),取景范围内对焦的焦点等。以第一参考点为原点,平行于手机底边(或当前取景框的底边)为x轴,垂直于x轴的方向为y构建坐标系,并且当前坐标系平行于手机显示屏。利用构建的坐标系定义第一人脸图像或第一人嘴图像的中心点相对于第一参考点的偏移方向,偏移角度和偏移距离。示例性的,如图13中(a)所示,为手机竖屏显示的情况下,坐标系的情况,其中,x轴平行于手机底边(即短边)。如图13中(b)所示,为手机横屏显示的情况下,坐标系的情况其中,x轴平行于手机侧边(即长边)。其中,x轴与y轴的交点,即原点坐标为(0,0),x轴正方向为右,y轴正方向为上。可以看出,当手机切换竖屏显示和横屏显示之后,坐标系x轴和y轴的方向发生改变,第一人脸图像或第一人嘴图像的中心点相对于第一参考点的偏移方向,偏移角度和偏移距离会随之变化。
示例性的,如图12中(a)所示的界面1201,第一人脸图像的数量为1,第一人脸图像的中心点为标识121对应的位置,取景框显示的视频画面的中心点为标识122对应的位置。其中,取景框中心点位置为根据取景框上下左右的边沿极限坐标确定。手机根据标识121和标识122的位置关系,确定第一人脸图像的位置信息。比如,界面1201显示的场景中,第一人脸图像的位置信息为取景框中心点左下方。或者,如图12中(b)所示的界面1202,第一人脸图像的数量为1, 第一人嘴图像的中心点为标识123对应的位置,取景框显示的视频画面的中心点为标识124对应的位置。手机根据标识123和标识124的位置关系,确定第一人脸图像的位置信息。比如,界面1202显示的场景中,第一人嘴图像的位置信息为取景框中心点左下方。
在一些实施例中,若第一人脸图像的数量为多个,那么第一人脸图像的中心点为多张人脸图像组成的图像范围内的中心点。比如,如图10中(b)所示的场景,第一人脸图像的中心点为虚线框105框选范围的几何中心点。又比如,如图11中(b)所示的场景,第一人嘴图像的中心点为虚线框115框选范围的几何中心点。同样的,取景框显示的视频画面的中心点也为取景框的几何中心点。
需要说明的是,上述确定第一人脸图像中心点和第一人嘴图像的中心点的过程中,均将第一人脸图像和第一人嘴图像转化为矩形后,将矩形中心点作为对应的第一人脸图像中心点或第一人嘴图像的中心点。可以理解的是,也可以利用不规则的几何图形对应第一人脸图像和第一人嘴图像,从而更加精确的确定对应的中心点,本申请实施例中的矩形仅为一种示例性说明,对此本申请实施例不做具体限定。
并且,上述确定第一人脸图像或第一人嘴图像对应的位置信息的过程中,在有些场景中,将取景框的中心点作为第一参考点,即利用取景框中心点表示视频画面的中心点。可以理解的是,基于视频画面的显示形式,第一参考点也可以用其他形式表示。比如,将手机显示屏的全部屏幕的中心点用于表示视频画面的中心点,即作为第一参考点。本申请实施例中的以取景框中心点作为第一参考点仅为一种示例性说明,对此本申请实施例不做具体限定。
在有些场景中,用户在录制视频画面的过程中,可能并不会将关注的物体置于取景范围内的中心位置,而是会通过对焦的方式,选择较为关注的物体。手机通过检测对焦的焦点位置,可以获得用户意图,确定用户关注的物体。其中,对焦的焦点位置也可以为手机自动对焦获得的焦点位置。例如,手机自动识别人像,自动对焦后确定对应的焦点位置。
示例性的,如图12中(c)所示界面1203,当前场景中,第一人脸图像的数量为2,第一人脸图像的中心点为标识125对应的位置。手机检测到用户点击屏幕的操作,获得对焦的焦点位置,并显示虚线框126。虚线框126框选的范围为手机根据用户的意图确定的对焦范围。那么,对焦范围内的中心焦点为标识127对应的位置。手机根据标识125和标识127的位置关系,确定第一人脸图像的位置信息。如第一人脸图像的位置信息为焦点中心左上方。
在一种可能的实现方式中,手机可以根据第一人脸图像的中心点坐标或第一人嘴图像的中心点坐标和第一参考点坐标,确定第一人脸图像或第一人嘴图像与第一参考点的相对位置关系,进而确定第一人脸图像或第一人嘴图像在取景框显示的视频画面中的偏移方向。
示例性的,参考如图13中(a)或(b)所示的坐标系。假设第一人脸图像的中心点坐标或第一人嘴图像的中心点坐标为(X1,Y1),第一参考点坐标为(X2,Y2),将第一参考点设置为坐标系原点(0,0)。其中,第一人脸图像或第一人 嘴图像与第一参考点的相对位置关系可以参考下表2所示。比如,X1<X1,则表示第一人脸图像或第一人嘴图像位于第一参考点左侧,即偏移方向为向左。又比如,X1=X2,同时Y1=Y2,则表示第一人脸图像或第一人嘴图像的中心点与第一参考点的左右偏移量和上下偏移量均为零,即第一人脸图像中心点与第一参考点中心重合,偏移方向为未偏移。
表2
坐标关系 偏移方向
X1<X2 向左
X1>X2 向右
X1=X2 左右未偏移
Y1<Y2 向下
Y1>Y2 向上
Y1=Y2 上下未偏移
在另一种可能的实现方式中,手机可以根据第一人脸图像的中心点坐标或第一人嘴图像的中心点坐标和第一参考点坐标,确定第一人脸图像在取景框显示的视频画面中的偏移角度(如图14中所示的第一人脸图像的中心点坐标或第一人嘴图像的中心点坐标(X1,Y1)与第一参考点(X2,Y2)的连线,与X轴的夹角θ)。示例性的,如图14所示,大圆141用于表示手机取景框对应的最大拾音范围,将取景框中心点坐标设置为(0,0),即将取景框中心点设置为第一参考点。将最大拾音范围划分为4个象限,如第一象限142,第二象限143,第三象限144以及第四象限145。假设偏移角度为θ,手机可以基于每一象限中(X1,Y1)和(X2,Y2)连线与x轴夹角大小,确定偏移角度θ,则0<θ<90°。或者,手机基于全象限确定偏移角度θ,则0<θ<360°。比如,在图14中,第一人脸图像显示于取景框的第二象限143,tanθ=|Y2-Y1|/|X2-X1|,从而手机可以获得第一人脸图像在取景框显示的视频画面中的偏移角度θ。
在又一种可能的实现方式中,手机可以根据第一人脸图像的中心点坐标或第一人嘴图像的中心点坐标和第一参考点坐标,确定第一人脸图像在取景框显示的视频画面中的偏移距离。手机根据偏移距离,以及第一人脸对应的拾音范围的半径,可以确定第一人脸图像对应的拾音范围是否超出取景范围对应的拾音范围,进而确定第一拾音范围。
示例性的,如图15中(a)所示,大圆151为取景框对应的最大拾音范围,半径为R。第一参考点为取景框显示的视频画面的中心点,即最大拾音范围的中心点,坐标为(X2,Y2),第一人脸图像中心点坐标为(X1,Y1),手机根据面积参数比信息确定的小圆152半径为r。手机根据勾股定理,可以获得偏移距离
Figure PCTCN2021108458-appb-000002
那么,第一人脸图像中心点距离最大拾音范围的边缘的距离S=R-L。若第一人脸图像对应的拾音范围未超出最大拾音范围,即r≤S,那么第一拾音范围的半径r=R*P。其中,P为第一人脸图像与取景框显示视频画面的面积的比值,即面积占比参数。若第一人脸图像对应的拾音范围部分超出最大拾音范围,即r>S。如图15中(b)所示,超出手机最大拾音范围的部分无法拾音,那么第一人脸图像对应的拾音范围对应发生改变,保证手机能够获取声音。比如,若1.5S>r>S,则第一拾音范围的半径等于第一人脸图像中心点距离最大拾音范围的边缘的距离。若r≥1.5S,则第一拾音范围的半径等于全景拾音范围的半径与面积占比参数的乘积,在此情况下,手机不会对超出最大拾音范围的部分进行拾音。可以理解的是,在r>S的情况下,通过比较r与1.5S的大小确定第一拾音范围的半径的方法仅为一种示例性说明,还可以通过其他方法确定第一拾音范围的半径,保证手机可以对第一人脸图像对应的音频数据进行拾音。比如,通过比较r与2S的大小确定第一拾音范围的半径。
需要说明的是,上述确定第一人脸图像或第一人嘴图像的中心点的确认过程中,均将第一人脸图像和第一人嘴图像转化为矩形后,将矩形的几何中心点作为对应的第一人脸图像或第一人嘴图像的中心点。可以理解的是,也可以利用不规则的几何图形对应第一人脸图像和第一人嘴图像,从而更加精确的确定对应的中心点位置,本申请实施例中的矩形仅为一种示例性说明,对此本申请实施例不做具体限定。
在一些实施例中,手机可以利用上述方案一至方案三中任一方案确定第一拾音范围。或者,手机可以将上述方案一至方案三中的多个方案相结合后,确定第一拾音范围。又或者,手机可以利用上述方案一至方案三中的一个或多个参数与其它参数结合后,确定第一拾音范围。再或者,手机可以利用其它参数,确定第一拾音范围。
比如,如下介绍一种手机将上述方案一至方案三相结合后,确认第一拾音范围的方法。
示例性的,假设当前用户选择利用后置摄像头录制视频画面,如图16A中(a)所示,那么手机根据第一人脸图像对应的视频画面的前后置属性参数,确定第一人脸图像对应的视频画面为后置视频画面。如图16A中(b)所示,第一拾音范围在手机后置180度的范围内。即椭圆161、椭圆162和椭圆163表示的范围。
之后,手机可以根据第一人脸图像对应的位置信息,进一步确定第一拾音范围。比如,如图16B中的(a)所示,第一人脸图像为左侧的人脸图像,第一人脸图像中心点164位于取景框中心点165左上方。手机根据位置信息,确定偏移方向为左上方,第一拾音范围的中心点位于后置拾音范围中心点的左上方,比如第一拾音范围可以参见图16B中的(b)所示的椭圆161和椭圆162表示的范围中的左侧。如图16B中(c)所示,大圆166为后置视频画面对应的最大拾音范围,将拾音范围沿中心虚线左右分割,即可确认对应的左右拾音范围。比如,后置左上方的第一拾音范围可以参见图16B中(c)所示的左半个椭圆1611和左半个椭圆 1621表示的范围。
在此基础上,假设位置信息还包括偏移角度和偏移距离。如偏移角度大于45度,偏移距离大于取景框显示的视频画面的半径的1/2。也即第一人脸图像位于取景框中显示视频画面中心位置的上方,并且与中心位置距离较远。如图16C中的(a)所示,第一人脸图像为左侧的人脸图像,第一人脸图像中心点166与取景框中心点167之间的偏移距离较大。那么,中间拾音范围对第一人脸图像对应的音频产生的辅助作用较小,第一拾音范围可以参见图16C中的(b)所示的椭圆161表示的范围。进一步的,第一人脸图像可以为图16B中(c)所示的左半个椭圆1611表示的范围。
示例性的,如下表3所示,示例性的说明图2B中(d)所示的多麦克风场景中,手机根据第一人脸图像对应的视频画面的前后置属性参数,以及第一人脸图像对应的位置信息,确定的拾音范围。或者手机根据第一人嘴图像对应的视频画面的前后置属性参数,以及第一人嘴图像对应的位置信息,确定的拾音范围。
表3
Figure PCTCN2021108458-appb-000003
最后,手机可以根据第一人脸图像对应的面积占比,确定最终的第一拾音范围。手机通过面积占比,取景范围对应的拾音范围,可以确定第一人脸图像对应的第一拾音范围的半径。
示例性的,通过上述结合方案一至方案三中的方法确定第一拾音范围的过程中,比如,如图15中的(a)所示的圆152圈定第一拾音范围。其中,圆152的半径可以用于对应表示第一拾音范围的半径范围。那么,可以利用图16B中的(c)所示的左半个椭圆1611表示的范围表示第一拾音范围。又比如,如图15中的(b)所示的场景,最后确定第一拾音范围的半径为第一人脸图像中心点距离最大拾音范围的边缘的距离。那么,可以利用图16B中的(c)所示的左半个椭圆1611和左半个椭圆1612表示的范围表示第一拾音范围。
需要说明的是,在手机结合上述方案一至方案三中的多个方案,确定第一拾音范围的过程中,对于确定各个参数的先后顺序不做限制,手机可以采用不同于上述示例中的其他顺序确定各个参数。如同时确定各个参数等。
通过上述方案可以确定第一人脸图像或第一人嘴图像对应的第一拾音范围,进而后续可利用第一拾音范围获取音频,从而提高音频质量。
S604、手机根据第一拾音范围,获取音频。
其中,手机可以采用单个麦克风,或者多个麦克风采集周围各个方向的声音信号,即采集全景声音信号。手机将多个麦克风采集到的全景声音信号进行预处理后,可以获得初始音频数据,该初始音频数据包含各个方向的声音信息。而后,手机可以根据初始音频数据和第一拾音范围,录制第一人脸图像对应的音频。
可选的,手机确定第一人脸图像或第一人嘴图像对应的第一拾音范围后,可以对初始音频数据中第一拾音范围内的声音进行增强,对第一拾音范围外的声音进行抑制(或称减弱),进而对处理后的音频数据进行录制,获得第一人脸图像或第一人嘴图像对应的音频。
如此,第一人脸图像或第一人嘴图像对应的音频录制的是第一拾音范围内的声音,而第一拾音范围是根据第一人脸图像或第一人嘴图像对应的第一特征值确定的拾音范围,因而第一拾音范围内的声音为用户关注的发声人脸或发声人嘴的对应的声音。也就是说,减小了录制视频画面中杂音对发声人脸或发声人嘴发出的声音的干扰。
进一步的,基于第一拾音范围,定向进行语音增强,能够在复杂的拍摄环境中,仅利用音频算法对部分音频信号加强处理,能够简化音频处理算法,提高处理效率,降低对手机硬件计算性能的要求。
在另一些场景中,由于手机根据第一特征值确定的第一拾音范围,与第一人脸图像或第一人嘴图像的显示范围可能存在一定的误差,因而手机可以在第一拾音范围附近确定一个或多个参考第一拾音范围。其中,手机根据第一拾音范围获得一路音频,根据参考第一拾音范围获得至少一路音频,手机还可以将全景音频作为一路音频。那么,手机基于第一拾音范围可以获得第一人脸图像或第一人嘴图像对应的多路音频。其中,一路音频可以理解为一个音频文件。
在一种可能的实现方式中,手机可以根据第一人脸图像或第一人嘴图像对应的面积占比,确定对应的一个或多个参考第一拾音范围。假设根据该面积参数占比信息,确定第一拾音范围为和参考第一拾音范围。比如,基于表1,如下表4所示,手机可以根据下表4中的规则确定第一拾音范围和参考第一拾音范围。下表4中,第一拾音范围为推荐值,参考第一拾音范围包括增强值1、增强值2和增强值3。
表4
推荐值 增强值1 增强值2 增强值3
N*X/Y 1.1*N*X/Y 0.95*N*X/Y 1.05*N*X/Y
在另一种可能的实现方式中,手机可以根据不同的音频处理方法确定第一拾音范围和参考第一拾音范围对应的音频。比如,基于上述确定第一拾音范围的流程,第一拾音范围对应的音频为利用杜比音效算法确定的音频,参考第一拾音范围对应的音频为根据Histen音效算法确定的音频。如下表5所示,算法1-算法4为不同的音频算法,根据不同的音频算法确定第一拾音范围和参考第一拾音范围 对应的音频。其中第一拾音范围为推荐值,参考第一拾音范围包括增强值1、增强值2和增强值3。
表5
推荐值 增强值1 增强值2 增强值3
算法1 算法2 算法3 算法4
在又一种可能的实现方式中,手机可以结合第一人脸图像或第一人嘴图像对应的面积参数占比信息和音频算法,获取第一拾音范围和参考第一拾音范围对应的音频。如下表6所示,其中第一拾音范围为推荐值,参考第一拾音范围包括增强值1、增强值2和增强值3。
表6
Figure PCTCN2021108458-appb-000004
可以理解的是,手机还可以利用其他方法确定参考第一拾音范围,本申请实施例不做具体限定。
并且,手机可以对初始音频数据进行处理,以增强参考第一拾音范围内的声音,抑制参考第一拾音范围外的声音,进而对处理后的音频数据进行录制获得第一人脸图像或第一人嘴图像对应的一路或多路音频。
如此,手机可以根据第一拾音范围和参考第一拾音范围,录制获得与第一人脸图像或第一人嘴图像对应的第一特征值以及第一人脸图像或第一人嘴图像的画面相匹配的多路音频,以供用户后续选择播放。其中,第一人脸图像或第一人嘴图像对应的每路音频数据可以保存为一个音频文件,第一人脸图像可以对应多个音频文件。
在手机根据第一拾音范围和参考第一拾音范围,录制第一人脸图像或第一人嘴图像对应的多路音频的情况下,该多路音频为用户提供的不同拾音范围内的音频数量更多,与用户关注的第一人脸图像或第一人嘴图像对应的声音匹配的可能性更大,用户音频播放的选择性也更大。
在一些实施例中,手机还可以根据用户的选择的第一拾音范围或者参考第一拾音范围,录制第一人脸图像或第一人嘴图像对应的音频。示例性的,如图17所示界面1701,手机检测到用户点击推荐值选择控件171的操作,则在录制视频画面的过程中,根据第一拾音范围和初始音频数据,录制第一人脸图像或第一人嘴图像对应的音频。同样的,若手机检测到用户点击增强值1选择控件的操作,则在录制视频画面的过程中,根据增强值1对应的参考第一拾音范围和初始音频数据,录制第一人脸图像或第一人嘴图像对应的音频。其中,若手机检测到用户点击无处理选择控件172的操作,则在录制视频画面的过程中,根据初始音频数据,融合各个方向上的音频信号,获得全景音频。即无处理选择控件172对应的音频为全景音频,也可以理解为手机处于非语音增强模式时,获取的音频。其中,界面1701中推荐值,增强值1,增强值2和增强值3确定的方法,可以参见上述表 4-表6所示,在此不再进行赘述。
在一些实施例中,用户可以在正式录制视频画面之前,体验不同拾音范围对应的录制效果,进而确定最终录制视频画面过程中,选用的拾音范围。手机可以根据用户的选择,仅保存对应的音频文件。保证满足用户需求的同时,可以节约手机存储空间。
在另一些场景中,手机在录制视频画面的过程中,第一拾音范围可能会变化为第二拾音范围。比如,由于手机在录制视频画面的过程中,检测到用户指示切换前后置摄像头的操作。切换前的拾音范围为第一拾音范围,切换后的拾音范围为第二拾音范围。那么,对于录制的视频中的音频来说,手机录制的音频至少包括第一时长音频和第二时长音频。其中,第一时长音频为第一拾音范围对应的音频,第二时长音频为第二拾音范围对应的音频。也就是说,手机可以视频画面中发声人脸或发声人嘴的变化,动态确定拾音范围,进而根据拾音范围录制音频。最终检测到用户指示停止录制的操作后,形成的视频画面的音频中可以包含按照根据时间顺序,基于变化的拾音范围录制的不同时长或相同时长的多个音频。
如此,手机可以根据拾音范围的变化,始终对焦于提高需要进行语音增强的部分的音频录制质量,从而保证音频录制效果。并且,在用户播放视频文件时,可以向用户展示匹配视频内容变化的声音范围等动态变化的播放体验。
在一种可能的实现方式中,手机在录制视频画面的过程中,第一人脸图像或第一人嘴图像对应的第一特征值变化,导致拾音范围的变化。示例性的,假设视频画面的前后置属性参数变化,导致第一拾音范围变化。如图18中(a)所示界面1801,显示前置视频画面。手机在录制到00:15时长时,检测到用户点击前后置切换控件181的操作,切换至后置摄像头拍摄,并显示如图18中(b)所示界面1802。那么,在00:15时长前后,第一人脸图像或第一人嘴图像对应的第一特征值发生变化,录制的音频中00:00-00:15时长内的音频为第一拾音范围对应的音频,00:15时长之后的音频为第二拾音范围对应的音频。或者,检测到用户选择对焦的焦点位置变化,那么第一人脸图像或第一人嘴图像对应的位置信息变化,导致第一拾音范围变化。
又或者,取景框内视频画面的画面范围和画面大小,会随着变焦倍数(即Zoom值)的变化而变化。该变焦倍数可以是预设的变焦倍数,上一次在相机关闭前使用的变焦倍数,或用户预先指示的变焦倍数等。并且,取景框对应的变焦倍数还可以根据用户的指示而变化。那么,在一种场景中,随着变焦倍数的变化,取景范围发生变化。相应的,第一人脸图像面积或第一人嘴图像面积,进而第一人脸图像面积或第一人嘴图像对应的面积占比发生改变。也就是说,变焦倍数变化,会导致拾音范围的改变。如此,在后续视频播放过程中,录制的音频可以随着视频内容显示面积等的变化而动态变化,提升用户播放体验。
比如,在其他参数相同的情况下,若变焦倍数增大为原来的2倍,则拾音范围可能缩小为原来的1/3;若变焦倍数增大为原来的3倍,则拾音范围可能缩小为原来的1/6。因而,手机可以根据变焦倍数确定取景范围对应的拾音范围,以及第一人脸图像面积占比或第一人嘴图像面积占比对应的拾音范围。如下表7所示, 其中X用于表示第一人脸图像面积或者第一人嘴图像面积。Y用于表示取景框显示的视频画面的面积。在Zoom值发生变化时,X和Y的值也会发生改变。相应的拾音范围也会发生改变。
表7
Figure PCTCN2021108458-appb-000005
需要说明的是,变焦倍数的变化,也可以不改变拾音范围。比如,在录制过程中,变焦倍数变化后,第一人脸图像未改变,说明用户关注的内容并未发生改变。比如,用户A采访用户B,并利用手机拍摄用户B的采访过程。手机确定视频画面中的第一人脸图像为用户B的人脸图像。手机检测到变焦倍数变大,但此时,视频画面中的第一人脸图像仍为用户B的人脸图像。那么,手机可以不必重新获取第一拾音范围,以降低运算量,节约功耗。或者,在预设时间范围内,手机检测到多次改变变焦倍数的操作,则可以不必改变拾音范围。比如,预设时间段为2s,手机第一次检测到改变变焦倍数的操作后,先不必重新计算拾音范围。若2s之内,手机未检测到改变变焦倍数的操作,则重新计算拾音范围。若2s之内,手机再次检测到改变变焦倍数的操作,则不必重新计算拾音范围。并以此次检测到改变变焦倍数的操作的时间节点为起点,监控下一个2s时间段内,是否会检测到改变变焦倍数的操作。
在一种可能的实现方式中,手机在录制视频画面的过程中,第一人脸图像或第一人嘴图像发生变化,则第一拾音范围改变。比如,上述前后置摄像头切换场景,也可以理解为第一人脸图像第一人嘴图像发生了变化。或者,发声的人脸图像或人嘴图像变化,造成第一人脸图像或第一人嘴图像变化。比如,如图18中的(b)所示界面1802,假设在00:16-00:20时长内,手机确认第一人脸图像为视频画面中包含的两张人脸图像。在00:21-00:30时长内,手机识别第一人脸图像为视频画面中的右侧的人脸图像182。又或者,拍摄画面移动,当前录制的视频画面不包含之前识别的第一人脸图像或第一人嘴图像,则需要利用上述方法重新识别第一拾音范围。
在一种可能的实现方式中,响应于用户改变选择第一拾音范围或者参考第一拾音范围的操作,确定第二拾音范围。示例性的,如图18中(c)所示界面1803,手机在00:30时长之前采用推荐值对应的第一拾音范围录制视频画面,在00:30时检测到用户点击增强值2选择控件183的操作。响应于该操作,手机将第二拾音范围确定为增强值2对应的拾音范围,并显示如图18中(d)所示界面1804,在00:30时长之后,采用增强值2对应的拾音范围获取音频。
在本申请的一些实施例中,手机在生成各路音频的音频文件之前,可以对每路音频进行多种音效处理,以使得录制的音频获得更高的音频质量和更好的音频处理效果。例如,该音效处理可以包括:杜比音效,Histen音效,声音恢复系统(sound retrieval system,SRS)音效,低音增强引擎(bass enhanced engine, BBE)音效,或动态低音增强引擎(dynamic bass enhanced engine,DBEE)音效等。
需要说明的是,为了防止手机抖动造成的第一特征值的频繁变化,导致第一拾音范围的频繁变化,手机可以设置预设时间阈值,在预设时间阈值以内的变化,手机不会改变第一拾音范围。比如,设置在1s内,第一特征值连续两次发生变化,则手机认为当前第一特征值的变化为手机抖动导致,不会改变对应的第一拾音范围。
可选的,手机在通过上述方法利用麦克风采集音频信号的过程中,可以边采集音频信号,边基于第一拾音范围处理音频信号,获得第一人脸图像或第一人嘴图像对应的音频。最终在视频录制结束后,直接生成最终的音频。或者,手机也可以先采集音频信号,在视频录制完成后,再根据第一拾音范围处理音频信号,获得第一人脸图像或第一人嘴图像对应的音频。又或者,手机根据第一拾音范围,调用对应的麦克风采集第一拾音范围内的音频信号,处理后获得第一人脸图像或第一人嘴图像对应的音频。
可选的,录像功能可以包括单路录像功能和多路录像功能。其中,单路录像功能是指在手机拍摄过程中显示一个取景框,用于录制的一路视频画面。多路录像功能是指手机在拍摄过程中显示至少两个取景框,每一取景框用于一路视频画面。其中,使用多路录像功能的过程中,每一路视频画面及对应的音频采集方式均可以参照单路录像功能的实现方式。上述确定根据第一人脸图像以及第一人嘴图像确定第一拾音范围,并根据第一拾音范围录音的方法中,均以拍摄界面包括一个取景框为例进行说明。此外,包含两个或两个以上取景框的多路录像功能对应的过程与此类似,不再进行赘述。
手机在录制视频画面的过程中,根据发声人脸图像或发声人嘴图像,确定第一拾音范围,进而根据第一拾音范围录制音频。后续,需要对录制的音频进行保存,用户可以播放已保存的录像的视频画面和音频。需要说明的是,若录制视频画面的场景为直播,视频通话等实时通信场景,则其录制视频画面过程中,录制音频的方法可以参考上述方法,但是在检测到用户指示停止拍摄的操作即为停止通信的操作后,直接停止通信,不必生成录像视频。可以理解的是,某些实时通信场景中,用户也可以选择保存录像视频。手机响应于用户的操作,确定是否保存实时通信场景中的录像视频。
下面对手机保存录像视频以及播放已经保存的录像视频的场景进行介绍。
可选的,手机检测到用户指示停止拍摄的操作后,停止录制视频画面和音频,并生成录像视频。其中,用户指示停止拍摄的操作可以为用户点击图4中(c)所示录像预览界面403中,显示的控件45的操作,用户语音指示停止拍摄的操作,或隔空手势操作等其他操作,本申请实施例不做具体限定。
可选的,手机检测到用户指示停止拍摄的操作后,生成录像视频并返回录像预览界面或拍摄预览界面。其中,录像视频可以包括视频画面和音频。示例性的,手机生成的录像视频的缩略图可以参见图19中的(a)所示界面1901中显示的缩略图191,或图19中的(b)所示界面1902中显示的缩略图192。
在一种可能的实现方式中,手机可以提示用户该录像视频具有多路音频。示例性的,录像视频缩略图上或录像视频的详细信息可以包括用于表示多路音频的提示信息,例如该提示信息可以是图19中的(b)所示界面1902上显示的多个喇叭的标记193,其他形式的标记,或文字信息等。其中,每一路音频可以分别对应于在第一拾音范围和参考第一拾音范围对应采集的音频。
在一种可能的实现方式中,响应于用户指示停止操作拍摄的操作,手机显示如图19中的(c)所示界面1903,用于提示用户保存需要的视频文件的音频。其中视频文件当前包含音频194-197,分别对应于不同的拾音范围录制的音频文件,或者对应于相同拾音范围不同音频算法处理后获得的音频文件。比如,对应于上表4-6所示的方法,音频194-197分别对应于推荐值,增强值1,增强值2和增强值3的音频。响应于用户指示播放的操作,手机可以播放视频文件和对应的音频。比如,手机检测到用户指示播放音频194的操作,则播放视频文件和音频194。用户在观看视频文件后,可以选择其认为音频效果较好的音频进行保存。响应于用户选择,确定用户需要保存的音频,提高用户的使用体验,并且避免保存过多的音频导致存储空间占用过多的问题。如图19中的(c)所示界面1903,当前视频文件用户选择保存音频194和音频197。手机响应于用户点击保存控件198的操作,完成视频文件的保存,并显示如图19中的(b)所示的界面1902。其中,喇叭的标记193中喇叭的数量可以对应于当前视频文件包含的音频数量。
可选的,手机检测到用户指示播放录像视频的操作后,播放录像视频的视频画面和音频。其中,用户指示播放录像视频的操作,可以为用户点击图19中(a)所示的录像预览界面中的缩略图191的操作。或者,用户指示播放录像视频的操作,可以为用户点击图19中(b)所示的图库中的缩略图192的操作。
在一种可能的实现方式中,手机检测到用户指示播放录像视频的操作后,根据上述录像过程中录制的视频画面和音频播放该录像视频。其中,在视频回放时,手机可以显示视频播放界面,该视频播放界面可以包括录制的视频画面,同时手机可以默认播放第一拾音范围对应的音频,而后可以根据用户的指示切换播放其他的音频。或者,录制过程中,用户已经选择了特定的拾音范围,那么手机自动播放用户选择的拾音范围对应的音频。
例如,在视频回放时,视频播放界面上可以包括多个音频切换控件,每个音频切换控件对应的一路音频。手机检测到用户点击某个音频切换控件的操作后,播放该音频切换控件对应的该路音频。
示例性的,在视频回放时,手机可以显示如图20中的(a)所示的视频播放界面2001,视频播放界面2001显示有视频画面。视频播放界面2001上还显示有音频切换控件201-205。如图20中的(a)所示,手机当前选中了的音频切换控件201,或者默认选择了推荐值,则播放第一拾音范围对应的音频。若手机检测到用户点击音频切换控件203的操作后,可以播放音频切换控件203对应的参考第一拾音范围对应的音频。
又示例性的,手机可以响应于用户的操作,删除视频文件对应的部分音频。如图20中的(b)所示界面2002,手机检测到用户长按音频切换控件205的操作, 显示删除提示框。若用户确认删除,则删除音频切换控件205对应的音频,并显示如图20中的(c)所示界面2003。在界面2003中,不再显示用户已经确认删除的音频对应的音频控件205。如此,能够实现在视频回放过程中,根据用户需求删除用户不想保存的音频,提高用户使用体验。
在另一种可能的实现方式中,在视频回放时,手机可以显示视频播放界面,且先不播放音频。手机在检测到用户的指示操作后,播放用户指示的音频。
在上述实施例描述的方案中,在视频回放,手机可以播放第一人脸图像或第一人嘴图像对应的音频,使得播放的音频中降低杂音对发声人脸或发声人嘴发出的声音的干扰,并且播放的音频与用户关注的人脸图像实时匹配,提高用户音频体验。
并且,手机可以切换播放不同拾音范围对应的音频,给用户以多种音频播放选择,实现了音频的可调节性,可以提高用户音频播放体验。
而且,手机可以播放实时变化的第一人脸图像或第一人嘴图像以及第一特征值对应的音频,使得音频与变化的视频画面实时匹配,提高用户音频体验。
图21为本申请实施例提供的又一种音频处理方法流程示意图。该音频处理方法可应用于如图1所示的电子设备100。
在一些实施例中,电子设备检测到用户指示打开相机的操作后,启动相机,并显示拍摄预览界面。之后,在检测到用户指示拍摄的操作后,开始采集视频图像和第一音频(即初始音频信号)。
需要说明的是,电子设备摄像头采集到的图像为初始视频图像,将初始视频图像处理后,获得在显示屏上可显示的视频画面。其中,处理初始视频图像的步骤由处理器执行。图21中,摄像头采集视频画面仅为一种示例性说明。
其中,在检测到用户指示拍摄的操作之前或之后,响应于用户的操作,电子设备启动语音增强模式。或者,在检测到用户指示拍摄的操作后,电子设备启动语音增强模式。
在一些实施例中,第一音频为电子设备的一个或多个麦克风采集的各个方向上的音频信号。后续,可以基于该第一音频,获得语音增强后的音频。
示例性的,以处理器包括GPU、NPU和AP为例进行举例说明。可以理解的是,这里GPU、NPU和AP执行的步骤也可以为处理器中其他处理单元执行,本申请实施例对此不做限定。
在一些实施例中,处理器中的NPU利用图像识别技术,识别视频画面中是否包含人脸图像和/或人嘴图像。进一步的,NPU还可以根据人脸图像和/或人嘴图像的数据,确认其中的发声人脸或发声人嘴,从而确认需要进行定向录音的拾音范围。
其中,可以利用目标图像,确定目标图像的第一特征值,进而根据第一特征值确定第一拾音范围。第一特征值包括前后置属性参数,面积占比,位置信息中的一项或几项。其中,前后置属性参数,用于表示视频画面为前置摄像头拍摄的视频画面还是后置摄像头拍摄的视频画面;面积占比,用于表示目标图像的面积与视频画面的面积的比值;位置信息,用于表示目标图像在视频画面中的位置。
在一些场景中,第一特征值包括目标图像对应的前后置属性参数。也就是说,处理器中的AP确定当前目标图像所在的视频画面为前置视频画面还是后置视频画面。若为前置视频画面,则第一拾音范围为前置摄像头侧的拾音范围。若为后置视频画面,则第一拾音范围为后置摄像头侧的拾音范围。
在另一些场景中,第一特征值包括目标图像对应的面积占比。其中,“面积占比”用于表示第一人脸图像面积或第一人嘴图像面积与视频画面的面积的比值(例如用X/Y表示)。例如,电子设备根据第一人脸图像面积与取景框面积的比值,确定第一特征值。
具体的,面积占比用于衡量第一人脸图像或第一人嘴图像对应的第一拾音范围的大小,如第一拾音范围的半径范围或直径范围。因此,AP可以根据第一人脸图像的面积占比,确定第一拾音范围的半径范围;或者,AP可以根据第一人嘴图像的面积占比,确定第一拾音范围的半径范围。进而AP可以根据面积占比和第一音频的拾音范围,确定第一拾音范围(例如用N*X/Y表示)。例如,目标图像的面积/视频画面的面积=第一拾音范围/第一音频的拾音范围。
在又一些实施例中,第一特征值包括目标图像对应的位置信息。AP根据目标图像在视频画面中的位置信息,确定目标图像对应的第一拾音范围在第一音频的拾音范围内的位置。具体的,AP确定目标图像的中心点相对于第一参考点的第一偏移量,第一参考点为视频画面的中心点或对焦的焦点。之后,AP确定第一拾音范围的中心点相对于第一音频的拾音范围的中心点的第二偏移量,第二偏移量与第一偏移量成正比,得到第一拾音范围。
其中,第一偏移量或第二偏移量包括偏移角度和/或偏移距离。比如,以第一参考点为原点,平行于电子设备底边(或当前取景框的底边)为x轴,垂直于x轴的方向为y构建坐标系,将第一参考点作为该坐标系的坐标原点,并且该坐标系平行于电子设备的显示屏。如第一偏移量为左上方45度,则第二偏移量为左上方45度。那么,第一拾音范围在第一音频的拾音范围内,且第一拾音范围中心点在第一音频的拾音范围的中心点的左上方45度。
示例性的,目标图像中心相对于参考点的偏移包括偏移角度θ1,偏移距离L1。第一拾音范围相对于第一音频拾音范围位置的偏移包括偏移角度θ2,偏移距离L2。那么,θ1=θ2,L1/L2=常数。
可以理解的是,AP可以利用前后置属性参数,面积占比,位置信息中的一项或任意组合,确定第一拾音范围。
在一些实施例中,处理器中的AP在确定第一拾音范围之后,利用一个或多个麦克风采集到的第一音频,增强第一拾音范围内的音频信号,和/或削弱第一拾音范围以外的音频信号,获得第一人脸图像或第一人嘴图像对应的音频,即获得第二音频。
在一些实施例中,AP可以调用第一拾音范围对应的麦克风,以增强第一拾音范围内的音频信号,使得第一拾音范围内的音量大于第一拾音范围以外的音量。
示例性的,电子设备包含一个或多个麦克风,一个或多个麦克风用于采集第一音频。当一个或多个麦克风中第一麦克风的拾音范围内包含第一拾音范围的部 分或全部时,执行以下至少一个操作得到第二音频:增强第一麦克风的拾音范围中第一拾音范围内的音频信号;削弱第一麦克风的拾音范围中第一拾音范围外的音频信号;削弱一个或多个麦克风中除第一麦克风外的其他麦克风的音频信号。
又示例性的,电子设备包含至少两个麦克风,至少两个麦克风用于采集第一音频。当至少两个麦克风中的第二麦克风的拾音范围不包含第一拾音范围时,关闭第二麦克风,将至少两个麦克风中除第二麦克风外的其他麦克风采集的音频为第一人脸图像或第一人嘴图像对应的音频。或者,在关闭第二麦克风时,增强至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除第二麦克风外的其他麦克风的拾音范围中第一拾音范围外的音频信号。
在一些实施例中,处理器中的AP获得第二音频后,在利用获得的视频画面,获得录像视频。在检测到指示停止拍摄的操作后,获得包括第二音频和视频画面的录像视频。
在一些实施例中,录像视频可以包含多个音频文件,其中,每一音频文件包含一路音频。比如,由于电子设备根据第一特征值确定的第一拾音范围,与第一人脸图像或第一人嘴图像的显示范围可能存在一定的误差,因而电子设备可以在第一拾音范围附近确定一个或多个参考第一拾音范围。其中,电子设备根据第一拾音范围获得一路音频,根据参考第一拾音范围获得至少一路音频,电子设备还可以将全景音频作为一路音频。那么,电子设备基于第一拾音范围可以获得第一人脸图像或第一人嘴图像对应的多路音频。其中,一路音频可以理解为一个音频文件。
如此,可以为用户提供多种音频体验。并且,用户根据个人视听体验,可以选择删除其中的部分音频,保存其认为最佳的音频,提高用户使用体验,并且降低存储器的存储压力。
本申请实施例还提供一种电子设备,包括一个或多个处理器以及一个或多个存储器。该一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行上述相关方法步骤实现上述实施例中的音频处理方法。
本申请实施例还提供一种芯片系统,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得该芯片系统实现上述任一方法实施例中的方法。
可选地,该芯片系统中的处理器可以为一个或多个。该处理器可以通过硬件实现也可以通过软件实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等。当通过软件实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现。
可选地,该芯片系统中的存储器也可以为一个或多个。该存储器可以与处理器集成在一起,也可以和处理器分离设置,本申请并不限定。示例性的,存储器可以是非瞬时性处理器,例如只读存储器ROM,其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请对存储器的类型,以及存储器与处理器的设置方式不作具体限定。
示例性的,该芯片系统可以是现场可编程门阵列(field programmable gate array,FPGA),可以是专用集成芯片(application specific integrated circuit,ASIC),还 可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是微控制器(micro controller unit,MCU),还可以是可编程控制器(programmable logic device,PLD)或其他集成芯片。
应理解,上述方法实施例中的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在终端设备上运行时,使得终端设备执行上述相关方法步骤实现上述实施例中的音频处理方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的音频处理方法。
另外,本申请的实施例还提供一种装置,该装置具体可以是组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使装置执行上述各方法实施例中的音频处理方法。
其中,本申请实施例提供的终端设备、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
可以理解的是,为了实现上述功能,电子设备包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本实施例可以根据上述方法示例对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块可以采用硬件的形式实现。需要说明的是,本实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法,可以通过其它的方式实现。例如,以上所描述的终端设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,模块或单元的间接 耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序指令的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种音频处理方法,其特征在于,所述方法应用于电子设备,所述方法包括:
    检测打开相机应用的第一操作;
    响应于所述第一操作,显示拍摄预览界面;
    检测开始录像的第二操作;
    响应于所述第二操作,采集视频画面和第一音频,并显示拍摄界面,所述拍摄界面包括所述视频画面的预览界面;
    识别所述视频画面中的目标图像,所述目标图像为第一人脸图像和/或第一人嘴图像;其中,所述第一人脸图像为所述视频图像中的发声对象的人脸图像,所述第一人嘴图像为所述视频图像中的发声对象的人嘴图像;
    根据所述目标图像,确定所述发声对象对应的第一拾音范围;
    根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,所述第二音频中所述第一拾音范围内的音频音量大于所述第一拾音范围之外的音频音量。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标图像,确定所述发声对象对应的第一拾音范围;包括:
    根据所述目标图像,获得第一特征值;其中,所述第一特征值包括前后置属性参数,面积占比,位置信息中的一项或几项;其中,所述前后置属性参数,用于表示所述视频画面为前置摄像头拍摄的视频画面还是后置摄像头拍摄的视频画面;所述面积占比,用于表示所述目标图像的面积与所述视频画面的面积的比值;所述位置信息,用于表示所述目标图像在所述视频画面中的位置;
    根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:
    当所述视频画面为前置视频画面时,确定所述第一拾音范围为前置摄像头侧的拾音范围;
    当所述视频画面为后置视频画面时,确定所述第一拾音范围为后置摄像头侧的拾音范围。
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:
    根据所述面积占比以及所述第一音频的拾音范围,确定所述第一拾音范围。
  5. 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:
    根据所述位置信息,确定所述第一拾音范围在所述第一音频的拾音范围中的位置。
  6. 根据权利要求5所述的方法,其特征在于,所述位置信息包括所述目标图像的中心点相对于第一参考点的第一偏移量,所述第一参考点为所述视频画面的中心点或对焦的焦点;
    所述根据所述位置信息,确定所述第一拾音范围在所述第一音频的拾音范围中的位置,包括:
    根据所述第一偏移量,确定所述第一拾音范围的中心点相对于所述第一音频的拾 音范围的中心点的第二偏移量,所述第二偏移量与所述第一偏移量成正比;
    根据所述第二偏移量,确定所述第一拾音范围在所述第一音频的拾音范围中的位置。
  7. 根据权利要求5或6所述的方法,其特征在于,所述视频画面的中心点为的取景框的中心点,或者所述视频画面的中心点为的显示屏的中心点。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频;包括:
    增强所述第一音频中在所述第一拾音范围以内的音频信号,和/或削弱所述第一音频中在所述第一拾音范围以外的音频信号,获得所述第二音频。
  9. 根据权利要求8所述的方法,其特征在于,所述电子设备包含一个或多个麦克风,所述一个或多个麦克风用于采集所述第一音频;
    所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,包括:
    当所述一个或多个麦克风中第一麦克风的拾音范围内包含所述第一拾音范围的部分或全部时,执行以下至少一个操作得到所述第二音频:增强所述第一麦克风的拾音范围中所述第一拾音范围内的音频信号;削弱所述第一麦克风的拾音范围中所述第一拾音范围外的音频信号;削弱所述一个或多个麦克风中除所述第一麦克风外的其他麦克风的音频信号。
  10. 根据权利要求8所述的方法,其特征在于,所述电子设备包含至少两个麦克风,所述至少两个麦克风用于采集所述第一音频;
    所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,包括:
    当所述至少两个麦克风中第二麦克风的拾音范围不包含所述第一拾音范围时,关闭所述第二麦克风,所述至少两个麦克风中除所述第二麦克风外的其他麦克风采集的音频为所述第二音频。
  11. 根据权利要求10所述的方法,其特征在于,在关闭所述第二麦克风时,所述方法还包括:
    增强所述至少两个麦克风中除所述第二麦克风外的其他麦克风的拾音范围中所述第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除所述第二麦克风外的其他麦克风的拾音范围中所述第一拾音范围外的音频信号。
  12. 根据权利要求2-11任一项所述的方法,其特征在于,所述第一人脸图像的数量为一个或多个,所述第一人嘴的数量为一个或多个。
  13. 根据权利要求1-12任一项所述的方法,其特征在于,在所述响应于所述第二操作,采集视频画面和第一音频,并显示拍摄界面之后,所述方法还包括:
    检测停止拍摄的第三操作;
    响应于所述第三操作,停止录制并生成录像视频;所述录像视频包括所述视频画面,以及所述第二音频;
    检测播放所述录像视频的第四操作;
    响应于所述第四操作,显示视频播放界面,播放所述视频画面,以及所述第二音 频。
  14. 根据权利要求13所述的方法,其特征在于,所述录像视频还包括第三音频,所述第三音频为根据第二拾音范围确定的音频,所述第二拾音范围为根据所述第一拾音范围确定,且与所述第一拾音范围不同的拾音范围;所述视频播放界面包括第一控件和第二控件,所述第一控件对应所述第二音频,所述第二控件对应第三音频。
  15. 根据权利要求14所述的方法,其特征在于,所述方法还包括:
    响应于所述第四操作,播放所述视频画面和所述第二音频;所述第四操作包括操作播放控件的操作或操作所述第一控件的操作;
    检测操作所述第二控件的第五操作;
    响应于所述第五操作,播放所述视频画面和所述第三音频。
  16. 根据权利要求14或15所述的方法,其特征在于,所述方法还包括:
    响应于删除所述第二音频或所述第三音频的操作,删除所述第二音频或所述第三音频。
  17. 根据权利要求1-16任一项所述的方法,其特征在于,在所述响应于所述第一操作,显示拍摄预览界面之后,所述方法还包括:
    检测启动语音增强模式的第六操作;
    响应于所述第六操作,启动语音增强模式。
  18. 一种电子设备,其特征在于,包括:处理器,存储器,麦克风,摄像头和显示屏,所述存储器、所述麦克风、所述摄像头、所述显示屏与所述处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器从所述存储器中读取所述计算机指令,使得所述电子设备执行如权利要求1-17任一项所述的音频处理方法。
  19. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-17中任一项所述的音频处理方法。
  20. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-17中任一项所述的音频处理方法。
PCT/CN2021/108458 2020-08-26 2021-07-26 音频处理方法及电子设备 WO2022042168A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023513516A JP2023540908A (ja) 2020-08-26 2021-07-26 オーディオ処理方法および電子デバイス
EP21860008.8A EP4192004A4 (en) 2020-08-26 2021-07-26 AUDIO PROCESSING METHOD AND ELECTRONIC DEVICE
US18/042,753 US20230328429A1 (en) 2020-08-26 2021-07-26 Audio processing method and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010868463.5 2020-08-26
CN202010868463.5A CN113556501A (zh) 2020-08-26 2020-08-26 音频处理方法及电子设备

Publications (1)

Publication Number Publication Date
WO2022042168A1 true WO2022042168A1 (zh) 2022-03-03

Family

ID=78101621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108458 WO2022042168A1 (zh) 2020-08-26 2021-07-26 音频处理方法及电子设备

Country Status (5)

Country Link
US (1) US20230328429A1 (zh)
EP (1) EP4192004A4 (zh)
JP (1) JP2023540908A (zh)
CN (1) CN113556501A (zh)
WO (1) WO2022042168A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231686A1 (zh) * 2022-05-30 2023-12-07 荣耀终端有限公司 一种视频处理方法和终端

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9854156B1 (en) 2016-06-12 2017-12-26 Apple Inc. User interface for camera effects
US11112964B2 (en) 2018-02-09 2021-09-07 Apple Inc. Media capture lock affordance for graphical user interface
US11212449B1 (en) * 2020-09-25 2021-12-28 Apple Inc. User interfaces for media capture and management
JP2022155135A (ja) * 2021-03-30 2022-10-13 キヤノン株式会社 電子機器及びその制御方法及びプログラム及び記録媒体
TWI831175B (zh) * 2022-04-08 2024-02-01 驊訊電子企業股份有限公司 虛擬實境提供裝置與音頻處理方法
CN116962564A (zh) * 2022-04-19 2023-10-27 华为技术有限公司 一种定向拾音方法及设备
CN114679647B (zh) * 2022-05-30 2022-08-30 杭州艾力特数字科技有限公司 无线麦拾音距离的确定方法、装置、设备及可读存储介质
CN116048448A (zh) * 2022-07-26 2023-05-02 荣耀终端有限公司 一种音频播放方法及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (zh) * 2007-07-19 2008-01-09 华中科技大学 基于视觉特征的单音节语言唇读识别系统
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
US9728203B2 (en) * 2011-05-02 2017-08-08 Microsoft Technology Licensing, Llc Photo-realistic synthesis of image sequences with lip movements synchronized with speech
CN108711430A (zh) * 2018-04-28 2018-10-26 广东美的制冷设备有限公司 语音识别方法、智能设备及存储介质
CN109145853A (zh) * 2018-08-31 2019-01-04 百度在线网络技术(北京)有限公司 用于识别噪音的方法和装置
CN109413563A (zh) * 2018-10-25 2019-03-01 Oppo广东移动通信有限公司 视频的音效处理方法及相关产品
CN110310668A (zh) * 2019-05-21 2019-10-08 深圳壹账通智能科技有限公司 静音检测方法、系统、设备及计算机可读存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5214394B2 (ja) * 2008-10-09 2013-06-19 オリンパスイメージング株式会社 カメラ
US9094645B2 (en) * 2009-07-17 2015-07-28 Lg Electronics Inc. Method for processing sound source in terminal and terminal using the same
KR20110038313A (ko) * 2009-10-08 2011-04-14 삼성전자주식회사 영상촬영장치 및 그 제어방법
US9258644B2 (en) * 2012-07-27 2016-02-09 Nokia Technologies Oy Method and apparatus for microphone beamforming
US20150022636A1 (en) * 2013-07-19 2015-01-22 Nvidia Corporation Method and system for voice capture using face detection in noisy environments
US9596437B2 (en) * 2013-08-21 2017-03-14 Microsoft Technology Licensing, Llc Audio focusing via multiple microphones
CN104699445A (zh) * 2013-12-06 2015-06-10 华为技术有限公司 一种音频信息处理方法及装置
WO2015168901A1 (en) * 2014-05-08 2015-11-12 Intel Corporation Audio signal beam forming
CN106486147A (zh) * 2015-08-26 2017-03-08 华为终端(东莞)有限公司 指向性录音方法、装置及录音设备
CN107402739A (zh) * 2017-07-26 2017-11-28 北京小米移动软件有限公司 一种拾音方法及装置
CN111050269B (zh) * 2018-10-15 2021-11-19 华为技术有限公司 音频处理方法和电子设备
CN110366065A (zh) * 2019-07-24 2019-10-22 长沙世邦通信技术有限公司 定向跟随人脸位置拾音的方法、装置、系统及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101752A (zh) * 2007-07-19 2008-01-09 华中科技大学 基于视觉特征的单音节语言唇读识别系统
US9728203B2 (en) * 2011-05-02 2017-08-08 Microsoft Technology Licensing, Llc Photo-realistic synthesis of image sequences with lip movements synchronized with speech
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
CN108711430A (zh) * 2018-04-28 2018-10-26 广东美的制冷设备有限公司 语音识别方法、智能设备及存储介质
CN109145853A (zh) * 2018-08-31 2019-01-04 百度在线网络技术(北京)有限公司 用于识别噪音的方法和装置
CN109413563A (zh) * 2018-10-25 2019-03-01 Oppo广东移动通信有限公司 视频的音效处理方法及相关产品
CN110310668A (zh) * 2019-05-21 2019-10-08 深圳壹账通智能科技有限公司 静音检测方法、系统、设备及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4192004A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231686A1 (zh) * 2022-05-30 2023-12-07 荣耀终端有限公司 一种视频处理方法和终端

Also Published As

Publication number Publication date
CN113556501A (zh) 2021-10-26
EP4192004A1 (en) 2023-06-07
JP2023540908A (ja) 2023-09-27
EP4192004A4 (en) 2024-02-21
US20230328429A1 (en) 2023-10-12

Similar Documents

Publication Publication Date Title
WO2022042168A1 (zh) 音频处理方法及电子设备
EP4099688A1 (en) Audio processing method and device
EP4054177B1 (en) Audio processing method and device
EP3944063A1 (en) Screen capture method and electronic device
JP7355941B2 (ja) 長焦点シナリオにおける撮影方法および端末
US20230276014A1 (en) Photographing method and electronic device
WO2021078001A1 (zh) 一种图像增强方法及装置
EP3893495A1 (en) Method for selecting images based on continuous shooting and electronic device
EP3873084A1 (en) Method for photographing long-exposure image and electronic device
WO2022001806A1 (zh) 图像变换方法和装置
US20230298498A1 (en) Full-Screen Display Method and Apparatus, and Electronic Device
EP4325877A1 (en) Photographing method and related device
CN115689963A (zh) 一种图像处理方法及电子设备
US11870941B2 (en) Audio processing method and electronic device
WO2022262416A1 (zh) 音频的处理方法及电子设备
CN115484380A (zh) 拍摄方法、图形用户界面及电子设备
CN113593567B (zh) 视频声音转文本的方法及相关设备
WO2022062985A1 (zh) 视频特效添加方法、装置及终端设备
WO2022033344A1 (zh) 视频防抖方法、终端设备和计算机可读存储介质
WO2022228010A1 (zh) 一种生成封面的方法及电子设备
WO2023143171A1 (zh) 一种采集音频的方法及电子设备
WO2024078275A1 (zh) 一种图像处理的方法、装置、电子设备及存储介质
CN117221707A (zh) 一种视频处理方法和终端
CN116414329A (zh) 一种投屏显示方法、系统及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21860008

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023513516

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021860008

Country of ref document: EP

Effective date: 20230302

NENP Non-entry into the national phase

Ref country code: DE