WO2022218271A1 - 一种视频录制方法和电子设备 - Google Patents

一种视频录制方法和电子设备 Download PDF

Info

Publication number
WO2022218271A1
WO2022218271A1 PCT/CN2022/086166 CN2022086166W WO2022218271A1 WO 2022218271 A1 WO2022218271 A1 WO 2022218271A1 CN 2022086166 W CN2022086166 W CN 2022086166W WO 2022218271 A1 WO2022218271 A1 WO 2022218271A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
user
electronic device
image
voice
Prior art date
Application number
PCT/CN2022/086166
Other languages
English (en)
French (fr)
Inventor
陶凯
尹明婕
庞立臣
常青
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22787497.1A priority Critical patent/EP4297398A1/en
Publication of WO2022218271A1 publication Critical patent/WO2022218271A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor
    • H04N5/92Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback
    • H04N5/926Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback by pulse code modulation
    • H04N5/9265Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback by pulse code modulation with processing of the sound signal
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • G01S3/803Systems for determining direction or deviation from predetermined direction using amplitude comparison of signals derived from receiving transducers or transducer systems having differently-oriented directivity characteristics
    • G01S3/8034Systems for determining direction or deviation from predetermined direction using amplitude comparison of signals derived from receiving transducers or transducer systems having differently-oriented directivity characteristics wherein the signals are derived simultaneously
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/433Content storage operation, e.g. storage operation in response to a pause request, caching operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/433Content storage operation, e.g. storage operation in response to a pause request, caching operations
    • H04N21/4334Recording operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/765Interface circuits between an apparatus for recording and another apparatus
    • H04N5/77Interface circuits between an apparatus for recording and another apparatus between a recording apparatus and a television camera
    • H04N5/772Interface circuits between an apparatus for recording and another apparatus between a recording apparatus and a television camera the recording apparatus and the television camera being placed in the same enclosure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/60Substation equipment, e.g. for use by subscribers including speech amplifiers
    • H04M1/6033Substation equipment, e.g. for use by subscribers including speech amplifiers for providing handsfree use or a loudspeaker mode in telephone sets
    • H04M1/6041Portable telephones adapted for handsfree use
    • H04M1/6058Portable telephones adapted for handsfree use involving the use of a headset accessory device connected to the portable telephone
    • H04M1/6066Portable telephones adapted for handsfree use involving the use of a headset accessory device connected to the portable telephone including a wireless connection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/52Details of telephonic subscriber devices including functional features of a camera
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • H04R2430/23Direction finding using a sum-delay beam-former
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/13Hearing devices using bone conduction transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • the present application relates to the technical field of terminals and communications, and in particular, to a video recording method and electronic device.
  • TWS headsets With the popularization of true wireless stereo (TWS) headsets, when users wear TWS headsets, which are far away from electronic devices, when video recording is performed, the TWS headsets can try to avoid other sound signals in the environment, and the acquisition is relatively clear
  • the user's voice signal is then converted into an audio signal, and then the audio signal is transmitted to the electronic device through a wireless network.
  • the electronic device can process the received audio signal and the recorded image to obtain a video. In this video, the user's voice is clear.
  • the stereo is not stereo after being recorded and processed by the TWS headset, resulting in the obtained video, the user's voice does not With a stereo sense, the audio-visual orientation of the user's voice cannot match the user's visual orientation.
  • the present application provides a video recording method and electronic device.
  • the electronic device records a video including the user, in the recorded video, the audio-visual orientation of the user's voice and the user's video In the case of matching the orientation, it can also ensure that the user's voice is clear.
  • the present application provides a video recording method, comprising: during a process of video recording by a first electronic device, the first electronic device collects an image and a first audio signal; the first electronic device collects an image and a first audio signal according to the image and the The first audio signal determines the audio-visual orientation of the user's voice in the image; the first electronic device generates a stereo audio signal according to the audio-visual orientation of the user's voice and the user audio signal; the user audio signal is generated by the second electronic device Obtained and sent to the first electronic device; the first electronic device generates a video according to the image and the stereo audio signal.
  • the first electronic device will calculate the audio-visual orientation of the user's voice, which is used to perform stereo processing on the user's audio signal, so that the audio-visual orientation of the user's voice restored from the stereo audio signal in the generated video is the same as the audio-visual orientation of the user's voice.
  • the user's viewing orientation is matched.
  • the user audio signal is collected by the second electronic device, and the second electronic device can be an earphone worn by the user. In this way, the collected user audio signal is clear, so that the stereo user audio signal generated by the first electronic device can also be clear.
  • the audio-visual orientation of the user's voice matches the visual orientation of the user, and the user's voice is clear.
  • the stereo audio signal generated according to the audio-visual orientation of the user's voice and the user's audio signal mainly includes the user's voice information without mixing other sound information in the environment.
  • the method in this embodiment can achieve the purpose of highlighting the user's voice.
  • the first electronic device determines the audio-visual orientation of the user's voice in the image according to the image and the first audio signal, which specifically includes: the first electronic device pairing The image is subjected to face recognition to obtain the pixel position of the face; the first electronic device determines the user's sound source orientation according to the first audio signal; the first electronic device determines the user's sound source orientation according to the user's sound source orientation and the face The pixel position determines the pixel position of the user in the image; the first electronic device determines the audio-visual orientation of the user's voice in the image according to the pixel position of the user in the image.
  • the first electronic device can calculate the audio-visual orientation of the user's voice through the collected first audio signal and the image, and the audio-visual orientation of the user's voice is used as a parameter of the algorithm when generating the stereo audio signal, so that The audio-visual orientation of the user's voice restored by the stereo audio signal matches the visual orientation of the user in the image.
  • the first electronic device determines the audio-visual orientation of the user's voice in the image according to the image and the first audio signal, which specifically includes: the first electronic device pairing The image is subjected to face recognition to obtain the pixel position of the face; the first electronic device uses the bone conduction audio signal and combines the first audio signal to determine the position of the user's sound source; the bone conduction audio signal is transmitted by the second electronic device.
  • the first electronic device determines the pixel position of the user in the image according to the position of the sound source of the user and the pixel position of the face; the first electronic device determines the pixel position of the user in the image according to the The pixel position in the image determines the audiovisual orientation of the user's voice in the image.
  • the bone conduction audio signal in the process of calculating the audio-visual orientation of the user's voice, can be used to screen out the part of the audio signal that is strongly related to the user's voice information in the first audio signal, so as to improve the calculation of the user's voice information.
  • the accuracy of the sound's panning orientation in the process of calculating the audio-visual orientation of the user's voice, the bone conduction audio signal can be used to screen out the part of the audio signal that is strongly related to the user's voice information in the first audio signal, so as to improve the calculation of the user's voice information.
  • the first electronic device generates a stereo audio signal according to the sound image orientation of the user's voice and the user audio signal, which specifically includes: the first electronic device generates ambient stereo sound an audio signal; the first electronic device generates the stereo audio signal according to the audio-visual orientation of the user's voice, the user audio signal, and the ambient stereo audio signal.
  • the generated stereo audio signal not only the user audio signal but also the environment audio signal will be used, so that the generated stereo audio signal not only includes the user's sound information but also other sound information in the environment .
  • the first electronic device generates an ambient stereo audio signal, which specifically includes: the first electronic device performs adaptive blocking filtering on the first audio signal according to the user audio signal , filtering out the user's voice information in the first audio signal; the first electronic device generates the ambient stereo audio signal according to the filtered first audio signal.
  • the first audio signal collected by the first electronic device may include other sound information in a clearer environment.
  • the first electronic device uses the collected first audio signal to filter out the user's voice information.
  • Other sound information in the real environment can be obtained.
  • the first electronic device generates an ambient stereo audio signal, which specifically includes: the first electronic device generates a stereo first audio signal by using the first audio signal; the first electronic device generates a stereo first audio signal; An electronic device performs adaptive blocking filtering on the stereo first audio signal according to the user audio signal, filters out the user's voice information in the stereo first audio signal, and obtains an ambient stereo audio signal.
  • the first audio signal collected by the first electronic device may include other sound information in a clearer environment.
  • the first electronic device uses the collected first audio signal to filter out the user's voice information.
  • Other sound information in the real environment can be obtained.
  • the first electronic device generates the stereo audio signal according to the sound image orientation of the user's voice, the user audio signal, and the ambient stereo audio signal, which specifically includes : the first electronic device generates a user stereo audio signal according to the audio-visual orientation of the user's voice and the user audio signal; the first electronic device enhances the user stereo audio signal without changing the ambient stereo audio signal; The first electronic device generates the stereo audio signal according to the enhanced user stereo audio signal and the ambient stereo audio signal.
  • the user stereo audio signal includes both the user stereo audio signal and the ambient stereo audio signal
  • audio zooming can be performed, and when the user is closer to the first electronic device in the image, the user's voice can be changed. larger, while the ambient sound does not change.
  • the user's voice may become quieter while the ambient sound does not change.
  • the first electronic device generates the stereo audio signal according to the sound image orientation of the user's voice, the user audio signal, and the ambient stereo audio signal, which specifically includes : The first electronic device generates a user stereo audio signal according to the audio-visual orientation of the user's voice and the user audio signal; the first electronic device enhances the user stereo audio signal and simultaneously suppresses the ambient stereo audio signal ; The first electronic device generates the stereo audio signal according to the enhanced user stereo audio signal and the suppressed ambient stereo audio signal.
  • the user stereo audio signal includes both the user stereo audio signal and the ambient stereo audio signal
  • audio zooming can be performed, and when the user is closer to the first electronic device in the image, the user's voice can be changed. louder, while ambient sound becomes quieter.
  • the user's voice may become quieter, while ambient sounds become quieter.
  • the user audio signal is that the second electronic device performs joint noise reduction processing on the second audio signal according to the bone conduction audio signal, and removes the noise in the second audio signal. obtained from other sound information in the surrounding environment of the second electronic device.
  • the bone conduction audio signal is used to filter the second audio signal, so that the obtained user audio signal is basically the information of the user's voice.
  • the first electronic device uses the user audio signal to calculate the user's voice
  • the influence of other sound information in the environment of the calculation result is small, and the calculation result is more accurate.
  • the user audio signal performs noise reduction processing for the second electronic device according to the second audio signal to remove the surrounding environment of the second electronic device in the second audio signal other sound information in.
  • the second audio signal is filtered to remove sound information in part of the environment, so that in the obtained user audio signal, the information of the user's voice is preserved.
  • the first electronic device uses the user audio signal
  • the influence of other sound information in the environment of the calculation result is small, so that the calculation result is accurate.
  • the present application provides an electronic device comprising: one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used for storing computer program codes, the computer
  • the program code includes computer instructions, which are invoked by the one or more processors to cause the electronic device to perform: in the process of recording a video, capture an image and a first audio signal; according to the image and the first audio signal, determine The audio-visual orientation of the user's voice in the image; according to the audio-visual orientation of the user's voice and the user audio signal, a stereo audio signal is generated; the user audio signal is acquired by the second electronic device and sent to the electronic device; according to the image and this stereo audio signal to generate video.
  • the first electronic device will calculate the audio-visual orientation of the user's voice, which is used to perform stereo processing on the user's audio signal, so that the audio-visual orientation of the user's voice restored from the stereo audio signal in the generated video is the same as the audio-visual orientation of the user's voice.
  • the user's viewing orientation is matched.
  • the user audio signal is collected by the second electronic device, and the second electronic device can be an earphone worn by the user. In this way, the collected user audio signal is clear, so that the stereo user audio signal generated by the first electronic device can also be clear.
  • the audio-visual orientation of the user's voice matches the visual orientation of the user, and the user's voice is clear.
  • the stereo audio signal generated according to the audio-visual orientation of the user's voice and the user's audio signal mainly includes the user's voice information without mixing other sound information in the environment.
  • the method in this embodiment can achieve the purpose of highlighting the user's voice.
  • the one or more processors are specifically configured to invoke the computer instruction to cause the electronic device to perform: perform face recognition on the image, and obtain the pixel position of the face ; According to the first audio signal, determine the sound source orientation of the user; According to the user's sound source orientation and the pixel position of the face, determine the pixel position of the user in the image; According to the user's pixel position in the image, The audiovisual orientation of the user's voice in the image is determined.
  • the first electronic device can calculate the audio-visual orientation of the user's voice through the collected first audio signal and the image, and the audio-visual orientation of the user's voice is used as a parameter of the algorithm when generating the stereo audio signal, so that The audio-visual orientation of the user's voice restored by the stereo audio signal matches the visual orientation of the user in the image.
  • the one or more processors are specifically configured to invoke the computer instruction to cause the electronic device to perform: perform face recognition on the image, and obtain the pixel position of the face ; Using the bone conduction audio signal, combined with the first audio signal, to determine the sound source orientation of the user; the bone conduction audio signal is acquired by the second electronic device and sent to the electronic device; according to the user's sound source orientation and the person The pixel position of the face is used to determine the pixel position of the user in the image; according to the pixel position of the user in the image, the audio-visual orientation of the user's voice in the image is determined.
  • the bone conduction audio signal in the process of calculating the audio-visual orientation of the user's voice, can be used to screen out the part of the audio signal that is strongly related to the user's voice information in the first audio signal, so as to improve the calculation of the user's voice information.
  • the accuracy of the sound's panning orientation in the process of calculating the audio-visual orientation of the user's voice, the bone conduction audio signal can be used to screen out the part of the audio signal that is strongly related to the user's voice information in the first audio signal, so as to improve the calculation of the user's voice information.
  • the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: generating an ambient stereo audio signal; The orientation and the user audio signal, and the ambient stereo audio signal, generate the stereo audio signal.
  • the generated stereo audio signal not only the user audio signal but also the environment audio signal will be used, so that the generated stereo audio signal not only includes the user's sound information but also other sound information in the environment .
  • the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: adapt the first audio signal according to the user audio signal Blocking filtering, filtering out the user's voice information in the first audio signal; and generating the ambient stereo audio signal according to the filtered first audio signal.
  • the first audio signal collected by the first electronic device may include other sound information in a clearer environment.
  • the first electronic device uses the collected first audio signal to filter out the user's voice information.
  • Other sound information in the real environment can be obtained.
  • the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to perform: auto-autonomously perform the stereo first audio signal on the stereo first audio signal according to the user audio signal Adaptive blocking filtering is used to filter out the user's voice information in the stereo first audio signal to obtain an ambient stereo audio signal.
  • the first audio signal collected by the first electronic device may include other sound information in a clearer environment.
  • the first electronic device uses the collected first audio signal to filter out the user's voice information.
  • Other sound information in the real environment can be obtained.
  • the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to execute: according to the audio-visual orientation of the user's voice and the user's audio signal to generate a user stereo audio signal; enhance the user stereo audio signal without changing the ambient stereo audio signal; and generate the stereo audio signal according to the enhanced user stereo audio signal and the ambient stereo audio signal.
  • the user stereo audio signal includes both the user stereo audio signal and the ambient stereo audio signal
  • audio zooming can be performed, and when the user is closer to the first electronic device in the image, the user's voice can be changed. larger, while the ambient sound does not change.
  • the user's voice may become quieter while the ambient sound does not change.
  • the one or more processors are specifically configured to invoke the computer instructions to cause the electronic device to execute: according to the audio-visual orientation of the user's voice and the user's audio signal , generate a user stereo audio signal; enhance the user stereo audio signal, and suppress the ambient stereo audio signal at the same time; generate the stereo audio signal according to the enhanced user stereo audio signal and the suppressed ambient stereo audio signal .
  • the user stereo audio signal includes both the user stereo audio signal and the ambient stereo audio signal
  • audio zooming can be performed, and when the user is closer to the first electronic device in the image, the user's voice can be changed. louder, while ambient sound becomes quieter.
  • the user's voice may become quieter, while ambient sounds become quieter.
  • an embodiment of the present application provides a chip system, the chip system is applied to an electronic device, the chip system includes one or more processors, and the processors are configured to invoke computer instructions to cause the electronic device to perform the first The method described in any one of the embodiments of the aspect.
  • an embodiment of the present application provides a computer program product containing instructions, characterized in that, when the computer program product runs on an electronic device, the electronic device is made to execute any one of the implementations of the first aspect the described method.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, characterized in that, when the instructions are executed on an electronic device, the electronic device is made to execute any one of the implementations of the first aspect the described method.
  • the chip system provided in the third aspect, the computer program product provided in the fourth aspect, and the computer storage medium provided in the fifth aspect are all used to execute the methods provided by the embodiments of the present application. Therefore, for the beneficial effects that can be achieved, reference may be made to the beneficial effects in the corresponding method, which will not be repeated here.
  • 1a is a schematic structural diagram of a world coordinate system, a camera coordinate system, and an image plane coordinate system provided by an embodiment of the present application;
  • FIG. 1b and 1c are schematic diagrams of an image pixel coordinate system provided by an embodiment of the present application.
  • FIGS. 2a and 2b are schematic diagrams of video recording in a solution provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a communication system 100 provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a first electronic device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a second electronic device provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a signaling interaction of a video recording method in an embodiment of the present application.
  • FIG. 7 is a flowchart of determining the audio-visual orientation of the user's voice corresponding to the user's audio signal by the first electronic device in the embodiment of the present application.
  • first and second are only used for descriptive purposes, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the “multiple” The meaning is two or more.
  • the visual orientation of the user refers to the position of the user relative to the center of the camera in the real world determined from the image when the electronic device acquires an image including the user.
  • the reference coordinate system for this position may be the camera coordinate system.
  • the determination of the visual orientation involves the world coordinate system, the camera coordinate system, the image plane coordinate system, and the image pixel coordinate system.
  • FIG. 1a shows an example of a world coordinate system, a camera coordinate system, and an image plane coordinate system provided by an embodiment of the present application.
  • the world coordinate system is represented by O3-Xw-Yw-Zw. Through the world coordinate system, the coordinates of the user in the real world can be obtained.
  • the camera coordinate system is represented by O1-Xc-Yc-Zc.
  • O1 is the optical center of the camera and the origin of the camera coordinate system.
  • the Xc axis, Yc axis, and Zc axis are the coordinate axes of the camera coordinate system, and the Zc axis is the main light. axis.
  • the coordinates of a point in the world coordinate system can be transformed into the camera coordinate system through rigid body transformation.
  • the camera captures the light reflected by the user, and presents the light on the imaging plane to obtain an optical image of the user.
  • An image plane coordinate system can be established on this imaging plane.
  • the image plane coordinate system is represented by O2-X-Y, O2 is the center of the optical image and the origin of the image plane coordinate system, and the X and Y axes are parallel to the Xc and Yc axes.
  • the coordinates of a point in the camera coordinate system can be transformed into the image plane coordinate system through perspective projection.
  • FIG. 1b and FIG. 1c are schematic diagrams of image pixel coordinate systems provided by embodiments of the present application.
  • the electronic device can process the optical image in the image plane to obtain an image that can be displayed on the display screen.
  • an image plane coordinate system is established in the image plane, denoted by O2-X-Y.
  • the electronic device may directly display the image corresponding to the image on the display screen without cropping the image in the image plane.
  • An image pixel coordinate system is established on the image, and the image pixel coordinate system is expressed in O-U-V, and the unit is in pixels.
  • O is a vertex of the image, and the U and V axes are parallel to the X and Y axes.
  • an image plane coordinate system is established in the image plane, denoted by O2-X-Y.
  • the electronic device can crop the image in the image plane, and display the image corresponding to the cropped image on the display screen.
  • An image pixel coordinate system is established on the image, and the image pixel coordinate system is represented by O-U-V.
  • O is a vertex of the image corresponding to the cropped image, and the U and V axes are parallel to the X and Y axes.
  • the electronic device may also perform other processing, such as zooming, on the image in the image plane. Get an image that can be displayed on the monitor. Then, the image pixel coordinate system can be established with the vertex of the image as the coordinate origin.
  • the electronic device can use a certain point on the user to determine the position of the user relative to the center of the camera. In this way, when the electronic device can determine the position of the point relative to the center of the camera, it can be converted into the pixel corresponding to the point in the image pixel. Pixel coordinates in the coordinate system. Similarly, when the electronic device obtains the pixel coordinates of a pixel in the image pixel coordinate system, the pixel coordinates can also be used to determine the position of the point corresponding to the pixel relative to the center of the camera. In this way, the electronic device can obtain the pixel coordinates of the user in the image through the position of the user relative to the center of the camera. The position of the user relative to the center of the camera can also be obtained through the pixel coordinates of the user in the image. The specific process will be described below, and will not be repeated here.
  • the visual azimuth of a user can be represented in the camera coordinate system in various ways, including an azimuth angle relative to the center of the camera coordinate system, and the azimuth angle can include an azimuth angle and an elevation angle, such as an elevation angle
  • the angle from the image of the object to the Zc axis is m°
  • the azimuth angle is the angle from the image of the object to the Yc axis is n°.
  • the visual orientation of the user can be recorded as (m°, n°).
  • the distances relative to the Xc axis, Yc axis, and Zc axis of the camera are respectively a, b, and c.
  • the visual orientation of the user can be recorded as (a, b, c).
  • the visual orientation of a user relative to the camera coordinate system may also be defined by other representations, which is not limited in this embodiment of the present application.
  • one solution is to collect the audio signal through the microphone of the electronic device, and then convert the audio signal into an electrical signal. audio signal. Then, the electronic device can focus the audio signal, focus the audio signal into a desired area (eg, the area where the user is speaking), and the electronic device reproduces the focused audio signal into sound. Therefore, the audio-visual orientation of the user's voice in the recorded video is matched with the user's visual orientation.
  • the audio-visual orientation means that the sound is transmitted to the electronic device through the sound signal. After the sound signal is collected by the electronic device, the electronic device converts the sound signal into an audio signal, and then reproduces the sound corresponding to the audio signal. position.
  • TWS earphone In order to solve the problem that the user's voice is not clear in the recorded video when the user is far away from the electronic device in the above solution.
  • Another solution is that the user wears a TWS earphone, and then uses the TWS earphone to collect the user's voice signal. Since the microphone of the TWS headset is very close to the user, the TWS headset can collect the user's sound signal at close range and isolate other sound signals in some environments. When collecting the user's sound signal, the collection of other sound signals in the environment is reduced. . For other sound signals in the collected environment, TWS headphones can perform noise reduction processing on them, remove the other sound signals, retain the user's sound signals, and then transmit the sound signals to electronic devices through a wireless network. The received sound signal and the recorded image can be processed to obtain a video. In this video, the user's voice is clear.
  • the microphone of the TWS headset always collects the user's voice signal at the user's ear, for the microphone, the direction of the collected user's voice signal does not change, and the voice-producing part of the user is always relative to the user's ear. Orientation of the center of the microphone.
  • the direction of the user's voice signal does not change, the direction of the captured image of the user may be changed.
  • the direction of the user relative to the center of the microphone of the TWS earphone basically does not change at two moments before and after, and the direction of the user's voice signal collected by the microphone basically does not change. Then, in the recorded video, the audio-visual orientation of the user's voice does not change.
  • the audio-visual orientation of the user's voice does not change, but in the video, the user's visual orientation changes. In this way, in the obtained video, the audio-visual orientation of the user's voice does not match the user's visual orientation.
  • the audio-visual azimuth of the user's voice matches the user's visual azimuth, And in the video, the user's voice is clear.
  • the user wears a TWS headset and records a video including himself when he is far away from the mobile phone.
  • FIG. 2a and FIG. 2b reference may be made to FIG. 2a and FIG. 2b above.
  • the electronic device can obtain the audio-visual azimuth of the user's voice corresponding to the sound signal by using the sound signal and the user's visual azimuth in the image.
  • the azimuth matches the user's visual azimuth, and when the sound signal corresponding to the user's voice is reproduced by using the audio-visual azimuth, the user's voice can be matched with the user's visual azimuth.
  • the visual orientation of the user in FIG. 2a is to the left in the direction of the center of the camera, then at this time, the audio-visual orientation of the user's voice is also to the left relative to the camera.
  • the visual orientation of the user in FIG. 2b is the direction of the center of the camera to the right, and at this time, the audio-visual orientation of the user's voice is also to the right relative to the camera.
  • the audio-visual orientation of the user's voice matches the user's visual orientation, and when the user's voice signal is collected by the TWS headset, the user's voice is clear in the video.
  • FIG. 3 is a schematic structural diagram of a communication system 100 provided by an embodiment of the present application.
  • the communication system 100 includes: a plurality of electronic devices, such as a first electronic device, a second electronic device, and a third electronic device.
  • the first electronic device in the embodiment of the present application may be a terminal device equipped with Android, Huawei HarmonyOS, iOS, Microsoft or other operating systems, such as a smart screen, mobile phone, tablet computer, notebook computer, personal computer, etc.
  • the second electronic device and the third electronic device can collect sound signals, convert the sound signals into electrical signals, and then transmit them to the first electronic device.
  • the second electronic device and the third electronic device may be TWS earphones, Bluetooth earphones, and the like.
  • the wireless network is used to provide various services, such as communication services, connection services, transmission services, etc., to the electronic devices involved in the embodiments of the present application.
  • Wireless networks include: Bluetooth (bluetooth, BT), wireless local area network (wireless local area network, WLAN) technology, wireless wide area network (wireless wide area network, WWAN) technology, etc.
  • the first electronic device may establish a connection with the second electronic device and the third electronic device through a wireless network, and then perform data transmission.
  • the first electronic device can search for the second electronic device, and when the second electronic device is found, the first electronic device can send a request to establish a connection to the second electronic device, and after the second electronic device receives the request, A connection can be established with the first electronic device.
  • the second electronic device can convert the collected sound signal into an electrical signal, that is, an audio signal, and transmit it to the first electronic device through a wireless network.
  • the first electronic device involved in the communication system 100 is described below.
  • FIG. 4 is a schematic structural diagram of a first electronic device provided by an embodiment of the present application.
  • the first electronic device may have more or fewer components than shown in the figures, may combine two or more components, or may have different component configurations.
  • the various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • the first electronic device may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194 And a subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • graphics processor graphics processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • SDA serial data line
  • SCL serial clock line
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the SIM interface can be used to communicate with the SIM card interface 195 to realize the function of transferring data to the SIM card or reading data in the SIM card.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the first electronic device, and can also be used to transmit data between the first electronic device and a peripheral device. It can also be used to connect headphones to play audio through the headphones.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the first electronic device.
  • the first electronic device may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the wireless communication function of the first electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the first electronic device may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G etc. applied on the first electronic device.
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal.
  • the wireless communication module 160 can provide wireless communication including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (BT), etc. applied on the first electronic device s solution.
  • WLAN wireless local area networks
  • BT Bluetooth
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the antenna 1 of the first electronic device is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the first electronic device can communicate with the network and other devices through wireless communication technology.
  • the first electronic device implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the first electronic device can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the first electronic device may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
  • Video codecs are used to compress or decompress digital video.
  • the first electronic device may support one or more video codecs.
  • the first electronic device can play or record videos in multiple encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG Moving Picture Experts Group
  • MPEG2 moving picture experts group
  • MPEG3 MPEG4
  • MPEG4 Moving Picture Experts Group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the first electronic device, such as image recognition, face recognition, speech recognition, text understanding, etc., can be realized through the NPU.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the first electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes various functional applications and data processing of the first electronic device by executing the instructions stored in the internal memory 121 .
  • the internal memory 121 may include a storage program area and a storage data area.
  • the first electronic device may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the first electronic device can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be received by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the first electronic device may be provided with at least one microphone 170C.
  • the first electronic device can be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals.
  • the first electronic device may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 180A may be provided on the display screen 194 .
  • the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material.
  • the gyro sensor 180B may be used to determine the motion attitude of the first electronic device. In some embodiments, the angular velocity of the first electronic device about three axes (ie, the x, y, and z axes) may be determined by the gyro sensor 180B. The gyro sensor 180B can be used for image stabilization.
  • the air pressure sensor 180C is used to measure air pressure.
  • the first electronic device calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the first electronic device can detect the opening and closing of the flip holster using the magnetic sensor 180D.
  • the first electronic device can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the first electronic device in various directions (generally three axes).
  • the magnitude and direction of gravity can be detected when the first electronic device is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the first electronic device can measure distance by infrared or laser. In some embodiments, when shooting a scene, the first electronic device can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the first electronic device emits infrared light outward through the light emitting diode.
  • the first electronic device detects infrared reflected light from nearby objects using a photodiode.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the first electronic device can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the first electronic device is in the pocket to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the first electronic device can use the collected fingerprint characteristics to unlock the fingerprint, access the application lock, take a picture with the fingerprint, answer the incoming call with the fingerprint, and the like.
  • the temperature sensor 180J is used to detect the temperature.
  • the first electronic device uses the temperature detected by the temperature sensor 180J to execute the temperature processing strategy.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the first electronic device can receive key input and generate key signal input related to user settings and function control of the first electronic device.
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact with and separation from the first electronic device.
  • the first electronic device may further include: a laser sensor (not shown).
  • Laser sensors are used to sense the vibrations of objects, which can be converted into electrical signals.
  • the laser sensor can detect the vibration of the throat, obtain the Doppler frequency shift signal when the throat vibrates, and convert the Doppler frequency shift signal into an electrical signal corresponding to the vibration frequency of the throat .
  • the first electronic device may further include other devices for detecting the vibration of the object, such as a vibration sensor (not shown), an ultrasonic sensor (not shown), and the like.
  • the processor 110 may call the computer instructions stored in the internal memory 121, so that the first electronic device executes the video recording method in the embodiment of the present application.
  • the second electronic device involved in the communication system 100 is described below.
  • FIG. 5 is a schematic structural diagram of a second electronic device provided by an embodiment of the present application.
  • the second electronic device may have more or fewer components than shown in the figures, may combine two or more components, or may have different component configurations.
  • the various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • the second electronic device may include: a processor 151 , a microphone 152 , a bone conduction sensor 153 , and a wireless communication processing module 154 .
  • the processor 151 may be used to parse the signal received by the wireless communication processing module 154 .
  • the signal includes: a request for establishing a connection sent by the first electronic device.
  • the processor 151 may also be configured to generate a signal sent by the wireless communication processing module 154 to the outside, where the signal includes a request for transmitting an audio signal to the first electronic device, and the like.
  • a memory may also be provided in the processor 151 for storing instructions.
  • the instructions may include instructions to enhance noise reduction, instructions to send a signal, and the like.
  • the microphone 152 can use the conductivity of air to sound to collect the user's sound signal, as well as other sound signals in a part of the surrounding environment. Then, convert the sound signal into an electrical signal to obtain an audio signal.
  • the microphone 152 is also called “earpiece” and “microphone”.
  • the second electronic device may include 1 or N microphones 152 , where N is a positive integer greater than 1.
  • N is a positive integer greater than 1.
  • the bone conduction sensor 153 can collect sound signals by utilizing the conduction of bone to sound. Most of the sound signals in the surrounding environment are conducted through the air, while the bone conduction sensor can only collect the sound signals conducted in direct contact with the bones, such as the user's sound signal, and then convert the sound signal into an electrical signal to obtain bone conduction. audio signal.
  • the wireless communication processing module 154 may include one or more of a Bluetooth (bluetooth, BT) communication processing module 154A and a WLAN communication processing module 154B, for providing services such as establishing a connection with the first electronic device and performing data transmission.
  • BT bluetooth
  • FIG. 6 is a schematic diagram of signaling interaction of a video recording method in an embodiment of the present application.
  • the first electronic device and the second electronic device have established a connection, and data transmission can be performed.
  • the first electronic device can continuously collect multiple frames of images and at the same time continuously collect audio information in the shooting environment.
  • the second electronic device also records the audio signal in the shooting environment.
  • Step S101-Step S109 is for the process of recording video, the audio signal corresponding to the current frame image (including the first audio signal, the second audio signal, the bone conduction audio signal and the user audio signal and other audio signals) and the current frame image.
  • a description of the processing procedure It can be understood that, by processing the audio signal corresponding to each frame of image and each frame of image according to the description of steps S101 to S109, the video involved in the embodiments of the present application can be obtained.
  • the second electronic device collects the second audio signal
  • the electronic device may continue to collect (playing a time corresponding to one frame of image) a second audio signal for a period of time, and the second audio signal may include the user's voice information and the environment around the second electronic device. other audio information.
  • the length of the period of time may be different under different circumstances. For example, 1/24 of a second or 1/12 of a second, etc.
  • the second electronic device collects bone conduction audio signals
  • the second electronic device may further use the bone conduction sensor to continuously collect the user's bone conduction audio signal for a period of time (the time corresponding to playing one frame of image).
  • the second electronic device processes the second audio signal to obtain a user audio signal
  • the second electronic device may perform sampling, noise reduction, etc. on the second audio signal to remove other sound information in the environment, thereby enhancing the user's sound information in the second audio signal, and obtaining the user audio signal .
  • the user audio information obtained in the above manner includes not only the sound information when the user makes a sound, but also a small part of other sound information in the surrounding environment of the second electronic device. Therefore, optionally, in some other embodiments, in order to remove other sound information in the surrounding environment of the second electronic device, the user audio signal obtained after processing the second audio signal only includes the user's sound information, The second electronic device may perform joint noise reduction processing by using the bone conduction audio signal and the second audio signal to remove other sound information in the environment surrounding the second electronic device in the second audio signal to obtain the user audio signal.
  • the implementation manner of joint noise reduction is the same as the implementation manner of performing joint noise reduction processing on multi-channel audio signals by the same electronic device in the prior art.
  • This embodiment provides one of the methods of joint noise reduction: performing differential calculation between the bone conduction frequency signal and the second audio signal, and canceling the noise in the bone conduction audio signal and the second audio signal, so as to achieve joint noise reduction. Effect. It should be noted that in the process of differential calculation, it is necessary to weight according to the sound wave intensities of the two audio signals, so that the weighted noise intensities are basically the same to achieve maximum noise reduction. If the audio signal, that is, the non-noise signal, is weakened, the differential audio signal can be amplified to obtain the user audio signal.
  • the first electronic device collects an image
  • the camera of the first electronic device may capture an image, and the image may include a portrait, and the portrait may be a portrait of a user.
  • the first electronic device collects the first audio signal
  • the microphone of the first electronic device starts to continuously collect the first audio signal for a period of time, and the first audio signal may include the user's sound information and other sound information in the surrounding environment of the first electronic device.
  • the second electronic device sends a user audio signal to the first electronic device
  • the second electronic device may send the user audio signal to the first electronic device through the wireless network.
  • the second electronic device sends a bone conduction audio signal to the first electronic device
  • the second electronic device may send a bone conduction audio signal to the first electronic device through a wireless network.
  • the first electronic device determines the audio-visual orientation of the user's voice corresponding to the user's audio signal
  • Fig. 7 shows a flowchart of the first electronic device determining the audio-visual orientation of the user's voice corresponding to the user's audio signal.
  • the first electronic device may first determine the location of the sound source of the user by using the first audio signal. In other embodiments, in order to improve the accuracy of the obtained sound source bearing of the user, the first electronic device may also combine the first audio signal with the bone conduction audio signal to obtain the sound source bearing of the user. The process is described in detail. Refer to step S201.
  • the first electronic device can obtain the pixel position of the face in the image, and use the pixel position of each face, combined with the user's sound source position, to obtain the audio-visual position of the user's voice.
  • the process refer to step S202- Step S204.
  • the first electronic device uses the first audio signal to determine the position of the sound source of the user;
  • the azimuth of the user's sound source may be an azimuth angle of the user's sound source relative to the center of the microphone of the first electronic device, and the azimuth angle may include at least one of an azimuth angle and an elevation angle.
  • the horizontal angle is recorded as ⁇
  • the pitch angle is recorded as ⁇ .
  • the azimuth of the user's sound source may be an azimuth angle of the user's sound source relative to the center of the microphone of the first electronic device.
  • the horizontal angle ⁇ and the pitch angle ⁇ can be obtained through the first audio signal, and the specific implementation can refer to the description of the following algorithm:
  • the first electronic device may be based on a high-resolution spatial spectrum estimation algorithm, and the horizontal angle ⁇ and the pitch angle ⁇ may be determined using the first audio signal.
  • the first electronic device may determine the horizontal angle ⁇ and the pitch angle ⁇ according to beamforming of the N microphones and the first audio signal based on a beamforming algorithm of maximum output power.
  • the first electronic device may also determine the horizontal angle ⁇ and the pitch angle ⁇ in other manners. This embodiment of the present application does not limit this.
  • the first electronic device can determine the beam direction with the maximum power as the target sound source azimuth, which is the user's sound source azimuth.
  • the formula to obtain the target sound source azimuth ⁇ can be expressed as:
  • t represents the time frame, that is, the processing frame of the audio signal.
  • i represents the ith microphone
  • H i (f, ⁇ ) represents the beam weight of the ith microphone in beamforming
  • Y i (f, t) represents the time-frequency domain obtained from the sound information collected by the ith microphone audio signal.
  • the beamforming refers to the response of the N microphones to the narrowband sound signal. Since the response is different in different azimuths, beamforming is correlated with sound source azimuth. Therefore, beamforming can locate the sound source in real time and suppress the interference of background noise.
  • Beamforming can be represented as a 1 ⁇ N matrix, denoted as H(f, ⁇ ), where N is the number of corresponding microphones.
  • the value of the i-th element in beamforming can be expressed as H i (f, ⁇ ), which is related to the arrangement position of the i-th microphone in the N microphones.
  • the beamforming can be obtained by using a power spectrum, which can be a capon spectrum, a barttlett spectrum, or the like.
  • the i-th element in beamforming obtained by the first electronic device using the barttlett spectrum can be expressed as where j is an imaginary number, is the phase compensation value of the beamformer for the microphone, and ⁇ i represents the delay difference between the same sound information reaching the ith microphone.
  • the time delay difference is related to the position of the sound source and the position of the ith microphone, and reference may be made to the description below.
  • the center of the first microphone that can receive sound information among the N microphones is selected as the origin, and a three-dimensional space coordinate system is established.
  • the relationship between ⁇ i and the sound source azimuth and the position of the ith microphone can be expressed by the following formula:
  • the first audio signal includes an audio signal obtained from sound information collected by N microphones, where N is a positive integer greater than 1.
  • the sound information collected by the ith microphone it can be converted into an audio signal in the time-frequency domain, which is expressed as Among them, s o (f, t) is the change with time t, and the sound information collected by the microphone as the origin is converted into an audio signal in the time-frequency domain.
  • the first audio signal when the first audio signal is broadband information, in order to improve the processing accuracy, the first audio signal may be divided into the frequency domain by discrete Fourier transform (discrete fourier transform, DFT) to obtain several A narrowband audio signal is obtained, and the positioning result of the wideband audio signal is obtained by synthesizing the processing results of the narrowband audio signal at each frequency point.
  • DFT discrete Fourier transform
  • a wideband audio signal with a sampling rate of 48khz is divided into 2049 narrowband audio signals by 4096-point DFT.
  • the target sound source orientation can be determined by processing each narrowband audio signal or a plurality of narrowband audio signals by using the above algorithm.
  • the formula to obtain the target sound source azimuth ⁇ can be expressed as:
  • f represents the frequency value in the frequency domain.
  • the first audio signal in addition to the user's sound information, also includes some other sound information, in order to prevent other sound information from affecting the determination of the user's sound source location.
  • the first electronic device can use the bone conduction audio signal to filter out other sound signals in the first audio signal, and enhance the user's voice information in the first audio information, so that the obtained position of the user's sound source is more accurate.
  • a correlation analysis can be performed on the first audio signal in combination with the bone conduction audio signal, and a larger weight is set for the audio information corresponding to the time-frequency point in the first audio signal that is strongly correlated with the bone conduction audio signal, and the correlation is weak.
  • a smaller weight is set for the audio information corresponding to the time-frequency point.
  • a weight matrix w(f, t) is obtained, and an element in the weight matrix can be recorded as w mn , which represents the weight of the audio signal with frequency n at the mth time. Then using the weight matrix w(f, t), combined with the above algorithm, the bone conduction audio signal can be obtained, combined with the first audio signal, the formula to obtain the target sound source azimuth ⁇ can be expressed as:
  • the first electronic device obtains the pixel position of the face in the image according to the image
  • the human face refers to all the human faces in the image that can be recognized by the first electronic device, and the pixel position of the human face in the image can be represented by the pixel coordinates of the human face in the image pixel coordinate system.
  • a pixel point may be selected from the human face, and the pixel coordinates of the point may be used to represent the pixel position of the human face.
  • the pixel coordinates of the center point of the human mouth can be used as the pixel position of the human face
  • the pixel coordinates of the center point of the human face can also be used as the pixel position of the human face.
  • the first electronic device may sample N frames of images obtained from images in a period of time, and determine the pixel position of the face in a certain frame of image.
  • the first electronic device may perform face recognition on the image to obtain pixel coordinates of the face, where the pixel coordinates are the pixel positions of the face in the image during this period.
  • the pixel position of the face can be represented by a matrix, denoted as H.
  • the first electronic device obtains the visual orientation of the user in the image according to the sound source orientation and the pixel position of the human face;
  • the first electronic device can obtain the correlation between the pixel position of each face and the sound source position according to the sound source position and the pixel position of the face. If the correlation between the pixel position of a certain human face and the sound source orientation is stronger, the first electronic device determines that the visual orientation of the human face is the visual orientation of the user.
  • the first electronic device can obtain the approximate pixel coordinates of the user in the image through the sound source orientation. Then, from the pixel positions of the face, a pixel position of the face closest to the approximate pixel coordinates is determined. The pixel position of the face is taken as the pixel coordinates of the user in the image. Then, using the pixel coordinates of the user in the image, the visual orientation of the user is obtained.
  • an algorithm for obtaining the user's video orientation by using the sound source orientation can refer to the following description.
  • the sound source azimuth is obtained relative to the center of the microphone. Since the distance between the microphone center and the camera center in the first electronic device is much smaller than the distance between the first electronic device and the user, the sound source azimuth can be considered relative. obtained at the center of the camera.
  • the horizontal angle ⁇ and the elevation angle ⁇ corresponding to the sound source azimuth are taken as the approximate horizontal angle ⁇ and the elevation angle ⁇ of the user relative to the camera, and the user's video azimuth can be obtained by using a correlation algorithm including camera parameters.
  • u 0 , v 0 is the value of the origin in the image plane coordinate system in the pixel coordinates (u 0 , v 0 ) in the image pixel coordinate system
  • dx is the image, in the U-axis direction
  • dy is the length of one pixel in the V-axis direction in the image.
  • step S203 is not limited in this embodiment of the present application.
  • the correlation between each face and the sound source orientation obtained in the above steps S201 to S203 may be used as the first decision. factor.
  • the first electronic device may perform first feature extraction on the user's voice signal by using the bone conduction audio signal to obtain the first feature in the user's voice signal. And, the first feature extraction is performed on the face in the image to obtain the first feature of each face in the image.
  • the first electronic device may use an image to perform the first feature extraction on the human face, and the first electronic device may use the first feature as the first feature of the human face.
  • the first feature may include a voice activity detection (voice activity detection, VAD) feature, a phoneme (phoneme) feature, and the like.
  • the first electronic device may use the ultrasonic echo signal or the laser echo signal generated by the vibration of the throat when the face utters to perform the first feature extraction, in this case, the first feature may be the pitch ( pitch) features, etc.
  • the first electronic device may use the first feature as the first feature of the human face.
  • the ultrasonic echo signal generated by the vibration of the throat when the person is selected can be collected by the vibration sensor or the ultrasonic sensor of the first electronic device.
  • the laser echo signal can be collected by the laser sensor of the first electronic device.
  • the first electronic device may use the first feature in the user's voice signal and the first feature of the face in the image to perform a correlation analysis to obtain a relationship between the first feature of each face and the first feature in the user's bone conduction audio signal. correlation as a second determinant.
  • the first electronic device can determine the user's visual orientation as the audio-visual orientation of the user's voice.
  • the first electronic device obtains a stereo audio signal by using the audio-visual orientation of the user's voice and the user's audio signal.
  • the stereo audio signal only includes the user stereo audio signal but does not include the ambient stereo audio signal.
  • the user stereo audio signal is a stereo audio signal.
  • the user's stereo audio signal includes the user's voice information.
  • the user stereo audio signal can be used for reproducing the user's voice, and in the user's voice reproduced by the user stereo audio signal, the audio-visual orientation of the user's voice matches the user's visual orientation.
  • the user stereo audio signal refers to a dual-channel user audio signal.
  • the aforementioned user audio signal is a single-channel audio signal, and the restored user's voice is not stereo.
  • the single-channel audio signal can be converted into a dual-channel user. audio signal.
  • the user's voice restored from the dual-channel user audio signal is stereo.
  • the first electronic device may use the audio-visual orientation of the user's voice and combine the user audio signal to obtain a user stereo audio signal corresponding to the audio-visual orientation of the user's voice.
  • the first electronic device convolves the user audio signal with a head-related impulse response (HRIR) corresponding to the sound-image orientation of the user's voice to recover the inter-aural level difference (inter-aural level difference).
  • HRIR head-related impulse response
  • ILD inter-aural level difference
  • ITD inter-aural time difference
  • spectral cues so that a single-channel user audio signal can be turned into a dual-channel user audio signal, which can include left and right channels. road.
  • the ILD, ITD and spectral cues are used to enable the two-channel user audio signal to determine the audio-visual orientation of the user's voice.
  • the first electronic device can also obtain a binaural user audio signal by using other algorithms, such as a cepstral room impulse response (Binaural Room Impulse Response, BRIR).
  • BRIR Binaural Room Impulse Response
  • the stereo audio information in the recorded video may also include an ambient stereo audio signal.
  • the ambience audio signal can be used to reproduce sounds other than the user's voice in the shooting environment.
  • the ambient stereo audio signal refers to a dual-channel ambient audio signal, and the ambient audio signal refers to an electrical signal converted from other sound signals in the environment.
  • the second electronic device since the second electronic device will filter out most of the other sound information in the environment when collecting the sound signal, the other sound information in the environment included in the first audio signal is clearer than the second information, then the first audio signal is clearer than the other sound information in the environment.
  • the electronic device may use the first audio signal to acquire audio signals of other sounds in the environment, where the audio signals of other sounds are ambient audio signals. Then, using the ambient audio signal, an ambient stereo audio signal is obtained.
  • the first electronic device may perform adaptive blocking filtering by using the first audio signal and the user audio signal to filter out the user's voice information in the first audio signal to obtain an ambient audio signal. Then, a two-channel ambient audio signal is obtained through beamforming of the first electronic device, and the two-channel ambient audio signal may be in the X/Y format, the M/S format, or the A/B format.
  • the first electronic device may also obtain the ambient stereo audio signal in other ways, for example, first obtain the stereo first audio signal by using the first audio signal.
  • the stereo first audio signal is a two-channel first audio signal. Then, adaptive blocking filtering is performed using the stereo first audio signal and the user audio signal to filter out the user's voice information in the stereo first audio signal to obtain an ambient stereo audio signal, which is not limited in this embodiment of the present application.
  • the ambient stereo audio signal and the user stereo audio signal are mixed to obtain a stereo audio signal that includes both the user's voice information and other sound information in the environment.
  • the first electronic device can also perform audio zooming on the user stereo audio signal, so as to realize the sound image size of the user and the sound of the user and the first electronic device.
  • Distance matching refers to the volume level of the user's voice in the video.
  • the first electronic device may determine the volume of the user's stereo audio signal in the video according to the focus information given by the user.
  • the first electronic device when the first electronic device records a video, in response to the user's operation of increasing the focal length of the shooting, the first electronic device determines that the user is getting closer to the first electronic device, and at this time, The first electronic device may enhance the user stereo audio signal.
  • the ambience audio signal is left unchanged, or suppressed. Then mix the user audio signal with the ambient stereo audio signal to obtain a stereo audio signal.
  • the volume of the user's voice in the stereo audio signal will become larger, and the volume of other sounds in the environment will be relatively small.
  • the suppression of the ambience stereo audio signal by the first electronic device can make other sounds around the first electronic device reduced by the ambience stereo audio signal.
  • the first electronic device when the first electronic device records a video, in response to the user's operation to reduce the focal length of the shooting, the first electronic device determines that the user is far away from the first electronic device, and at this time, the first electronic device The device may suppress the user's stereo audio signal.
  • the ambience audio signal is left unchanged, or enhanced.
  • the user audio signal and the ambient stereo audio signal are mixed to obtain a stereo audio signal.
  • the volume of the user's voice in the stereo audio signal will be reduced, and the volume of other sounds in the environment will be relatively high.
  • the first electronic device in addition to the shooting focal length set by the user, can also mix the ambient stereo audio signal and the user stereo audio signal according to a certain volume ratio in other forms.
  • a default sound mixing ratio which is not limited in this embodiment of the present application.
  • step S101 to S105 are not in order, as long as the step S103 is after the step S101.
  • step S106 and step S107 can also be performed simultaneously, that is, in some embodiments, the second electronic device can encode the user audio signal and the bone conduction audio signal together, and send them to the first electronic device at the same time.
  • the first electronic device can obtain the user's stereo audio signal corresponding to the multi-frame image according to steps S101 to S109 multiple times, and perform the stereo audio signal corresponding to the multi-frame image. Encode to get the audio stream. At the same time, multiple frames of images are encoded to obtain a video stream. Then, the audio stream is mixed with the video stream to obtain the recorded video. The electronic device can process multiple frames of images in the video to obtain multiple frames of images.
  • the first electronic device When the first electronic device plays the video, it can play a frame of image at a certain moment, and the image will stay on the display screen for a period of time, and within a period of time from this moment, the first electronic device can play the frame of image
  • the corresponding stereo audio signal, the stereo audio signal can restore the user's voice, and at this time, the audio-visual orientation of the user's voice matches the user's visual orientation.
  • the next frame of image can be played, and at the same time the stereo audio signal corresponding to the next frame of image is played until the video is played.
  • the first electronic device may play the first frame of image, and at the same time start to play the stereo audio signal corresponding to the first frame of image, and the stereo audio signal of the frame may restore the user's voice.
  • the visual orientation of the user is to the left
  • the audio-visual orientation of the user's voice restored by the stereo audio signal corresponding to the frame of image is also to the left.
  • the first frame of image may stay on the display screen of the electronic device for a period of time, and within this period of time, the first electronic device may continue to play the stereo audio signal corresponding to the frame of image.
  • the first electronic device can play the second frame of image and simultaneously start playing the stereo audio signal corresponding to the second frame of image, and the stereo audio signal corresponding to the second frame of image can restore the user's voice.
  • the visual orientation of the user is to the right, and the audio-visual orientation of the user's voice restored by the stereo audio signal corresponding to the second frame of image is also to the right.
  • the audio-visual orientation of the user's voice in the video matches the visual orientation of the user.
  • the user audio signal in the stereo audio signal in the video is collected by the second electronic device, and the stereo user audio signal can restore the user's audio signal. voice, and the user's voice is clear.
  • the term “when” may be interpreted to mean “if” or “after” or “in response to determining" or “in response to detecting" depending on the context.
  • the phrases “in determining" or “if detecting (the stated condition or event)” can be interpreted to mean “if determining" or “in response to determining" or “on detecting (the stated condition or event)” or “in response to the detection of (the stated condition or event)”.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.
  • the process can be completed by instructing the relevant hardware by a computer program, and the program can be stored in a computer-readable storage medium.
  • the program When the program is executed , which may include the processes of the foregoing method embodiments.
  • the aforementioned storage medium includes: ROM or random storage memory RAM, magnetic disk or optical disk and other mediums that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Studio Devices (AREA)
  • Stereophonic System (AREA)

Abstract

一种视频录制方法。在该方法中,第一电子设备录制包括用户在内的视频时,可以采集多帧图像,将该多帧图像进行编码,得到视频流。第一电子设备还可以利用第二电子设备发送的骨传导音频信号以及用户音频信号,得到与用户的声像方位匹配的立体声音频信号,将多帧图像对应的立体声音频信号进行编码,得到音频流,将该音频流与视频流混流,得到视频。所述视频中,所述用户的声音的声像方位与用户的视像方位是匹配的。实施本申请提供的技术方案,在录制的视频中,在用户的声音的声像方位与用户的视像方位相匹配的情况下也能保证用户的声音是清晰的。

Description

一种视频录制方法和电子设备
本申请要求于2021年4月17日提交中国专利局、申请号为202110415047.4、申请名称为“一种视频录制方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端及通信技术领域,尤其涉及视频录制方法和电子设备。
背景技术
随着手机拍摄功能的完善,越来越多的用户喜欢在距离电子设备较远时,录制包括自己在内的视频,很多时候,用户在录制的视频中,希望自己的声音是清晰的、有方位感的。
随着真无线立体声(true wireless stereo,TWS)耳机的普及,当用户佩戴着TWS耳机,距离电子设备较远,进行视频录制时,该TWS耳机可以尽量避免环境中的其他声音信号,采集比较清晰的用户的声音信号,然后将该声音信号转换成音频信号,再通过无线网络将该音频信号传输给电子设备,电子设备可以将收到的音频信号与录制的图像进行处理,得到视频。在该视频中,用户的声音是清晰的。
但是,采取上述方法录制的视频时,虽然用户在说话时,发出的声音是立体声,但是,该立体声经过TWS耳机记录并处理后,就不是立体的了,导致得到的视频中,用户的声音不具有立体感,用户的声音的声像方位无法与用户的视像方位匹配。
发明内容
本申请提供了一种视频录制方法和电子设备,在用户距离电子设备较远,电子设备录制包括用户在内的视频时,录制的视频中,在用户的声音的声像方位与用户的视像方位相匹配的情况下也能保证用户的声音是清晰的。
第一方面,本申请提供了一种视频录制方法,包括:在第一电子设备录制视频的过程中,该第一电子设备采集图像和第一音频信号;该第一电子设备根据该图像和该第一音频信号,确定图像中用户的声音的声像方位;该第一电子设备根据该用户的声音的声像方位和用户音频信号,生成立体声音频信号;该用户音频信号由该第二电子设备获取并发送给该第一电子设备;该第一电子设备根据该图像和该立体声音频信号生成视频。
在上述实施例中,第一电子设备会计算出用户的声音的声像方位,用于对用户音频信号做立体声处理,使得生成的视频中的立体声音频信号还原出的用户的声音的声像方位与用户的视像方位是匹配的。且用户音频信号是第二电子设备采集的,第二电子设备可以是用户佩戴的耳机,这样,采集的用户音频信号是清晰的,使得第一电子设备生成的立体声用户音频信号也可以是清晰的。则在生成的视频中,用户的声音的声像方位与用户的视像方位是匹配的,且该用户的声音是清晰的。且根据该用户的声音的声像方位和用户音频信号生成的立体声音频信号中,主要包括用户的声音信息,而没有掺杂环境中其他的声音信息,在一些需要使得视频中突出的是用户的声音信息的场景下,使用该实施例中的方法可以达到突出用户的声音的目的。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备根据该图像和该第一音频信号,确定图像中用户的声音的声像方位,具体包括:该第一电子设备对该图像进行人脸 识别,得到人脸的像素位置;该第一电子设备根据该第一音频信号,确定用户的声源方位;该第一电子设备根据该用户的声源方位以及该人脸的像素位置,确定该用户在图像中的像素位置;该第一电子设备根据该用户在图像中的像素位置,确定该图像中该用户的声音的声像方位。
在上述实施例中,第一电子设备通过采集的第一音频信号与图像会计算出该用户的声音的声像方位,该用户的声音的声像方位作为生成立体声音频信号时的算法的参数,使得该立体声音频信号还原的用户的声音的声像方位是与图像中用户的视像方位匹配的。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备根据该图像和该第一音频信号,确定图像中用户的声音的声像方位,具体包括:该第一电子设备对该图像进行人脸识别,得到人脸的像素位置;该第一电子设备利用骨传导音频信号,结合该第一音频信号,确定用户的声源方位;该骨传导音频信号由该第二电子设备获取并发送给该第一电子设备;该第一电子设备根据该用户的声源方位以及该人脸的像素位置,确定该用户在图像中的像素位置;该第一电子设备根据该用户在图像中的像素位置,确定该图像中该用户的声音的声像方位。
在上述实施例中,在计算用户的声音的声像方位的过程中,可以利用骨传导音频信号筛选出第一音频信号中与用户的声音信息强相关的那部分音频信号,提高计算出用户的声音的声像方位的准确性。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备根据该用户的声音的声像方位和用户音频信号,生成立体声音频信号,具体包括:该第一电子设备生成环境立体声音频信号;该第一电子设备根据该用户的声音的声像方位和该用户音频信号,以及该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,在生成立体声音频信号的过程中,不仅会利用用户音频信号,也会利用环境音频信号,使得生成的立体声音频信号中不仅有用户的声音信息还包括环境中的其他声音信息。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备生成环境立体声音频信号,具体包括:该第一电子设备根据该用户音频信号对该第一音频信号进行自适应阻塞滤波,滤除该第一音频信号中用户的声音信息;第一电子设备根据滤波后的第一音频信号生成该环境立体声音频信号。
在上述实施例中,第一电子设备采集的第一音频信号中,可以包含更清晰的环境中的其他声音信息。第一电子设备利用采集的第一音频信号,滤除其中的用户的声音信息。可以得到真实的环境中的其他声音信息。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备生成环境立体声音频信号,具体包括:该第一电子设备利用该第一音频信号,生成立体声第一音频信号;该第一电子设备根据该用户音频信号对该立体声第一音频信号进行自适应阻塞滤波,滤除该立体声第一音频信号中用户的声音信息,得到环境立体声音频信号。
在上述实施例中,第一电子设备采集的第一音频信号中,可以包含更清晰的环境中的其他声音信息。第一电子设备利用采集的第一音频信号,滤除其中的用户的声音信息。可以得到真实的环境中的其他声音信息。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备根据该用户的声音的声像方位和该用户音频信号,以及该环境立体声音频信号,生成该立体声音频信号,具体包括:该第一电子设备根据该用户的声音的声像方位和该用户音频信号,生成用户立体声音频信号; 该第一电子设备将该用户立体声音频信号进行增强,同时不改变该环境立体声音频信号;该第一电子设备根据增强后的该用户立体声音频信号与该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,用户立体声音频信号中既包括用户立体声音频信号,也包括环境立体声音频信号时,可以进行音频变焦,当图像中用户距离第一电子设备更近时,该用户的声音可以变得更大,同时环境声音不改变。当图像中用户距离第一电子设备更远时,该用户的声音可以变得更小,同时环境声音不改变。
结合第一方面的一些实施例,在一些实施例中,该第一电子设备根据该用户的声音的声像方位和该用户音频信号,以及该环境立体声音频信号,生成该立体声音频信号,具体包括:该第一电子设备根据该用户的声音的声像方位和该用户音频信号,生成用户立体声音频信号;该第一电子设备将该用户立体声音频信号进行增强,同时将该环境立体声音频信号进行抑制;该第一电子设备根据增强后的该用户立体声音频信号与抑制后的该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,用户立体声音频信号中既包括用户立体声音频信号,也包括环境立体声音频信号时,可以进行音频变焦,当图像中用户距离第一电子设备更近时,该用户的声音可以变得更大,同时环境声音变小。当图像中用户距离第一电子设备更远时,该用户的声音可以变得更小,同时环境声音变小。
结合第一方面的一些实施例,在一些实施例中,该用户音频信号为该第二电子设备根据该骨传导音频信号对该第二音频信号进行联合降噪处理,除去该第二音频信号中的该第二电子设备周围环境中的其他声音信息所得。
在上述实施例中,利用骨传导音频信号对第二音频信号进行滤波处理,使得得到的用户音频信号中,基本是用户的声音的信息,当第一电子设备利用该用户音频信号计算用户的声音的声像方位时,使得计算结果环境中其他声音信息的影响较小,使得计算结果更加准确。
结合第一方面的一些实施例,在一些实施例中,该用户音频信号为该第二电子设备根据第二音频信号进行降噪处理,除去该第二音频信号中的该第二电子设备周围环境中的其他声音信息所得。
在上述实施例中,对第二音频信号进行滤波处理,除去部分环境中的声音信息,使得得到的用户音频信号中,用户的声音的信息被保存下来,当第一电子设备利用该用户音频信号计算用户的声音的声像方位时,使得计算结果环境中其他声音信息的影响较小,使得计算结果准确。
第二方面,本申请提供了一种电子设备,该电子设备包括:一个或多个处理器和存储器;该存储器与该一个或多个处理器耦合,该存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令,该一个或多个处理器调用该计算机指令以使得该电子设备执行:在录制视频的过程中,采集图像和第一音频信号;根据该图像和该第一音频信号,确定图像中用户的声音的声像方位;根据该用户的声音的声像方位和用户音频信号,生成立体声音频信号;该用户音频信号由该第二电子设备获取并发送给该电子设备;根据该图像和该立体声音频信号生成视频。
在上述实施例中,第一电子设备会计算出用户的声音的声像方位,用于对用户音频信号做立体声处理,使得生成的视频中的立体声音频信号还原出的用户的声音的声像方位与用户的视像方位是匹配的。且用户音频信号是第二电子设备采集的,第二电子设备可以是用户佩 戴的耳机,这样,采集的用户音频信号是清晰的,使得第一电子设备生成的立体声用户音频信号也可以是清晰的。则在生成的视频中,用户的声音的声像方位与用户的视像方位是匹配的,且该用户的声音是清晰的。且根据该用户的声音的声像方位和用户音频信号生成的立体声音频信号中,主要包括用户的声音信息,而没有掺杂环境中其他的声音信息,在一些需要使得视频中突出的是用户的声音信息的场景下,使用该实施例中的方法可以达到突出用户的声音的目的。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:对该图像进行人脸识别,得到人脸的像素位置;根据该第一音频信号,确定用户的声源方位;根据该用户的声源方位以及该人脸的像素位置,确定该用户在图像中的像素位置;根据该用户在图像中的像素位置,确定该图像中该用户的声音的声像方位。
在上述实施例中,第一电子设备通过采集的第一音频信号与图像会计算出该用户的声音的声像方位,该用户的声音的声像方位作为生成立体声音频信号时的算法的参数,使得该立体声音频信号还原的用户的声音的声像方位是与图像中用户的视像方位匹配的。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:对该图像进行人脸识别,得到人脸的像素位置;利用骨传导音频信号,结合该第一音频信号,确定用户的声源方位;该骨传导音频信号由该第二电子设备获取并发送给该电子设备;根据该用户的声源方位以及该人脸的像素位置,确定该用户在图像中的像素位置;根据该用户在图像中的像素位置,确定该图像中该用户的声音的声像方位。
在上述实施例中,在计算用户的声音的声像方位的过程中,可以利用骨传导音频信号筛选出第一音频信号中与用户的声音信息强相关的那部分音频信号,提高计算出用户的声音的声像方位的准确性。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:生成环境立体声音频信号;根据该用户的声音的声像方位和该用户音频信号,以及该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,在生成立体声音频信号的过程中,不仅会利用用户音频信号,也会利用环境音频信号,使得生成的立体声音频信号中不仅有用户的声音信息还包括环境中的其他声音信息。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:根据该用户音频信号对该第一音频信号进行自适应阻塞滤波,滤除该第一音频信号中用户的声音信息;根据滤波后的第一音频信号生成该环境立体声音频信号。
在上述实施例中,第一电子设备采集的第一音频信号中,可以包含更清晰的环境中的其他声音信息。第一电子设备利用采集的第一音频信号,滤除其中的用户的声音信息。可以得到真实的环境中的其他声音信息。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:根据该用户音频信号对该立体声第一音频信号进行自适应阻塞滤波,滤除该立体声第一音频信号中用户的声音信息,得到环境立体声音频信号。
在上述实施例中,第一电子设备采集的第一音频信号中,可以包含更清晰的环境中的其他声音信息。第一电子设备利用采集的第一音频信号,滤除其中的用户的声音信息。可以得 到真实的环境中的其他声音信息。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:根据该用户的声音的声像方位和该用户音频信号,生成用户立体声音频信号;将该用户立体声音频信号进行增强,同时不改变该环境立体声音频信号;根据增强后的该用户立体声音频信号与该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,用户立体声音频信号中既包括用户立体声音频信号,也包括环境立体声音频信号时,可以进行音频变焦,当图像中用户距离第一电子设备更近时,该用户的声音可以变得更大,同时环境声音不改变。当图像中用户距离第一电子设备更远时,该用户的声音可以变得更小,同时环境声音不改变。
结合第二方面的一些实施例,在一些实施例中,该一个或多个处理器具体用于调用该计算机指令以使得该电子设备执行:根据该用户的声音的声像方位和该用户音频信号,生成用户立体声音频信号;将该用户立体声音频信号进行增强,同时将该环境立体声音频信号进行抑制;根据增强后的该用户立体声音频信号与抑制后的该环境立体声音频信号,生成该立体声音频信号。
在上述实施例中,用户立体声音频信号中既包括用户立体声音频信号,也包括环境立体声音频信号时,可以进行音频变焦,当图像中用户距离第一电子设备更近时,该用户的声音可以变得更大,同时环境声音变小。当图像中用户距离第一电子设备更远时,该用户的声音可以变得更小,同时环境声音变小。
第三方面,本申请实施例提供了一种芯片系统,该芯片系统应用于电子设备,该芯片系统包括一个或多个处理器,该处理器用于调用计算机指令以使得该电子设备执行如第一方面的任意一种实施方式所描述的方法。
第四方面,本申请实施例提供了一种包含指令的计算机程序产品,其特征在于,当该计算机程序产品在电子设备上运行时,使得该电子设备执行如第一方面的任意一种实施方式所描述的方法。
第五方面,本申请实施例提供了一种计算机可读存储介质,包括指令,其特征在于,当该指令在电子设备上运行时,使得该电子设备执行如第一方面的任意一种实施方式所描述的方法。
可以理解地,第三方面提供的芯片系统、第四方面提供的计算机程序产品和第五方面提供的计算机存储介质均用于执行本申请实施例所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1a是本申请实施例提供世界坐标系、摄像机坐标系、像平面坐标系的结构示意图;
图1b、图1c是本申请实施例提供的图像像素坐标系的示意图;
图2a、图2b是本申请实施例提供的一种方案中视频录制的示意图;
图3是本申请实施例提供的通信系统100的结构示意图;
图4是本申请实施例提供的第一电子设备的结构示意图;
图5是本申请实施例提供的第二电子设备的结构示意图;
图6是本申请实施例中视频录制方法的一个信令交互示意图;
图7是本申请实施例中第一电子设备确定该用户音频信号对应的用户的声音的声像方位的流程图。
具体实施方式
本申请以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括复数表达形式,除非其上下文中明确地有相反指示。还应当理解,本申请中使用的术语“和/或”是指并包含一个或多个所列出项目的任何或所有可能组合。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为暗示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征,在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
为了便于理解,下面先对本申请实施例涉及的相关术语及概念进行介绍。
(1)视像方位:
在本申请实施例中,用户的视像方位是指,当电子设备获取包括该用户在内的图像时,从该图像中判断的该用户在真实世界中相对于摄像机中心的位置。该位置的参考坐标系可以是摄像机坐标系。
可以理解的是,在一些情况下,当用户在真实世界中的相对于摄像机中心的位置没有改变,但是从图像中,会感觉用户在真实世界中相对于摄像机中心的位置发生了改变。例如,电子设备改变摄像机中的参数(例如焦距等)得到的图像,或者,对包括用户的像进行裁剪后得到的图像。
视像方位的确定涉及到了世界坐标系、摄像机坐标系、像平面坐标系、图像像素坐标系。
图1a示出了本申请实施例提供的世界坐标系、摄像机坐标系、像平面坐标系的示例。
世界坐标系以O3-Xw-Yw-Zw表示。通过世界坐标系,可以得到用户在真实世界中的坐标。
摄像机坐标系以O1-Xc-Yc-Zc表示,O1是摄像机的光心,也是摄像机坐标系的原点,Xc轴、Yc轴、Zc轴分别是摄像机坐标系的坐标轴,其中Zc轴位主光轴。
世界坐标系中某一个点的坐标可以通过刚体变换转换成到摄像机坐标系中。
摄像机采集到用户反射的光线,将这些光线呈现在成像平面上,得到用户的光学图像。在该成像平面上可以建立像平面坐标系。
其中,像平面坐标系以O2-X-Y表示,O2是光学图像的中心,也是像平面坐标系的原点,X轴、Y轴与Xc轴、Yc轴平行。
摄像机坐标系中某一个点的坐标可以通过透视投影,转换到像平面坐标系中。
图1b、图1c示出了本申请实施例提供的图像像素坐标系的示意图。
电子设备可以将像平面中的光学图像经过处理,得到可以显示在显示屏中的图像。
如图1b所示,在一些实施例中,在像平面中建立像平面坐标系,以O2-X-Y表示。电子设备可以不对像平面中的像进行裁剪,直接将该像对应的图像,显示在显示屏中。在该图像 上建立图像像素坐标系,该图像像素坐标系以O-U-V表示,单位以像素计。O是图像的一个顶点,U轴、V轴与X轴、Y轴平行。
如图1c所示,在一些实施例中,在像平面中建立像平面坐标系,以O2-X-Y表示。电子设备可以对像平面中的像进行裁剪,将裁剪后得到的像对应的图像显示在显示屏中。在该图像上建立图像像素坐标系,该图像像素坐标系以O-U-V表示。O是裁剪后得到的像对应的图像的一个顶点,U轴、V轴与X轴、Y轴平行。
在另一些实施例中,电子设备还可以对像平面中的像做其他的处理,例如变焦等。得到可以显示在显示屏中的图像。然后,可以以该图像的顶点为坐标原点,建立图像像素坐标系。
电子设备可以用用户身上的某一个点来确定该用户相对于摄像机中心的位置,这样,当电子设备可以确定该点相对于摄像机中心的位置时,可以转化为该点对应的像素点在图像像素坐标系中的像素坐标。同样,当电子设备获取到某一个像素点在图像像素坐标系中的像素坐标时,也可以利用该像素坐标确定该像素点对应的点相对于摄像机中心的位置。这样,电子设备便可以通过用户相对于摄像机中心的位置,得到该用户在图像中的像素坐标。也可以通过用户在图像中的像素坐标,得到该用户相对于摄像机中心的位置。具体的过程将在下文描述,此处暂不赘述。
本申请实施例中,一个用户的视像方位在摄像机坐标系中的表示方式可以有多种,包括相对于摄像机坐标系中心的方位角度,该方位角度可以包括方位角和俯仰角,例如俯仰角为物体的像到Zc轴的角度为m°,方位角为物体的像到Yc轴的角度为n°,此时,该用户的视像方位可以记为(m°,n°)。以及相对于摄像机Xc轴、Yc轴、Zc轴的距离分别为a,b,c,此时,该用户的视像方位可以记为(a,b,c)。还可以用其他的表示方式定义一个用户相对于摄像机坐标系的视像方位,本申请实施例对此不做限定。
为了使电子设备录制的视频中,用户的声音的声像方位与用户的视像方位相匹配,一种方案是,通过电子设备的麦克风采集声音信号,再将该声音信号转换成电信号形式的音频信号。然后,电子设备可以对该音频信号进行聚焦,将该音频信号聚焦到期望的区域(例如用户说话时所处的区域)中,由电子设备将聚焦后的音频信号再现成声音。从而使录制的视频中用户的声音的声像方位与用户的视像方位匹配。
其中,声像方位是指,声音通过声音信号传播到电子设备,该声音信号被电子设备采集后,电子设备将该声音信号转换成音频信号,然后再现该音频信号对应的声音时,该声音的方位。
但是,采取该方案时,当用户距离电子设备较远时,电子设备在采集用户的声音信号的同时,也会采集到环境中的其他声音信号,且用户的声音信号容易受到在空气中传播时距离衰减的影响,导致录制的视频中,用户的声音往往不清晰。
为了解决上述方案中,当用户距离电子设备较远时,导致录取的视频中,用户的声音不清晰的问题。另一种方案是,用户佩戴TWS耳机,然后利用该TWS耳机来采集用户的声音信号。由于TWS耳机的麦克风距离用户很近,使得TWS耳机可以近距离采集用户的声音信号,并且隔离部分环境中的其他声音信号,在采集用户的声音信号时,减少对环境中的其他声音信号的采集。对于部分被采集的环境中的其他声音信号,TWS耳机可以对其进行降噪处理,除去该部分其他声音信号,保留用户的声音信号,然后通过无线网络将该声音信号传输给电子设备,电子设备可以将收到的声音信号与录制的图像进行处理,得到视频。在该视频 中,用户的声音是清晰的。
但是,采取该方案时,由于TWS耳机的麦克风始终在用户的耳朵处采集用户的声音信号,所以对于麦克风而言,采集到的用户的声音信号的方向没有改变,始终是用户的发声部位相对于麦克风中心的方向。但是,对于电子设备的摄像机而言,在用户的声音信号的方向没有改变的情况下,采集到的用户的图像的方向可以是改变的。
如图2a所示,某一时刻,该用户发声时,相对于电子设备的摄像机中心的方向偏左。假设此时,在得到的图像中,用户的视像方位相对于摄像机中心的方向偏左。如图2b所示,另一时刻,用户的位置明显已经改变了,此时,用户相对电子设备的摄像机中心的方向偏右。假设此时,在得到的图像中,用户的视像方位相对于摄像机中心的方向偏右。但是,前后两个时刻用户相对于TWS耳机的麦克风中心的方向基本没有改变,则该麦克风采集到的用户的声音信号的方向基本没有改变。则在录制的视频中,用户的声音的声像方位没有改变。
所以,TWS耳机采集到的声音信号,在视频中再现该声音信号对应的声音时,该用户的声音的声像方位是没有改变的,但视频中,用户的视像方位是改变了的。这样,在得到的视频中,用户的声音的声像方位与用户的视像方位不匹配。
而采用本申请实施例提供的视频录制方法,当用户距离电子设备较远,使用电子设备进行视频录制时,录制的视频中,用户的声音的声像方位与用户的视像方位是匹配的,且视频中,用户的声音是清晰的。
在本申请实施例中,用户佩戴TWS耳机,在距离手机较远时录制包括自己在内的视频,该场景可以参考上述图2a与图2b。不同的是,采用本申请实施例的方法时,在用户的位置发生变化的过程中,虽然麦克风采集到的用户的声音信号的方向基本没有改变。但是,当TWS耳机将该声音信号传输到电子设备之后,电子设备可以利用对该声音信号与用户的在图像中的视像方位得到该声音信号对应的用户的声音的声像方位,该声像方位与用户的视像方位相匹配,然后利用该声像方位再现该声音信号对应用户的声音时,该用户的声音则可以与用户的视像方位匹配。例如,图2a中的用户的视像方位为摄像机中心的方向偏左,则此时,用户的声音的声像方位相对于摄像机也是偏左的。图2b中的用户的视像方位为摄像机中心的方向偏右,则此时,用户的声音的声像方位相对于摄像机也是偏右的。
这样,在录制的视频中,用户的声音的声像方位与用户的视像方位是匹配的,且利用TWS耳机采集用户的声音信号时,视频中,用户的声音是清晰的。
下面首先介绍本申请实施例应用的通信系统100。
图3是本申请实施例提供的通信系统100的结构示意图。
如图3所示,该通信系统100包括:多个电子设备,例如第一电子设备、第二电子设备、第三电子设备。
本申请实施例中的第一电子设备可以是搭载Android、华为鸿蒙系统(HuaweiHarmonyOS)、iOS、Microsoft或者其它操作系统的终端设备,例如智慧屏、手机、平板电脑、笔记本电脑、个人计算机等。
第二电子设备、第三电子设备可以采集声音信号并且将该声音信号转换为电信号,然后传输到第一电子设备。例如,第二电子设备、第三电子设备可以是TWS耳机、蓝牙耳机等。
无线网络用于为本申请实施例涉及的电子设备提供各项服务,例如通信服务、连接服务、传输服务等。
无线网络包括:蓝牙(bluetooth,BT),无线局域网(wireless local area network,WLAN)技术、无线广域网(wirelesswide area network,WWAN)技术等。
第一电子设备可以和第二电子设备、第三电子设备通过无线网络建立连接,然后进行数据传输。
例如,第一电子设备可以对第二电子设备进行查找,当查找到第二电子设备时,第一电子设备可以向第二电子设备发送建立连接的请求,第二电子设备接收到该请求后,可以与第一电子设备建立连接。此时,第二电子设备可以将采集到的声音信号,转换为电信号,即音频信号,通过无线网络,传输到第一电子设备。
下面介绍通信系统100中涉及的第一电子设备。
图4是本申请实施例提供的第一电子设备的结构示意图。
下面以第一电子设备为例对实施例进行具体说明。应该理解的是,第一电子设备可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
第一电子设备可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。
SIM接口可以被用于与SIM卡接口195通信,实现传送数据到SIM卡或读取SIM卡中数据的功能。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为第一电子设备充电,也可以用于第一电子设备与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对第一电子设备的结构限定。在本申请另一些实施例中,第一电子设备也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。
第一电子设备的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。第一电子设备中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在第一电子设备上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。
无线通信模块160可以提供应用在第一电子设备上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。
在一些实施例中,第一电子设备的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得第一电子设备可以通过无线通信技术与网络以及其他设备通信。所
第一电子设备通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲 染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。
第一电子设备可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,第一电子设备可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。
视频编解码器用于对数字视频压缩或解压缩。第一电子设备可以支持一种或多种视频编解码器。这样,第一电子设备可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现第一电子设备的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展第一电子设备的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行第一电子设备的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。
第一电子设备可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。第一电子设备可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当第一电子设备接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。第一电子设备可以设置至少一个麦克风170C。在另一些实施例中,第一电子设备可以设置两 个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,第一电子设备还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。
陀螺仪传感器180B可以用于确定第一电子设备的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定第一电子设备围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。
气压传感器180C用于测量气压。在一些实施例中,第一电子设备通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。第一电子设备可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当第一电子设备是翻盖机时,第一电子设备可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器180E可检测第一电子设备在各个方向上(一般为三轴)加速度的大小。当第一电子设备静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。第一电子设备可以通过红外或激光测量距离。在一些实施例中,拍摄场景,第一电子设备可以利用距离传感器180F测距以实现快速对焦。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。第一电子设备通过发光二极管向外发射红外光。第一电子设备使用光电二极管检测来自附近物体的红外反射光。
环境光传感器180L用于感知环境光亮度。第一电子设备可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测第一电子设备是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。第一电子设备可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,第一电子设备利用温度传感器180J检测的温度,执行温度处理策略。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。第一电子设备可以接收按键输入,产生与第一电子设备的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反 馈。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和第一电子设备的接触和分离。
本申请实施例中,第一电子设备还可以包括:激光传感器(未示出)。
激光传感器用于感受物体的振动,可以将该振动转换成电信号。在一些实施例中,激光传感器可以对喉部振动进行探测,获取喉部振动时的多普勒频移信号,将该多普勒频移信号转换成与喉部的振动频率相对应的电信号。
可以理解的是,第一电子设备还可以包括其他对物体的振动进行探测的装置,例如振动传感器(未示出)、超声波传感器(未示出)等。
本申请实施例中,该处理器110可以调用内部存储器121中存储的计算机指令,以使得第一电子设备执行本申请实施例中的视频录制方法。
下面介绍通信系统100中涉及的第二电子设备。
图5是本申请实施例提供的第二电子设备的结构示意图。
下面以第二电子设备为例对实施例进行具体说明。应该理解的是,第二电子设备可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
在本申请实施例中,第二电子设备可以包括:处理器151、麦克风152、骨传导传感器153、无线通信处理模块154。
处理器151可以用于解析无线通信处理模块154接收到的信号。该信号包括:第一电子设备发送的建立连接的请求。处理器151还可以用于生成无线通信处理模块154向外发送的信号,该信号包括:将音频信号传输到第一电子设备的请求等。
处理器151中还可以设置存储器,用于存储指令。在一些实施例中,该指令可以包括:增强降噪的指令、发送信号的指令等。
麦克风152可以利用空气对声音的传导性,来采集用户的声音信号,以及,一部分周围环境中的其他声音信号。然后,将声音信号转换为电信号,得到音频信号。麦克风152,也称“听筒”、“传声器”。
第二电子设备可以包括1个或N个麦克风152,N为大于1的正整数。当麦克风有2个及其以上时,可以采取不同的排列方式,得到不同的麦克风阵列,该麦克风阵列可以用于提高采集的声音信号的质量。
骨传导传感器153可以利用骨骼对声音的传导性进行声音信号的采集。周围环境中的声音信号大部分都是通过空气传导的,而骨传导传感器可以只采集与骨骼直接接触传导的声音信号,例如用户的声音信号,然后将该声音信号转换为电信号,得到骨导音频信号。
无线通信处理模块154可以包括蓝牙(bluetooth,BT)通信处理模块154A、WLAN通信处理模块154B中的一项或多项,用于提供和第一电子设备建立连接,进行数据传输等服务。
下面结合上述示例性第一电子设备、第二电子设备的硬件结构示意图,对本申请实施例 中的视频录制方法进行具体描述:
图6为本申请实施例中视频录制方法的一个信令交互示意图。
假设此时,第一电子设备与第二电子设备已经建立连接,可以进行数据传输。
当用户距离电子设备较远,使用第一电子设备进行视频录制时,第一电子设备可以持续采集多帧图像,同时持续采集拍摄环境中的音频信息。这时,第二电子设备也会记录拍摄环境中的音频信号。
步骤S101-步骤S109是针对录制视频的过程中,对当前帧图像对应的音频信号(包括第一音频信号、第二音频信号、骨传导音频信号以及用户音频信号等音频信号)与当前帧图像的处理过程的描述。可以理解的是,对每一帧图像对应的音频信号与每一帧图像根据步骤S101-步骤S109的描述进行处理,即可以得到本申请实施例中涉及的视频。
S101.第二电子设备采集第二音频信号;
在视频录制的过程中,电子设备可以在一段时间内持续采集(播放一帧图像对应的时间)的第二音频信号,该第二音频信号可以包括用户的声音信息以及第二电子设备周围环境中的其他声音信息。
可以理解的是,在不同的情况下,该一段时间的长短可以是不同的。例如,1/24秒或者1/12秒等。
S102.第二电子设备采集骨传导音频信号;
可选的,在一些实施例中,在视频录制的过程中,第二电子设备还可以利用骨传导传感器在一段时间内(播放一帧图像对应的时间)持续采集用户的骨传导音频信号。
S103.第二电子设备对该第二音频信号进行处理,得到用户音频信号;
在一些实施例中,第二电子设备可以对该第二音频信号进行采样,降噪等处理,除去环境中的其他声音信息,从而增强第二音频信号中,用户的声音信息,得到用户音频信号。
上述方式中获取的用户音频信息中除包括用户发声时的声音信息外,还包括很少一部分第二电子设备周围环境中的其他声音信息。所以,可选的,在另一些实施例中,为了除去第二电子设备周围环境中的其他声音信息,使得将该第二音频信号处理后得到的用户音频信号中,只包括用户的声音信息,第二电子设备可以利用骨传导音频信号与第二音频信号进行联合降噪处理,除去所述第二音频信号中的所述第二电子设备周围环境中的其他声音信息,得到用户音频信号。
具体的,联合降噪的实现方式与现有技术中同一个电子设备对多路音频信号进行联合降噪处理的实现方式是相同的。本实施例给出其中一种联合降噪处理的方式:将骨传导频信号和第二音频信号进行差分计算,将骨传导音频信号以及第二音频信号中的噪声进行抵消,从而达到联合降噪的效果。需要说明的是,在进行差分计算的过程中,需要根据两个音频信号的声波强度进行加权,使加权后的噪声强度基本相同,实现最大程度的降噪,另外,若差分计算后使正常的音频信号即非噪声信号减弱,则可以对差分后的音频信号进行放大,得到用户音频信号。
S104.第一电子设备采集图像;
在视频录制的过程中,第一电子设备的摄像头可以采集图像,该图像中可以包括人像,该人像可以是用户的人像。
S105.第一电子设备采集第一音频信号;
在视频录制的过程中,第一电子设备的麦克风开始在一段时间内持续采集第一音频信号,该第一音频信号可以包括用户的声音信息以及第一电子设备周围环境中的其他声音信息。
S106.第二电子设备向第一电子设备发送用户音频信号;
第二电子设备可以通过无线网络,向第一电子设备发送用户音频信号。
S107.第二电子设备向第一电子设备发送骨导音频信号;
可选的,第二电子设备可以通无线网络,向第一电子设备发送骨传导音频信号。
S108.第一电子设备确定该用户音频信号对应的用户的声音的声像方位;
图7示出了第一电子设备确定该用户音频信号对应的用户的声音的声像方位的流程图。
在一些实施例中,第一电子设备首先可以利用第一音频信号,确定用户的声源方位。在另一些实施例中,为了提高得到的用户的声源方位的准确性,第一电子设备还可以利用骨传导音频信号结合该第一音频信号,得到用户的声源方位,该过程的详细描述可以参考步骤S201。
然后,第一电子设备可以得到图像中人脸的像素位置,利用每个人脸的像素位置,结合用户的声源方位,得到用户的声音的声像方位,该过程的详细描述可以参考步骤S202-步骤S204。
S201.第一电子设备利用第一音频信号,确定用户的声源方位;
在一些实施例中,该用户的声源方位可以是用户的声源相对于第一电子设备的麦克风中心的方位角度,该方位角度可以包括方位角,俯仰角中的至少一个。其中,水平角记为α,俯仰角记为β。
在另一些实施例中,该用户的声源方位可以是用户的声源相对于第一电子设备的麦克风中心的方位角。
可以理解的是,还可以有其他表示用户的声源方位的方式,本申请实施例对此不作限定。
假设此时,用户的声源方位表示为声源相对于第一电子设备的麦克风的水平角与俯仰角,可以记为θ=[α,β]。
该水平角α与俯仰角β,可以通过第一音频信号得到,具体的实现方式可以参考对下述算法的描述:
在一些实施例中,第一电子设备可以基于高分辨率的空间谱估计算法,利用该第一音频信号可以确定该水平角α与俯仰角β。
在另一些实施例中,第一电子设备可以基于最大输出功率的波束形成算法,根据N个麦克风的波束形成(beamforming)和第一音频信号可以确定该水平角α与俯仰角β。
可以理解的是,第一电子设备还可以采取其他的方式确定该水平角α与俯仰角β。本申请实施例对此不作限定。
下面以基于最大输出功率的波束形成算法确定该水平角α与俯仰角β为例,集合具体算法详细介绍一种可能的实现算法,可以理解的是,该算法不对本申请有限制。
第一电子设备通过比较第一音频信号在各个方向上的输出功率,可以将最大功率的波束方向确定为目标声源方位,该目标声源方位即为用户的声源方位。得到该目标声源方位θ的公式可以表示为:
Figure PCTCN2022086166-appb-000001
式中t表示时间帧,即对音频信号的处理帧。i表示第i个麦克风,H i(f,θ)表示波束形成中的第i个麦克风的波束权值,Y i(f,t)表示第i个麦克风采集的声音信息得到的时频域上的音频信号。
其中,波束形成是指N个麦克风对窄带声音信号的响应。由于该响应在不同方位上是不同的,所以波束形成与声源方位是相互关联的。因此,波束形成可以对声源进行实时定位,并抑制背景噪声的干扰。
波束形成可以表示为一个1×N的矩阵,记为H(f,θ),N为应麦克风的数量。波束形成中的第i个元素的值可以表示为H i(f,θ),该值与第i个麦克风在N个麦克风中的排列位置有关。可以利用功率谱得到波束形成,功率谱可以是capon谱、barttlett谱等。
例如,以barttlett谱为例,第一电子设备利用barttlett谱得到波束形成中的第i个元素可以表示为
Figure PCTCN2022086166-appb-000002
式中,j为虚数,
Figure PCTCN2022086166-appb-000003
为波束形成器的对该麦克风的相位补偿值,τ i表示同一个声音信息到达第i个麦克风的时延差。该时延差与声源方位以及第i个麦克风的位置有关,可以参考下文的描述。
选择N个麦克风中的第一个可以接收到声音信息的麦克风的中心为原点,建立三维空间坐标系。在该三维空间坐标系中,第N个麦克风的位置可以表示为P i=[x i,y i,z i]。则τ i与声源方位以及第i个麦克风的位置的关系可以用下述公式表示:
Figure PCTCN2022086166-appb-000004
其中c为声音信号的传播速度。
第一音频信号中包括N个麦克风采集的声音信息得到的音频信号,N为大于1的正整数。
对于第i个麦克风采集的声音信息,可以转化为时频域上的音频信号,表示为
Figure PCTCN2022086166-appb-000005
其中,s o(f,t)为随时间t的变化,作为原点的麦克风,采集的声音信息,转化为时频域上的音频信号。
在一些实施例中,当第一音频信号为宽带信息时,为了提高处理的精度,可以将该第一音频信号通过离散傅里叶变换(discrete fourier transform,DFT)划分到频域上,得到若干个窄带音频信号,综合各个频点上的窄带音频信号的处理结果得到宽带音频信号的定位结果。例如,将一个48khz采样率的宽带音频信号,通过4096点DFT划分为2049个窄带音频信号。再利用上述算法对每个窄带音频信号或者取其中的若干个窄带音频信号进行处理,即可确定出目标声源方位。得到该目标声源方位θ的公式可以表示为:
Figure PCTCN2022086166-appb-000006
式中f表示表示频域上的频点值。
可选的,在另一些实施例中,由于第一音频信号中除了用户的声音信息外,还包括一些其他的声音信息,为了防止其他的声音信息影响对用户的声源方位的确定。第一电子设备可以利用骨传导音频信号过滤掉第一音频信号中其他的声音信号,增强第一音频信息中,用户的声音信息,使得得到的用户的声源方位更准确。
例如,可以结合骨传导音频信号,对第一音频信号进行相关性分析,将第一音频信号中与骨传导音频信号关联性强的时频点对应的音频信息设置较大的权重,关联性弱的时频点对 应的音频信息设置较小的权重。得到一个权重矩阵w(f,t),该权重矩阵中的一个元素可以记为w mn,表示第m时刻,频率为n的音频信号的权重。则利用该权重矩阵w(f,t),结合上述算法,可以得到利用骨传导音频信号,结合第一音频信号,得到该目标声源方位θ的公式可以表示为:
Figure PCTCN2022086166-appb-000007
S202.第一电子设备根据图像,得到图像中的人脸的像素位置;
人脸是指图像中所有可以被第一电子设备识别的全部人脸,图像中人脸的像素位置可以用图像像素坐标系下人脸的像素坐标表示。
在一些实施中,可以从人脸中选取一个像素点,用该点的像素坐标表示该人脸的像素位置。例如,可以用人嘴的中心点的像素坐标作为人脸的像素位置,也可以用人脸的中心点的像素坐标作为人脸的像素位置。
第一电子设备可以对一段持续时间内的图像得到的N帧图像进行采样,确定某一帧图像中的人脸的像素位置。
在一些实施例中,第一电子设备可以对图像进行人脸识别,得到人脸的像素坐标,该像素坐标即为这段时间内的图像中的人脸的像素位置。
人脸的像素位置可以用一个矩阵表示,记为H。该矩阵中的第i个元素表示图像中第i个人脸的像素位置,可以表示为H i=[u i,v i]。
S203.第一电子设备根据该声源方位和该人脸的像素位置,得到图像中用户的视像方位;
第一电子设备可以根据该声源方位和该人脸的像素位置,得到每一个人脸的像素位置与该声源方位的相关性。若某一人脸的像素位置与该声源方位的相关性越强,则第一电子设备确定该人脸的视像方位为用户的视像方位。
具体的,第一电子设备可以通过该声源方位,得到用户在图像中大致的像素坐标。然后,从人脸的像素位置中,确定一个和该大致的像素坐标最接近的人脸的像素位置。将该人脸的像素位置作为用户在图像中的像素坐标。然后,利用该用户在图像中的像素坐标,得到用户的视像方位。
下面提供一种算法下的实现方式,可以理解的是,该算法不对本申请实施例构成限制。
假设此时,用户的声源方位表示为声源相对于第一电子设备的麦克风的水平角与俯仰角,可以记为θ=[α,β]。则利用该声源方位,得到用户的视像方位的一种算法,可以参考对下文的描述。
声源方位是相对于麦克风的中心所得的,由于第一电子设备中,麦克风中心与摄像机中心间的距离远小于第一电子设备与用户的之间的距离,所以该声源方位可以认为是相对于摄像机的中心所得的。将该声源方位对应的水平角α与俯仰角β,作为用户相对于摄像机的大致的水平角α与俯仰角β,利用包括摄像机参数的相关算法可以得到该用户的视像方位。
具体的,首先,利用该水平角α与俯仰角β可以得到用户在像坐标系中的坐标g=[x,y],得到g中x的公式可以表示为:x=f tan α,式中f为摄像机的焦距。得到g中y的公式可以表示为:y=f cos α tan β。
然后,将像坐标系中的坐标g=[x,y]转换为图像中的像素坐标h=[u,v],该像素坐标h=[u,v]用户在图像中大致的像素坐标,其转换公式可以表示为:
Figure PCTCN2022086166-appb-000008
式中,u 0,v 0为图像中,像平面坐标系中的原点在图像像素坐标系中的像素坐标(u 0,v 0)中的值,dx为图像中,U轴方向上,一个像素的长度,dy为图像中,V轴方向上,一个像素的长度。
再利用该大致的像素坐标h=[u,v],人脸的像素位置H i=[u i,v i]进行匹配,得到每一个人脸与该大致的像素坐标h=[u,v]的相关性,例如,可以认为人脸的像素位置与该大致的像素坐标间的距离越小,相关性越强,则人脸的像素位置与该声源方位的相关性越强。将H i=[u i,v i]中与h=[u,v]相关性最强的人脸的像素位置为用户在图像中的像素坐标,可以表示为h′=[u′,v′]。
最后,将该用户在图像中的像素坐标转换成相对于摄像机的水平角α′与俯仰角β′,作为该用户的视像方位θ′=[α′,β′]。该转换过程与上述利用水平角α与俯仰角β得到用户在图像中大致的像素坐标h=[u,v]过程相反,此处不再赘述。
可以理解的是,当声源方位的表示方式不同时,转换的方式可以不相同,本申请实施例对步骤S203中的算法不作限定。
在一些实施例中,为了进一步提高第一电子设备得到用户的视像方位的准确性,可以将上述步骤S201-步骤S203中得到的每一个人脸与该声源方位的相关性作为第一决定因素。
再添加第二决定因素与第一决定因素共同决策出人脸中的用户,该第二决定因素为每一个人脸的第一特征与用户的骨传导音频信号中的第一特征间的相关性。
具体的,第一电子设备可以利用骨传导音频信号对用户的声音信号做第一特征提取,获取用户的声音信号中的第一特征。以及,对图像中的人脸进行该第一特征提取,获取图像中每个人脸的第一特征。
对于人脸的第一特征,可以采取不同的方式获取。
例如,在一些实施例中第一电子设备可以利用图像对人脸进行该第一特征提取,第一电子设备可以将该第一特征作为人脸的第一特征。此时,该第一特征可以包括语音活动检测(voice activity detection,VAD)特征,音素(phoneme)特征等。
在另一些实施例中,第一电子设备可以利用人脸发声时因喉咙振动产生的超声波回波信号或者激光回波信号中进行第一特征提取,此时,该第一特征可以为音高(pitch)特征等。第一电子设备可以将该第一特征作为人脸的第一特征。
其中,选人发声时因喉咙振动产生的超声波回波信号可以利用第一电子设备的振动传感器或者超声波传感器采集。激光回波信号可以利用第一电子设备的激光传感器采集。
第一电子设备可以利用用户的声音信号中的第一特征与图像中人脸的第一特征进行相关性分析,得到每个人脸的第一特征与用户的骨传导音频信号中的第一特征间的相关性,将该相关性作为第二决定因素。
由于麦克风中心与摄像机中心间的距离远小于第一电子设备与用户的之间的距离,所以第一电子设备可以将该用户的视像方位确定为用户的声音的声像方位。
S109.第一电子设备利用该用户的声音的声像方位、该用户音频信号,得到立体声音频信号。
在一些实施例中,该立体声音频信号只包括用户立体声音频信号但不包括环境立体声音频信号,此时,该用户立体声音频信号即为立体声音频信号。该用户立体声音频信号中包括用户的声音信息。该用户立体声音频信号可以用于再现用户的声音,该用户立体声音频信号再现的用户的声音中,用户的声音的声像方位与用户的视像方位匹配。
其中,用户立体声音频信号是指双通道的用户音频信号。
前述用户音频信号是单通道的音频信号,将其还原的用户的声音不是立体声,为了得到用户立体声音频信号,更真实的还原用户的声音,可以将该单通道的音频信号转换成双通道的用户音频信号。该双通道的用户音频信号还原得到的用户的声音则是立体声。
第一电子设备可以利用用户的声音的声像方位,结合该用户音频信号,得到与用户的声音的声像方位对应用户立体声音频信号。
具体的,第一电子设备将户音频信号与对应用户的声音的声像方位的头部相关脉冲响应(head-related impulse response,HRIR)卷积以恢复耳间声强差(inter-aural level difference,ILD)、耳间时间差(inter-aural time difference,ITD)和频谱线索,使得单声道的用户音频信号可以变成双声道的用户音频信号,该双通道可以包括左声道和右声道。ILD、ITD和频谱线索用于使该双通道的用户音频信号可以确定出用户的声音的声像方位。
可以理解的是,除了HRIR算法以外,第一电子设备还可以利用其它的算法得到双声道的用户音频信号,例如倒谱房间脉冲响应(binaural room impulse response,BRIR)等。
在一些实施例中,录制的视频中立体声音频信息中除了用户立体声音频信号,还可以包括环境立体声音频信号。该环境立体声音频信号可以用于再现拍摄环境中除了用户的声音以外的其他声音。该环境立体声音频信号是指双通道的环境音频信号,该环境音频信号是指环境中的其他声音信号转换的电信号。
通常说来,由于第二电子设备采集声音信号时会过滤掉大部分环境中的其他声音信息,所以第一音频信号中包括的环境中的其他声音信息比第二信息中更清晰,则第一电子设备可以利用第一音频信号,获取环境中其他声音的音频信号,该其他声音的音频信号即为环境音频信号。然后,利用该环境音频信号,得到环境立体声音频信号。
具体的,在一些实施例中,首先,第一电子设备可以利用第一音频信号与用户音频信号进行自适应阻塞滤波,滤除第一音频信号中用户的声音信息,得到环境音频信号。然后,通过第一电子设备的波束形成得到双声道的环境音频信号,该双声道的环境音频信号可以是X/Y制式、M/S制式或者A/B制式的。
在另一些实施例中,第一电子设备还可以通过其他的方式获取环境立体声音频信号,例如,先利用第一音频信号得到立体声第一音频信号。该立体声第一音频信号为双声道的第一音频信号。然后,再利用该立体声第一音频信号与用户音频信号进行自适应阻塞滤波,滤除该立体声第一音频信号中用户的声音信息,得到环境立体声音频信号,本申请实施例对此不作限定。
然后,将该环境立体声音频信号与用户立体声音频信号进行混音,得到既包括用户的声音信息,也包括环境中的其他声音信息的立体声音频信号。
第一电子设备在将环境立体声音频信号与用户立体声音频信号进行混音的过程中,还可以将该用户立体声音频信号进行音频变焦,实现该用户声像大小与用户发声时与第一电子设备的距离远近的匹配。用户声像大小是指视频中用户的声音的音量大小。
第一电子设备可以根据用户给定的聚焦信息来确定视频中用户立体声音频信号的音量大小。
具体的,在一些实施例中,第一电子设备在录制视频时,响应于用户将拍摄焦距变大的操作,则第一电子设备确定该用户距离第一电子设备变近了,则此时,第一电子设备可以将该用户立体声音频信号的进行增强。将该环境立体声音频信号不改变,或者抑制。然后再将 用户音频信号与环境立体声音频信号进行混音,得到立体声音频信号。则视频中,该立体声音频信号中用户的声音的音量会变大,环境中其他声音的音量相对较小。
其中,第一电子设备将环境立体声音频信号进行抑制可以使得环境立体声音频信号还原的第一电子设备周围的其他声音变小。
在一些实施例中,第一电子设备在录制视频时,响应于用户将拍摄焦距变小的操作,则第一电子设备确定该用户距离第一电子设备变远了,则此时,第一电子设备可以将该用户立体声音频信号的进行抑制。将该环境立体声音频信号不改变,或者增强。然后再将用户音频信号与环境立体声音频信号进行混音,得到立体声音频信号。则视频中,该立体声音频信号中用户的声音的音量会变小,环境中其他声音的音量相对较大。
在另一些实施例中,除了通过用户设置的拍摄焦距,第一电子设备还可以通过其他的形式来按照一定的音量大小比例将环境立体声音频信号与用户立体声音频信号进行混音,例如,可以设备一个默认的混音比例,本申请实施例对此不作限定。
应该理解的是,上述步骤S101-步骤S105之间的步骤没有先后顺序,只要步骤S103在步骤S101之后即可。步骤S106与步骤S107之间没有先后顺序,也可以同时进行,即在一些实施例中,第二电子设备可以将用户音频信号与骨导音频信号编码到一起,同时发送给第一电子设备。
可以理解的是,用户使用第二电子设备时,利用第一电子设备可以多次根据步骤S101-步骤S109,得到多帧图像对应的用户立体声音频信号,将该多帧图像对应的立体声音频信号进行编码,得到音频流。同时,将多帧图像进行编码,得到视频流。然后,将该音频流与视频流进行混流,即可得到录制的视频。电子设备可以将该视频中的多帧图像经过处理,得到多帧图像。
第一电子设备在播放该视频时,在某一个时刻可以播放一帧图像,且该图像会在显示屏中停留一段时间,从该时刻开始的一段时间内,第一电子设备可以播放该帧图像对应的立体声音频信号,立体声音频信号可以还原出用户的声音,此时,用户的声音的声像方位与用户的视像方位是匹配的。当前帧图像播放完之后,就可以播放下一帧图像,同时播放下一帧图像对应的立体声音频信号,直到该视频播放完毕。
例如,第一时刻,第一电子设备可以播放第一帧图像,同时开始播放该第一帧图像对应的立体声音频信号,该帧立体声音频信号可以还原出用户的声音。此时,第一帧图像中,用户的视像方位是偏左的,则该帧图像对应的立体声音频信号还原的用户的声音的声像方位也偏左的。该第一帧图像可以在电子设备的显示屏中停留一段时间,该段时间内,第一电子设备可以持续播放该帧图像对应的立体声音频信号。然后,第二时刻,第一电子设备可以播放第二帧图像,同时开始播放该第二帧图像对应的立体声音频信号,该第二帧图像对应的立体声音频信号可以还原出用户的声音。此时,第二帧图像中,用户的视像方位是偏右的,则该第二帧图像对应的立体声音频信号还原的用户的声音的声像方位也偏右的。
该视频中用户的声音的声像方位与用户的视像方位是匹配的,视频中的立体声音频信号中的用户音频信号是通过第二电子设备采集的,该立体声用户音频信号可以还原出用户的声音,且该用户的声音是清晰的。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实 施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。
上述实施例中所用,根据上下文,术语“当…时”可以被解释为意思是“如果…”或“在…后”或“响应于确定…”或“响应于检测到…”。类似地,根据上下文,短语“在确定…时”或“如果检测到(所陈述的条件或事件)”可以被解释为意思是“如果确定…”或“响应于确定…”或“在检测到(所陈述的条件或事件)时”或“响应于检测到(所陈述的条件或事件)”。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。

Claims (21)

  1. 一种视频录制方法,其特征在于,包括:
    在第一电子设备录制视频的过程中,所述第一电子设备采集图像和第一音频信号;
    所述第一电子设备根据所述图像和所述第一音频信号,确定图像中用户的声音的声像方位;
    所述第一电子设备根据所述用户的声音的声像方位和用户音频信号,生成立体声音频信号;所述用户音频信号由所述第二电子设备获取并发送给所述第一电子设备;
    所述第一电子设备根据所述图像和所述立体声音频信号生成视频。
  2. 根据权利要求1所述的方法,其特征在于,所述第一电子设备根据所述图像和所述第一音频信号,确定图像中用户的声音的声像方位,具体包括:
    所述第一电子设备对所述图像进行人脸识别,得到人脸的像素位置;
    所述第一电子设备根据所述第一音频信号,确定用户的声源方位;
    所述第一电子设备根据所述用户的声源方位以及所述人脸的像素位置,确定所述用户在图像中的像素位置;
    所述第一电子设备根据所述用户在图像中的像素位置,确定所述图像中所述用户的声音的声像方位。
  3. 根据权利要求1所述的方法,其特征在于,所述第一电子设备根据所述图像和所述第一音频信号,确定图像中用户的声音的声像方位,具体包括:
    所述第一电子设备对所述图像进行人脸识别,得到人脸的像素位置;
    所述第一电子设备利用骨传导音频信号,结合所述第一音频信号,确定用户的声源方位;所述骨传导音频信号由所述第二电子设备获取并发送给所述第一电子设备;
    所述第一电子设备根据所述用户的声源方位以及所述人脸的像素位置,确定所述用户在图像中的像素位置;
    所述第一电子设备根据所述用户在图像中的像素位置,确定所述图像中所述用户的声音的声像方位。
  4. 根据权利要求1-3所述的方法,其特征在于,所述第一电子设备根据所述用户的声音的声像方位和用户音频信号,生成立体声音频信号,具体包括:
    所述第一电子设备生成环境立体声音频信号;
    所述第一电子设备根据所述用户的声音的声像方位和所述用户音频信号,以及所述环境立体声音频信号,生成所述立体声音频信号。
  5. 根据权利要求4所述的方法,其特征在于,所述第一电子设备生成环境立体声音频信号,具体包括:
    所述第一电子设备根据所述用户音频信号对所述第一音频信号进行自适应阻塞滤波,滤除所述第一音频信号中用户的声音信息;
    第一电子设备根据滤波后的第一音频信号生成所述环境立体声音频信号。
  6. 根据权利要求4所述的方法,其特征在于,所述第一电子设备生成环境立体声音频信号,具体包括:
    所述第一电子设备利用所述第一音频信号,生成立体声第一音频信号;
    所述第一电子设备根据所述用户音频信号对所述立体声第一音频信号进行自适应阻塞滤波,滤除所述立体声第一音频信号中用户的声音信息,得到环境立体声音频信号。
  7. 根据权利要求4所述的方法,其特征在于,所述第一电子设备根据所述用户的声音的声像方位和所述用户音频信号,以及所述环境立体声音频信号,生成所述立体声音频信号,具体包括:
    所述第一电子设备根据所述用户的声音的声像方位和所述用户音频信号,生成用户立体声音频信号;
    所述第一电子设备将所述用户立体声音频信号进行增强,同时不改变所述环境立体声音频信号;
    所述第一电子设备根据增强后的所述用户立体声音频信号与所述环境立体声音频信号,生成所述立体声音频信号。
  8. 根据权利要求4所述的方法,其特征在于,所述第一电子设备根据所述用户的声音的声像方位和所述用户音频信号,以及所述环境立体声音频信号,生成所述立体声音频信号,具体包括:
    所述第一电子设备根据所述用户的声音的声像方位和所述用户音频信号,生成用户立体声音频信号;
    所述第一电子设备将所述用户立体声音频信号进行增强,同时将所述环境立体声音频信号进行抑制;
    所述第一电子设备根据增强后的所述用户立体声音频信号与抑制后的所述环境立体声音频信号,生成所述立体声音频信号。
  9. 根据权利要求1-8所述的方法中,其特征在于,所述用户音频信号为所述第二电子设备根据所述骨传导音频信号对所述第二音频信号进行联合降噪处理,除去所述第二音频信号中的所述第二电子设备周围环境中的其他声音信息所得。
  10. 根据权利要求1-8所述的方法中,其特征在于,所述用户音频信号为所述第二电子设备根据第二音频信号进行降噪处理,除去所述第二音频信号中的所述第二电子设备周围环境中的其他声音信息所得。
  11. 一种电子设备,其特征在于,所述电子设备包括:一个或多个处理器和存储器;
    所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述电子设备执行:
    在录制视频的过程中,采集图像和第一音频信号;
    根据所述图像和所述第一音频信号,确定图像中用户的声音的声像方位;
    根据所述用户的声音的声像方位和用户音频信号,生成立体声音频信号;所述用户音频 信号由所述第二电子设备获取并发送给所述电子设备;
    根据所述图像和所述立体声音频信号生成视频。
  12. 根据权利要求11所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    对所述图像进行人脸识别,得到人脸的像素位置;
    根据所述第一音频信号,确定用户的声源方位;
    根据所述用户的声源方位以及所述人脸的像素位置,确定所述用户在图像中的像素位置;
    根据所述用户在图像中的像素位置,确定所述图像中所述用户的声音的声像方位。
  13. 根据权利要求11所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    对所述图像进行人脸识别,得到人脸的像素位置;
    利用骨传导音频信号,结合所述第一音频信号,确定用户的声源方位;所述骨传导音频信号由所述第二电子设备获取并发送给所述电子设备;
    根据所述用户的声源方位以及所述人脸的像素位置,确定所述用户在图像中的像素位置;
    根据所述用户在图像中的像素位置,确定所述图像中所述用户的声音的声像方位。
  14. 根据权利要求11-13所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    生成环境立体声音频信号;
    根据所述用户的声音的声像方位和所述用户音频信号,以及所述环境立体声音频信号,生成所述立体声音频信号。
  15. 根据权利要求14所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    根据所述用户音频信号对所述第一音频信号进行自适应阻塞滤波,滤除所述第一音频信号中用户的声音信息;
    根据滤波后的第一音频信号生成所述环境立体声音频信号。
  16. 根据权利要求14所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    根据所述用户音频信号对所述立体声第一音频信号进行自适应阻塞滤波,滤除所述立体声第一音频信号中用户的声音信息,得到环境立体声音频信号。
  17. 根据权利要求14所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    根据所述用户的声音的声像方位和所述用户音频信号,生成用户立体声音频信号;
    将所述用户立体声音频信号进行增强,同时不改变所述环境立体声音频信号;
    根据增强后的所述用户立体声音频信号与所述环境立体声音频信号,生成所述立体声音频信号。
  18. 根据权利要求14所述的电子设备,其特征在于,所述一个或多个处理器具体用于调用所述计算机指令以使得所述电子设备执行:
    根据所述用户的声音的声像方位和所述用户音频信号,生成用户立体声音频信号;
    将所述用户立体声音频信号进行增强,同时将所述环境立体声音频信号进行抑制;
    根据增强后的所述用户立体声音频信号与抑制后的所述环境立体声音频信号,生成所述立体声音频信号。
  19. 一种芯片系统,所述芯片系统应用于电子设备,所述芯片系统包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述电子设备执行如权利要求1-10中任一项所述的方法。
  20. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-10中任一项所述的方法。
  21. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-10中任一项所述的方法。
PCT/CN2022/086166 2021-04-17 2022-04-11 一种视频录制方法和电子设备 WO2022218271A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22787497.1A EP4297398A1 (en) 2021-04-17 2022-04-11 Video recording method and electronic devices

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110415047.4 2021-04-17
CN202110415047.4A CN115225840A (zh) 2021-04-17 2021-04-17 一种视频录制方法和电子设备

Publications (1)

Publication Number Publication Date
WO2022218271A1 true WO2022218271A1 (zh) 2022-10-20

Family

ID=83604962

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086166 WO2022218271A1 (zh) 2021-04-17 2022-04-11 一种视频录制方法和电子设备

Country Status (3)

Country Link
EP (1) EP4297398A1 (zh)
CN (1) CN115225840A (zh)
WO (1) WO2022218271A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845582A (zh) * 2005-04-06 2006-10-11 索尼株式会社 成像装置、录音装置和录音方法
CN104581512A (zh) * 2014-11-21 2015-04-29 广东欧珀移动通信有限公司 一种立体声录制方法及装置
US20180115744A1 (en) * 2016-10-20 2018-04-26 Plantronics, Inc. Combining Audio and Video Streams for a Video Headset
CN110740259A (zh) * 2019-10-21 2020-01-31 维沃移动通信有限公司 视频处理方法及电子设备
CN111402915A (zh) * 2020-03-23 2020-07-10 联想(北京)有限公司 信号处理方法、装置及系统
CN107004426B (zh) * 2014-11-28 2020-09-11 华为技术有限公司 录取录像对象的声音的方法和移动终端
CN111970625A (zh) * 2020-08-28 2020-11-20 Oppo广东移动通信有限公司 录音方法和装置、终端和存储介质
CN112165590A (zh) * 2020-09-30 2021-01-01 联想(北京)有限公司 视频的录制实现方法、装置及电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845582A (zh) * 2005-04-06 2006-10-11 索尼株式会社 成像装置、录音装置和录音方法
CN104581512A (zh) * 2014-11-21 2015-04-29 广东欧珀移动通信有限公司 一种立体声录制方法及装置
CN107004426B (zh) * 2014-11-28 2020-09-11 华为技术有限公司 录取录像对象的声音的方法和移动终端
US20180115744A1 (en) * 2016-10-20 2018-04-26 Plantronics, Inc. Combining Audio and Video Streams for a Video Headset
CN110740259A (zh) * 2019-10-21 2020-01-31 维沃移动通信有限公司 视频处理方法及电子设备
CN111402915A (zh) * 2020-03-23 2020-07-10 联想(北京)有限公司 信号处理方法、装置及系统
CN111970625A (zh) * 2020-08-28 2020-11-20 Oppo广东移动通信有限公司 录音方法和装置、终端和存储介质
CN112165590A (zh) * 2020-09-30 2021-01-01 联想(北京)有限公司 视频的录制实现方法、装置及电子设备

Also Published As

Publication number Publication date
CN115225840A (zh) 2022-10-21
EP4297398A1 (en) 2023-12-27

Similar Documents

Publication Publication Date Title
CN111050269B (zh) 音频处理方法和电子设备
WO2020249062A1 (zh) 一种语音通信方法及相关装置
EP4054177B1 (en) Audio processing method and device
EP4258685A1 (en) Sound collection method, electronic device, and system
JP7442647B2 (ja) ブルートゥース通信方法および装置
CN113393856B (zh) 拾音方法、装置和电子设备
CN112118527A (zh) 多媒体信息的处理方法、装置和存储介质
CN114610193A (zh) 内容共享方法、电子设备及存储介质
EP4148731A1 (en) Audio processing method and electronic device
CN116887015A (zh) 音频处理的方法及电子设备
CN113573120B (zh) 音频的处理方法及电子设备、芯片系统及存储介质
CN109285563B (zh) 在线翻译过程中的语音数据处理方法及装置
CN114120950B (zh) 一种人声屏蔽方法和电子设备
CN113129916A (zh) 一种音频采集方法、系统及相关装置
WO2022218271A1 (zh) 一种视频录制方法和电子设备
CN114120987B (zh) 一种语音唤醒方法、电子设备及芯片系统
CN115706755A (zh) 回声消除方法、电子设备及存储介质
CN115525366A (zh) 一种投屏方法及相关装置
CN115914517A (zh) 一种声音信号处理方法及电子设备
CN115480250A (zh) 语音识别方法、装置、电子设备及存储介质
CN113436635A (zh) 分布式麦克风阵列的自校准方法、装置和电子设备
CN115297269B (zh) 曝光参数的确定方法及电子设备
WO2022042460A1 (zh) 一种设备的连接方法及电子设备
CN113542984B (zh) 立体声实现系统、方法、电子设备及存储介质
WO2023197997A1 (zh) 穿戴设备、拾音方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22787497

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022787497

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022787497

Country of ref document: EP

Effective date: 20230919

NENP Non-entry into the national phase

Ref country code: DE