CN117636928A - Pickup device and related audio enhancement method - Google Patents

Pickup device and related audio enhancement method Download PDF

Info

Publication number
CN117636928A
CN117636928A CN202210980664.3A CN202210980664A CN117636928A CN 117636928 A CN117636928 A CN 117636928A CN 202210980664 A CN202210980664 A CN 202210980664A CN 117636928 A CN117636928 A CN 117636928A
Authority
CN
China
Prior art keywords
processing unit
target object
information
audio
video information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210980664.3A
Other languages
Chinese (zh)
Inventor
杨子路
鄢展鹏
刘军康
柯波
王海明
唐玺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202210980664.3A priority Critical patent/CN117636928A/en
Publication of CN117636928A publication Critical patent/CN117636928A/en
Pending legal-status Critical Current

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application discloses pick-up device and relevant audio enhancement method, its characterized in that, pick-up device includes vision processing unit and audio processing unit, just vision processing unit with audio processing unit passes through bus connection, wherein, vision processing unit is used for: determining position information of a target object relative to the sound pickup apparatus according to image data of the target object and depth data including distance information between the target object and the sound pickup apparatus; transmitting the location information to the audio processing unit through the bus; the audio processing unit is used for: a first audio signal of the target object is determined based on the location information of the target object. By adopting the embodiment of the application, the pickup performance and the pickup quality can be improved, and the user experience is improved.

Description

Pickup device and related audio enhancement method
Technical Field
The present disclosure relates to the field of audio electronics, and more particularly, to a sound pickup apparatus and an associated audio enhancement method.
Background
As electronic devices become more popular, they are increasingly used in noisy environments such as airports, outdoor streets, and traffic situations, e.g., restaurants. In the process of picking up sound by electronic equipment, noise needs to be suppressed to obtain clear specific sound. For example, in the field of video shooting, when an electronic device picks up sound, if a clear voice is required, the electronic device needs to locate a sound source position to weaken or eliminate noise, i.e. separate the voice from the noise.
Currently, acoustic localization techniques are often used to determine the target sound location, such as based on two microphones, to implement an audio zoom system. In turn, the audio scaling application allows capturing and enhancing sound from the direction of the target sound while attenuating sources of interference from all other directions. In this process, the target sound is often determined and enhanced based on beamforming techniques, i.e., beamforming may include both fixed beamforming and adaptive beamforming. The fixed beam forming comprises delay adding, super-directional beam forming and the like, and mainly utilizes the delay information difference of the sound wave reaching the sensor to perform array processing; adaptive beamforming requires real-time estimation of azimuth and ambient noise. However, in the practical application process, it is found that in the adaptive beam forming, if the noise and the target azimuth are not estimated accurately, the target sound is damaged. Furthermore, due to size constraints of electronic devices (e.g., cell phones), typically cell phone microphone distances less than 0.2m apart, it is difficult to achieve the desired beamforming effect. In summary, the sound pickup quality of the electronic device is reduced and the sound is ambiguous due to the low positional accuracy of the target sound determined by the acoustic wave positioning technology.
Therefore, how to provide a sound pickup apparatus and related audio enhancement method to improve the sound pickup performance and quality is a problem to be solved.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present application is how to provide a pickup device and a related audio enhancement method, so as to improve pickup performance and quality.
In a first aspect, embodiments of the present application provide a sound pickup apparatus, which is characterized in that the sound pickup apparatus includes a visual processing unit and an audio processing unit, and the visual processing unit is connected to the audio processing unit through a bus, where the visual processing unit is configured to: determining position information of a target object relative to the sound pickup apparatus according to image data of the target object and depth data including distance information between the target object and the sound pickup apparatus; transmitting the position information to the audio processing unit; the audio processing unit is used for: a first audio signal of the target object is determined based on the location information of the target object.
In the embodiment of the invention, the visual processing unit is used for acquiring the image data and the depth data of the target object to determine the three-dimensional space position of the target object, then the audio processing unit is used for receiving the three-dimensional space position sent by the visual processing unit through the bus between the visual processing unit and the audio processing unit, and then the audio processing unit can determine the audio signal of the target object based on the three-dimensional space position, so that the problem that in the prior art, the difference between the space position determined only based on the audio signal and the actual position of the target object is large, and the audio rendering effect is poor is avoided, and the pickup performance and quality of the pickup device are improved. In another prior art, a pickup device needs to acquire video data of a target object and a corresponding audio signal, and then the processor analyzes the video data to determine position information of the target object. Further, the processor renders the audio signal based on the location information. However, in the prior art, video data and audio signals of a target object need to be acquired in advance and then uniformly processed by a processor, if the data size of the video data and the audio signals is large, the audio rendering speed of the processor is slow, and in some scenes with high real-time requirements, the processor cannot quickly perform audio rendering, so that the pickup performance of a pickup device is reduced, and the user experience is poor. In summary, in the present application, the error between the three-dimensional spatial position determined by the visual processing unit based on the image information and the depth information of the target object and the actual position of the target object is smaller, and the three-dimensional spatial position of the target object can be timely sent to the audio processing unit by the visual processing unit and the bus between the audio processing unit, so that the audio processing unit can pick up sound based on the three-dimensional spatial position of the target object, which avoids the problem that in the prior art, video data and audio signals of the target object need to be acquired in advance and then are uniformly processed by the processor, and if the data amount of the video data and the audio signals is larger, the audio rendering speed of the processor is slow, thereby improving the pickup performance and quality of the pickup device.
In a possible implementation, the sound pickup apparatus further includes N microphones, N being an integer greater than 1, the audio processing unit is further configured to: collecting N original audio signals through the N microphones; determining a phase difference between the N original audio signals based on the position information; processing the N original audio signals based on the phase difference generates the first audio signal.
In the embodiment of the present invention, the pickup device may include a plurality of microphones, so that the audio processing unit may acquire the original audio signal of one target object based on each microphone, and then determine the phase difference between each path of the original audio signal based on the position information (i.e., the three-dimensional spatial position) of the target object. Further, the plurality of original audio signals can be processed based on the phase difference to obtain the audio signal of the target object. Because in the application, the error between the three-dimensional space position determined by the visual processing unit based on the image information and the depth information of the target object and the actual position of the target object is smaller, the problem that the audio rendering effect is poor due to the fact that the error between the space position determined only based on the audio signal and the actual position of the target object is larger in the prior art is avoided, and therefore the pickup performance and quality of the pickup device are improved.
In a possible implementation, the visual processing unit is further configured to: acquiring video information, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In the embodiment of the invention, the video information can be understood as one or more frames of images acquired by a camera in a preset time period; the content information may include, but is not limited to, object information and scene information within the video information, such as in a video capture scene, the object information may include characters, animals, plants, etc., and the scene information may include parks, highways, indoors, etc. The vision processing unit may automatically select one of the plurality of objects as a target object based on content information corresponding to the video information, for example, a person speaking in the plurality of objects may be determined as the target object. Further, the visual processing unit may acquire image data of the target object according to the video information, that is, the image data may include, but is not limited to, coordinate information of the target object in the video information, and then the visual processing unit may accurately determine a three-dimensional spatial position of the target object based on the coordinate information and depth information of the target object, so that the subsequent audio processing unit may pick up sound (for example, enhance sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving pickup performance and quality of the pickup device.
In a possible implementation, the visual processing unit is further configured to: processing the video information based on a preset algorithm to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
In the embodiment of the invention, a vision processing AI engine can be built in the vision processing unit, and after the vision processing unit acquires the video information, the vision processing AI engine can analyze and process the video information based on a preset algorithm to produce the content information of the video information, wherein the preset algorithm comprises, but is not limited to, a motion detection (background modeling) technology, a face detection technology, a lip movement detection technology and the like. Further, the vision processing unit may determine the target object based on the generated content information; if the visual processing unit sends the content information to the audio processing unit, the audio processing unit can amplify the sound sent by the target object based on the content information, so that the pickup performance and quality of the pickup device are improved.
In a possible implementation manner, the audio processing unit is further configured to: receiving the content information sent by the visual processing unit through the bus, and determining the sounding frequency range of the target object based on the content information; and enhancing the audio segment of each original audio signal in the sounding frequency range.
In the embodiment of the invention, after the audio processing unit receives the content information sent by the visual processing unit, what kind of the target object is can be identified based on the content information, and then the sounding frequency range of the kind under the general condition can be determined. After the audio processing unit collects the original audio signals of the target object through the microphone, the audio frequency in the sounding frequency range of the target object can be enhanced, so that the effect of highlighting the sound size of the target object is achieved, and the pickup performance and quality of the pickup device are improved.
In a possible implementation, the visual processing unit is further configured to: acquiring video information, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In the embodiment of the invention, the video information can be understood as one or more frames of images acquired by the camera in a preset time period. And the visual processing unit responds to the target operation after detecting the target operation of the target user on the video information, and determines an object corresponding to the target operation as a target object. Further, the visual processing unit may acquire image data of the target object according to the video information, that is, the image data may include, but is not limited to, coordinate information of the target object in the video information, and then the visual processing unit may accurately determine a three-dimensional spatial position of the target object based on the coordinate information and depth information of the target object, so that the subsequent audio processing unit may pick up sound (for example, enhance sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving pickup performance and quality of the pickup device.
In a possible implementation, the visual processing unit is further configured to: the depth data is acquired through a sensor, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
In the embodiment of the invention, a depth sensor can be further arranged on the pickup device so as to acquire the depth data of the target object relative to the pickup device through the depth sensor; or a binocular camera is arranged on the pickup device to determine depth data of the target object relative to the pickup device based on parallax information obtained by the binocular camera. Further, the visual processing unit can accurately determine the three-dimensional space position of the target object based on the image data and the depth data of the target object, so that the subsequent audio processing unit can pick up sound (such as enhancing the sound of the target object) based on the three-dimensional space position of the target object, thereby improving the pick-up performance and quality of the pick-up device.
In a second aspect, embodiments of the present application provide an audio enhancement method, which is applied to a sound pickup apparatus, the sound pickup apparatus including a visual processing unit and an audio processing unit, and the visual processing unit and the audio processing unit being connected by a bus, the method including: determining, by the vision processing unit, positional information of a target object with respect to the sound pickup apparatus according to image data of the target object and depth data including distance information between the target object and the sound pickup apparatus; transmitting, by the vision processing unit, the location information to the audio processing unit based on the bus; and determining, by the audio processing unit, a first audio signal of the target object according to the position information of the target object.
In one possible implementation, the sound pickup apparatus further includes N microphones, N being an integer greater than 1, the method further including: collecting N original audio signals based on the N microphones through the audio processing unit; determining a phase difference between the N original audio signals based on the position information; processing the N original audio signals based on the phase difference generates the first audio signal.
In one possible implementation, the method further includes: acquiring video information through the vision processing unit, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In one possible implementation, the method further includes: processing the video information based on a preset algorithm through the vision processing unit to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
In one possible implementation, the method further includes: receiving, by the audio processing unit, the content information transmitted by the visual processing unit based on the bus, and determining a sounding frequency range of the target object based on the content information; and enhancing the audio segment of each original audio signal in the sounding frequency range.
In one possible implementation, the method further includes: acquiring video information through the vision processing unit, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In one possible implementation, the method further includes: and acquiring the depth data through a sensor through the vision processing unit, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
In a third aspect, the present application provides a computer storage medium, wherein the computer storage medium stores a computer program which, when executed by a processor, implements the method according to any one of the second aspects.
In a fourth aspect, the present application provides a chip system comprising a processor for supporting an electronic device to implement the functions involved in the second aspect above, for example, to generate or process information involved in the audio enhancement method above. In one possible design, the chip system further includes a memory to hold the necessary program instructions and data for the electronic device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.
In a fifth aspect, the present application provides a computer program comprising instructions which, when executed by a computer, cause the computer to perform the method of any of the second aspects above.
Drawings
Fig. 1 is a schematic diagram of a pickup system according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a sound pickup apparatus according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a positioning target sound source according to an embodiment of the present invention.
Fig. 4 is an interface schematic diagram of video recording of a smart phone according to an embodiment of the present invention.
Fig. 5 is an interface schematic diagram of another video recording of a smart phone according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an audio signal according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of acquiring an audio signal of a target object according to an embodiment of the present invention.
Fig. 8 is a schematic structural view of another sound pickup apparatus according to an embodiment of the present invention.
Fig. 9 is a schematic diagram of blind source separation according to an embodiment of the present invention.
Fig. 10 is a flowchart of an audio enhancement method according to an embodiment of the present invention.
Detailed Description
Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
Based on the foregoing, an embodiment of the present invention provides a pickup system. Referring to fig. 1, fig. 1 is a schematic diagram of a sound pickup system according to an embodiment of the present invention, where the sound pickup system 100 may include a sound pickup apparatus 101, a target sound source 102, and an interference source 103. The sound pickup apparatus 101 includes, but is not limited to, various devices such as a smart phone, a recording pen, a learning machine, a conference terminal, a large screen, a car machine, a smart wearable device (e.g., a smart watch), a tablet pc, and the like. The sound pickup system 100 may be used to capture and enhance target sounds (e.g., sounds emitted by the target sound source 102) while attenuating interfering sources 103 from all other directions, including ambient noise and interfering noise from other directions (e.g., interfering human voices). In particular, the method comprises the steps of,
The sound pickup apparatus 101 may include a processor, an audio module, a speaker, a receiver, a microphone, a camera, and the like. The sound pickup apparatus 101 can realize an audio function by an audio module, a speaker, a receiver, a microphone, an application processor, and the like. Such as music playing, recording, etc. The audio module can be used for converting digital audio information into analog audio signals for output and can also be used for converting analog audio input into digital audio signals. An audio module may also be used to encode and decode audio signals. In some embodiments, the audio module may be disposed in a system on a chip (e.g., a mobile phone processor chip), or a portion of the functional module of the audio module may be disposed in the system on a chip. In some embodiments, the audio module may be a stand-alone audio chip, which is not limited herein. Speakers, which may also be referred to as "speakers," may be used to convert an audio electrical signal into a sound signal. The sound pickup apparatus 101 can listen to music through a speaker or listen to hands-free conversation. A receiver, which may also be referred to as a "earpiece", may be used to convert an audio electrical signal into a sound signal. When the sound pickup apparatus 101 (e.g., a smart phone) picks up a telephone call or voice information, voice can be picked up by the receiver. Microphones, which may also be referred to as "microphones" and "microphones," may be used to convert sound signals into electrical signals. When making a call or transmitting voice information, a user can sound near the microphone through the mouth, inputting a sound signal to the microphone. The sound pickup apparatus 101 may be provided with at least one microphone. In other embodiments, the sound pickup apparatus 101 may be provided with two microphones, and a noise reduction function may be realized in addition to collecting sound signals. In other embodiments, the pickup device 101 may be provided with three, four, or more microphones to collect sound signals, reduce noise, identify a source of sound, implement a directional recording function, and the like. In some embodiments, due to the size limitation of the sound pickup apparatus 101, too many microphones cannot be provided on the sound pickup apparatus 101, so that image data of the target sound source 102 can be acquired by means of the camera in the sound pickup apparatus 101, then the target sound source 102 can be accurately positioned by the audio enhancement method provided by the embodiment of the present invention, and noise generated by the interference source 103 can be attenuated, and the target sound generated by the target sound source 102 can be enhanced, thereby improving the sound pickup performance and quality of the sound pickup apparatus 101.
It will be appreciated that the pickup system 100 of fig. 1 is merely some exemplary implementations provided by embodiments of the present invention, including but not limited to the above implementations.
In order to facilitate understanding of the embodiments of the present invention, the following exemplifies a scenario in which a sound pickup apparatus and a related audio enhancement method are applied to a sound pickup system in the present application, and it is to be understood that, when one audio enhancement method in the present application is applied to a different scenario, the sound pickup apparatus may correspond to different types of devices, respectively, and the following exemplifies two scenarios.
Scene one, video shooting scene:
with the high-speed development of the Internet, more and more users like to record the point drip in their lives in a video shooting mode and upload the point drip to the network, and share the point drip with their friends and vermicelli. This has prompted the continual progress in camera shooting, and people slowly put down heavy cameras to start picking up the mobile phone and recording video material at any time and place. The method comprises the steps of writing manuscripts in the earlier stage, shooting video materials, and editing in the later stage to form a continuous video with complete contents. It is inevitable in the video production process that in order to more vividly present the target object, the picture of the target object is generally highlighted, and the sound of the target object is highlighted as much as possible, that is, the effect of capturing a highlight dynamic picture (or video) while highlighting the focusing of the sound effect. However, when the target object is in a noisy environment, it is difficult in the prior art to directly pick up the sound of the target object and amplify it. Through the pickup device and the related audio enhancement method, the position of the target object can be accurately determined in a noisy environment, so that an interference source and environmental noise are suppressed, and the sound emitted by the target object is amplified, so that the video shooting effect is improved.
Scene two, video conference scene:
with the progress of technology, more and more enterprises begin to pay attention to efficient office work, and thus video conferences are rapidly developing. Video conferencing refers to a conference in which people located at two or more sites are talking face-to-face with a network via a communication device. In video conferencing, a participant can access the conference in any environment through a smart device (e.g., a smart phone, a notebook, etc.). If the speaker in the participant is in a complex and noisy environment, the speaker's speech can be influenced by the surrounding environment, so that other participants can not clearly hear the speech content, and the user experience is reduced. Through the pick-up device and the related audio enhancement method provided by the application, the sound source position of the speaker can be accurately positioned in a noisy sound environment, so that an interference source and environmental noise are suppressed (such as filtering out the sound interference of a side speaker in a video conference), the sound emitted by the speaker is amplified, and other participants experience the effect of being personally on the scene.
It can be appreciated that the above two application scenarios are only exemplary implementations of the embodiments of the present invention, and the application scenarios in the embodiments of the present invention include, but are not limited to, the above application scenarios.
Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a sound pickup apparatus according to an embodiment of the present invention, and the sound pickup apparatus according to the embodiment of the present invention will be described in detail with reference to fig. 2. As shown in fig. 2, the pickup device 200 may include, but is not limited to, a visual processing unit 201 and an audio processing unit 202, and the visual processing unit 201 and the audio processing unit 202 are connected through a bus, wherein the visual processing unit 201 and the audio processing unit 202 may be connected through a dedicated bus, or may be connected through a system bus, so that data transmission may be directly performed between the visual processing unit 201 and the audio processing unit 202. In addition, the visual processing unit 201 and the audio processing unit 202 may be integrated in an audio chip, and the visual processing unit 201 and the audio processing unit 202 may be integrated in a system on chip, which is not limited herein. It should be noted that, the sound pickup apparatus 200 provided in the present application may include all or part of the functions of the sound pickup apparatus 101 in fig. 1, and the audio processing unit 202 may include all or part of the functions of the audio module of the sound pickup apparatus 101 in fig. 1. As will be described in detail below,
The vision processing unit 201 is configured to determine positional information of a target object with respect to the sound pickup apparatus 200 based on image data of the target object and depth data including distance information between the target object and the sound pickup apparatus 200.
Specifically, the target object may be understood as a target sound source in a noisy scene, for example, in the video shooting scene, the target object may be a target object that needs to be emphasized, and for example, in the video conference scene, the target object may be a conference speaker in the video conference; the image data of the target object may be understood as coordinate information of the target object with respect to the sound pickup apparatus 200 acquired by the camera; the depth data may be understood as a distance between the target object and the sound pickup apparatus 200; the positional information may be understood as three-dimensional positional information of the target object with respect to the sound pickup apparatus 200, i.e., coordinate information of the target object with respect to the sound pickup apparatus 200 acquired by the camera may be combined with depth information of the target object with respect to the sound pickup apparatus 200 to obtain three-dimensional positional information of the target object with respect to the sound pickup apparatus 200, such as three-dimensional coordinates of the target object with respect to the sound pickup apparatus 200.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a target sound source positioning provided in an embodiment of the present invention, where the sound pickup apparatus 200 may be a smart phone, and the visual processing unit 201 and the audio processing unit 202 may be built in a system on a chip of the smart phone or may be built in an independent audio processing chip, which is not limited herein. Assuming that the smart phone is horizontally placed on a desktop to perform shooting, a spatial three-dimensional coordinate system (the spatial three-dimensional coordinate system may be a cartesian coordinate system or a spherical coordinate system, which is not limited herein) may be first established with respect to the smart phone, and then the vision processing unit 201 may acquire coordinate information of the target object with respect to the pickup device 200 through the camera, that is, may determine the position of the target object on the X-axis and the Z-axis. Further, if the depth sensor is integrated on the smart phone, the visual processing unit 201 may obtain the depth information of the target object relative to the pickup device 200 through the depth sensor, so as to determine the position of the target object on the Y axis based on the depth information, and further the visual processing unit 201 may accurately determine the spatial three-dimensional position of the target object, so that the subsequent audio processing unit 202 may pick up sound (e.g. enhance the sound of the target object) based on the spatial three-dimensional position of the target object, thereby improving the pickup performance and quality of the pickup device 200.
In a possible implementation, the vision processing unit 201 is further configured to: acquiring video information, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
Specifically, the video information may be understood as one or more frames of images acquired by the camera in a preset time period; the content information may include, but is not limited to, object information and scene information within the video information, such as in a video capture scene, the object information may include characters, animals, plants, etc., and the scene information may include parks, highways, indoors, etc. The vision processing unit 201 may automatically select one from a plurality of objects as a target object based on content information corresponding to video information, for example, a person speaking in the plurality of objects may be determined as a target object. Further, the visual processing unit 201 may acquire image data of the target object according to the video information, that is, the image data may include, but is not limited to, coordinate information of the target object in the video information, and then the visual processing unit 201 may accurately determine the three-dimensional spatial position of the target object based on the coordinate information and the depth information of the target object, so that the subsequent audio processing unit 202 may pick up sound (e.g., enhance sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
For example, taking the sound pickup apparatus 200 as a smart phone as an example, as shown in fig. 4, fig. 4 is a schematic diagram of an interface for video recording of a smart phone according to an embodiment of the present invention, in the figure, after a user opens the video recording, a visual processing unit 201 in the smart phone obtains video information through a camera in a preset time period, and then performs semantic analysis on the video information to generate content information. Further, the vision processing unit 201 may select one from a plurality of objects as a target object based on the content information, e.g., the vision processing unit 201 may determine a person at the center position of the screen as the target object.
In a possible implementation, the vision processing unit 201 is further configured to: processing the video information based on a preset algorithm to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
Specifically, the vision processing unit 201 may have a vision processing AI engine built therein, and after the vision processing unit 201 obtains the video information, the vision processing AI engine may analyze and process the video information based on a preset algorithm to produce content information of the video information, where the preset algorithm includes, but is not limited to, a motion detection (background modeling) technology, a face detection technology, a lip movement detection technology, and the like. Further, the vision processing unit 201 may determine a target object based on the generated content information; if the visual processing unit 201 sends content information to the audio processing unit 202, the audio processing unit 202 may amplify sound emitted from the target object based on the content information, thereby improving the sound pickup performance and quality of the sound pickup apparatus 200.
For example, based on an object that can be detected by an image in video information (some specific information that is generated by a part of a scene in motion and by combining semantics, such as a wagon passing) or a human behavior (such as speaking and knocking), or a feature of the scene itself (focusing intention, scene recognition, quiet or loud intention expression, or specific artistic atmosphere, etc.), content information of the video information is obtained. Further, the vision processing unit 201 may determine a target object based on the generated content information; if the visual processing unit 201 sends content information to the audio processing unit 202, the audio processing unit 202 may amplify sound emitted from the target object based on the content information, thereby improving the sound pickup performance and quality of the sound pickup apparatus 200.
In a possible implementation, the vision processing unit 201 is further configured to: acquiring video information, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
Specifically, the video information may be understood as one or more frames of images acquired by the camera during a preset period of time. The vision processing unit 201, after detecting the target operation of the target user on the video information, responds to the target operation and determines the object corresponding to the target operation as the target object. Further, the visual processing unit 201 may acquire image data of the target object according to the video information, that is, the image data may include, but is not limited to, coordinate information of the target object in the video information, and then the visual processing unit 201 may accurately determine the three-dimensional spatial position of the target object based on the coordinate information and the depth information of the target object, so that the subsequent audio processing unit 202 may pick up sound (e.g., enhance sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
For example, taking the sound pickup apparatus 200 as a smart phone as an example, as shown in fig. 5, fig. 5 is a schematic diagram of an interface for video recording of another smart phone according to an embodiment of the present invention, in which after a user opens the video recording, the vision processing unit 201 in the smart phone obtains video information through a camera in a preset period of time, when the smart phone detects a target operation of the user on the video information, for example, detects a touch operation of the user on a current display screen, an object corresponding to the target operation is determined as a target object, for example, a person at a central position of a click screen of the user, and then the vision processing unit 201 determines the person as the target object.
In a possible implementation, the vision processing unit 201 is further configured to: the depth data is acquired through a sensor, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
Specifically, a depth sensor, such as a Time of flight (TOF) depth sensor, may also be provided on the sound pickup apparatus 200 to acquire depth data of the target object with respect to the sound pickup apparatus 200; or a binocular camera is provided on the sound pickup apparatus 200 to determine depth data of the target object with respect to the sound pickup apparatus 200 based on parallax information obtained by the binocular camera. Further, the visual processing unit 201 can accurately determine the three-dimensional spatial position of the target object based on the image data and the depth data of the target object, so that the subsequent audio processing unit 202 can pick up sound (e.g. enhance the sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
Alternatively, the vision processing unit 201 may determine a Phase Pixel Pattern (PP) according to the properties of the left and right eyes of the human being, and the depth is obtained according to the cross correlation technique after PP extraction, and then generates the distance information of the target object through the correction matrix.
Alternatively, the vision processing unit 201 may also generate the distance information of the target object using the time series of the monocular cameras and the movement of the object.
The visual processing unit 201 is configured to send the location information to the audio processing unit 202 via a bus between the visual processing unit 201 and the audio processing unit 202.
Specifically, the vision processing unit 201 may send the position information (i.e., three-dimensional spatial information) of the target object to the audio processing unit 202 through the bus between the vision processing unit 201 and the audio processing unit 202, so that the subsequent audio processing unit 202 can pick up sound, such as enhancing the sound of the target object, based on the three-dimensional spatial position of the target object. Since the positional information determined by the vision processing unit 201 based on the image data and the depth information of the target object, the positional accuracy is higher than that of the target object determined directly based on the audio information, thereby improving the sound pickup performance and quality of the sound pickup apparatus 200.
Optionally, the visual processing unit 201 and the audio processing unit 202 are connected by a dedicated bus. After the vision processing unit 201 obtains the three-dimensional spatial position of the target object, the three-dimensional spatial position may be sent to the audio processing unit 202 through a dedicated bus between the vision processing unit 201 and the audio processing unit 202, so that the subsequent audio processing unit 202 may pick up sound (e.g., enhance sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
Optionally, the visual processing unit 201 and the audio processing unit 202 are connected through a system bus, such as an I2C bus, an I2S bus, or a high definition audio (High Definition Audio, HDA) bus. The I2C interface is a bidirectional synchronous serial bus, and includes a serial data line (SDA) and a serial clock line (derail clock line, SCL). The vision processing unit 201 may be coupled to the audio processing unit 202, the camera module, etc. through an I2C bus interface, respectively. For example: the vision processing unit 201 may couple the audio processing unit 202 through an I2C interface such that the vision processing unit 201 communicates with the audio processing unit 202 through an I2C bus interface. After the vision processing unit 201 obtains the three-dimensional spatial position of the target object, the three-dimensional spatial position may be sent to the audio processing unit 202 through the I2C bus or through the HDA bus, so that the subsequent audio processing unit 202 may pick up sound and enhance the sound size of the target object based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
The audio processing unit 202 is configured to receive, via the bus, the location information sent by the visual processing unit 201.
Specifically, after the vision processing unit 201 precisely determines the three-dimensional spatial position (i.e., position information) of the target object based on the coordinate information and the depth information of the target object, the audio processing unit 202 may receive the position information transmitted by the vision processing unit 201 through the bus. Further, the audio processing unit 202 may pick up sound (e.g., enhance the sound of the target object) based on the three-dimensional spatial position of the target object, thereby improving the pick-up performance and quality of the pick-up device 200.
The audio processing unit 202 is configured to determine a first audio signal of the target object based on the position information of the target object.
Specifically, the positional information may be understood as three-dimensional positional information of the target object with respect to the sound pickup apparatus 200, for example, a spatial three-dimensional coordinate system is established with a microphone in the sound pickup apparatus 200 as a coordinate origin, and the positional information may be spatial three-dimensional coordinates. The first audio signal may be understood as an audio signal acquired by the audio processing unit 202 based on the position information. When the audio processing unit 202 receives the position information (i.e., the three-dimensional spatial position of the target object) sent by the visual processing unit 201 through the bus, the audio signal of the target object may be acquired based on the position information, so as to achieve the effect of enhancing the sound of the target object.
In some prior arts, the pickup device needs to acquire the video data of the target object and the corresponding audio signal, and then the processor in the pickup device analyzes the video data to determine the position information of the target object. Further, the processor renders the audio signal based on the position information, i.e. increases the sound size in the position direction, while also weakening the sound size in other directions. However, in the prior art, video data and audio signals of a target object need to be acquired in advance and then uniformly processed by a processor, if the data size of the video data and the audio signals is large, the audio rendering speed of the processor is slow, and in some scenes with high real-time requirements, the processor cannot quickly perform audio rendering, so that the pickup performance of a pickup device is reduced, and the user experience is poor.
In summary, in the present application, the visual processing unit 201 performs three-dimensional spatial positioning on the target object, and then the visual processing unit 201 sends the three-dimensional spatial position of the target object to the audio processing unit 202 in time, so that the audio processing unit 202 can pick up sound based on the three-dimensional spatial position of the target object, for example, enhance the sound of the target object (i.e. perform audio rendering), thereby improving the pick-up performance and quality of the pick-up device 200.
In a possible implementation, the sound pickup apparatus 200 further includes N microphones, where N is an integer greater than 1, and the audio processing unit 202 is further configured to: collecting N original audio signals through the N microphones; determining a phase difference between the N original audio signals based on the position information; processing the N original audio signals based on the phase difference generates the first audio signal.
Specifically, the pickup device 200 may include a plurality of microphones, where each microphone may collect an audio signal of a target object, but because there is a difference (such as a phase delay) between audio signals obtained when different microphones collect sound sources with different frequencies at different positions, it is necessary to determine a phase difference between the audio signals collected by different microphones based on a spatial position of the target object, that is, after the audio processing unit 202 obtains an original audio signal of a target object based on each microphone, the phase difference between each original audio signal may be determined based on position information (i.e., a three-dimensional spatial position) of the target object. Further, the plurality of original audio signals can be processed based on the phase difference to obtain the audio signal of the target object. As shown in fig. 6, fig. 6 is a schematic diagram of an audio signal provided in an embodiment of the present invention, in which it is assumed that two microphones are disposed on a sound pickup apparatus 200, and an audio processing unit 202 can obtain an original audio signal 1 and an original audio signal 2 based on the two microphones, respectively. Meanwhile, after receiving the position information of the target object sent by the visual processing unit 201 through the bus, the audio processing unit 202 may calculate, based on the position information, delays of the audio signals reaching the two microphones, respectively, and may calculate a phase difference between the original audio signal 1 and the original audio signal 2. Next, the audio processing unit 202 may process the original audio signal, such as frequency shifting, based on the phase difference, thereby achieving the effect of amplifying the sound of the target object. Because in the application, the error between the three-dimensional space position determined by the visual processing unit based on the image information and the depth information of the target object and the actual position of the target object is smaller, the problem that the audio rendering effect is poor due to the fact that the error between the space position determined only based on the audio signal and the actual position of the target object is larger in the prior art is avoided, and therefore the pickup performance and quality of the pickup device are improved.
For example, as shown in fig. 7, fig. 7 is a schematic diagram of acquiring an audio signal of a target object according to an embodiment of the present invention, where the pickup device 200 may be a smart phone, and the vision processing unit 201 and the audio processing unit 202 may be built in a system on a chip of the smart phone or may be built in a separate audio processing chip, which is not limited herein. Assuming that the smart phone is horizontally placed on a desktop to perform shooting, a spatial three-dimensional coordinate system (the spatial three-dimensional coordinate system may be a cartesian coordinate system or a spherical coordinate system, which is not limited herein) may be first established with respect to the smart phone, and then the vision processing unit 201 may acquire coordinate information of the target object with respect to the pickup device 200 through the camera, that is, may determine the position of the target object on the X-axis and the Z-axis. Further, if a depth sensor is integrated on the smart phone, the vision processing unit 201 may acquire depth information of the target object relative to the pickup device 200 through the depth sensor, so as to determine a position of the target object on the Y-axis based on the depth information, and further the vision processing unit 201 may accurately determine a spatial three-dimensional position of the target object. Next, the vision processing unit 201 transmits the position information (i.e., the spatial three-dimensional position) of the target object to the audio processing unit 202 through the bus. Meanwhile, the audio processing unit 202 may acquire the original audio signals of the multiple paths of target objects through different microphones, determine phase differences between the different original audio signals based on the received position information of the target objects, and then process the multiple paths of original audio signals based on the phase differences to obtain the audio signals of the target objects, so that the pickup device 200 in the application may acquire the audio signals of the target objects timely and more specifically, and the pickup performance of the pickup device 200 is improved.
As another example, as shown in fig. 8, fig. 8 is a schematic structural diagram of another sound pickup apparatus according to an embodiment of the present invention, where the sound pickup apparatus 200 may include a visual processing unit 201, an audio processing unit 202, a camera 203, a microphone 204, and a depth sensor 205. The vision processing unit 201 may first acquire image data of a target object through the camera 203, acquire depth information of the target object through the depth sensor 205, and determine position information of the target object based on the image data and the depth information of the target object. Next, the vision processing unit 201 transmits the position information of the target object to the audio processing unit 202. Meanwhile, the audio processing unit 202 may acquire the original audio signals of the multiple paths of target objects through different microphones 204, determine phase differences between the different original audio signals based on the received position information of the target objects, and then process the multiple paths of original audio signals based on the phase differences to obtain the audio signals of the target objects, so that the pickup device 200 in the application may acquire the audio signals of the target objects timely and more specifically, and the pickup performance of the pickup device 200 is improved.
In a possible implementation, the visual processing unit 201 is further configured to send the content information to the audio processing unit 202.
Specifically, after the visual processing unit 201 generates the content information, the content information may be further sent to the audio processing unit 202, so that the audio processing unit 202 may perform audio rendering based on the content information after acquiring the audio signal of the target object, thereby improving the pickup performance and quality of the pickup device 200.
In a possible implementation, the audio processing unit 202 is further configured to: receiving the content information transmitted by the visual processing unit 201 through the bus, and determining a sound emission frequency range of the target object based on the content information; and enhancing the audio segment of each original audio signal in the sounding frequency range.
Specifically, when the audio processing unit 202 receives the content information sent by the visual processing unit 201 through the bus, it can identify what kind of the target object is based on the content information, and further can determine the sounding frequency range of the kind under the general situation. After the audio processing unit collects the original audio signals of the target object through the microphone, the audio frequency section in the sounding frequency range of the target object can be enhanced, so that the effect of highlighting the sound size of the target object is achieved, and the pickup performance and quality of the pickup device are improved. It should be noted that the audio may be enhanced in all the frequencies within the frequency range, or may be enhanced in a specific frequency, and the present invention is not limited thereto.
For example, the audio processing unit 202 determines that the target object is an adult based on the content information, for example, the fundamental frequency of the adult sound may be 80-800 hz, and thus the audio rendering may be performed in a targeted manner. After the audio processing unit 202 collects the original audio signal of the target object through the microphone, the audio signal of the specific frequency band within 80-800 hz can be enhanced, and some audio signals within other frequency bands can be enhanced based on 80-800 hz, so that the effect of highlighting the sound of the target object is achieved, and the pickup performance and quality of the pickup device 200 are improved.
Optionally, after the audio processing unit 202 collects the original audio signal of the target object through the microphone, the volume of the audio signal corresponding to the direction of the target object in the original audio signal can be enhanced, and the volume of the audio signal corresponding to the other directions can be attenuated, so that the effect of highlighting the sound of the target object is achieved, and the pickup performance and quality of the pickup device 200 are improved.
Optionally, the signal-to-interference ratio and signal-to-noise ratio of the original audio signal are improved by a blind source separation technique. For example, as shown in fig. 9, fig. 9 is a schematic diagram of blind source separation provided in the embodiment of the present invention, in the drawing, audio signals No. 1 and No. 2 may be extracted by a blind source separation technology, and if No. 1 is a target object, the sound of the extracted audio signal No. 1 may be enhanced, and the sound size of the extracted audio signal No. 2 may be reduced, so as to improve the signal-to-interference ratio and signal-to-noise ratio of the audio signal.
The blind source separation technique is to analyze an original signal without observation from a plurality of observed mixed signals. The typically observed mixed signal comes from the outputs of multiple sensors, and the output signals of the sensors are independent (linearly uncorrelated). The primary goal of blind source signal separation is to restore the original signal to the original single signal. For example, cocktails can effect that a listener can focus on a person speaking when many people speak together in the same space.
Optionally, the signal-to-interference ratio and signal-to-noise ratio of the original audio signal are improved by a beamforming technique. For example, the data such as the sound source 1, the sound source 2, and the noise can be extracted by the beam forming technology, and if the sound source 1 is a target object, the sound of the extracted audio signal of the sound source 1 can be enhanced, and the sound sizes of other extracted audio signals (i.e. all audio signals except the audio signal of the sound source 1) can be weakened, so that the signal-to-interference ratio and the signal-to-noise ratio of the audio signals can be improved.
Optionally, the signal-to-noise ratio of the original audio signal is improved by a noise estimation technique. The noise estimation technology is used for estimating noise, and audio rendering and noise reduction are completed according to the estimated noise. Noise estimation is mainly based on some characteristics or phenomena of noisy speech. Common noise estimation techniques include, but are not limited to, quantile noise estimation, histogram noise estimation, and the like.
In the embodiment of the invention, the error between the three-dimensional space position determined by the visual processing unit based on the image information and the depth information of the target object and the actual position of the target object is smaller, and the three-dimensional space position of the target object can be timely sent to the audio processing unit by the visual processing unit through a bus between the visual processing unit and the audio processing unit, so that the audio processing unit can pick up sound based on the three-dimensional space position of the target object, the problem that in the prior art, video data and audio signals of the target object need to be acquired in advance and then are uniformly processed by a processor, and if the data quantity of the video data and the audio signals is larger, the problem that the audio rendering speed of the processor is slow is solved, thereby improving the pick-up performance and the sound quality of a pick-up device.
The foregoing details the pickup apparatus according to the embodiment of the present invention, and the related audio enhancement method according to the embodiment of the present invention is provided below.
Referring to fig. 10, fig. 10 is a flowchart of an audio enhancement method according to an embodiment of the present invention, which is applicable to a sound pickup apparatus and a device including the sound pickup apparatus described in fig. 2. The method may include the following step S301-step S303. The pickup device comprises a visual processing unit and an audio processing unit, and the visual processing unit is connected with the audio processing unit through a bus. The detailed description is as follows:
Step S301: and determining, by the vision processing unit, positional information of the target object with respect to the sound pickup apparatus based on image data and depth data according to the target object.
Specifically, the depth data includes distance information between the target object and the sound pickup apparatus.
Step S302: and sending the position information to the audio processing unit through the visual processing unit.
Step S303: a first audio signal of the target object is determined by the audio processing unit based on the position information of the target object.
In one possible implementation, the sound pickup apparatus further includes N microphones, N being an integer greater than 1, the method further including: collecting N original audio signals based on the N microphones through the audio processing unit; determining a phase difference between the N original audio signals based on the position information; processing the N original audio signals based on the phase difference generates the first audio signal.
In one possible implementation, the method further includes: acquiring video information through the vision processing unit, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In one possible implementation, the method further includes: processing the video information based on a preset algorithm through the vision processing unit to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
In one possible implementation, the method further includes: receiving, by the audio processing unit, the content information transmitted by the visual processing unit based on the bus, and determining a sounding frequency range of the target object based on the content information; and enhancing the audio segment of each original audio signal in the sounding frequency range.
In one possible implementation, the method further includes: acquiring video information through the vision processing unit, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object; and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
In one possible implementation, the method further includes: and acquiring the depth data through a sensor through the vision processing unit, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
In the embodiment of the invention, the error between the three-dimensional space position determined by the visual processing unit based on the image information and the depth information of the target object and the actual position of the target object is smaller, and the three-dimensional space position of the target object can be timely sent to the audio processing unit by the visual processing unit through a bus between the visual processing unit and the audio processing unit, so that the audio processing unit can pick up sound based on the three-dimensional space position of the target object, the problem that in the prior art, video data and audio signals of the target object need to be acquired in advance and then are uniformly processed by a processor, and if the data quantity of the video data and the audio signals is larger, the problem that the audio rendering speed of the processor is slow is solved, thereby improving the pick-up performance and the sound quality of a pick-up device.
The present application provides a computer storage medium storing a computer program which when executed by a processor implements any one of the above-mentioned audio enhancement methods.
The embodiment of the application provides electronic equipment, which comprises a processor, wherein the processor is configured to support the electronic equipment to realize the corresponding functions in any audio enhancement method. The electronic device may also include a memory for coupling with the processor that holds the program instructions and data necessary for the electronic device. The electronic device may also include a communication interface for the electronic device to communicate with other devices or communication networks.
The present application provides a chip system comprising a processor for supporting an electronic device for performing the above-mentioned functions involved, e.g. generating or processing information involved in an audio enhancement method as described above. In one possible design, the chip system further includes a memory to hold the necessary program instructions and data for the electronic device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.
The present application provides a computer program characterized in that the computer program comprises instructions which, when executed by a computer, cause the computer to perform an audio enhancement method as described above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc., in particular may be a processor in the computer device) to perform all or part of the steps of the above-described method of the various embodiments of the present application. Wherein the aforementioned storage medium may comprise: various media capable of storing program codes, such as a U disk, a removable hard disk, a magnetic disk, a compact disk, a Read-Only Memory (abbreviated as ROM), or a random access Memory (Random Access Memory, abbreviated as RAM), are provided.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (15)

1. A sound pickup apparatus, characterized in that the sound pickup apparatus includes a visual processing unit and an audio processing unit, and the visual processing unit is connected with the audio processing unit through a bus, wherein the visual processing unit is configured to:
determining position information of a target object relative to the sound pickup apparatus according to image data of the target object and depth data including distance information between the target object and the sound pickup apparatus;
transmitting the location information to the audio processing unit through the bus;
the audio processing unit is used for:
a first audio signal of the target object is determined based on the location information of the target object.
2. The apparatus of claim 1, wherein the sound pickup apparatus further comprises N microphones, N being an integer greater than 1, the audio processing unit further configured to:
collecting N original audio signals through the N microphones;
determining a phase difference between the N original audio signals based on the position information;
processing the N original audio signals based on the phase difference generates the first audio signal.
3. The apparatus of claim 1 or 2, wherein the vision processing unit is further configured to:
acquiring video information, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information;
and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
4. The apparatus of claim 3, wherein the vision processing unit is further configured to:
processing the video information based on a preset algorithm to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
5. The apparatus of claim 3 or 4, wherein the audio processing unit is further configured to:
receiving the content information sent by the visual processing unit through the bus, and determining the sounding frequency range of the target object based on the content information;
and enhancing the audio segment of each original audio signal in the sounding frequency range.
6. The apparatus of claim 1 or 2, wherein the vision processing unit is further configured to:
acquiring video information, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object;
and acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
7. The apparatus of any of claims 1-6, wherein the vision processing unit is further configured to:
the depth data is acquired through a sensor, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
8. An audio enhancement method, characterized by being applied to a sound pickup apparatus including a visual processing unit and an audio processing unit, and the visual processing unit and the audio processing unit being connected by a bus, the method comprising:
Determining, by the vision processing unit, positional information of a target object with respect to the sound pickup apparatus according to image data of the target object and depth data including distance information between the target object and the sound pickup apparatus;
transmitting, by the vision processing unit, the location information to the audio processing unit based on the bus;
and determining, by the audio processing unit, a first audio signal of the target object according to the position information of the target object.
9. The method of claim 8, wherein the pickup device further comprises N microphones, N being an integer greater than 1, the method further comprising:
collecting N original audio signals based on the N microphones through the audio processing unit;
determining a phase difference between the N original audio signals based on the position information;
processing the N original audio signals based on the phase difference generates the first audio signal.
10. The method of claim 8 or 9, wherein the method further comprises:
acquiring video information through the vision processing unit, and determining the target object based on content information of the video information, wherein the content information comprises one or more of object information and scene information in the video information;
And acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
11. The method of claim 10, wherein the method further comprises:
processing the video information based on a preset algorithm through the vision processing unit to generate the content information; the preset algorithm comprises one or more of a motion detection algorithm, a face detection algorithm and a lip movement detection algorithm.
12. The method of claim 10 or 11, wherein the method further comprises:
receiving, by the audio processing unit, the content information transmitted by the visual processing unit based on the bus, and determining a sounding frequency range of the target object based on the content information;
and enhancing the audio segment of each original audio signal in the sounding frequency range.
13. The method of claim 8 or 9, wherein the method further comprises:
acquiring video information through the vision processing unit, responding to target operation of a target user on the video information, and determining an object corresponding to the target operation as the target object;
And acquiring the image data of the target object according to the video information, wherein the image data comprises coordinate information of the target object in the video information.
14. The method of any one of claims 8-13, wherein the method further comprises:
and acquiring the depth data based on a sensor through the vision processing unit, wherein the sensor is one or more of a monocular camera, a binocular camera and a depth sensor.
15. A computer storage medium, characterized in that the computer storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 8-14.
CN202210980664.3A 2022-08-16 2022-08-16 Pickup device and related audio enhancement method Pending CN117636928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210980664.3A CN117636928A (en) 2022-08-16 2022-08-16 Pickup device and related audio enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210980664.3A CN117636928A (en) 2022-08-16 2022-08-16 Pickup device and related audio enhancement method

Publications (1)

Publication Number Publication Date
CN117636928A true CN117636928A (en) 2024-03-01

Family

ID=90020415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210980664.3A Pending CN117636928A (en) 2022-08-16 2022-08-16 Pickup device and related audio enhancement method

Country Status (1)

Country Link
CN (1) CN117636928A (en)

Similar Documents

Publication Publication Date Title
CN106157986B (en) Information processing method and device and electronic equipment
US9338420B2 (en) Video analysis assisted generation of multi-channel audio data
US20210217433A1 (en) Voice processing method and apparatus, and device
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
CN113676592B (en) Recording method, recording device, electronic equipment and computer readable medium
CN114697812B (en) Sound collection method, electronic equipment and system
US9756421B2 (en) Audio refocusing methods and electronic devices utilizing the same
JP2019220848A (en) Data processing apparatus, data processing method and program
CN111163281A (en) Panoramic video recording method and device based on voice tracking
WO2011108377A1 (en) Coordinated operation apparatus, coordinated operation method, coordinated operation control program and apparatus coordination system
WO2023231787A1 (en) Audio processing method and apparatus
US20240144948A1 (en) Sound signal processing method and electronic device
CN114531425B (en) Processing method and processing device
US11902754B2 (en) Audio processing method, apparatus, electronic device and storage medium
CN117636928A (en) Pickup device and related audio enhancement method
US20200177405A1 (en) Computer system, method for assisting in web conference speech, and program
CN113542466A (en) Audio processing method, electronic device and storage medium
US20220337945A1 (en) Selective sound modification for video communication
CN112073639A (en) Shooting control method and device, computer readable medium and electronic equipment
CN114449341B (en) Audio processing method and device, readable medium and electronic equipment
US11961501B2 (en) Noise reduction method and device
JP2024041721A (en) video conference call
GB2594942A (en) Capturing and enabling rendering of spatial audio signals
CN116962919A (en) Sound pickup method, sound pickup system and electronic equipment
CN114827448A (en) Video recording method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication