CN112423191B - Video call device and audio gain method - Google Patents

Video call device and audio gain method Download PDF

Info

Publication number
CN112423191B
CN112423191B CN202011300121.XA CN202011300121A CN112423191B CN 112423191 B CN112423191 B CN 112423191B CN 202011300121 A CN202011300121 A CN 202011300121A CN 112423191 B CN112423191 B CN 112423191B
Authority
CN
China
Prior art keywords
target
image
area
position information
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011300121.XA
Other languages
Chinese (zh)
Other versions
CN112423191A (en
Inventor
董圣伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Hisense Commercial Display Co Ltd
Original Assignee
Qingdao Hisense Commercial Display Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Hisense Commercial Display Co Ltd filed Critical Qingdao Hisense Commercial Display Co Ltd
Priority to CN202011300121.XA priority Critical patent/CN112423191B/en
Publication of CN112423191A publication Critical patent/CN112423191A/en
Application granted granted Critical
Publication of CN112423191B publication Critical patent/CN112423191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Abstract

The embodiment of the application provides video call equipment and an audio gain method, relates to the technical field of electronics, and can better perform gain control on audio collected in a teleconference. The device includes: the device comprises a microphone array, an audio processor and a camera assembly. The microphone array consists of a plurality of microphones and is used for acquiring sound data in a sound pickup area corresponding to the video call equipment; the audio processor is used for determining a target subarea where a sound producer corresponding to the sound data is located according to the sound data acquired by the microphone array; the target sub-area is one of a plurality of sub-areas included in the sound pickup area; the camera shooting assembly is used for acquiring a target image of a target subregion determined by the audio processor and determining target position information of a speaker according to the target image; the audio processor is further used for performing gain control on the sound data acquired by the microphone array according to the target position information determined by the camera shooting assembly.

Description

Video call device and audio gain method
Technical Field
The invention relates to the technical field of electronics, in particular to video call equipment and an audio gain method.
Background
In a teleconference, because there may be a plurality of people in a conference room, and the distances from different people to a pickup Microphone (MIC) are different, the pickup effect may also be different, and once no processing is performed in the pickup process, the pickup audio may be not accurate and clear enough, and the sound played at the far end may not be heard clearly. Therefore, certain gain needs to be performed on the collected audio (sound data) in the sound pickup process to ensure that the audio is clear enough when the far-end is played. In the conventional audio gain method, multiple microphones (at least 6 microphones) are required to pick up sound, and then angle and distance information of a speaker is calculated, so that the acquired audio is subjected to gain control according to the data, and the accuracy and definition of sound pick-up are increased. In this way, because the collected audio needs to be subjected to gain calculation continuously, the calculation process is complex, and the calculation of the position of the speaker is completed through the collected audio, once the collected audio has more noise, the calculation result is inaccurate, so that the existing audio gain method can cause great delay between the collection of the audio and the playing of the audio, and cannot perform good gain control on the audio collected in the teleconference.
Disclosure of Invention
Embodiments of the present invention provide a video call device and an audio gain method, which can better perform gain control on audio collected in a teleconference.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, a video call device is provided, which includes: the device comprises a microphone array, an audio processor and a camera assembly. The microphone array consists of a plurality of microphones and is used for acquiring sound data in a sound pickup area corresponding to the video call equipment; the audio processor is used for determining a target subarea where a sound producer corresponding to the sound data is located according to the sound data acquired by the microphone array; the target sub-area is one of a plurality of sub-areas included in the sound pickup area; the camera shooting assembly is used for acquiring a target image of a target subregion determined by the audio processor and determining target position information of a speaker according to the target image; the audio processor is further used for performing gain control on the sound data acquired by the microphone array according to the target position information determined by the camera shooting assembly.
In the technical solution provided in the above embodiment, first, the audio processor is used to determine the approximate position of the speaker according to the sound data acquired by the microphone array, and determine the target sub-area of the speaker in the sound pickup area, then the camera module determines the specific target position information of the speaker according to the image information of the target sub-area, and finally, the audio processor can perform gain control on the sound data acquired by the microphone array according to the target position information. In the whole process, the sound data and the image data are collected together to determine the position information of the sound producer, so that the method is more accurate compared with the scheme only depending on the sound data in the prior art, and the influence of noise on the position information judgment result can be avoided, so that the finally determined target position information is more accurate, and the effect of finally performing gain control on the sound data according to the target position information is better.
In a second aspect, an audio gain method is provided, including: acquiring sound data of a pickup area; determining a target subregion where a speaker corresponding to the sound data is located according to the sound data; the target sub-area is one of a plurality of sub-areas included in the sound pickup area; acquiring a target image of a target subregion, and determining target position information of a speaker according to the target image; and performing gain control on the sound data according to the target position information.
In a third aspect, a video call device is provided, comprising a memory, a processor, a bus and a communication interface; the memory is used for storing computer execution instructions, and the processor is connected with the memory through a bus; the processor executes computer-executable instructions stored by the memory to cause the video call device to perform the audio gain method as provided by the second aspect when the video call device is operating.
In a fourth aspect, there is provided a computer readable storage medium comprising computer executable instructions which, when executed on a video telephony device, cause the video telephony device to perform the audio gain method as provided in the second aspect.
In a fifth aspect, a computer program product is provided, wherein instructions, when executed by a processor of a video telephony device, cause the video telephony device to perform the audio gain method as provided in the second aspect.
It can be understood that the solutions provided in the second aspect to the fifth aspect include the same technical features as those provided in the technical solution provided in the first aspect, and have the same technical effects as those provided in the technical solution provided in the first aspect, so that the technical effects thereof can refer to the relevant expressions of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a scene schematic diagram of a teleconference according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a video call device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of another video call device according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating a method for calculating depth information in an image processing method according to an embodiment of the present disclosure;
fig. 5 is a schematic view illustrating a sub-area division of a sound pickup area according to an embodiment of the present disclosure;
fig. 6 is a first flowchart illustrating an audio gain method according to an embodiment of the present application;
fig. 7 is a flowchart illustrating a second audio gain method according to an embodiment of the present application;
fig. 8 is a third flowchart illustrating an audio gain method according to an embodiment of the present application;
fig. 9 is a fourth schematic flowchart of an audio gain method according to an embodiment of the present application;
fig. 10 is a flowchart illustrating an audio gain method according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of another video call device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.
It should be noted that in the embodiments of the present application, "of", "corresponding" and "corresponding" may be sometimes used in combination, and it should be noted that the intended meaning is consistent when the difference is not emphasized.
For the convenience of clearly describing the technical solutions of the embodiments of the present application, in the embodiments of the present invention, the words "first", "second", and the like are used for distinguishing the same items or similar items with basically the same functions and actions, and those skilled in the art can understand that the words "first", "second", and the like are not limited in number or execution order.
FIG. 1 is a diagram illustrating an implementation environment, such as the scenario of a teleconference shown in FIG. 1, in accordance with an exemplary embodiment. Wherein the person in the area a performs a video conference through the video call device 01-1 provided in the area a and the person in the area B and the video call device 01-2 provided in the area B. When a person in the area a speaks, the area a is used as the sound pickup area in the embodiment of the present application, and otherwise, the area B is used as the sound pickup area. A plurality of fixed position seats may be provided in the sound pickup area for use by persons participating in the video conference. Video telephony device 01 (01-1 and 01-2) may specifically be a display device provided with a microphone, a camera and data processing capabilities, such as a smart television, a smart screen, etc. The video call device can process sound data collected by a microphone and image data collected by a camera.
An execution main body of the audio gain method provided in the embodiment of the present application may be the video call device described above, or may be a part of the video call device, and the embodiment of the present application is not particularly limited.
In the existing audio gain scheme, because the collected audio needs to be subjected to gain calculation continuously, the calculation process is complex, the calculation of the position of a speaker is completed through the collected audio, once the collected audio has more noise, the calculation result is inaccurate, so that the existing audio gain method can cause great delay between the collection of the audio and the playing of the audio, and the audio collected in a teleconference cannot be subjected to good gain control.
In view of the above problem, referring to fig. 2, a video call device 01 provided in an embodiment of the present application may include: a microphone array 21, an audio processor 22 and a camera assembly 23.
The microphone array 21 is composed of a plurality of microphones and is used for acquiring sound data in a sound pickup area corresponding to the video call device. For example, the microphones in the microphone array may be omnidirectional microphones or directional microphones. The sensitivity of the omnidirectional microphone is substantially the same for sounds from different angles; directional microphones have different sensitivities to sound at different angles. The microphone array in the embodiment of the present application may be any planar array, a linear array, or a spatial stereo array, where the omnidirectional microphone and the directional microphone are arranged in the specific array according to actual requirements.
And the audio processor 22 is configured to determine a target sub-area where a speaker corresponding to the sound data is located according to the sound data acquired by the microphone array 21. Because the positions of different microphones in the microphone array are different, the arrival times of the sound data received by different microphones are inconsistent, and therefore the approximate position of a speaker can be obtained according to the point and a corresponding algorithm; of course, any other feasible manner may be adopted, and the present application does not specifically limit this. In the embodiment of the present application, the audio processor 22 may be a dedicated audio processing chip connected to the microphone array 21, and in the embodiment of the present application, the dedicated audio processing chip may at least implement the following functions: positioning of a speaker, determining target audio gain information according to the target position information and the environment information, and performing gain control on sound data.
And the camera assembly 23 is used for acquiring a target image of the target subregion determined by the audio processor 22 and determining the target position information of the sound producer according to the target image. For example, referring to fig. 3, the camera assembly 23 may include a camera 231 and an image recognition module 232, where the camera 231 is configured to take a picture of an image within a view angle range of the camera to obtain a corresponding image, and specifically may be a monocular camera or a monocular camera, and specifically is determined according to actual requirements. The image recognition module 232 may be a dedicated image processing chip connected to the camera 231, which may be used to implement the following functions: speaker recognition, environmental information recognition of the sound pickup area, and position information recognition of the person in the sound pickup area (specifically, the direction and the distance from the camera).
In the present application, the camera 231 may photograph the target sub-region determined by the audio processor 22 to obtain a target image; specifically, the center of the visual angle of the camera 231 and the center of the target subregion may be photographed, and the photographed image is the target image; or the camera 231 may photograph the whole pickup area, and then extract an image of the corresponding area from the photographed pickup area as a target image, where the specific acquisition mode is specifically limited in the present application. The image recognition module 232 may specifically perform image recognition on the target image obtained by the camera 231 to determine the target position information of the speaker in the target image. The target image may be a single photograph or multiple photographs, and is determined according to actual requirements.
Specifically, image recognition may be implemented using computer vision algorithms. Computer vision algorithms are a mathematical model that helps computers understand images. The core idea of computer vision algorithm is to learn statistical properties and patterns from big data by using a data-driven method, and a large number of training samples are generally needed to train the model. In particular, image features including texture, color, shape, spatial relationships, high-level semantics, and the like, may be modeled using computer vision algorithms. Training the initial model through training samples, and adjusting parameters in the initial model to converge errors of image recognition so as to construct a new model. After training is completed, the image recognition module can predict the image classification and the classification probability through the new model, so that image recognition is performed. In the embodiment of the application, the image recognition can comprise recognition of a speaker and recognition of position information of the speaker.
Illustratively, in the embodiments of the present application, the target location information may include a distance of the speaker from video call device 01 and an orientation (e.g., angle) of the speaker with respect to video call device 01. The process of identifying the target image and determining the target position information by the image identification module 232 may include the following two steps S1 and S2:
s1, the image recognition module 232 performs image recognition on all the people in the target image to determine the vocalization persons in all the tasks.
In the embodiment of the application, taking the target image comprising multiple pictures as an example, the image recognition module may determine the posture and the facial movement of each person by recognizing the face and the movement of each person in the multiple pictures, and when it is determined that a person is in a standing posture and the facial movement of the person is a direct movement of the mouth, the person is considered as a speaker. Of course, any other feasible determination method can be adopted, and the application is not particularly limited.
And S2, determining the target position information of the speaker according to the target image.
For the distance between the speaker and the video call device in the target position information, the distance between the speaker and the video call device can be approximately regarded as the distance between the speaker and the camera because the camera is arranged in the video call device. The specific determination mode has the following two optional modes:
(1) In the embodiment of the present application, the distance between the speaker and the camera may be determined as a distance gradient, for example: far, medium, near, etc. Specifically, the distance between the speaker and the camera may be determined by using the size of the speaker in the target image, and the mapping relationship between the size of the object in the target image and the distance may be obtained through experiments in advance, where the smaller the speaker is in the target image, the farther the speaker is from the camera.
(2) When the camera is a multi-view camera, a speaker can be used to play a sound in the multi-view cameraThe distance between the speaker and the camera is determined by the imaging parallax (disparity) in the same camera. Taking a binocular camera as an example, referring to fig. 4, the left image plane and the right image plane are similar to respective image planes of two cameras in a multi-view image, assuming that the sizes of the left image plane and the right image plane are both L and O R And O T Respectively represent different cameras (namely the optical centers of the lenses), the two cameras are in the same plane, and the distance is B. It can be seen from fig. 4 that the optical axes of the left and right cameras are parallel and divide the respective imaging planes equally into two halves, and f represents the focal length. P 1 And P 2 Which are the imaging points of the object P in real space in the left image plane (corresponding to the first image) and the right image plane (corresponding to the second image), respectively. Wherein, P 1 The point is at a distance X from the left boundary of the left image plane R ,P 2 The point is at a distance X from the left boundary of the right image plane T
Based on the principle of similarity of triangles, B/Z = ((B- (X) R -X T ) /(Z-f); thus obtaining Z = (B X f)/(X) R -X T )=(B*f)/d。
Wherein (X) R -X T ) Is the difference in distance (called parallax) between the positions of the same object P in space in different captured images. The above formula represents the depth information and the parallax (X) of the object P R -X T ) The relationship between the focal length f and the distance B is based on (X) since B and f are constant values R -X T ) The distance Z of the object P can be determined; (X) R -X T ) The position of the speaker can be obtained from two pictures obtained by two cameras of the binocular camera at the same time.
It should be understood that the above examples of determining the distance between the speaker and the camera are only used to explain the embodiments of the present application, and should not be construed as limiting. The distance between the speaker and the camera can also be measured in other ways, for example using structured light. The embodiment of the application does not limit the measuring mode of the distance between the speaker and the camera.
For the position of the speaker relative to the video call device in the target position information, the position of the speaker relative to the video call device can be approximately regarded as the position of the speaker relative to the camera because the camera is arranged in the video call device. Specifically, a three-dimensional coordinate system in the sound pickup area is established by taking the camera as a coordinate origin, then specific coordinates of a speaker (in consideration of convenient calculation, a certain point in the face of the speaker can be used as a speaker body for processing) are acquired from a target image by using an image recognition technology, and then the position of the speaker relative to the camera is determined according to the coordinates.
And the audio processor 22 is configured to perform gain control on the sound data acquired by the microphone array 21 according to target audio gain information corresponding to the target position information determined by the camera module 23.
In the technical scheme provided by the embodiment of the application, firstly, the audio processor is used for judging the approximate position of a speaker according to sound data acquired by the microphone array, and determining a target sub-area of the speaker in a sound pickup area, then the camera shooting assembly determines specific target position information of the speaker according to image information of the target sub-area, and finally, the audio processor can perform gain control on the sound data acquired by the microphone array according to the target position information. In the whole process, the sound data and the image data are collected together to determine the position information of the sound producer, so that the method is more accurate compared with the scheme only depending on the sound data in the prior art, and the influence of noise on the position information judgment result can be avoided, so that the finally determined target position information is more accurate, and the effect of finally performing gain control on the sound data according to the target position information is better.
Furthermore, in the process of determining the target position information, only one approximate region (target sub-region) needs to be determined for the analysis and calculation of the sound data, only the analysis and calculation of the image data needs to be performed for the analysis and calculation of the image data in one sub-region, and the final gain control is performed according to the target position information, so that compared with the scheme that in the prior art, the accurate position is determined according to all the sound data and the denoising processing is performed at the same time, then the audio gain information corresponding to the accurate position is calculated and the gain control is performed on the sound data, the calculation amount required in the process of determining the position is smaller, the calculation efficiency is higher, the time delay is lower, and the user experience degree is better. Furthermore, in the present application, the video call device performs specific positioning using the image data acquired by the camera component in the process of performing gain control on the voice data of the speaker, so that compared with the number of microphones in a microphone array in the prior art, the number of microphone arrays in the present application can achieve a good positioning effect even if the number of microphone arrays is smaller; furthermore, under the condition that the video call equipment is provided with the camera, the technical scheme provided by the embodiment of the application can reduce the cost required by the production of the video call equipment.
Optionally, in order to reduce the calculation demand of the video call device 01, each piece of audio gain information may be calculated in advance and stored in a gain database, where the gain database may be placed on the video call device or in a server capable of communicating with the video call device 01, so that the audio processor 22 may be specifically configured to: searching for target audio gain information corresponding to the target position information determined by the camera module 23 from the gain database; the sound data acquired by the microphone array 21 is subjected to gain control using the target audio gain information.
For example, the target gain information may include a gain parameter and/or beamforming information corresponding to the target location information. Wherein, the beam forming information can be obtained by the gain parameter through a beam forming algorithm. When the target gain information only includes the gain parameter, the gain control is performed on the sound data of the microphone array after the beam forming information is calculated according to the gain parameter. The beam forming information is mainly used for giving information for controlling the direction of the microphones in the microphone array or how to process the sound data collected by each microphone, so that the intensity of the sound data in a certain direction is enhanced, the intensity of the sound data except the direction is weakened, the sound data of a speaker collected by the microphone array is ensured to be higher in intensity, and the playing side of the sound data can play the sound of the speaker more clearly and accurately.
Further optionally, in order to better complete the determination of the audio gain information in the gain database, before the microphone array 21 acquires the sound data in the sound pickup area corresponding to the video call device, the camera module 23 is further configured to acquire an area image of the sound pickup area, and determine environment information of the sound pickup area and position information of each person in the sound pickup area according to the area image; the position information of all the persons comprises target position information; the audio processor 22 is further configured to determine audio gain information corresponding to each piece of position information according to the environment information and the position information of each person acquired by the camera module 23, and store the audio gain information in the gain database; and the audio gain information corresponding to all the position information comprises target audio gain information. For example, the environment information may include the size of the sound pickup area and how many obstacles are in the sound pickup area; in practice, when a microphone array picks up sound, the size of the environment in which the microphone array is located and the number of obstacles in the environment also affect the relevant parameters of the sound data collected by the microphone array, so the above-mentioned environment information needs to be considered when calculating the audio gain information corresponding to each position information.
It should be noted that, because a person may change during a teleconference, the above-mentioned process of determining the audio gain information corresponding to each piece of location information by the camera assembly 23 and the audio processor 22 may be executed every predetermined time (for example, 5 minutes) after the video call device to which the audio processor 22 belongs is turned on, so as to update the relevant data in the gain database.
Optionally, the camera module 23 is further configured to divide the sound pickup area into a plurality of sub-areas according to the area image of the sound pickup area, and send the division result to the audio processor 22, so that it is convenient for the audio processor to determine a target sub-area where a speaker corresponding to the sound data is located. In addition, the division result may also be stored in a gain database, and when the audio processor 22 needs to be used later, the relevant data may be searched from the gain database.
In one possible approach, the division result may include an angular range and a number of each sub-region relative to the orientation of the camera assembly 23. The audio processor 22 generally obtains the angular range of the orientation of the speaker with respect to the microphone array 21 when determining the target sub-area, but because the camera module 23 and the microphone array are both located on the video call device 01 and are generally disposed on the same plane, the angular range of the orientation of the speaker with respect to the microphone array 21 can be approximately regarded as the angular range of the orientation of the speaker with respect to the camera module 23, and after determining the angular range, the corresponding number is sent to the camera module 23, so that the camera module can determine the target sub-area and perform the subsequent operation. In another possible way, the division result may also be an angular range and a number of the orientation of the sub-area with respect to a certain point on the plane where the video call device 01 is located (for example, a certain point on the screen of the video call device shown in fig. 1), and the subsequent audio processor 22 may regard the angular range of the orientation of the speaker with respect to the microphone array 21 as the angular range of the orientation of the speaker with respect to the point when determining the target sub-area. Of course, any other feasible division result may be used, and the present application is not limited in this respect.
For example, taking the angle range and the number of each sub-region relative to the orientation of the image capturing assembly as an example, the sub-regions may be divided according to the maximum angle that the image capturing assembly can capture, as shown in fig. 5, the maximum capture angle of the image capturing assembly (not shown in the figure) may be α, and then α may be divided into several equal parts (5 is equally divided in the figure), and then corresponding numbers are set for each of the equal parts, as shown in fig. 5, for sub-region 1, sub-region 2, sub-region 3, sub-region 4, and sub-region 5.
Therefore, before a speaker starts to sound, the audio gain information corresponding to the position information of each character can be calculated according to the distribution situation of the characters in the sound pickup area, and then the audio gain information of a certain position information needs to be determined, so that the audio gain information can be directly taken from the gain database, the time delay required by the calculation process of calculating the audio gain information in real time is saved, the time delay of the whole audio gain process is lower, and the user experience is improved. Furthermore, because the calculation of the audio gain information can be performed before the audio data is acquired, the calculation power requirement on the audio processor can be reduced, the cost of the audio processor with low calculation power can be lower, and the cost required by the production of the video call equipment can be reduced to a certain extent.
Further optionally, when the camera assembly 23 includes the camera 231 and the image recognition module 232, before the microphone array 21 acquires sound data in a sound pickup area corresponding to the video call device 01, the camera 231 is further configured to photograph the sound pickup area to obtain an area image; the image recognition module 232 is further configured to perform image recognition on the area image obtained by the camera 231 to obtain environment information of the sound pickup area and position information of each person in the sound pickup area.
In the process of realizing audio gain, the video call device provided by the embodiment of the application firstly uses the audio processor to judge the approximate position of a speaker according to the sound data acquired by the microphone array, and determines the target sub-area of the speaker in the sound pickup area, then the camera shooting assembly determines the specific target position information of the speaker according to the image information of the target sub-area, and finally the audio processor can perform gain control on the sound data acquired by the microphone array according to the target position information. Because in the whole process, the sound data and the image data are collected together to determine the position information of the speaker, compared with the scheme of only depending on the sound data in the prior art, the method is more accurate, and the influence of noise on the position information judgment result can be avoided, so that the finally determined target position information is more accurate, and the effect of finally performing gain control on the sound data according to the target position information is better.
Based on the video call device provided in the foregoing embodiment, referring to fig. 6, an embodiment of the present application further provides an audio gain method, which can be applied to the video call device in the foregoing embodiment, where the method includes 601 to 604:
601. sound data of a sound pickup area is acquired.
Wherein, this pickup area is the pickup area that video call equipment corresponds.
602. And determining the target subarea where the sound generator corresponding to the sound data is located according to the sound data.
Wherein, the target sub-area is one of a plurality of sub-areas included in the sound pickup area.
603. And acquiring a target image of the target sub-area, and determining target position information of the speaker according to the target image.
Optionally, when the camera assembly in the video call device includes a camera and an image recognition module, referring to fig. 7, the step 603 may specifically include 6031 and 6032:
6031. and photographing the target sub-area to obtain a target image.
6032. And performing image recognition on the target image to determine target position information of the speaker in the target image.
604. And performing gain control on the sound data according to the target position information.
Optionally, referring to fig. 8 in combination with fig. 7, 604 specifically includes 6041 and 6042:
6041. and searching target audio gain information corresponding to the target position information from the gain database.
6042. The sound data is gain-controlled using the target audio gain information.
Further optionally, referring to fig. 9 in combination with fig. 8, before the step 601, 600A and 600B are further included:
600A, acquiring an area image of a sound pickup area, and determining environment information of the sound pickup area and position information of each person in the sound pickup area according to the area image.
600B, determining audio gain information corresponding to each piece of position information according to the environment information and the position information of each person, and storing the audio gain information in a gain database; and the audio gain information corresponding to all the position information comprises target audio gain information.
Further optionally, when the camera component in the video call device includes a camera and an image recognition module, referring to fig. 10 in conjunction with fig. 9, 600A includes 6001A and 6002A:
6001A, photographing the sound pickup area to obtain the area image.
6002A, and carrying out image recognition on the area image so as to acquire environment information of the sound pickup area and position information of each person in the sound pickup area.
Since the audio gain method provided in the embodiment of the present application is based on the video call device provided in the foregoing embodiment and has the same technical features, the beneficial effects of the audio gain method can refer to the beneficial effects of the video call device in the foregoing embodiment, and are not described again here.
In the case of using an integrated module, referring to fig. 11, an embodiment of the present application further provides another video call device, which includes a memory 41, a processor 42, a bus 43, and a communication interface 44; the memory 41 is used for storing computer execution instructions, and the processor 42 is connected with the memory 41 through a bus 43; when the video telephony device is operating, processor 42 executes computer-executable instructions stored by memory 41 to cause the video telephony device to perform the audio gain method provided by the embodiments described above. The video call apparatus should further include a camera 231 capable of taking a picture and a microphone array 45 capable of picking up sound. The camera 231 and the microphone array 45 are connected to the peripheral interface 46 via the bus 43, and the peripheral interface 46 is connected to the processor 42 and the memory 41 via the bus 43.
In particular implementations, processor 42 (42-1 and 42-2) may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 11, as one embodiment. And as an example, the video telephony device may include a plurality of processors 42, such as processor 42-1 and processor 42-2 shown in fig. 11. Each of the processors 42 may be a Single-core processor (Single-CPU) or a Multi-core processor (Multi-CPU). Processor 42 may refer herein to one or more devices, circuits, and/or processing cores that process data (e.g., computer program instructions).
The Memory 41 may be a Read-Only Memory 41 (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable Read-Only Memory (EEPROM), a compact disc Read-Only Memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 41 may be self-contained and coupled to the processor 42 via a bus 43. The memory 41 may also be integrated with the processor 42.
In a specific implementation, the memory 41 is used for storing data in the present application and computer-executable instructions corresponding to software programs for executing the present application. Processor 42 may perform various functions of the video telephony device by running or executing software programs stored in memory 41 and invoking data stored in memory 41.
The communication interface 44 is any device, such as a transceiver, for communicating with other devices or communication networks, such as a control system, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), and the like. The communication interface 44 may include a receiving unit implementing a receiving function and a transmitting unit implementing a transmitting function.
The bus 43 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an extended ISA (enhanced industry standard architecture) bus, or the like. The bus 43 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.
Embodiments of the present application further provide a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed on a computer, the computer is enabled to execute the audio gain method provided in the foregoing embodiments.
Embodiments of the present application further provide a computer program product, and when the instructions in the computer program product are executed by a processor of a video call device, the video call device is caused to execute the audio gain method provided in the foregoing embodiments.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and there may be other division ways in actual implementation. For example, various elements or components may be combined or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the claims shall control.

Claims (6)

1. A video call device, comprising:
the audio processor is used for updating audio gain information corresponding to the position information of each person stored in the gain database every preset time according to the environment information of the pickup area and the position information of each person in the pickup area, which are acquired by the camera assembly, after the remote video conference is started;
the microphone array is composed of a plurality of microphones and is used for acquiring sound data in a sound pickup area corresponding to the video call equipment;
the audio processor is further configured to determine a target sub-area where a speaker corresponding to the sound data is located according to the sound data acquired by the microphone array; the target sub-area is one of a plurality of sub-areas included in the sound pickup area;
the camera shooting component is also used for acquiring a target image of a target subregion determined by the audio processor and determining target position information of the sound producer according to the target image;
the audio processor is further configured to search the gain database for target audio gain information corresponding to the target position information determined by the camera module; and performing gain control on the sound data acquired by the microphone array by using the target audio gain information.
2. The video call device of claim 1, wherein the camera assembly comprises a camera and an image recognition module;
the camera is used for photographing the target subarea determined by the audio processor to obtain the target image;
the image recognition module is used for carrying out image recognition on the target image obtained by the camera so as to determine the target position information of the speaker in the target image.
3. The video call device of claim 2, wherein before the microphone array acquires sound data in a pickup area corresponding to the video call device,
the camera is also used for photographing the pickup area to obtain the area image;
the image recognition module is further used for carrying out image recognition on the area image obtained by the camera so as to obtain the environment information of the sound pickup area and the position information of each person in the sound pickup area.
4. An audio gain method, comprising:
after a remote video conference is started, updating audio gain information corresponding to the position information of each person stored in a gain database every other preset time according to the acquired environment information of the sound pickup area and the position information of each person in the sound pickup area;
acquiring sound data of a pickup area;
determining a target subregion where a speaker corresponding to the sound data is located according to the sound data; the target sub-area is one of a plurality of sub-areas included in the sound pickup area;
acquiring a target image of the target subregion, and determining target position information of the speaker according to the target image;
searching target audio gain information corresponding to the target position information from the gain database;
gain control is performed on the sound data using the target audio gain information.
5. The audio gain method of claim 4, wherein the obtaining of the target image of the target subregion and determining the target position information of the speaker according to the target image comprises:
photographing the target sub-area to obtain the target image;
and performing image recognition on the target image to determine target position information of a speaker in the target image.
6. The audio gain method of claim 5, wherein the obtaining of the area image of the sound pickup area and determining the environment information of the sound pickup area and the position information of each person in the sound pickup area according to the area image comprises:
photographing the pickup area to obtain the area image;
and performing image recognition on the area image to acquire environment information of the sound pickup area and position information of each person in the sound pickup area.
CN202011300121.XA 2020-11-18 2020-11-18 Video call device and audio gain method Active CN112423191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011300121.XA CN112423191B (en) 2020-11-18 2020-11-18 Video call device and audio gain method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011300121.XA CN112423191B (en) 2020-11-18 2020-11-18 Video call device and audio gain method

Publications (2)

Publication Number Publication Date
CN112423191A CN112423191A (en) 2021-02-26
CN112423191B true CN112423191B (en) 2022-12-27

Family

ID=74773468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011300121.XA Active CN112423191B (en) 2020-11-18 2020-11-18 Video call device and audio gain method

Country Status (1)

Country Link
CN (1) CN112423191B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113676687A (en) * 2021-08-30 2021-11-19 联想(北京)有限公司 Information processing method and electronic equipment
CN114401350A (en) * 2022-01-24 2022-04-26 联想(北京)有限公司 Audio processing method and conference system
CN114911449A (en) * 2022-04-08 2022-08-16 南京地平线机器人技术有限公司 Volume control method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107919133A (en) * 2016-10-09 2018-04-17 赛谛听股份有限公司 For the speech-enhancement system and sound enhancement method of destination object
JP2018170717A (en) * 2017-03-30 2018-11-01 沖電気工業株式会社 Sound pickup device, program, and method
CN110933254A (en) * 2019-12-11 2020-03-27 杭州叙简科技股份有限公司 Sound filtering system based on image analysis and sound filtering method thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9973848B2 (en) * 2011-06-21 2018-05-15 Amazon Technologies, Inc. Signal-enhancing beamforming in an augmented reality environment
CN111050269B (en) * 2018-10-15 2021-11-19 华为技术有限公司 Audio processing method and electronic equipment
CN110691196A (en) * 2019-10-30 2020-01-14 歌尔股份有限公司 Sound source positioning method of audio equipment and audio equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107919133A (en) * 2016-10-09 2018-04-17 赛谛听股份有限公司 For the speech-enhancement system and sound enhancement method of destination object
JP2018170717A (en) * 2017-03-30 2018-11-01 沖電気工業株式会社 Sound pickup device, program, and method
CN110933254A (en) * 2019-12-11 2020-03-27 杭州叙简科技股份有限公司 Sound filtering system based on image analysis and sound filtering method thereof

Also Published As

Publication number Publication date
CN112423191A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112423191B (en) Video call device and audio gain method
JP6844038B2 (en) Biological detection methods and devices, electronic devices and storage media
US11748906B2 (en) Gaze point calculation method, apparatus and device
CN108734733B (en) Microphone array and binocular camera-based speaker positioning and identifying method
CN109947886B (en) Image processing method, image processing device, electronic equipment and storage medium
CN111160178B (en) Image processing method and device, processor, electronic equipment and storage medium
CN109683135A (en) A kind of sound localization method and device, target capturing system
EP2509070A1 (en) Apparatus and method for determining relevance of input speech
CN111445583B (en) Augmented reality processing method and device, storage medium and electronic equipment
JP2020506487A (en) Apparatus and method for obtaining depth information from a scene
CN110807361A (en) Human body recognition method and device, computer equipment and storage medium
WO2021218568A1 (en) Image depth determination method, living body recognition method, circuit, device, and medium
CN112927363A (en) Voxel map construction method and device, computer readable medium and electronic equipment
CN110554356A (en) Equipment positioning method and system in visible light communication
CN114677350A (en) Connection point extraction method and device, computer equipment and storage medium
CN113424522A (en) Three-dimensional tracking using hemispherical or spherical visible depth images
CN110188630A (en) A kind of face identification method and camera
US20230254639A1 (en) Sound Pickup Method and Apparatus
WO2023015938A1 (en) Three-dimensional point detection method and apparatus, electronic device, and storage medium
CN111982293B (en) Body temperature measuring method and device, electronic equipment and storage medium
CN112330793A (en) Obtaining method of ear mold three-dimensional model, earphone customizing method and computing device
CN115880206A (en) Image accuracy judging method, device, equipment, storage medium and program product
KR102146839B1 (en) System and method for building real-time virtual reality
CN107802468B (en) Blind guiding method and blind guiding system
JP2017108240A (en) Information processing apparatus and information processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant