CN115633290A

CN115633290A - Audio processing method, electronic device and storage medium

Info

Publication number: CN115633290A
Application number: CN202211250411.7A
Authority: CN
Inventors: 凌宏强; 张洪伟; 徐家喜
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-01-20

Abstract

The application relates to an audio processing method, electronic equipment and a medium, which belong to the technical field of audio processing, and the method comprises the following steps: obtaining first position information of a target sound source by using an image collected by an image collecting component; generating an acoustic image based on the audio acquisition assembly to obtain each relative position relation between each sound source and the audio acquisition assembly; determining a target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information; determining a target beam and a volume gain of a target sound source based on the target relative positional relationship; the volume gain is used to adjust the audio volume of the audio data acquired in accordance with the target beam. Even if interference factors such as noise, echo, reverberation and the like exist in a meeting place, accurate position information of a target sound source can be obtained, and a target beam and a volume gain are set for the target sound source, so that the volume difference between the target sound sources is not too large, and the audio volume of each target sound source can be consistent.

Description

Audio processing method, electronic device and storage medium

[ technical field ] A method for producing a semiconductor device

The application relates to an audio processing method, electronic equipment and a storage medium, and belongs to the technical field of audio processing.

[ background of the invention ]

When a teleconference is carried out, conference terminals are required to be used for communicating different conference places, and the conference terminals are generally provided with sensors for collecting sound. Interference factors such as noise, echo and reverberation inevitably exist in the meeting place, the effect of sound transmitted into another meeting place is seriously influenced, and the normal running of a conference is interfered.

The conventional conference terminal is provided with a fixed beam facing a participant through a sensor, only collects sound in the beam, reduces interference of sound outside the beam, and further weakens the influence of interference factors in a conference room.

However, since the sound level of each participant is different and the distance between each participant and the sensor is different, a problem of a large sound difference between the participants is caused.

[ summary of the invention ]

The application provides an audio processing method, an electronic device and a storage medium, which can cause the problem of great sound difference between participants due to different sound sizes of each participant and different distances between each participant and a sensor. The application provides the following technical scheme:

in a first aspect, an audio processing method is provided, where the method includes:

carrying out image recognition on an image collected by an image collecting assembly in a conference scene to obtain first position information of a target sound source;

generating an acoustic image based on the audio signal acquired by the audio acquisition component, wherein the acoustic image is used for indicating each relative position relationship between each sound source in the conference scene and the audio acquisition component;

determining a target relative positional relationship between the target sound source and the audio acquisition component among the respective relative positional relationships using the first positional information;

determining a target beam and a volume gain of the target sound source based on the target relative positional relationship, wherein the volume gain is in a positive correlation with a distance indicated by the target relative positional relationship;

adjusting an audio volume of audio data acquired in accordance with the target beam using the volume gain.

Optionally, the first location information comprises a target position and a target distance, the target position comprising a horizontal angular position and a vertical angular position;

the image recognition of the image collected by the image collecting component in the conference scene is performed to obtain the first position information of the target sound source, and the method comprises the following steps:

carrying out face detection on the image to obtain face pixel height and face central point coordinates;

acquiring the screen pixel size, the screen center point coordinate, the horizontal field angle, the vertical field angle and the real height of a human face of the image acquisition assembly, wherein the screen pixel size comprises the horizontal size and the vertical size;

acquiring a horizontal distance and a vertical distance between the coordinates of the face center point and the coordinates of the screen center point;

obtaining a horizontal angle orientation between the face center point coordinates and the image acquisition assembly by using the horizontal size, the horizontal distance and the horizontal field angle;

obtaining a vertical angular orientation between the face center point coordinates and the image acquisition assembly using the vertical dimension, the vertical distance, and the vertical field angle;

obtaining a focal length using the vertical dimension and the vertical field angle;

and obtaining a target distance by using the focal length, the real height of the human face and the height of the human face pixel.

Optionally, the sound image comprises audio peaks of the respective sound sources, and positions of the audio peaks are used for indicating the respective relative positional relationships; the determining a target relative positional relationship between the target sound source and the audio acquisition component in respective relative positional relationships using the first positional information includes:

acquiring second position information of the first position information corresponding to the acoustic image;

determining false audio peaks of which the relative position relations are not matched with the second position information in the audio peaks of the sound image;

and deleting the false audio peak values in the audio peak values of the sound image to obtain the target relative position relation indicated by each deleted audio peak value.

Optionally, the second positional information comprises a first distance and a first orientation between the audio acquisition assembly and the target sound source, and the respective relative positional relationships comprise a second distance and a second orientation;

the determining false audio peaks of which the relative position relations are not matched with the second position information in the audio peaks of the sound image comprises:

obtaining a distance difference between the second distance and the first distance in each audio peak of the sound image,

determining the false audio peak for which the second orientation does not match the first orientation and/or for which the distance difference is greater than a preset distance threshold.

Optionally, the determining false audio peaks, in which the respective relative positional relationships are not matched with the second positional information, in the respective audio peaks of the acoustic image includes:

determining a false audio peak whose relative positional relationship does not match the second positional information among the respective audio peaks of the acoustic image in a case where the number of audio peaks in the acoustic image is greater than or equal to the number of target sound sources.

Optionally, the determining a volume gain of the target sound source based on the target relative position relationship includes:

determining an intermediate distance using the maximum distance and the minimum distance indicated by the relative positional relationship of the targets;

acquiring a standard volume gain corresponding to the intermediate distance;

determining a gain difference between a volume gain of each target sound source and the standard volume gain based on a difference between each target sound source and the intermediate distance;

determining a volume gain for each target sound source based on the standard volume gain and the gain difference.

determining a first gain value of the target sound source based on the target relative position relationship, wherein the first gain value is in positive correlation with the distance indicated by the target relative position relationship;

determining the volume difference between the audio volume obtained by adjusting the audio data according to the first gain value and a preset standard volume;

determining a second gain value of the target sound source using the volume difference;

determining a volume gain of the target sound source based on the first gain value and the second gain value.

Optionally, the method comprises:

determining whether the target sound source meets a preset volume adjustment condition; the volume adjusting condition comprises at least one of the following conditions: the target sound source is in a speaking state; the target sound source is a speaker; the target sound source is a participant interacting with the speaker;

and under the condition that the target sound source meets the volume adjusting condition, triggering and executing the step of adjusting the volume of the audio data collected according to the target wave beam by using the volume gain.

Optionally, the method further comprises:

and under the condition that the target sound source does not meet the volume adjusting condition, if the volume of the audio data is greater than or equal to a preset volume threshold value, inhibiting the audio volume of the audio data.

In a second aspect, an electronic device is provided, where the electronic device includes a processor and a memory connected to the processor, and the memory stores a program, and the processor executes the program to implement the audio processing method provided in the first aspect.

In a third aspect, a computer-readable storage medium is provided, in which a program is stored, which, when executed by a processor, is configured to implement the audio processing method provided in the first aspect.

The beneficial effects of this application include at least: the method comprises the steps that image recognition is carried out on images collected by an image collecting assembly in a meeting scene to obtain first position information of a target sound source; generating an acoustic image based on the audio signal acquired by the audio acquisition component, wherein the acoustic image is used for indicating each relative position relationship between each sound source in the conference scene and the audio acquisition component; determining a target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information; determining a target beam and a volume gain of a target sound source based on the target relative positional relationship; adjusting an audio volume of the audio data collected according to the target beam using the volume gain; the problem that the sound of each participant is different in size and the distance between each participant and the sensor is different, so that the sound of the participants is greatly different can be solved; because the acoustic image is used for indicating the relative position relationship of each sound source in the meeting place, the relative position relationship of the sound source corresponding to interference factors such as noise, echo and reverberation may also be included, and the target sound source can be determined by using the image and the first position information can be acquired, so that the target relative position relationship is determined in each relative position relationship by using the first position information, and the relative position relationship of the interference factors can be eliminated under the condition that the interference factors such as noise, echo and reverberation exist in the acoustic image, so that the target relative position relationship corresponding to the target sound source is obtained, and the relative position information acquired in the acoustic image is more accurate, so that the accurate position information of the target sound source can be acquired, and the target beam and the volume gain are set for the target sound source, so that the volume difference between the target sound sources is not too large, the quality of audio data in the meeting can be improved, and the audio volumes of each target sound source can be consistent.

Meanwhile, the target relative position relation between the target sound source and the audio acquisition assembly is determined by combining the first position information and each relative position relation, so that the problem that the target wave beam deviates from the target sound source due to inaccurate first position information caused by image distortion can be avoided; the problem that the target wave beam points to noise due to the fact that noise can form an audio peak value in a sound image can be avoided; and determining a target relative position relation corresponding to the target sound source by combining the first position information and each relative position relation, thereby accurately setting a target beam pointing to the target sound source, and adjusting the audio volume of the audio data of the target sound source collected in the target beam through volume gain, so that the volume of each target sound source tends to accord with the listening rule of a user, and the quality of conference audio is improved.

Meanwhile, due to the fact that the volume gain is used for adjusting the audio gain of the audio data of the target sound source collected in the target wave beam, the problem that the sound volume of the participants is too large or too small to cause hearing failure can be avoided, the number of times of repeated speaking caused by the fact that other participants can not hear failure can be reduced, the conference time can be reduced, and the conference efficiency is improved.

In addition, deleting false audio peaks which are not matched with the first position information in the acoustic image to obtain a target relative position relation indicated by each deleted audio peak; the accuracy of the obtained relative position relation of the target can be improved, so that the accuracy of the target beam pointing to the target sound source is improved, the effectiveness of adjusting the volume of the target sound source by the volume gain is improved, and the interference of interference factors such as noise in a meeting place on a meeting is reduced.

In addition, when the number of the audio peaks is greater than or equal to the number of the target sound sources, a false audio peak with a relative position relation not matched with the first position information is determined in each audio peak of the sound image, so that the situation that the target relative position relation of the target sound source is obtained wrongly under the condition that an audio component is damaged or the target sound sources do not all enter a conference scene can be avoided, and the reliability of the electronic equipment can be improved.

In addition, the first position information of the target sound source of the target type is obtained, the false audio peak value of which the relative position relation is not matched with the first position and orientation information of the target sound source of the target type is determined, different target sound sources can be flexibly determined according to different meeting scenes, and the flexibility of the electronic equipment can be improved.

In addition, the volume gain of each target sound source is determined according to the standard volume gain corresponding to the middle distance, so that the volumes of the target sound sources tend to be consistent, the listening rules of users are met, meanwhile, the corresponding relation between the preset distance and the volume gain can be reduced, and the storage resources of the electronic equipment are saved.

In addition, the second gain value is determined according to the preset standard volume on the basis of the first gain value, so that when the volume of the target sound source changes, the corresponding volume gain can be adjusted in time, the volume change can be smoothed, and the accuracy of the electronic equipment in adjusting the volume is improved.

In addition, under the condition that the preset volume adjustment condition is met, the volume of the audio data collected according to the target wave beam is adjusted by using the volume gain, so that the volume of the target sound source which meets the volume adjustment condition can be ensured, and the intelligence of the electronic equipment for adjusting the volume is improved.

In addition, under the condition that the preset volume adjustment condition is not met, the audio volume of the audio data of the target sound source is suppressed, the situation that the volume of the target sound source which does not meet the volume adjustment condition is too large to interfere with the normal operation of the conference can be avoided, and therefore the conference efficiency can be improved.

The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical solutions of the present application more clear and clear, and to implement the technical solutions according to the content of the description, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.

[ description of the drawings ]

FIG. 1 is a schematic diagram of a system for an audio processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of an audio processing method provided by an embodiment of the present application;

FIG. 3 is a schematic illustration of an image provided by one embodiment of the present application;

FIG. 4 is a schematic illustration of a target bearing calculation method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a target distance calculation method according to an embodiment of the present application;

FIG. 6 is a schematic illustration of an acoustic image provided by an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a target beam and a target sound source according to an embodiment of the present application;

FIG. 8 is a schematic diagram of the distribution of volume gain according to intermediate distance according to one embodiment of the present application;

FIG. 9 is a flowchart of specific steps provided by an embodiment of the present application;

FIG. 10 is a block diagram of an apparatus for audio processing provided by an embodiment of the present application;

FIG. 11 is a block diagram of an electronic device provided by an embodiment of the application.

[ detailed description ] embodiments

The following detailed description of the present application will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

When a teleconference is carried out, because the speaking volume of each participant in the same conference scene is different, and the distance between each participant and the audio acquisition component is different, the volume of the audio data acquired by the audio acquisition component possibly differs greatly, thereby interfering the normal operation of the teleconference. Conventional solutions include, but are not limited to, at least one of the following:

first, at the end of collecting the speaking audio of the participant, the participant actively adjusts the speaking volume of the participant, or the participant actively adjusts the distance from the audio collecting component, so as to change the volume of the audio data.

But the conference participants cannot well grasp the volume and distance of the active adjustment, so that the volume is changed from too high to too low; or the volume changes from too low to too high, even when the cut-off causes distortion of the sound at the time of sound amplification.

And secondly, actively adjusting the output volume of the audio data at the end of outputting the speaking audio of the participants.

Noise in the audio data is amplified while the output volume of the audio data is increased; and frequent manual adjustment of the output volume is also troublesome.

Third, each participant wears a respective audio capture component, such as a gooseneck microphone.

The gooseneck microphone requires an interface to access the terminal. In general, the number of interfaces of the terminal is small, and all gooseneck microphones cannot be accessed, so that external expansion equipment is required. Therefore, access lines in the conference scene can be increased, the difficulty of line arrangement in the conference scene can be increased, and the cost can be increased.

Fourth, the level of the audio volume of the audio data is automatically adjusted using an Automatic Gain Control (AGC) technique.

The working principle of AGC for adjusting the volume is as follows: when a weak signal is input, the linear amplification circuit works to ensure the strength of an output signal; when the input signal reaches a certain intensity, the compression amplifying circuit is started to reduce the output amplitude. The greater the similarity of the input and output signals, the better the Echo cancellation performance of the Echo canceller (AEC). Therefore, the AGC may affect the echo cancellation effect of the AEC because the input-output signal difference is increased by the AGC.

Fifthly, a microphone array is adopted, and a beam is effectively formed in a desired direction through a beam forming technology, so that only signals in the beam are picked up, and the aim of improving the signal-to-noise ratio is fulfilled.

However, the positions of the participants are changed, and interference factors such as noise, echo, reverberation and the like exist in a conference scene, so that the problem that the microphone array cannot set the beam direction accurately and timely according to the positions of the participants is caused.

Therefore, currently, one or more virtual single-point microphones are formed by fixing the beam direction, and only respond to the audio frequency in the beam direction, so as to reduce the influence of the interference factors. However, the problem that the audio volume of the audio data is greatly different due to the fact that the speaking volume of each participant is different and the distances from the audio acquisition component are different still cannot be solved.

Therefore, the application provides an audio processing method, an electronic device and a storage medium for solving the problem that the audio volume of the collected audio data is greatly different due to the fact that the speaking volume of each participant is different and the distance between the participant and the audio collecting component is different.

Fig. 1 is a schematic diagram of a system of an audio processing method according to an embodiment of the present application. As can be seen from fig. 1, the system 10 at least comprises: an audio capture component 110, an image capture component 120, and an electronic device 130. In this embodiment, the electronic device 130 is taken as a remote conference terminal (such as a video conference terminal or an audio conference terminal) for example, in other embodiments, the electronic device 10 may also be a mobile phone, a notebook computer, a television, and the like, and the implementation manner of the electronic device is not limited in this embodiment.

The audio capture component 110 is configured to capture audio signals within a conference scene (hereinafter referred to as a conference room). The audio capture component 110 includes, but is not limited to: a microphone array, or a sound intensity probe, etc., and different audio capturing assemblies 110 are suitable for the same or different scenes, and the implementation manner of the audio capturing assemblies 110 is not limited in this embodiment.

In this embodiment, the audio acquisition component 110 has a beam forming function. The beam forming is to point to a designated direction to form a beam for collecting audio signals falling in the beam and suppressing audio signals outside the beam.

For example, the audio collecting component 110 forms a target beam toward the target sound source, so that the audio signal of the target sound source falls within the target beam, thereby collecting the audio signal of the target sound source and suppressing the audio signal outside the target beam. The target sound source is a sound source that a user wants to collect audio data, and the target beam is a beam directed to the target sound source.

Taking the working scene of the audio acquisition component 110 as a teleconference scene as an example, the target sound source includes both the participants who participate in the conference and the participants who participate in the conference through the communication devices placed in the conference hall. The embodiment does not limit the implementation manner of the target sound source.

The audio capture component 110 is communicatively coupled to the electronic device 130 and transmits captured audio signals to the electronic device 130.

Optionally, the audio capturing component 110 may be a device independent from the electronic device 130, or may be integrated in the electronic device 130, and the implementation manner of the audio capturing component 110 and the electronic device 130 is not limited in this embodiment.

The electronic device 130 is configured to acquire an audio signal and generate an acoustic image based on acoustic imaging principles using the audio signal. The acoustic map is used to display audio information of each sound source in the conference room. Wherein the audio information includes but is not limited to: the relative positional relationships between the respective sound sources and the audio acquisition assembly 110, the volume levels of the respective sound sources, and the like.

The acoustic image comprises audio peaks of the respective sound sources, which are visual indications of the volume maximum of the sound waves in the acoustic image. Wherein the position of the audio peak is used to indicate the relative position relationship between the sound source corresponding to the audio peak and the audio acquisition component 110.

The acoustic image is generated by a sound source localization technique based on time difference of arrival. Namely, the time difference between the arrival of the audio signal of the sound source at each two audio acquisition assemblies 110 is measured, so as to obtain an equation set of the position coordinates of the sound source, the relative position relationship between the sound source and the audio acquisition assemblies 110 is obtained after the equation set is solved, the amplitude of the sound source is measured, and then the distribution of the sound source in the space is identified in an image mode, so as to obtain a sound image. The sound sources of the respective utterances are represented in the acoustic image by audio peaks, and therefore the positions of the audio peaks can indicate the relative positional relationship between the respective sound sources and the audio acquisition assembly. In other embodiments, the acoustic image may also be generated based on a beam forming technique, that is, a beam is effectively formed in a desired direction, and only the audio signal in the beam is picked up, so that the out-of-beam noise can be suppressed while the in-beam audio signal is extracted.

The audio peaks can be represented in the sound image map using image identifiers with brightness and color distinctions, where different colors or brightness are used to identify different volumes; or, the audio peak value is represented in a contour line mode, sound sources with the same volume are located on the same contour line, and sound sources with different volumes are located on different contour lines, so that the strength of the volume is identified. The present embodiment does not limit the representation manner of the audio peak.

Because there may be noise in the meeting place that the user does not want to collect, for example: the sound emitted by the air conditioner during operation, the sound emitted by the projector during operation, etc., and these noises may also have corresponding audio peaks, hereinafter false audio peaks, in the sound image. Based on this, it is necessary to eliminate spurious audio peaks in the sonogram, thereby improving the signal-to-noise ratio of the sonogram. The signal-to-noise ratio refers to a ratio of a normal sound signal to noise, and if the signal-to-noise ratio is higher, the noise is smaller, and the quality of the sound image is higher.

Based on this, the system of the audio processing method provided in this embodiment further includes an image collecting component 120, where the image collecting component 120 can collect an image based on an optical imaging principle, and thus, the electronic device may combine the image to screen an audio peak in the acoustic image, so as to improve accuracy of a sound source position obtained through the acoustic image.

The collection range of the image collection component 120 includes all sound sources corresponding to the audio peaks in the sound image. In an example, if the working scene of the image capturing component 120 is a teleconference scene, since each sound source reflected by the sound image is usually located in a central area of the teleconference scene, the image capturing component 120 is installed at an edge of the conference scene, which is directly opposite to the central area, and at this time, the image captured by the image capturing component 120 is a panoramic image covering all the sound sources in the whole teleconference scene.

In this embodiment, the image capturing component 120 is configured to capture an image within a meeting place, and the electronic device 10 obtains first position information of a target sound source. Image acquisition component 120 includes, but is not limited to: a panoramic camera, a scanner, etc., and the present embodiment does not limit the implementation manner of the image capturing assembly 120.

Wherein the first position information comprises a target position and a target distance of the target sound source relative to the image acquisition assembly 120.

The image capture component 120 is communicatively coupled to the electronic device 130 and transmits the captured panoramic image to the electronic device 130.

Optionally, the image capturing component 120 may be a device independent from the electronic device 130, or may be integrated in the electronic device 130, and the implementation of the image capturing component 120 and the electronic device 130 is not limited in this embodiment.

Since the image acquisition assembly 120 may have distortion during image acquisition due to different orientations and sizes of each target sound source in the conference room, determining the distance of the target sound source relative to the image acquisition assembly 120 by using only the image acquired by the image acquisition assembly 120 may not be accurate

In view of the foregoing problem, in this embodiment, the electronic device is configured to: carrying out image recognition on an image collected by an image collecting assembly in a meeting place to obtain first position information of a target sound source; generating an acoustic image based on the audio signal acquired by the audio acquisition component, wherein the acoustic image is used for indicating each relative position relationship between each sound source in the conference scene and the audio acquisition component; determining a target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information; determining a target beam and a volume gain of a target sound source based on the target relative positional relationship; the volume gain is used to adjust the audio volume of the audio data acquired in accordance with the target beam.

In general, the farther the distance from the audio acquisition component 110, the smaller the volume of the audio data of the acquired target sound source, and therefore, the volume gain is in positive correlation with the distance indicated by the relative positional relationship of the target. That is, the farther the target audio source is from the audio acquisition component 110, the greater the volume gain.

In this embodiment, the relative positional relationship corresponding to the target sound source is screened from each relative positional relationship by using the first positional information acquired by the image acquisition component. Therefore, even if interference factors such as noise, echo and reverberation exist in the conference room, the audio acquisition component can also acquire accurate position information of the target sound source, and set a target beam and a volume gain aiming at the target sound source, so that the volume between the target sound sources does not differ too much, the quality of audio data in the conference can be improved, and the audio volume of each target sound source can be consistent.

In addition, the target relative position relation between the target sound source and the audio acquisition assembly is determined by combining the first position information and each relative position relation, so that the problem that the target beam deviates from the target sound source due to inaccurate first position information caused by image distortion can be avoided; the problem that the target wave beam points to noise due to the fact that noise can also form audio peaks in a sound image can be avoided; and determining a target relative position relation corresponding to the target sound source by combining the first position information and each relative position relation, thereby accurately setting a target beam pointing to the target sound source, and adjusting the audio volume of the audio data of the target sound source collected in the target beam through volume gain, so that the volume of each target sound source tends to accord with the listening rule of a user, and the quality of conference audio is improved.

The following describes the audio processing method provided by the present application in detail. The following embodiments are described by taking as an example that the method is used in the electronic device shown in fig. 1, and is specifically used in a processor in the electronic device, and in practical implementation, the method may also be used in other devices communicatively connected to the electronic device, for example: the method is applied to a user terminal, or a server, etc., wherein the user terminal includes but is not limited to: a mobile phone, a computer, a tablet computer, a wearable device, and the like, the implementation manner of the other device and the implementation manner of the user terminal are not limited in this embodiment.

The communication connection mode may be wired communication or wireless communication, the wireless communication mode may be short-range communication or wireless communication, and the like.

Fig. 2 is a flowchart of an audio processing method according to an embodiment of the present application. The method at least comprises the following steps:

step 201, performing image recognition on an image acquired by an image acquisition assembly in a conference scene to obtain first position information of a target sound source.

Image recognition refers to techniques for processing, analyzing, and understanding images captured by an image capture assembly.

The first location information includes: the target azimuth and the target distance between the image acquisition assembly and the target sound source; wherein the target orientations include horizontal angular orientations, and vertical angular orientations. In other embodiments, the target position may also include a vertical angle position and/or a horizontal angle position, and the embodiment does not limit the implementation manner of the first position information.

The method for identifying the image collected by the image collecting component in the meeting scene to obtain the first position information of the target sound source comprises the following steps: carrying out face detection on the image to obtain face pixel height and face central point coordinates; acquiring the screen pixel size, the screen center point coordinate, the horizontal field angle, the vertical field angle and the real height of a human face of an image acquisition assembly, wherein the screen pixel size comprises the horizontal size and the vertical size; acquiring a horizontal distance and a vertical distance between the coordinates of the center point of the face and the coordinates of the center point of the screen; obtaining a horizontal angle direction between the coordinates of the center point of the face and the image acquisition assembly by using the horizontal size, the horizontal distance and the horizontal field angle; obtaining a vertical angle direction between the coordinates of the center point of the face and the image acquisition assembly by using the vertical size, the vertical distance and the vertical field angle; obtaining a focal length using a vertical dimension and a vertical field angle; and obtaining the target distance by using the focal length, the real height of the human face and the height of the human face pixel.

The real face height is expressed by using an actual average face height, for example, the average face height may be 25 cm or 23 cm, and the present embodiment does not limit the real face height.

Performing face detection on the image, including but not limited to at least one of the following face detection algorithms: eigenfaces algorithm, fisherfaces algorithm, etc.

And determining a target sound source in the image according to a face detection result of the image by using a face detection algorithm.

Specifically, taking the image acquisition component as the center for example, the target image acquired by the image acquisition component is as shown in fig. 3, and the target sound source 30 can be identified in, for example, the air conditioner 32, the door 31, the projector 34, and the target sound source 30 by using a neural network model.

Illustratively, in some special conferences, the type of conference participants speaking in the conference hall is limited, and in this case, the identified target sound source usually needs to be screened once. Therefore, in the process of identifying the target sound source in the image by using the preset image identification algorithm, the target type of the target sound source can be identified. Wherein, the target type can be set by a user, such as: target types include, but are not limited to: a man, a person older than or equal to sixty years, a person standing, or a specific person, etc.

In the conventional target orientation acquisition method, the focal length is variable, so that the focal length needs to be calculated, however, errors may exist in the focal length calculated through a model, and the accuracy of the target orientation is affected. Therefore, the embodiment provides a method for calculating a target azimuth without using a focal length, which includes:

using the horizontal size, horizontal distance and horizontal field angle to obtain the horizontal angular orientation between the coordinates of the center point of the face and the image capturing assembly, as shown in fig. 4, the image capturing assembly images the actual face 41 through the lens 42 onto the imaging screen 43, resulting in the following formula:

based on a triangle similarity principle, half of a horizontal field angle is the same as an azimuth angle beta, and a horizontal angle direction is the same as a horizontal angle alpha, so that alpha represents the horizontal angle direction, and beta represents half of the horizontal field angle of the image acquisition assembly; fw is the horizontal distance between the coordinates of the center point of the face and the coordinates of the center point of the imaging screen 43; f is the focal length of the image acquisition assembly; w is the horizontal dimension.

Thus, using the ratio of equation (1) and equation (2), the following equation can be obtained:

since the focal length F is inaccurate, the horizontal angular orientation α calculated by the formula (1) alone is also inaccurate, but the horizontal field angle, the horizontal dimension, and the horizontal distance are accurate, so that the formula (3) is derived by combining the formula (1) and the formula (2), the use of an inaccurate focal length during the calculation process can be avoided, and the horizontal angular orientation α calculated by the formula (3) can accurately represent the horizontal angular orientation between the target sound source and the image acquisition assembly.

Accordingly, the horizontal angle of view, the horizontal dimension, and the horizontal distance are replaced with the vertical angle of view, the vertical dimension, and the vertical distance, so that the accurate vertical angle and the direction can be calculated based on the triangle similarity principle and the formula.

Using the focal length, the real face height, and the face pixel height to obtain the target distance, as shown in fig. 4, the following formula can be obtained:

based on the triangle similarity principle, half of the vertical field angle is the same as the azimuth angle gamma, so that gamma represents half of the vertical field angle; h is the vertical distance.

The calculation formula of the focal length F obtained by transforming the formula (1) is as follows:

as shown in fig. 5, the image capturing component images an actual face 41 on an imaging screen 43 through a lens 42, and based on the triangle similarity principle, a ratio of a real height of the face to a target distance is equal to a ratio of a pixel height of the face to a focal length, and the formula is as follows:

wherein h is the true height of the face; d is the target distance; hp is the face pixel height.

Combining equation (5) and equation (6) results in the following equation:

based on equation (7), the target distance can be calculated.

Step 202, generating an acoustic image based on the audio signal collected by the audio collection component.

Step 203, determining a target relative position relationship between the target sound source and the audio acquisition component in each relative position relationship using the first position information.

The acoustic image includes audio peaks corresponding to sound sources, each of which includes a target sound source desired by a user (e.g., a participant in a conference room) and may also include noise not desired by the user. Therefore, it is necessary to obtain a target peak corresponding to the target sound source among the audio peaks.

Since the image coordinate system used by the image acquisition component and the sound image coordinate system used by the audio acquisition component are different, the first position information needs to be mapped into the sound image to be able to determine the target relative position relationship between the target sound source and the audio acquisition component in each relative position relationship.

In this embodiment, determining the target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information includes: acquiring second position information of the first position information corresponding to the sound image; determining audio peaks of which the relative position relations are not matched with the second position information in the audio peaks of the sound image as false audio peaks; and deleting the false audio peak values in the audio peak values of the sound image to obtain the target relative position relation indicated by each deleted audio peak value.

Wherein the second location information comprises a first position and a first distance of the audio acquisition component from the target sound source. The first direction may be a horizontal angular direction or a vertical angular direction, and this embodiment does not limit the implementation manner of the first direction.

The relative positional relationship of the target includes a second orientation and a second distance of the audio acquisition component to the target sound source, the second orientation being of the same type as the first orientation, such as: the first azimuth is a horizontal angular azimuth, the second azimuth is also a horizontal angular azimuth, and the first azimuth is a vertical angular azimuth, the second azimuth is also a vertical angular azimuth.

In one example, the relative positional relationship of the audio capture assembly and the image capture assembly in the venue is fixed, at which time the coordinate transformation relationship between the image coordinate system and the audio coordinate system is fixed. Acquiring second position information of the first position information corresponding to the sound image, wherein the second position information comprises the following steps: determining a relative position of the target sound source relative to the audio acquisition component in the image coordinate system using the first position information; and converting the relative position to an audio coordinate system by using a coordinate conversion relation between the image coordinate system and the audio coordinate system to obtain second position information.

Alternatively, a line connecting the installation position of the image capturing component and the installation position of the audio capturing component is perpendicular to the horizontal plane, and at this time, the relative position of the target sound source with respect to the image coordinate system is approximate to the relative position of the target sound source with respect to the audio capturing component, so that when the relative position of the target sound source with respect to the audio capturing component in the image coordinate system is determined using the first position information, the first position information can be directly used as the relative position.

Or the installation position of the image acquisition component relative to the audio acquisition component is fixed, when the relative position of the target sound source relative to the audio acquisition component in the image coordinate system is determined by using the first position information, the installation position relationship of the image acquisition component relative to the audio acquisition component is acquired, and the first position information is converted into the relative position of the target sound source relative to the audio acquisition component by using the installation position relationship.

In other embodiments, the capture range of the image capture component may include where the audio capture component is located. At this time, the electronic device may further determine the relative position of the audio capture component with respect to the image capture component based on the position of the audio capture component in the image, and the embodiment does not limit the manner of determining the relative position.

Illustratively, the image coordinate system is established based on an intermediate baseline of the image, such as: the middle point of the middle base line is used as the origin of the image coordinate system, the middle base line is used as the y-axis of the image coordinate system, and the direction perpendicular to the middle base line is used as the x-axis of the image coordinate system.

For example, using the intermediate baseline 35 of the image in fig. 3 as a mapping reference, an image coordinate system is established in which first location information for each participant 30 relative to the image acquisition assembly can be determined.

Since distortion occurs during image acquisition and since the true position and the true feature of each target sound source are different, but since the target direction calculated by the above formula (3) is accurate, the target distance in the acquired image is slightly different from the true distance information of the target sound source, and therefore, when the first position information is converted into the second position information in the acoustic image, the first distance converted from the target direction may be different from the true distance information. Based on this, in this embodiment, the first distance in the second position information may also be corrected using the relative positional relationship of the respective sound sources in the acoustic image.

In this embodiment, determining false audio peaks in which the relative position relationships of the false audio peaks in the acoustic image do not match the second position information at least includes the following steps:

step 1, in the acoustic image, obtaining a distance difference between the second distance and the first distance.

And 2, determining a false audio peak value of which the second direction is not matched with the first direction and/or the distance difference value is larger than a preset distance threshold.

The preset distance threshold may be pre-stored in the electronic device, or may be input by a user, and the implementation manner of the preset distance threshold is not limited in this embodiment.

For example, as shown in fig. 6, the second azimuth corresponding to the false audio peak 61 is not matched with the azimuth angles r1, r2, r3,0, 11, 12, and 13 corresponding to each target sound source, and the distance difference between the second distance corresponding to the false audio peak 61 and the first distance corresponding to each target sound source is also greater than the preset distance threshold; the spurious audio peaks 61 are replaced by spurious audio peaks 62 and the spurious audio peaks 62 can be determined using the same principles. The azimuth angle in this embodiment is calculated to the left and right respectively with the vertical angle being 0, and in other embodiments, the azimuth angle may also be calculated to the clockwise direction with the left horizontal line being 0, or calculated to the counterclockwise direction with the right horizontal line being 0, and this embodiment does not limit the implementation manner of the azimuth angle.

Although the second azimuth corresponding to the false peak 63 matches the azimuth 0, the distance difference between the second distance corresponding to the false audio peak 63 and each target sound source is greater than the preset distance threshold.

Alternatively, in the case where the audio collecting component is damaged or the target sound source has not all entered the meeting place, the number of audio peaks in the sound image collected in the meeting place may be smaller than the number of the target sound sources. Therefore, before determining a false audio peak whose relative positional relationship does not match the second positional information among the respective audio peaks of the acoustic image, it may be determined whether the number of audio peaks in the acoustic image is greater than or equal to the number of target sound sources.

In this embodiment, determining false audio peaks in the sound image whose relative position relationship does not match the second position information includes: determining a false audio peak value with the unmatched relative position relation with the second position information in each audio peak value of the sound image under the condition that the number of the audio peak values in the sound image is larger than or equal to that of the target sound sources;

and under the condition that the number of the audio peaks in the acoustic image is smaller than that of the target sound sources, re-acquiring the acoustic image after a preset time interval until the number of the audio peaks in the acoustic image is larger than or equal to that of the target sound sources or the number of times of acquiring the acoustic image reaches a preset number threshold. The preset time interval and the preset time threshold may be pre-stored in the electronic device, may be input by a user, or may be obtained from other devices.

Specifically, the preset time interval is 15s, and the preset time threshold is 3 times; or the preset time interval is 40s, and the preset time threshold is 2 times. Of course, the preset time interval and the preset time threshold may be other values, and the preset time interval and the preset time threshold are not limited in this embodiment.

Under the condition that the frequency of reacquiring the sound image reaches a preset frequency threshold, controlling the electronic equipment to output an audio component abnormal prompt, such as: text information, voice information, or the like. The present embodiment does not limit the implementation manner of the exception prompt.

In some special conferences, the types of participants speaking in the conference hall are limited, and at this time, the audio peak corresponding to a target sound source other than the target type should also be determined as a false audio peak. Thus, determining a target relative positional relationship between the target sound source and the audio acquisition component in the respective relative positional relationships using the first positional information comprises: and determining a false audio peak value of which the relative position relation is not matched with the first position information of the target sound source of the target type in each audio peak value of the sound image, and deleting the false audio peak value in the audio peak value of the sound image to obtain the target relative position relation indicated by each deleted audio peak value.

The above description is referred to a way of determining a false audio peak value whose relative position relationship does not match with the first position information of the target sound source of the target type, and the difference is that the target sound source is replaced by the target sound source of the target type, which is not described herein again in this embodiment.

In other embodiments, determining a target relative positional relationship between the target sound source and the audio acquisition component in respective relative positional relationships using the first positional information comprises: and determining a real audio peak value of which the relative position relation is matched with the second position information of the target sound source of the target type in each audio peak value of the sound image to obtain a target relative position relation indicated by the real audio peak value.

Wherein, the real audio peak refers to an audio peak formed in the sound image by the target sound source that the user desires to obtain.

Accordingly, distortion may occur during image acquisition, and since the actual position and the actual characteristic of each target sound source are different, the first position information in the acquired image and the actual position information of the target sound source have slight difference, and thus the second position information converted into the sound image may also have difference from the actual position information.

In this embodiment, determining, in each audio peak of the acoustic image, a true audio peak whose each relative position relationship matches the second position information at least includes the following steps:

and step 1, in the acoustic image, the distance difference between the second distance and the first distance.

And 2, determining the real audio peak value of which the second direction is matched with the first direction and the distance difference value is less than or equal to a preset distance threshold value.

And step 204, determining a target beam and a volume gain of the target sound source based on the target relative position relation.

And setting a target beam corresponding to the target sound source by using the second azimuth and the second distance in the relative position relationship of the target, wherein at the moment, the audio acquisition component only acquires the sound in the target beam, suppresses the sound outside the target beam and can reduce the interference of noise in the conference field.

Specifically, as shown in fig. 7, different target beams 71 are respectively set for target sound sources 30 with different directions and distances, the different target beams 71 point to different target sound sources 30, and the target beams 71 correspond to the target sound sources 30 one to one, and at this time, the audio acquisition component 110 only acquires audio signals of the target sound sources 30 in the direction of the target beams 71.

The volume gain is used for adjusting the volume of the audio data of each target sound source, so that the volume of the audio data of each target sound source tends to be consistent, and the listening law of a user is better met.

Illustratively, determining a volume gain of the target sound source based on the target relative positional relationship includes: determining an intermediate distance by using the maximum distance and the minimum distance indicated by the relative position relation of the targets; acquiring a standard volume gain corresponding to the intermediate distance; determining a gain difference between a volume gain of each target sound source and a standard volume gain based on a difference between each target sound source and the intermediate distance; the volume gain of each target sound source is determined based on the standard volume gain and the gain difference.

The maximum distance and the minimum distance are obtained by pairwise comparison among all distances indicated by the relative position relationship, and the obtaining manner of the maximum distance and the minimum distance is not limited in this embodiment.

Determining the intermediate distance using the maximum distance and the minimum distance includes: and determining a median value between the maximum distance and the minimum distance to obtain an intermediate distance.

The standard volume gain corresponding to the intermediate distance is stored in the electronic device in advance, and the standard volume gain may be input by the user or acquired from other devices, and the embodiment does not limit the manner of acquiring the standard volume gain.

The standard volume gain may be set to 12 db or 6 db, although the standard volume gain may be other values, and the present embodiment does not limit the preset value of the standard volume.

And acquiring the difference between the corresponding distance and the middle distance of other target sound sources according to the relative position relationship, and distributing the volume gains of other target sound sources according to the difference and the standard volume gain.

In one example, the difference between the distance corresponding to the other target sound source and the intermediate distance is obtained by the relative position relationship, and the volume gain of the other target sound source is allocated according to the difference and the standard volume gain by the following formula:

wherein D is _mid Represents the middle distance, gbase _ mid represents the standard toneThe quantity gain, n is the coding of the target beam, the coding order of the target beam is shown in fig. 8, and the target beam is coded as 1, 2.., 7 sequentially from the leftmost side, in other embodiments, the target beam may be coded in other coding orders, and the coding order of the target beam is not limited in this embodiment; gbase _ n indicates the volume gain of the nth entry marked beam; d _n And marking a second distance between the target sound source corresponding to the beam and the audio acquisition component for the nth item.

In another example, acquiring a difference between a distance corresponding to the other target sound source and the intermediate distance by using the relative position relationship, and allocating a volume gain of the other target sound source according to the difference and a standard volume gain, further includes: obtaining a difference value between the distance corresponding to other target sound sources and the middle distance, wherein the difference value has positive and negative values, when the difference value is a positive number, the distance corresponding to the target sound source is larger than the middle distance, and a volume gain larger than a standard volume gain is distributed; when the difference is positive, it is indicated that the distance corresponding to the target sound source is smaller than the intermediate distance, and a volume gain smaller than the standard volume gain is allocated; at this time, the standard volume gain is positively correlated with the difference. In practical implementation, the volume gains of other target sound sources may be allocated according to other differences and the standard volume gain, which is not limited in this embodiment.

Wherein, different differences correspond to different volume gains, and the corresponding relationship between the differences and the volume gains is pre-stored in the electronic equipment.

Because the volume of the sound is continuously changed when the target sound source produces the sound, although the volume gain is set to make the volume of the target sound source tend to be consistent, the problem that the speech of the target sound source is unclear because the adjusted volume is too large or too small still exists.

Optionally, determining a volume gain of the target sound source based on the target relative position relationship includes: determining a first gain value of a target sound source based on the relative position relation of the target; determining the volume difference between the audio volume obtained by adjusting the audio data according to the first gain value and the preset standard volume; determining a second gain value of the target sound source using the volume difference; the volume gain of the target sound source is determined based on the first gain value and the second gain value.

Wherein the first gain value is in positive correlation with the distance indicated by the target relative position relation.

Determining a volume difference between an audio volume obtained by adjusting the audio data according to the first gain value and a preset standard volume, including: and determining the difference value between the audio volume of the audio data of the target sound source adjusted according to the first gain and the preset standard volume to obtain the volume difference.

The preset standard volume is used for indicating the volume which can enable a user to clearly listen to the speech of the target sound source and can not be too harsh and accords with the listening rule of the user. The preset standard volume may be different according to the conference, for example, in a conference place where discussion is allowed, the sound discussed by the conference participants affects the volume that the conference participants can hear, and in this case, the preset standard volume may be 60 db or other values. The present embodiment does not limit the value of the preset standard volume. The preset standard volume may be set by a user, or may also be a default value stored in the electronic device, and the setting manner of the preset standard volume is not limited in this embodiment.

In practical implementation, the volume difference may also be obtained by using other manners, such as a ratio between the audio volume and the standard volume, and the implementation manner of the volume difference is not limited in this embodiment.

The second gain value of the target sound source is determined by using the volume difference, and the second gain value can be increased on the basis of the first gain value, so that the adjusted volume is not too large or too small, and the listening law of a user is better met. Optionally, since the target sound source is not fixed in the meeting place, the relative position relationship between the target sound source and the audio collecting component also changes, and at this time, the target beam cannot collect the audio signal of the target sound source, so after the target preset time interval, the electronic device is controlled to perform step 201 to step 204 again, so that the target beam changes in time along with the movement of the target sound source, and the audio collecting component can accurately and timely set and adjust the target beam.

Step 205, the volume gain is used to adjust the audio volume of the audio data collected according to the target beam.

Because the speaker and the ordinary participants exist in the meeting place, generally in the meeting, the speaker is responsible for explaining important contents of the meeting or promoting the meeting process to be carried out, so that when the speaker speaks, the sound of the speaker should be highlighted as much as possible, and the situation that the sound of the speaker is covered by the sound discussed by the participants to delay the meeting is avoided.

Alternatively, a conference participant who interacts with the speaker is present in the conference, and the speech of the conference participant is usually related to the speech of the speaker, that is, to important contents of the conference.

Or, in the conference, private discussion is performed among participants for some problems, and at this time, the sound for private discussion among the participants can be regarded as noise, and is usually small, but the sound gain affects other participants who are speaking.

Therefore, it is necessary to determine whether the target sound source satisfies a preset volume adjustment condition to determine whether to adjust the volume of the conference participant. Under the condition that the target sound source meets the volume adjusting condition, triggering and executing a step of adjusting the volume of the audio data acquired according to the target wave beam by using the volume gain; and under the condition that the target sound source does not meet the volume adjusting condition, if the volume of the audio data is greater than or equal to a preset volume threshold value, inhibiting the audio volume of the audio data.

Wherein, the volume adjusting condition comprises at least one of the following: the target sound source is in a speaking state; the target sound source is a speaker; the target sound source is a participant who interacts with the speaker.

Specifically, the manner of determining whether the target sound source is in the speaking state includes, but is not limited to: judging whether the target sound source is in a speaking state or not according to the opening and closing state of the lip of the target sound source; or, whether the target sound source is in the speaking state or not can be determined according to the face area of the target sound source; or, whether the target sound source is a speaker can be judged according to the identity information of the target sound source; or, whether the participant is interacting with the speaker may be determined according to the posture information, and whether the target sound source satisfies the preset volume adjustment condition may also be determined according to other manners in actual implementation, where the determination manner of the volume adjustment condition is not limited in this embodiment.

The identity information may be input by a user or acquired from other devices, where the identity information refers to information that can indicate whether a participant is a speaker, and the identity information includes but is not limited to: gender, name, special labeling, etc.

Optionally, suppressing audio volume includes, but is not limited to: reducing the volume gain to 0; alternatively, the audio volume of the audio data corresponding to the target sound source may be directly reduced in another manner.

Next, the audio processing method provided in this embodiment is described with a specific embodiment, in this example, the target type is a participant and a speaker, as shown in fig. 9.

Step 901, mapping acoustic imaging and optical imaging coordinates, and converting first position information determined based on an image acquired by an image acquisition assembly into second position information in an acoustic image.

Step 902, deleting false audio peaks in the acoustic image by combining the second position information to obtain a target relative position relation between a target sound source and the audio acquisition assembly; determining the intermediate distance between a target sound source and an audio acquisition component by using the relative position relation of the target, and setting the intermediate distance as the standard volume gain of a first gain value; and determining the gain difference by using the difference between the distance between each target sound source and the audio acquisition component and the intermediate distance, and obtaining the volume gain of each target sound source.

Step 903, determining whether the conference is finished, and if the conference is judged to be finished, ending the process; if the conference is not judged to be ended, judging whether the target sound source is in a speaking state or not; if the target sound source is determined to be in the speaking state, step 904 is executed, and if the target sound source is determined not to be in the speaking state, step 915 is executed;

step 904, determining the target type of the target sound source; in a case where it is determined that the target type of the target sound source is a speaker, step 905 is performed; in case that it is determined that the target type of the target sound source is the conference participant, performing step 907;

step 905, comparing the audio volume of the audio data of the speaker with a preset standard volume to obtain a volume difference; setting a second gain value based on the volume difference; adding the beam information of the target beam corresponding to the speaker into a beam protection pool;

the beam protection pool is used for storing beam information of a target beam needing volume adjustment. The beam information may be a number of a target beam or an account number of a target sound source, and the implementation of the beam information is not limited in this embodiment.

Step 906, weakening the audio volume of the unprotected audio data collected in the target beam outside the beam protection pool to highlight the audio volume of the protected audio data collected in the target beam inside the beam protection pool; step 903 is executed;

optionally, there is at least a 6 decibel difference between the audio volume of the non-protected audio data and the audio volume of the protected audio data. The first order volume is 3 db when audio is played, and the adjustment of the first order volume is not easy to be prominent in the meeting place and still submerged in the speech of the participants, so there is at least a difference of two orders volume, i.e. 6 db.

Step 907, adding the beam information of the target beam corresponding to the participant into a speaking statistics pool of the participant; step 908 is performed;

the conference participant speaking statistic pool is used for storing the beam information of the target beam corresponding to the conference participant in the speaking state. Since steps 905 and 907 are repulsive for the same target sound source, the beam information of the target beam corresponding to the speaker does not join the conference participant speaking statistics pool.

Step 908, comparing the audio volume of the audio data of the target sound source in the conference participant speaking statistic pool with a preset standard volume to obtain a volume difference, and setting a second gain value based on the volume difference; step 909 is executed;

step 909, judging whether the conference participants in the conference participant speaking statistic pool interact with the speaker; if the conference participants in the conference participant speech statistics pool are judged to be interactive with the speaker, executing step 910; executing step 911 under the condition that the conference participants in the conference participant speaking statistics pool are not interacted with the speaker;

step 910, adding the beam information of the target beam corresponding to the participant into a beam protection pool; step 906 is executed;

and target beams corresponding to the participants in the beam protection pool are moved from the participant speech statistics pool.

Step 911, determining whether the target beam corresponding to the participant is in the beam protection pool, and if so, clearing the target beam out of the beam protection pool; step 912 is executed;

step 912, determining whether the conference participant outside the beam protection pool is in a speaking state; in the case that the out-of-pool conferee is in the talk state, step 913 is performed; if the out-of-pool conferee is not in the speaking state, executing step 906;

step 913, determining whether the volume of the participants outside the beam protection pool reaches a preset volume threshold; executing step 914 when the volume of the participants outside the beam protection pool reaches a preset volume threshold; when the volume of the participant outside the beam protection pool does not reach the preset volume threshold, executing step 903;

step 914, the volume of the participants outside the beam protection pool is attenuated to be below a preset volume threshold; step 903 is executed;

step 915, judging the target type of the target sound source; in the case where it is determined that the target sound source is the speaker, step 916 is performed; in the case where it is determined that the target sound source is a participant, step 917 is performed;

step 916, the audio volume of the non-protection audio data collected in the target beam outside the beam protection pool is weakened, whether the beam information of the target beam corresponding to the speaker is in the beam protection pool is judged, if the beam information of the target beam corresponding to the speaker is in the beam protection pool, the beam information is cleared out of the beam protection pool, and step 903 is executed.

Step 917, firstly, determining whether the beam information of the target beam corresponding to the participant is in the participant speech statistics pool, if the beam information of the target beam corresponding to the participant is in the participant speech statistics pool, clearing the beam information out of the participant speech statistics pool, and then resetting the target beam corresponding to the participant and the volume gain, and executing step 903.

In summary, in the audio processing method provided in this embodiment, the first position information of the target sound source is obtained by performing image recognition on the image acquired by the image acquisition component in the conference scene; generating an acoustic image based on the audio signal acquired by the audio acquisition component, wherein the acoustic image is used for indicating each relative position relationship between each sound source in the conference scene and the audio acquisition component; determining a target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information; determining a target beam and a volume gain of a target sound source based on the target relative positional relationship; adjusting an audio volume of the audio data collected according to the target beam using the volume gain; the problem that the sound difference between the participants is large due to the fact that the sound of each participant is different in size and the distance between each participant and the sensor is different can be solved; because the acoustic image is used for indicating the relative position relations of the sound sources in the conference, the relative position relations of the sound sources corresponding to interference factors such as noise, echo and reverberation may also be included, and the target sound source can be determined and the first position information can be acquired by using the image, so that the target relative position relation is determined in each relative position relation by using the first position information, the relative position relations of the interference factors can be eliminated under the condition that the interference factors such as noise, echo and reverberation exist in the acoustic image, so that the target relative position relation corresponding to the target sound source is acquired, and the relative position information acquired in the acoustic image is more accurate, so that the accurate position information of the target sound source can be acquired, and the target beam and the volume gain are set for the target sound source, so that the volumes between the target sound sources do not differ too much, the quality of audio data in the conference can be improved, and the audio volumes of the target sound sources can be consistent.

Meanwhile, the target relative position relation between the target sound source and the audio acquisition assembly is determined by combining the first position information and each relative position relation, so that the problem that the target wave beam deviates from the target sound source due to inaccurate first position information caused by image distortion can be avoided; the problem that the target wave beam points to noise due to the fact that noise can also form audio peaks in a sound image can be avoided; and determining a target relative position relation corresponding to the target sound source by combining the first position information and each relative position relation, thereby accurately setting a target beam pointing to the target sound source, adjusting the audio volume of the audio data of the target sound source collected in the target beam through the volume gain, enabling the volume of each target sound source to tend to accord with the listening rule of a user, and improving the quality of conference audio.

In addition, the first position information of the target sound source of the target type is obtained, the false audio peak value of which the relative position relation is not matched with the first position and azimuth information of the target sound source of the target type is determined, different target sound sources can be flexibly determined according to different meeting scenes, and the flexibility of the electronic equipment can be improved.

In addition, the volume gain of each target sound source is determined according to the standard volume gain corresponding to the intermediate distance, so that the volumes of the target sound sources tend to be consistent, the listening rule of a user is met, meanwhile, the corresponding relation between the preset distance and the volume gain can be reduced, and the storage resources of the electronic equipment are saved.

Fig. 10 is a block diagram of an apparatus of an audio processing method according to an embodiment of the present application. The device at least comprises the following modules: a first position module 1010, a sound image generation module 1020, a relative position module 1030, a volume gain module 1040, and a volume adjustment module 1050.

The first position module 1010 is configured to perform image recognition on an image acquired by the image acquisition component in a conference scene to obtain first position information of the target sound source.

And a sound image generating module 1020 for generating a sound image based on the audio signal collected by the audio collecting component.

The relative position module 1030 determines a target relative positional relationship between the target sound source and the audio acquisition component among the respective relative positional relationships using the first positional information.

The volume gain module 1040 determines a target beam and a volume gain of a target sound source based on the target relative positional relationship.

The volume adjustment module 1050 adjusts an audio volume of the audio data collected according to the target beam using a volume gain.

For relevant details reference is made to the above-described method embodiments.

It should be noted that: in the device of the audio processing method provided in the foregoing embodiment, when processing audio data, only the division of the functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device of the audio processing method is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the apparatus of the audio processing method provided in the foregoing embodiment and the audio processing method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

Fig. 11 is an electronic device provided in an embodiment of the present application. The electronic device, which may be the electronic device described in fig. 1 or another device communicatively connected to the electronic device, includes at least a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 1101 may be implemented in at least one of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the audio processing methods provided by the method embodiments herein.

In some embodiments, the electronic device may further include: a peripheral interface and at least one peripheral. The processor 1101, memory 1102 and peripheral interface may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface via a bus, signal line, or circuit board. Illustratively, peripheral devices include, but are not limited to: radio frequency circuit, touch display screen, audio circuit, power supply, etc.

Of course, the electronic device may include fewer or more components, which is not limited by the embodiment.

Optionally, the present application further provides a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the audio processing method of the above-mentioned method embodiment.

Optionally, the present application further provides a computer product, which includes a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the audio processing method of the above-mentioned method embodiment.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A method of audio processing, the method comprising:

generating an acoustic image based on audio signals acquired by an audio acquisition component, wherein the acoustic image is used for indicating each relative position relationship between each sound source in the conference scene and the audio acquisition component;

determining a target relative positional relationship between the target sound source and the audio acquisition component in each relative positional relationship using the first positional information;

determining a target beam and a volume gain of the target sound source based on the target relative positional relationship, wherein the volume gain is in a positive correlation with the distance indicated by the target relative positional relationship;

2. The method of claim 1, wherein the first location information comprises a target position and a target distance, the target position comprising a horizontal angular position and a vertical angular position;

acquiring a screen pixel size, a screen central point coordinate, a horizontal field angle, a vertical field angle and a human face real height of the image acquisition assembly, wherein the screen pixel size comprises a horizontal size and a vertical size;

3. The method according to claim 1, wherein the acoustic image comprises audio peaks of respective sound sources, and positions of the audio peaks are used for indicating the respective relative positional relationships; the determining a target relative positional relationship between the target sound source and the audio acquisition component in respective relative positional relationships using the first positional information includes:

determining false audio peaks in the sound image, wherein the relative position relations of the false audio peaks are not matched with the second position information;

4. The method according to claim 3, wherein the second positional information includes a first distance and a first orientation between the audio capturing component and the target sound source, and the respective relative positional relationships include a second distance and a second orientation;

the determining false audio peaks of the sound image, of which the relative position relationships do not match with the second position information, includes:

5. The method of claim 3, wherein the determining false audio peaks in the sound image whose relative position relationship does not match the second position information comprises:

6. The method according to claim 1, wherein the determining the volume gain of the target sound source based on the target relative position relationship comprises:

acquiring a standard volume gain corresponding to the intermediate distance;

7. The method according to claim 1, wherein the determining a volume gain of the target sound source based on the target relative positional relationship comprises:

8. The method according to claim 1, characterized in that it comprises:

9. The method of claim 7, further comprising:

10. An electronic device, characterized in that the electronic device comprises a processor and a memory connected to the processor, in which memory a program is stored, which program, when executed by the processor, is adapted to carry out the audio processing method of any of claims 1 to 9.

11. A computer-readable storage medium, characterized in that the storage medium has stored therein a program which, when executed by a processor, is adapted to implement the audio processing method of any of claims 1 to 9.