CN115589566A

CN115589566A - Audio focusing method and device, storage medium and electronic equipment

Info

Publication number: CN115589566A
Application number: CN202211180723.5A
Authority: CN
Inventors: 刘念; 史润宇; 余俊飞; 吕柱良; 刘晗宇
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-10

Abstract

The present disclosure provides an audio focusing method and apparatus, a storage medium, and an electronic device, wherein the audio focusing method includes: determining a target spatial orientation of a user using the electronic device for a call relative to the electronic device; determining a target array signal received by at least one microphone on the electronic device; and focusing to obtain an audio signal from the target space direction according to the target array signal. According to the audio focusing method provided by the embodiment of the disclosure, the audio signal of the target space direction is obtained through focusing of the array signals received by the plurality of microphones, and the accuracy of the focused audio signal can be improved.

Description

Audio focusing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of signal processing technologies, and in particular, to an audio focusing method and apparatus, a storage medium, and an electronic device.

Background

With the diversification of multichannel audio applications, spatial audio has received a great deal of attention. The spatial audio can be combined with the sound source direction information and the spatial information to reconstruct a virtual sound field, so that a strong spatial sense is created for a user. The manner in which spatial audio is acquired, processed and reproduced is increasingly diverse, and one effective way to analyze and reproduce spatial audio is directional audio coding techniques, which mainly involve two processes, beamforming and spatial audio signal processing.

The existing audio focusing method generally determines the orientation information of an audio signal based on the time difference of receiving the same audio signal by microphones at different positions on an electronic device, and then performs audio focusing according to the acquired orientation information of the audio signal. In practical applications, a small number of microphones are provided on one electronic device, which results in inaccurate acquired focused audio signals.

Disclosure of Invention

In view of this, the disclosed embodiments provide an audio focusing method and apparatus, a storage medium, and an electronic device.

According to a first aspect of the present disclosure, there is provided an audio focusing method, the method comprising:

determining a target spatial orientation of a user using the electronic device for a call relative to the electronic device;

determining a target array signal received by at least one microphone on the electronic device;

and according to the target array signal, focusing to obtain an audio signal from the target space direction.

In combination with any implementation manner provided by the present disclosure, the determining a target spatial orientation of a user using the electronic device for a call with respect to the electronic device includes:

acquiring at least one frame of image including the user;

determining two-dimensional coordinates of the user in at least one frame of the image;

and determining the target space orientation according to the corresponding relation between the two-dimensional coordinates and the space orientation.

In combination with any embodiment provided by the present disclosure, the determining a target array signal received by at least one microphone on the electronic device includes:

determining alternative array signals corresponding to all physical microphones based on the first signal values received by each physical microphone;

determining a second signal value corresponding to each virtual microphone based at least on the first signal value;

and inserting all the second signal values into the alternative matrix signal to obtain the target array signal.

In combination with any embodiment provided by the present disclosure, the determining, based on at least the first signal value, a second signal value corresponding to each virtual microphone includes:

determining a first steering vector corresponding to each of the physical microphones;

determining a second guide vector corresponding to a first virtual microphone according to the angle value indicated by the target space orientation and the position of the first virtual microphone on the electronic equipment;

determining a translation vector based on at least one of the first steering vector and the second steering vector; wherein the conversion vector is used for representing the conversion relation between each first guide vector and the second guide vector;

and obtaining the second signal value corresponding to the first virtual microphone based on the product of each first signal value and the conversion vector.

In connection with any of the embodiments provided by the present disclosure, the spacing between any two microphones is less than half the wavelength of the audio signal.

In combination with any one of the embodiments provided by the present disclosure, the focusing the audio signal from the target spatial orientation according to the array signal includes:

determining a covariance matrix of the array signals;

respectively adding noise values at a plurality of designated positions in the covariance matrix to obtain a corrected covariance matrix;

focusing to obtain an audio signal from the target space direction according to an MVDR algorithm based on the corrected covariance matrix and the target guide vector; wherein the target steering vector is a steering vector corresponding to the plurality of microphones.

In combination with any one of the embodiments provided by the present disclosure, the adding noise values to a plurality of designated positions in the covariance matrix to obtain a modified covariance matrix includes:

determining a second noise matrix according to the product of a first noise matrix formed by a plurality of noise values and an identity matrix;

and calculating the sum of the covariance matrix and the second noise matrix to obtain the corrected covariance matrix.

In combination with any one of the embodiments provided by the disclosure, the specified position is a position on a main diagonal in the covariance matrix.

According to a second aspect of the present disclosure, there is provided an audio focusing apparatus, the apparatus comprising:

the target space orientation determining module is used for determining the target space orientation of a user using the electronic equipment for communication relative to the electronic equipment;

a target array signal determination module for determining a target array signal received by at least one microphone on the electronic device;

and the audio signal focusing module is used for focusing to obtain an audio signal from the target space direction according to the target array signal.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon machine readable instructions which, when invoked and executed by a processor, cause the processor to implement an audio focusing method of any embodiment of the present disclosure.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus comprising

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the audio focusing method of any of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the audio focusing method and device, the storage medium and the electronic equipment provided by the embodiment of the disclosure, the target space direction of a user who uses the electronic equipment for communication relative to the electronic equipment is determined; determining an array signal received by at least one microphone on the electronic equipment, and focusing to obtain an audio signal from the target space direction according to the array signal. The audio signals of the target space direction of the user relative to the electronic equipment are obtained through focusing of the array signals received by the microphones, the audio signals from the target space direction are effectively enhanced, and the purpose of improving the accuracy of the focused audio signals is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a method of audio focusing according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart diagram illustrating another audio focusing method of the present disclosure in accordance with an exemplary embodiment;

FIG. 3 is a flow chart illustrating another audio focusing method according to an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a user's left and right ears receiving audio signals according to an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating an audio focusing apparatus according to an exemplary embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device shown in accordance with an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

An audio focusing method according to an embodiment of the present disclosure is described in detail below with reference to the drawings.

Fig. 1 is a flow chart illustrating an audio focusing method according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the exemplary embodiment method may be applied to electronic devices including, but not limited to, smart watches, smart phones, televisions, computers, video recorders, and the like. As shown in FIG. 1, the exemplary embodiment method may include the steps of:

in step 100, a target spatial orientation of a user engaged in a call using the electronic device relative to the electronic device is determined.

Wherein the call comprises a video call and a voice call.

In an alternative example, an image containing the user may be captured by the current video call interface while the user is using the electronic device for a video call. In another optional example, when the user is using the electronic device to perform a voice call, an image containing the user may be automatically acquired by the electronic device through a camera of the electronic device.

After the image containing the user is acquired, the user in the image may be identified using a pre-trained image recognition neural network. Specifically, a designated portion of the user in the image may be identified, including but not limited to the head, or the quintessential, etc. The present disclosure is not limited thereto. After identifying the user in the image, location information of the user in the image may be further determined. In particular, two-dimensional coordinates of the user in the image may be determined. And then obtaining the target space orientation of the user relative to the electronic equipment according to the determined two-dimensional coordinates.

In one example, a coordinate system may be established with the center of the acquired lower edge of the image as an origin, rightward as a positive direction of the X-axis, and upward as a positive direction of the Y-axis. Or a coordinate system can be established by taking the positive center of the acquired image as an origin, the positive direction of the X axis towards the right and the positive direction of the Y axis upwards. The present disclosure is not limited thereto. The two-dimensional coordinates of the user in the image are then determined, assuming two-dimensional coordinates of (-20, 10).

Further, the target spatial orientation may be determined by a preset correspondence between the two-dimensional coordinates and the spatial orientation. In the present disclosure, the target spatial orientation may be represented by a spatial angle value of a straight line formed from a point where the user is located to a center point of the electronic device with respect to a plane where the electronic device is located. For example, the value of the spatial angle corresponding to the two-dimensional coordinate (-20, 10) is 150 degrees. The target spatial orientation is represented as 150 degrees.

It should be noted that, in the present example, the correspondence between the two-dimensional coordinates and the spatial orientation may be preset by the relevant staff based on the actual situation, and the disclosure does not limit this.

In step 102, a target array signal received by at least one microphone on the electronic device is determined.

Wherein the at least one microphone comprises: at least one physical microphone; and at least one virtual microphone.

The physical microphone refers to a microphone actually mounted on the electronic device. For a smart phone used in daily life, two physical microphones are generally installed. The positions and the number of the physical microphones installed on the mobile phones of different models may be different. Each of the physical microphones may receive a corresponding signal value.

The virtual microphone refers to a microphone which is virtually set by related workers based on the existing physical microphone on the smart phone according to a certain rule. The signal value received by each of the virtual microphones may be determined by the signal value received by each of the physical microphones.

An array signal refers to a collection of signals arranged according to a certain rule. In this example, the target array signal refers to a set of signals received by the physical microphone and the virtual microphone.

In an alternative example, after the array signal received by the physical microphone on the electronic device is acquired, audio noise reduction and dereverberation processing may be performed on the array signal, so as to improve the accuracy of the focused audio signal.

The audio noise reduction and dereverberation processing can be implemented based on the prior art, and may specifically refer to the related document, which is not described in detail in this disclosure.

In step 104, an audio signal from the target spatial orientation is focused according to the target array signal.

In this example, according to the obtained target array signal and the obtained target spatial orientation information, a Minimum Variance Distortionless Response (MVDR) algorithm is used to obtain weight parameters corresponding to signals received by the plurality of microphones, and then an audio signal from the target spatial orientation is calculated based on the target array signal and the obtained weight parameters.

According to the audio focusing method provided by the embodiment of the disclosure, a target space orientation of a user who uses the electronic device for communication relative to the electronic device is determined, an array signal received by at least one microphone on the electronic device is determined, and an audio signal from the target space orientation is obtained through focusing according to the array signal. The audio signals of the target space direction of the user relative to the electronic equipment are obtained through focusing of the array signals received by the microphones, the audio signals from the target space direction are effectively enhanced, and the purpose of improving the accuracy of the focused audio signals is achieved.

The following describes the steps of the audio focusing method in detail:

with respect to the step 100, in an optional example, when determining a target spatial orientation of a user using the electronic device for a call with respect to the electronic device, the method may specifically perform the following steps: acquiring at least one frame of image including the user, determining two-dimensional coordinates of the user in the at least one frame of image, and determining the target space orientation according to the corresponding relation between the two-dimensional coordinates and the space orientation.

Specifically, when the user uses the electronic device to perform a video call, an image including the user may be intercepted by a current video call interface. Or, when the user uses the electronic device to perform a voice call, the electronic device may automatically acquire an image including the user through a camera of the electronic device. And then identifying the user in the image through a pre-trained image identification neural network, and further determining the two-dimensional coordinates of the user in the image.

For example, the two-dimensional coordinates of the user may be determined based on a frame of acquired images containing the user. The two-dimensional coordinates of the user may also be determined based on the acquired plurality of frames of images containing the user. Specifically, after the two-dimensional coordinates of the user are determined from each frame of the image of the user, an average value of a plurality of two-dimensional coordinates may be calculated and used as the two-dimensional coordinates of the user in the image.

After the two-dimensional coordinates of the user in the image are determined, the target spatial orientation may be determined based on a correspondence between the two-dimensional coordinates and the spatial orientation set in advance.

For example, the preset correspondence between the two-dimensional coordinates and the spatial orientation may be as shown in table 1 below:

two-dimensional coordinate	Spatial orientation
		(20，10)	30 degree
(20，20)	45 degree
		(-20，10)	150 degree

TABLE 1

For example, when the determined two-dimensional coordinates of the user in the image are (20, 10), it may be determined that the target spatial orientation is 30 degrees based on the above correspondence. When the determined two-dimensional coordinates of the user in the image are (20, 20), it may be determined that the target spatial orientation is 45 degrees based on the above correspondence.

It should be noted that the information shown in table 1 is only for better understanding of the technical solutions of the embodiments of the present disclosure by those skilled in the art. In a specific implementation, the correspondence between the two-dimensional coordinates and the spatial orientation may be preset by a relevant worker based on an actual situation, which is not limited by the present disclosure.

According to the audio focusing method provided by the embodiment of the disclosure, the target spatial orientation of the user relative to the electronic device is determined by the acquired image containing the user, so that the problem that the determined target spatial orientation information is inaccurate due to inaccurate acquired time difference when the target spatial orientation of the user relative to the electronic device is determined based on the time difference of the signals acquired by two microphones in the prior art can be avoided, and the accuracy of the focused audio signal can be improved to a certain extent.

With respect to step 102 above, in an optional example, when determining a target array signal received by at least one microphone on the electronic device, the method may proceed based on the following steps:

based on the first signal values received by each physical microphone, alternative array signals corresponding to all physical microphones are determined.

A second signal value corresponding to each virtual microphone is determined based at least on the first signal value.

As previously described, the at least one microphone includes at least one physical microphone, and at least one virtual microphone.

The signal value received by each physical microphone is the first signal value. For example, the signal values received by a physical microphone may be a numsample (total number of samples taken) x 1 matrix. The signal values received by the three physical microphones can be represented as a numsample x 3 matrix. The numsample x 3 matrix is the candidate array signals corresponding to all physical microphones. The numsample is a snapshot number of the acquired signal. The number of fast beats refers to the number of sampling points in the time domain.

The signal value received by each of the virtual microphones is the second signal value. The second signal value may be determined by a signal value received by the physical microphone.

Specifically, in an optional example, the determining process of the second signal value may be performed based on the following steps:

first, a first steering vector corresponding to each of the physical microphones is determined.

The first steering vector is used to characterize a physical relationship between the signal and the physical microphone. In particular, the signals received by the physical microphones at different positions are different for the same signal. The first steering vector is used to characterize the difference. The first steering vector relates only to the position of the physical microphone and the acquired target spatial orientation of the user relative to the electronic device.

In a specific application process, the first steering vector corresponding to each physical microphone may be preset according to the position information of each physical microphone. Then, after a target space position of the user relative to the electronic device is obtained, the first guide vector corresponding to each physical microphone is calculated and obtained based on the angle information indicated by the obtained target space position.

And determining a second guide vector corresponding to the first virtual microphone according to the angle value indicated by the target space orientation and the position of the first virtual microphone on the electronic equipment.

The second steering vector, corresponding to the first steering vector described above, is used to characterize the physical relationship between the signal and the virtual microphone. The second steering vector is only related to the position of the virtual microphone and the obtained target spatial orientation of the user with respect to the electronic device. Therefore, for an electronic device, the position of the first virtual microphone may be preset by the relevant staff member, and then the second guide vector corresponding to the first virtual microphone may be determined based on the position information. And then calculating a second guide vector corresponding to the first virtual microphone based on the acquired angle value indicated by the target space orientation.

Then, a translation vector is determined based on at least one of the first steering vector and the second steering vector.

Wherein the conversion vector is used for representing the conversion relation between each first guide vector and the second guide vector.

Illustratively, the following schemes are introduced with three physical microphones, i.e., three first steering vectors.

First, the conversion vector may be determined based on the following formula (1).

b＝a ₁ β ₁ +a ₂ β ₂ +a ₃ β ₃ +ε (1)

Wherein b is used to characterize the second steering vector. A is a ₁ Steering vector, beta, for characterizing a first physical microphone ₁ A transformation element for characterizing a steering vector of a first one of the physical microphones with the second steering vector. a is ₂ Steering vector, beta, for characterizing a second physical microphone ₂ A transformation element for characterizing a steering vector of a second physical microphone with the second steering vector. a is ₃ Steering vector, beta, for characterizing a third physical microphone ₃ A transformation element for characterizing a steering vector of a third physical microphone with the second steering vector. ε is used to characterize the error.

Then, the transformation vector β can be calculated by using a least square estimator algorithm, and can be expressed as the following formula (2):

β＝(a ^T a) ^-1 a ^T b (2)

wherein the beta is used to characterize the beta ₁ 、β ₂ 、β ₃ The constructed vector. Said a is used to characterize the vector represented by said steering vector a ₁ 、a ₂ 、a ₃ A matrix of the composition a ^T For characterizing the transpose of matrix a. The analysis is performed in connection with the correlation practice, and in case of 3 physical microphones, the translation vector β is a3 × 1 vector.

Finally, the second signal value corresponding to the first virtual microphone may be obtained based on a product of each of the first signal values and the conversion vector.

Specifically, the second signal value corresponding to the first virtual microphone may be determined based on the following formula (3):

x ₂ ＝x ₁ ·β (3)

wherein, the x ₂ For characterizing a second signal value, x, received by the first virtual microphone ₁ For characterizing the signal values received by the three physical microphones, the beta being used for characterizing the translation vector.

As can be obtained from the foregoing, the signal values received by the three physical microphones may be collectively represented as a numsample × 3 matrix, the conversion vector β is a3 × 1 vector, and the second signal value obtained by calculation is a numsample × 1 matrix according to relevant mathematical knowledge.

The above example is a process of determining a signal received by one virtual microphone based on the three physical microphones. When determining the signals received by the other virtual microphones, the determination may also be performed based on the above process, which is not described in detail in this disclosure.

Finally, all the second signal values may be inserted into the candidate matrix signal composed of the physical microphones to obtain the target array signal.

As can be seen from the foregoing, when the number of the physical microphones is 3, the candidate matrix signal is a numsample × 3 matrix.

For an electronic device, when the number of the virtual microphones is 1, the number of the second signal values is also 1, and specifically may be a numsample × 1 matrix. Inserting the second signal value into the alternative array signal may obtain the target array signal, which may specifically be a numsample × 4 matrix.

Correspondingly, when the number of the virtual microphones is 2, the number of the second signal values is also 2, and may be specifically represented as a matrix of two numsamples × 1. Inserting the two second signal values into the alternative array signal may obtain the target array signal, which may specifically be a numsample × 6 matrix.

In an alternative example, the spacing between any two microphones is less than half the wavelength of the audio signal.

Wherein the audio signal wavelength = sound speed/frequency. Illustratively, the audio signal wavelength may be 140/18000 meters.

In practical applications, for an electronic device, distances between physical microphones mounted on the electronic device are all greater than half of the wavelength of an audio signal, so that a relatively large side lobe, even a side lobe having the same height as a main lobe, and a grating lobe appear in a directional diagram, and a signal of a target spatial direction cannot be accurately enhanced. Therefore, when the positions of the virtual microphones are set in the electronic device, the microphones should be uniformly arranged on the premise that the distance between any two microphones is less than half of the wavelength of the audio signal. Wherein the microphones include a physical microphone and a virtual microphone.

According to the audio focusing method provided by the embodiment of the disclosure, when the position of the virtual microphone is set on the electronic device, based on the principle that the distance between any two microphones is smaller than half of the wavelength of the audio signal, the influence of the side lobe and the grating lobe on the signal for enhancing the target spatial direction can be reduced, and the accuracy of the focused audio signal is improved.

Fig. 2 is a flowchart illustrating another audio focusing method according to an exemplary embodiment of the present disclosure. In the description of this embodiment, the same steps as those in any of the foregoing embodiments will be briefly described, and detailed descriptions thereof will be omitted, specifically referring to any of the foregoing embodiments. In the description of the present embodiment, how to focus an audio signal from the spatial orientation of the target according to the acquired target array signal will be described in detail. As shown in FIG. 2, the exemplary embodiment method may include the steps of:

in step 200, a target spatial orientation of a user using the electronic device for a call is determined relative to the electronic device.

In step 202, a target array signal received by at least one microphone on the electronic device is determined.

For example, the acquired target array signal is a numsample × 3 matrix. For convenience of the following description, the numsample may be 3, i.e., the target array signal is a3 × 3 matrix. The numsample is a snapshot number of the acquired signal. The number of fast beats refers to the number of sampling points in the time domain.

In step 204, a covariance matrix of the target array signal is determined.

Specifically, the covariance matrix of the target array signal may be obtained according to the following formula (4):

R＝E{xx ^H } (4)

wherein R is used to characterize the covariance matrix, x is used to characterize the target array signal, x ^H A conjugate transpose for characterizing the target array signal.

In step 206, noise values are added to a plurality of designated positions in the covariance matrix, respectively, to obtain a modified covariance matrix.

Wherein the specified position is a position on a main diagonal in the covariance matrix. The main diagonal line is an oblique line from the top left corner to the bottom right corner in the matrix, and in this example, a position on the oblique line is the designated position.

The MVDR algorithm in the prior art can be directly determined based on the above covariance matrix, as shown in the following equation (5):

wherein, R is ^-1 An inverse matrix for characterizing the target array signal matrix, a for characterizing the matrix formed by the steering vectors a1, a2, a3, a ^T For characterizing the transpose of matrix a.

Based on the actual situation, the elements on the main diagonal line in the covariance matrix are used to represent the noise value in the acquired audio signal, and in order to make the power of the signal of the target spatial orientation not compressed as much as possible when the power is compressed subsequently, the audio focusing method of this example may improve the MVDR algorithm in the prior art, that is, modify the covariance matrix to increase the noise value therein. In particular, noise values may be added on the main diagonal of the covariance matrix.

In an optional example, the process of obtaining the modified covariance matrix may specifically be:

and determining a second noise matrix according to the product of a first noise matrix formed by a plurality of noise values and an identity matrix.

Wherein the first noise matrix δ may be, for example, the following formula (6):

the second noise matrix

I.e., the product of the first noise matrix and the identity matrix, may be determined according to the following equation (7):

The covariance matrix R can be expressed, for example, as the following equation (8):

the modified covariance matrix R' can be obtained according to the following equation (9):

it should be noted that the above descriptions about the first noise matrix, the covariance matrix, etc. are only schematic, and are intended to enable those skilled in the art to better understand the technical solutions of the embodiments of the present disclosure, and in practical applications, the above matrices may also be in other forms, and the present disclosure does not limit the present disclosure.

In step 208, based on the modified covariance matrix and the target steering vector, an audio signal from the target spatial orientation is focused according to an MVDR algorithm.

Wherein the target steering vector is a steering vector corresponding to the plurality of microphones. When the number of the microphones is 3, the target steering vector is a steering vector corresponding to the three microphones.

Specifically, the process of obtaining the weight parameter according to the MVDR algorithm can be represented by the following formulas (10), (11):

ω＝argminω ^H Rω (10)

s.t.ω ^H a＝1 (11)

wherein, ω in the formula (10) ^H R ω is used to characterize the total power of the signals received by the microphone, including the power of the received signal at the target spatial location, the power of the signals at other spatial locations, and the power of the received noise. In the MVDR algorithm, it is desirable that the microphone be one that has a high frequency rangeThe total power of the received signal is minimal, i.e. ω ^H R ω is minimal. Omega is used to characterize the weighting parameters of the signals received by the respective microphones, omega ^H The conjugate transpose of the weight parameters used to characterize the signals received by the respective microphones, R, is used to characterize the target array signal.

ω in the formula (11) ^H a signal for characterizing the spatial orientation of said target, another omega ^H a =1, i.e., it is desirable not to compress the power of the signal of the target spatial orientation, to compress only the power of the signals of other spatial orientations, and to compress the power of the received noise.

From the above equations (10) and (11), the following equation (12) can be obtained by the lagrange multiplier method:

wherein δ is used to characterize the first noise matrix and I is used to characterize the identity matrix.

In this example, the obtained ω 0' is a3 × 1 vector, which may be specifically expressed as the following expression (13):

the ω is ¹ 、ω ² 、ω ³ I.e. the weighting parameters corresponding to the signals received by the three microphones of the electronic device.

The audio signal of the target spatial orientation = signal x ω received by the first microphone ¹ + a signal x ω received by a second microphone ² + signal x ω received by the third microphone ³ 。

According to the audio focusing method provided by the embodiment of the disclosure, the noise values are respectively added to the diagonal positions in the covariance matrix, and then the weight parameters are calculated and obtained according to the MVDR algorithm, so that the problem that the obtained covariance matrix has a smaller noise value due to insufficient snapshot number of signals can be avoided, and the accuracy of the focused audio signal is improved.

In an actual video call scene, at least two people should perform a video call, and therefore, based on the audio focusing method, electronic devices used by both video call parties can acquire focused audio of respective users, and then upload the acquired respective focused audio to a server for subsequent processing. In order to better understand the technical solution of the embodiment of the present disclosure, fig. 3 shows a flowchart of another audio focusing method. In the description of the present embodiment, the same steps as those in any of the foregoing embodiments will be briefly described, and detailed descriptions thereof will be omitted, so that reference may be made to any of the foregoing embodiments. In the description of the present embodiment, the processes of acquiring focused audio and performing spatial rendering on the acquired focused audio according to the preset virtual dialog space information are completely described. As shown in FIG. 3, the exemplary embodiment method may include the steps of:

in step 300, at least one image including the user is acquired.

In step 302, a user in the image is identified by a pre-trained image recognition neural network.

In step 304, a target spatial orientation of the user relative to the electronic device is determined.

In step 306, a target array signal received by at least one microphone on the electronic device is determined.

In step 308, audio noise reduction and dereverberation processing are performed on the target array signal received by the at least one microphone.

In step 310, an audio signal from the target spatial orientation is focused according to the target array signal.

In this step, focusing processing may be performed based on the acquired target spatial orientation information and the target array signal.

In step 312, the server presets a virtual dialog space.

The virtual dialog space may include the size of the virtual space, location information, distance information, angle information, etc. of both video call parties.

In step 314, HRTF (Head Related transform Function) coefficients of the left and right ears of both videos are generated based on the virtual dialog space information.

As shown in fig. 4, due to the distance and the angle information, the audio signals received by the left and right ears of the user are different for the audio signal transmitted at a certain position. Therefore, HRTF coefficients for the left and right ears of both videos can be generated based on the above information.

In step 316, based on the obtained focused audio signals of the video call parties and the HRTF coefficients, audio signals for the left and right ears of the video call parties are generated.

The acquired focused audio signals of the video call parties may include, for example: audio signal of user a, audio signal of user B. The HRTF coefficients may comprise, for example, HRTF coefficients for the left and right ear of user a, HRTF coefficients for the left and right ear of user B.

Then, based on the obtained audio signal of the user a and the HRTF coefficients of the left and right ears of the user B, the audio signal for the left and right ears of the user B may be generated. And generating audio signals for the left ear and the right ear of the user A based on the acquired audio signals of the user B and the HRTF coefficients of the left ear and the right ear of the user A.

In step 318, noise and reverberation are added to the generated audio signals of the left ear and the right ear of the two video call parties to obtain a final spatial audio, and the final spatial audio is sent to the corresponding electronic device.

Specifically, the audio signals for the left and right ears of the user a may be transmitted to the electronic device of the user a. And sending the audio signals aiming at the left ear and the right ear of the user B to the electronic equipment of the user B.

It should be noted that the above method is not limited to a scenario in which two persons perform video calls, but is also applicable to a scenario in which three persons and more than three persons perform video calls.

According to the audio focusing method provided by the embodiment of the disclosure, the focused audio signals of the two sides of the obtained video are rendered according to the preset virtual dialogue space information in the server, and the audio signals are subjected to processing of adding noise and reverberation, so that a real communication scene can be simulated to the greatest extent, and the on-line conversation experience is improved.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently.

Further, those skilled in the art will appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules are not necessarily required for the disclosure.

Corresponding to the embodiment of the application function implementation method, the disclosure also provides an embodiment of an application function implementation device and a corresponding terminal.

Fig. 5 is a schematic structural diagram of an audio focusing apparatus shown in accordance with an exemplary embodiment of the present disclosure, as shown in fig. 5, the audio focusing apparatus may include:

a target spatial orientation determining module 51, configured to determine a target spatial orientation of a user using the electronic device for a call with respect to the electronic device.

A target array signal determination module 52 configured to determine a target array signal received by at least one microphone on the electronic device.

And the audio signal focusing module 53 is configured to focus an audio signal from the target spatial direction according to the target array signal.

Optionally, when the target spatial orientation determining module 51 is configured to determine a target spatial orientation of a user using the electronic device for a call with respect to the electronic device, the target spatial orientation determining module includes:

at least one frame of image including the user is acquired.

Determining two-dimensional coordinates of the user in at least one of the images.

Optionally, the target array signal determining module 52, when configured to determine a target array signal received by at least one microphone on the electronic device, includes:

Optionally, the target array signal determining module 52, when configured to determine the second signal value corresponding to each virtual microphone based on at least the first signal value, includes:

a first steering vector corresponding to each of the physical microphones is determined.

Determining a translation vector based on at least one of the first steering vector and the second steering vector. Wherein the conversion vector is used for representing the conversion relation between each first guide vector and the second guide vector.

Optionally, the distance between any two microphones is less than half the wavelength of the audio signal.

Optionally, when the audio signal focusing module 53 is configured to focus an audio signal from the target spatial direction according to the target array signal, the audio signal focusing module includes:

determining a covariance matrix of the array signals.

And respectively adding noise values at a plurality of designated positions in the covariance matrix to obtain a corrected covariance matrix.

And focusing to obtain the audio signal from the target space direction according to an MVDR algorithm based on the corrected covariance matrix and the target guide vector. Wherein the target steering vector is a steering vector corresponding to the plurality of microphones.

Optionally, the audio signal focusing module 53, when configured to add noise values to a plurality of designated positions in the covariance matrix respectively to obtain a modified covariance matrix, includes:

a second noise matrix is determined based on a product of a first noise matrix formed of a plurality of the noise values and an identity matrix.

Optionally, the specified position is a position on a main diagonal in the covariance matrix.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the disclosure. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, an embodiment of the present disclosure provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the audio focusing method of any of the embodiments of the present disclosure.

Fig. 6 is a schematic diagram illustrating a structure of an electronic device 600 according to an example embodiment. For example, the electronic device 600 may be a user device, which may be embodied as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a wearable device such as a smart watch, a smart bracelet, and the like.

Referring to fig. 6, electronic device 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.

The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 606 provides power to the various components of electronic device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and a user as described above. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or slide action but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 600 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the electronic device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in position of the electronic device 600 or a component of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the electronic device 600, and a change in temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi,4G or 5g,4G LTE, 5G NR, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as the memory 604 including instructions, which when executed by the processor 620 of the electronic device 600, enable the electronic device 600 to perform an audio focusing method, the method including:

a target spatial orientation of a user using the electronic device for a video call is determined relative to the electronic device.

Target array signals received by a plurality of microphones on the electronic device are determined.

The non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio focusing method, applied to an electronic device, the method comprising:

2. The method of claim 1, wherein determining a target spatial orientation of a user using the electronic device for a call relative to the electronic device comprises:

acquiring at least one frame of image including the user;

3. The method of claim 1, wherein determining the target array signal received by at least one microphone on the electronic device comprises:

4. The method of claim 3, wherein determining the second signal value for each virtual microphone based on at least the first signal value comprises:

5. A method as claimed in claim 3 or 4, wherein the spacing between any two microphones is less than half the wavelength of the audio signal.

6. The method of claim 1, wherein focusing the audio signal from the target spatial orientation based on the array signal comprises:

determining a covariance matrix of the array signals;

7. The method according to claim 6, wherein the adding noise values to a plurality of designated positions in the covariance matrix to obtain a modified covariance matrix comprises:

8. The method according to claim 6 or 7, wherein the specified position is a position on a main diagonal in the covariance matrix.

9. An audio focusing apparatus, applied to an electronic device, the apparatus comprising:

a target array signal determination module to determine a target array signal received by at least one microphone on the electronic device;

and the audio signal focusing module is used for focusing to obtain the audio signal from the target space direction according to the target array signal.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured for performing the steps of the method of any one of claims 1 to 8.