CN111768797A

CN111768797A - Speech enhancement processing method, speech enhancement processing device, computer equipment and storage medium

Info

Publication number: CN111768797A
Application number: CN202010903341.5A
Authority: CN
Inventors: 谢单辉; 张伟彬
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-10-13

Abstract

The application relates to a speech enhancement processing method, a speech enhancement processing device, computer equipment and a storage medium. The method comprises the following steps: acquiring monitoring information corresponding to each microphone in a microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array; judging whether the corresponding microphone is shielded or not according to the monitoring information; when the shielded microphone exists in the microphone array, deleting the spatial position information of the shielded microphone in a position information list of the microphone array, and deleting the voice signal collected by the shielded microphone in a voice signal set; performing sound source localization on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation; and performing voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning. The method can improve the quality of the processed voice signal.

Description

Speech enhancement processing method, speech enhancement processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing speech enhancement, a computer device, and a storage medium.

Background

With the development of computer technology, speech signals are widely used in life. The microphone array can form directional pickup, and the quality of target voice signal collection is improved.

In the traditional method, the source direction of a voice signal is determined by comparing the phase and amplitude differences among sound signals collected by different microphones in a microphone array, so that the voice in a target direction is enhanced, and the voice interference in a non-target direction is attenuated. The traditional method can be seriously interfered when the microphone in the microphone array is shielded, and the voice enhancement effect is poor.

Disclosure of Invention

In view of the above, it is necessary to provide a speech enhancement processing method, apparatus, computer device and storage medium capable of resisting occlusion interference.

A method of speech enhancement processing, the method comprising:

acquiring monitoring information corresponding to each microphone in a microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array;

judging whether the corresponding microphone is shielded or not according to the monitoring information;

when an occluded microphone exists in the microphone array, deleting the spatial position information of the occluded microphone from a position information list of the microphone array, and deleting the voice signal collected by the occluded microphone from the voice signal set;

performing sound source localization on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation;

and performing voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

In one embodiment, the determining whether the corresponding microphone is blocked according to the monitoring information includes:

obtaining at least one judgment result of whether the microphone corresponding to at least one monitoring information is blocked or not according to at least one monitoring information corresponding to each microphone in the microphone array; determining whether a microphone corresponding to the at least one monitoring information is occluded based on the at least one determination result.

when the monitoring information is a monitoring image of a microphone in the microphone array, detecting whether a sound inlet hole exists in the monitoring image so as to judge whether the corresponding microphone is blocked; and/or the presence of a gas in the gas,

when the monitoring information is the bearing pressure value of the microphone in the microphone array, detecting whether the bearing pressure value exceeds a preset value or not so as to judge whether the corresponding microphone is shielded or not; and/or the presence of a gas in the gas,

and when the monitoring information is infrared information corresponding to the microphones in the microphone array, detecting whether a shielding object can be detected based on the infrared information or not so as to judge whether the corresponding microphones are shielded or not.

In one embodiment, the sound source localization of the voice signals in the voice signal set subjected to the deletion operation according to the spatial position information in the position information list subjected to the deletion operation includes:

performing echo cancellation operation on the voice signals in the voice signal set subjected to the deleting operation;

performing reverberation elimination operation on the voice signal subjected to echo elimination operation;

and carrying out sound source localization on the voice signal subjected to the reverberation elimination operation according to the spatial position information in the position information list subjected to the deletion operation.

selecting a reference voice signal serving as a reference datum from the voice signal set subjected to the deleting operation;

respectively calculating the time delay information of the rest voice signals except the reference voice signal in the voice signal set subjected to the deleting operation relative to the reference voice signal;

and obtaining the sound source direction of each voice signal through geometric operation based on the time delay information and the spatial position information in the position information list subjected to the deleting operation.

In one embodiment, the separately calculating the delay information of the remaining voice signals except the reference voice signal in the voice signal set subjected to the deletion operation with respect to the reference voice signal includes:

respectively calculating cross-correlation functions between the rest voice signals except the reference voice signal in the voice signal set and the reference voice signal;

obtaining coordinates of the cross-correlation function at a peak value;

and acquiring the time delay information according to the coordinates at the peak value.

In one embodiment, the performing speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source location includes:

performing time delay compensation on each voice signal in the voice signal set subjected to the deleting operation according to the time delay information;

generating a voice response matrix of the voice signal set subjected to the deleting operation according to the sound source direction;

and calculating each voice signal subjected to time delay compensation through the voice response matrix to obtain an enhanced voice signal.

In one embodiment, after performing speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source location, the method further includes:

acquiring updated monitoring information;

judging whether the shielded microphone is recovered to be normal or not according to the updated monitoring information;

if the shielded microphone returns to normal, updating the position information list and the voice signal set;

and performing voice enhancement processing based on the updated spatial position information list and the updated voice signal set.

A speech enhancement processing apparatus, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring monitoring information corresponding to each microphone in a microphone array and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array;

the judging module is used for judging whether the corresponding microphone is shielded or not according to the monitoring information;

a deleting module, configured to delete, when an occluded microphone exists in the microphone array, spatial location information of the occluded microphone in a location information list of the microphone array, and delete, in the speech signal set, a speech signal acquired by the occluded microphone;

the sound source positioning module is used for carrying out sound source positioning on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation;

and the enhancement processing module is used for carrying out voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

In one embodiment, the determining module is further configured to:

In one embodiment, the sound source localization module is further configured to:

obtaining coordinates of the cross-correlation function at a peak value;

In one embodiment, the enhancement processing module is further configured to:

In one embodiment, the apparatus further comprises:

the acquisition module is used for acquiring updated monitoring information;

the judging module is used for judging whether the shielded microphone returns to normal or not according to the updated monitoring information;

the updating module is used for updating the spatial position information list and the voice signal set if the shielded microphone returns to be normal;

and the enhancement processing module is used for carrying out voice enhancement processing based on the updated spatial position information list and the updated voice signal set.

A computer arrangement comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the speech enhancement processing method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech enhancement processing method.

In the above embodiment, the computer device obtains the monitoring information of each microphone, and can determine whether the microphone is blocked according to the monitoring information, and delete the spatial position information of the microphone and the voice signal collected by the microphone when the microphone is blocked. Because the computer device does not utilize the spatial position information of the shielded microphone and the voice signal collected by the shielded microphone when performing the voice enhancement processing, the voice enhancement processing algorithm can be prevented from being interfered due to the shielded microphone, and the quality of the voice signal after the voice enhancement processing is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a speech enhancement method;

FIG. 2 is a flow diagram illustrating a method for speech enhancement processing according to one embodiment;

FIG. 3 is a schematic diagram of a speech enhancement processing method in one embodiment;

FIG. 4 is a flowchart illustrating a speech enhancement processing method according to another embodiment;

FIG. 5 is a block diagram of an apparatus for speech enhancement processing according to an embodiment;

FIG. 6 is a block diagram showing the structure of a speech enhancement processing apparatus according to another embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment;

fig. 8 is an internal structural view of a computer device in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The speech enhancement processing method provided by the application can be applied to the application environment shown in fig. 1. The computer device 102 receives a voice signal set formed by voice signals collected by the microphone array 106, and acquires monitoring information corresponding to each microphone in the microphone array. And then judging whether the corresponding microphone is shielded or not according to the monitoring information. When there is an occluded microphone in the microphone array, the computer device 102 deletes spatial location information of the occluded microphone in the location information list and deletes a voice signal collected by the occluded microphone in the voice signal set. Finally, the computer device 102 performs speech enhancement processing on the speech signals in the speech signal set subjected to the deletion operation according to the spatial position information in the position information list subjected to the deletion operation, and outputs the speech signals subjected to the speech enhancement processing to the loudspeaker 104.

The computer device 102 may be a terminal or a server. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In one embodiment, as shown in fig. 2, a speech enhancement processing method is provided, which is described by taking the method as an example applied to the computer device in fig. 1, and comprises the following steps:

s202, acquiring monitoring information corresponding to each microphone in a microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array.

The microphone array is an audio front-end acquisition device composed of a plurality of microphones and used for acquiring sounds from different spatial directions. The microphone array has regular shapes such as linear, annular, spherical and star shapes, and also has other irregular shapes such as distributed and self-organized type.

The monitoring information is collected by the information collection equipment and can reflect the information whether the microphone is shielded or not. For example, the monitoring information may be physical state information, appearance characteristic information, and the like of the microphone. The physical state information may be pressure value information sensed at a preset position on the surface of the microphone, or may be light-sensitive information at the preset position on the surface of the microphone, or may be temperature information at the preset position on the surface of the microphone. The appearance feature information may be image information of the microphone, or may be infrared information of the microphone, or may be point cloud data information of the microphone.

The computer device may set the number of types of monitoring information corresponding to each microphone, and may also set the type of monitoring information corresponding to each microphone. For example, the computer device may be arranged to acquire two types of monitoring information for each microphone. For example, the two kinds of monitoring information may be a microphone surface preset position pressure value and infrared information of the microphone. For example, the computer device may be configured to acquire three types of monitoring information for a microphone in a position that is relatively easy to be shielded, and may be configured to acquire one type of monitoring information for a microphone in a position that is not easy to be shielded, so as to perform important monitoring on the microphone in the position that is easy to be shielded.

Wherein the speech signals are sound signals from different sound source directions collected by respective microphones in the microphone array. The voice signals collected by all the microphones in the microphone array form a voice signal set.

And S204, the computer equipment judges whether the corresponding microphone is shielded or not according to the monitoring information.

After the computer equipment acquires the monitoring information corresponding to the microphones, each kind of monitoring information corresponding to each microphone is analyzed and processed, and a judgment result whether the microphone is shielded or not based on the monitoring information is obtained. The computer device may determine whether the microphone is occluded based on a type of monitoring information to which the microphone corresponds. Or comprehensive analysis and judgment can be performed on part or all types of monitoring information corresponding to each microphone to determine whether the microphone is blocked.

S206, when the shielded microphone exists in the microphone array, deleting the spatial position information of the shielded microphone in the position information list of the microphone array, and deleting the voice signal collected by the shielded microphone in the voice signal set;

wherein the spatial position information of all microphones in the microphone array is stored in the position information list.

Wherein the spatial position information is a position coordinate of the microphone in a three-dimensional space.

Since the voice signals collected by the blocked microphones can generate serious attenuation or distortion, after the computer device determines that the blocked microphones exist in the microphone array, the computer device deletes the spatial position information of the blocked microphones in the position information list of the microphone array, and deletes the voice signals collected by the blocked microphones in the voice signal set. So as to prevent the voice signal collected by the shielded microphone from interfering the voice enhancement processing algorithm and polluting other normal voice signals.

S208, the computer equipment carries out sound source localization on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation.

Wherein sound source localization is determining a direction of a sound source of a speech signal picked up by a microphone in the microphone array.

Because the spatial positions of the microphones in the microphone array are different, the speech signals collected by the microphones have different speech signal characteristics such as phase, amplitude, time delay and the like. The computer device can perform sound source localization on the voice signals collected by the microphones according to the difference of the voice signals collected by the microphones in the microphone array on the characteristics of the voice signals and the spatial position information of the microphones so as to determine the sound source direction of the voice signals.

The computer equipment performs signal processing on the voice signals collected by the microphones and can acquire the voice signal characteristics of the voice signals, such as phase, amplitude, time delay and the like.

And S210, the computer equipment performs voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

Wherein, the sound source direction is the signal source direction of the voice signal collected by the microphone.

In one embodiment, a computer device divides a sound source direction of a voice signal into a target sound source direction and a non-target sound source direction. The target sound source direction is the direction of the sound source of the useful speech signal that the microphone is expected to pick up. The non-target sound source direction is the direction of origin of the noise signal.

The voice enhancement processing is to reserve and enhance the voice signal from the target sound source direction according to the sound source direction of the voice signal, and attenuate the noise voice signal which is not from the target sound source direction to filter the noise and enhance the useful voice signal.

In the above embodiment, the computer device obtains the monitoring information of each microphone, and can determine whether the microphone is blocked according to the monitoring information, and delete the spatial position information of the microphone and the voice signal collected by the microphone when the microphone is blocked. Because the computer device does not utilize the spatial position information of the shielded microphone and the voice signal collected by the shielded microphone when performing the voice enhancement processing, the voice enhancement algorithm can be prevented from being interfered by the data of the shielded microphone, and the quality of the voice signal after the voice enhancement processing is improved.

In one embodiment, the computer device determining whether the corresponding microphone is blocked according to the monitoring information includes: obtaining at least one judgment result of whether the microphone corresponding to at least one monitoring information is shielded or not according to at least one monitoring information corresponding to each microphone in the microphone array; and determining whether the microphone corresponding to the at least one monitoring information is blocked or not based on the at least one judgment result.

The computer device first determines at least one type of monitoring information corresponding to each microphone, that is, for each microphone, the computer device obtains at least one type of monitoring information. Based on each kind of monitoring information, the computer device analyzes and processes the monitoring information to judge whether the corresponding microphone is shielded or not based on the monitoring information, and a judgment result is obtained.

For example, the computer device acquires a kind of monitoring information, and determines whether the corresponding microphone is occluded or not based on the acquired kind of monitoring information. For example, the computer device obtains two or more kinds of monitoring information, first determines whether the corresponding microphone is blocked according to each kind of obtained monitoring information, and then performs comprehensive analysis on the determination result of each monitoring information corresponding to each microphone to obtain the determination result of whether the microphone corresponding to the monitoring information is blocked. For example, if the monitoring information acquired by the computer device is a monitoring image, the computer device determines whether the corresponding microphone is blocked according to the acquired monitoring image, and obtains a determination result. For example, when the monitoring information acquired by the computer device is infrared information and a bearing pressure value, the computer device determines whether the corresponding microphone is shielded according to the infrared information and the bearing pressure value, and then performs comprehensive analysis on a determination result based on the infrared information and a determination result based on the bearing pressure value to determine whether the microphone corresponding to the infrared information and the bearing pressure value is shielded. For example, when the monitoring information acquired by the computer device is the monitoring image, the infrared information and the bearing pressure value, the computer device judges whether the corresponding microphone is shielded according to the monitoring image, the infrared information and the bearing pressure value, and then performs comprehensive analysis on the judgment result based on the monitoring image, the judgment result based on the infrared information and the judgment result based on the bearing pressure value to determine whether the microphone corresponding to the monitoring image, the infrared information and the bearing pressure value is shielded.

In one embodiment, the computer device may set a judgment condition for judging one type of monitoring information corresponding to each microphone, and determine whether the microphone is occluded according to the judgment condition. It is to be understood that the monitoring information may be any monitoring information corresponding to the microphone, for example, any one of infrared information, image information, and preset position pressure value information, or a specific one of the above monitoring information selected according to a preset rule, where the preset rule may be a priority of the monitoring information. For example, the computer device may set the determination condition that whether the microphone is blocked is determined according to a determination result obtained from any one of the monitoring information corresponding to the microphone or the specific monitoring information indicating that the microphone is blocked. For example: the computer device may set a judgment condition that the microphone is blocked according to any kind of monitoring information corresponding to the microphone, for example, infrared information (or image information, or a preset position pressure value, etc.), and then it is determined that the microphone is blocked without further obtaining a detection result according to the other monitoring information. Further, if the microphone is judged not to be shielded according to each monitoring information, the microphone is determined not to be shielded. Further, in other embodiments, the computer device may set a determination condition that determination is sequentially performed according to the monitoring information with higher priority corresponding to the microphone, and determine whether the microphone is blocked according to the determination result. It can be understood that a user may set a priority for the monitoring information, and a computer device preferentially monitors the monitoring information with a higher priority, for example, the monitoring information includes infrared information, image information, and a preset position pressure value, and the priority is sequentially lowered, then the computer device preferentially obtains a determination result according to the monitoring information with a higher priority, for example, preferentially obtains a determination result whether the microphone is blocked according to the infrared information, and determines whether the microphone is blocked, and when the determination result that the microphone is blocked cannot be obtained according to the infrared information, further obtains a determination result whether the microphone is blocked according to the image information with a lower priority or the preset position pressure value, and further determines whether the microphone is blocked.

In one embodiment, the computer device may set a judgment condition for comprehensively judging at least two kinds of monitoring information corresponding to each microphone, and perform comprehensive analysis judgment on the monitoring information according to the judgment condition to determine whether the microphone is blocked. For example, the computer device may set the determination condition that the microphone is blocked according to the determination result obtained from the at least two types of monitoring information corresponding to the microphone. For example, the computer device may set the determination condition that a determination result obtained according to one of the monitoring information corresponding to the microphone, for example, the infrared information (or the image information, or the preset position pressure value, etc.), indicates that the microphone is blocked, and determine that the microphone is blocked if, among the remaining monitoring information, the determination result of at least one of the monitoring information indicates that the microphone is blocked.

In one embodiment, when comprehensively determining whether the microphone corresponding to the monitoring information is blocked, the computer device performs weighted calculation on the determination result corresponding to each type of monitoring information corresponding to the microphone, and determines that the corresponding microphone is blocked if the result of the weighted calculation is greater than a preset determination threshold. For example, the computer device sets the judgment result of being occluded to 1, and the judgment result of not being occluded to 0, and sets different weights for the judgment results obtained by each kind of monitoring information. The computer equipment can set different weights according to the accuracy degree of each monitoring information obtained based on the probability. For example, when the computer device performs comprehensive analysis based on the determination result obtained from the monitoring image and the infrared information, the weight of the determination result obtained from the monitoring image is set to 0.6, the weight of the determination result obtained from the infrared information is set to 0.4, and the determination threshold value is set to 0.5. If the judgment result obtained based on the monitoring image is shielded, the value corresponding to the shielded state is 1, and if the judgment result obtained based on the infrared information is not shielded, the value corresponding to the non-shielded state is 0, so that the calculation method of the weighting calculation of the comprehensive analysis of the judgment results obtained based on the monitoring image and the infrared information by the computer equipment is 1 x 0.6+0 x 0.4=0.6, and is greater than the judgment threshold value 0.5, and therefore, the microphone corresponding to the monitoring information is determined to be shielded.

In one embodiment, the computer device determining whether the corresponding microphone is blocked according to the monitoring information includes: when the monitoring information is a monitoring image of a microphone in the microphone array, detecting whether a sound inlet hole exists in the monitoring image so as to judge whether the corresponding microphone is blocked; and/or when the monitoring information is the bearing pressure value of the microphone in the microphone array, detecting whether the bearing pressure value exceeds a preset value so as to judge whether the corresponding microphone is shielded; and/or when the monitoring information is infrared information corresponding to the microphones in the microphone array, detecting whether the shielding object can be detected based on the infrared information so as to judge whether the corresponding microphones are shielded.

The computer device may compare the real-time monitored image of the microphone with a pre-stored normal image, or may detect whether the sound inlet hole exists in the monitored image by a method based on a convolutional neural network or based on an LSTM (long short term memory) network.

Since there is an error in the analysis and judgment of the monitoring information by the computer device, a judgment result obtained from one kind of monitoring information is likely to be erroneous. The computer equipment comprehensively judges the multiple kinds of monitoring information corresponding to each microphone, so that the judgment accuracy can be improved.

In one embodiment, the computer device performs sound source localization on a voice signal in a voice signal set subjected to a deletion operation according to spatial position information in a position information list subjected to the deletion operation, and includes: performing echo cancellation operation on the voice signals in the voice signal set subjected to the deleting operation; performing reverberation elimination operation on the voice signal subjected to echo elimination operation; and carrying out sound source localization on the voice signal subjected to the reverberation elimination operation according to the spatial position information in the position information list subjected to the deletion operation.

Because the sound signals played from the loudspeaker are superposed to the sound signals collected by the microphone to form echoes, when the computer equipment carries out sound source positioning on the sound signals in the sound signal set subjected to the deleting operation, the sound signals played from the loudspeaker collected by the microphone are firstly eliminated according to an echo eliminating algorithm, so that the sound signals played from the loudspeaker do not interfere with the sound source positioning algorithm, and the echo in the sound signals after the speech enhancement processing can be prevented from occurring, and the sound effect of the microphone array is not influenced.

The computer device may perform Echo Cancellation on the voice signal collected by the microphone by using an AEC (Acoustic Echo Cancellation) Echo Cancellation Echo canceller, or may perform Echo Cancellation on the voice signal collected by the microphone by using an AFC (Adaptive feedback control) technique.

When the microphone collects the voice signal, the microphone can receive the sound wave which is emitted by the sound source and arrives directly, and also can receive the sound wave which is emitted by the sound source and arrives after being reflected. The sound waves emitted by the sound source and arriving via reflection are called reverberation. Reverberation causes the voice effect of the microphone array to be poor, so the computer device performs reverberation cancellation operation on the received voice signal to avoid the interference of reverberation.

The computer device may perform reverberation cancellation on the speech signal acquired by the microphone using a beamforming-based reverberation cancellation algorithm or an inverse filtering-based reverberation cancellation algorithm and a complex cepstrum filtering reverberation cancellation algorithm.

In one embodiment, the computer device performs sound source localization on a voice signal in a voice signal set subjected to a deletion operation according to spatial position information in a position information list subjected to the deletion operation, and includes: selecting a reference voice signal serving as a reference datum from the voice signal set subjected to the deleting operation; respectively calculating time delay information of the rest voice signals except the reference voice signal in the voice signal set subjected to the deleting operation relative to the reference voice signal; and obtaining the sound source direction of each voice signal through geometric operation based on the time delay information and the spatial position information in the position information list subjected to the deleting operation.

Wherein the time delay information is a difference in time between different microphone acquisition signals. The computer device can reverse-infer the bearing of the sound source based on the speed of sound, the geometry of the microphone (i.e. spatial position of the microphone), and the estimated time delay information.

The geometric operation is an operation method for obtaining the direction of the sound source through calculation according to the geometric relationship between the microphone and the sound source determined based on the distance difference of different voice signals from the sound source to the corresponding microphone.

In one embodiment, the time delay difference between the sound source to the microphone that collects the reference speech signal (one of the microphone arrays that is not deleted is selected as the spatial reference) and to each of the other microphones is zero, and the distance difference can be presumed to be zero. That is, the sound source is equidistant from each microphone. According to the definition of a circle in geometry, the distance from each point on the circumference to the center of the circle is equal. The sound source is therefore on the centre of a circle defined by the arcs of circles connected by the microphones in the microphone array.

In another embodiment, the difference in distance from the sound source to the microphone that collects the reference speech signal and to each of the other microphones is the same and greater than zero. According to the definition of hyperbola, the distance difference between a point on the hyperbola and two focuses is a fixed value, namely, the sound source is located on the hyperbola. And the computer equipment can determine two hyperbolas according to the two groups of distance differences, and the intersection point of the two hyperbolas is the position of the sound source.

In another embodiment, the difference between the distance from the sound source to the microphone that collects the reference speech signal and the distance from the sound source to the microphone that collects the other speech signal is greater than zero and differs according to the spatial location of the different microphones. The computer device calculates the geometric relation between the microphones in the microphone array and the sound source according to the functional relation between the distance differences from the sound source to the microphone for collecting the reference voice signal and the distance differences from the sound source to other microphones, and then the position of the sound source can be obtained according to the spatial position information of the microphones.

After the computer equipment determines the position of the sound source, the point where the sound source is located and the point where the microphone is located are connected into a straight line, so that the direction of the sound source relative to the microphone, namely the sound source direction of each voice signal can be determined according to the spatial position information and the sound source position of the microphone.

In one embodiment, the computer device respectively calculates time delay information of the remaining voice signals except the reference voice signal in the voice signal set subjected to the deletion operation relative to the reference voice signal, and the time delay information comprises: respectively calculating pairwise cross-correlation functions between the other voice signals except the reference voice signal in the voice signal set and the reference voice signal; acquiring coordinates of each group of cross-correlation functions at a peak value; and acquiring time delay information according to the coordinates at the peak value.

Because the voice signals collected by different microphones have correlation. The computer device can obtain a cross-correlation function of the two voice signals through calculation, and a time coordinate at the peak of the cross-correlation function is a time difference between the two voice signals, namely time delay information between the two microphone voice signals.

In one embodiment, the computer device performs speech enhancement processing on the speech signals in the speech signal set subjected to the deletion operation according to the sound source direction determined by the sound source location, and the speech enhancement processing comprises the following steps: performing time delay compensation on each voice signal in the voice signal set subjected to the deleting operation according to the time delay information; generating a voice response matrix of the voice signal set subjected to the deleting operation according to the direction of the sound source; and calculating each voice signal subjected to time delay compensation through the voice response matrix to obtain an enhanced voice signal.

The delay compensation is to delay the voice signal to align the voice signal in time domain.

The voice response matrix is a matrix used by the computer device to perform weighting calculation on the voice signals collected by each microphone in the microphone array. The element values of the elements in the voice response matrix correspond to the weight values for performing weighting calculation on the voice signals collected by the microphones. The computer device determines the element values of the elements in the voice response matrix according to the sound source direction corresponding to the voice signal.

In one embodiment, if a sound source direction of a voice signal is a target sound source direction, the computer device determines that an element value of an element in a voice response matrix corresponding to the voice signal is 1; if the sound source direction of the voice signal is the non-target sound source direction, the computer device determines that the element value of the element in the voice response matrix corresponding to the voice signal is 0.

In one embodiment, if an angle between a sound source direction of a voice signal and a target sound source direction is smaller than a preset angle value, the computer device determines that an element value of an element in a voice response matrix corresponding to the voice signal is 1; if the included angle between the sound source direction of the voice signal and the target sound source direction is greater than the preset angle value, the computer equipment determines that the element value of the element in the voice response matrix corresponding to the voice signal is 0.

In one embodiment, a schematic diagram of a computer device performing speech enhancement processing on microphones in a microphone array is shown in fig. 3. In the case where the first microphone is occluded, the computer device deletes spatial position information of the first microphone and deletes a voice signal collected by the microphone. And then sequentially carrying out echo cancellation operation, reverberation cancellation operation, sound source positioning and voice enhancement processing on the voice signals in the voice signal set subjected to the deletion operation according to the spatial position information in the position information list subjected to the deletion operation, and finally outputting the signals subjected to the voice enhancement processing to the loudspeaker.

In one embodiment, the computer device further performs speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source location, and then: acquiring updated monitoring information; judging whether the shielded microphone is recovered to be normal or not according to the updated monitoring information; if the shielded microphone returns to normal, updating the position information list and the voice signal set; and performing voice enhancement processing based on the updated spatial position information list and the updated voice signal set.

The computer equipment dynamically monitors each microphone in real time, if the shielded microphone returns to normal, the spatial position information corresponding to the microphone is added into a carry position information list, and the voice signals collected by the microphone are added into a voice signal set.

And the computer equipment deletes or adds the data of the corresponding microphone into the speech enhancement processing algorithm in time according to the judgment result of whether the microphone is shielded based on the monitoring information, so that the quality of the speech signal output according to the speech enhancement processing algorithm is improved in real time.

In one embodiment, a flow of speech enhancement processing performed by a computer device on speech signals collected by microphones in a microphone array is shown in fig. 4, and includes the following steps:

s402, acquiring monitoring information corresponding to each microphone in the microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array.

S404, obtaining a judgment result whether the microphone corresponding to each monitoring information is blocked or not according to at least one monitoring information corresponding to each microphone in the microphone array.

S406, comprehensively determining whether the microphone corresponding to the monitoring information is blocked or not based on the judgment result whether the microphone corresponding to each kind of monitoring information corresponding to each microphone is blocked or not.

S408, when the occluded microphone exists in the microphone array, deleting the spatial position information of the occluded microphone in the position information list of the microphone array, and deleting the voice signal collected by the occluded microphone in the voice signal set.

And S410, performing echo cancellation operation on the voice signals in the voice signal set subjected to the deleting operation.

S412, a reverberation cancellation operation is performed on the voice signal subjected to the echo cancellation operation.

And S414, carrying out sound source localization on the voice signal subjected to the reverberation elimination operation according to the spatial position information in the position information list subjected to the deletion operation.

And S416, performing voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

The specific contents of the above S402 to S416 may refer to the specific implementation process described above.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 5, there is provided a speech enhancement processing apparatus including: an obtaining module 502, a judging module 504, a deleting module 506, a sound source positioning module 508 and an enhancement processing module 510, wherein:

an obtaining module 502, configured to obtain monitoring information corresponding to each microphone in the microphone array, and obtain a voice signal set formed by voice signals collected by each microphone in the microphone array;

a judging module 504, configured to judge whether a corresponding microphone is blocked according to the monitoring information;

a deleting module 506, configured to delete spatial location information of an occluded microphone from a location information list of the microphone array and delete a speech signal acquired by the occluded microphone from a speech signal set when the occluded microphone exists in the microphone array;

a sound source localization module 508, configured to perform sound source localization on the voice signals in the voice signal set subjected to the deletion operation according to the spatial location information in the location information list subjected to the deletion operation;

and an enhancement processing module 510, configured to perform speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source location.

In the above embodiment, the computer device obtains the monitoring information of each microphone, and can determine whether the microphone is blocked according to the monitoring information, and delete the spatial position information of the microphone and the voice signal collected by the microphone when the microphone is blocked. Because the computer device does not utilize the spatial position information of the shielded microphone and the voice signal collected by the shielded microphone when performing the voice enhancement processing, the voice enhancement algorithm can be prevented from being interfered due to the shielded microphone, and the quality of the voice signal after the voice enhancement processing is improved.

In one embodiment, the determining module 504 is further configured to:

obtaining at least one judgment result of whether the microphone corresponding to at least one monitoring information is shielded or not according to at least one monitoring information corresponding to each microphone in the microphone array; and determining whether the microphone corresponding to the at least one monitoring information is blocked or not based on the at least one judgment result.

In one embodiment, the determining module 504 is further configured to:

when the monitoring information is infrared information corresponding to the microphones in the microphone array, whether the shielding object can be detected based on the infrared information is detected so as to judge whether the corresponding microphones are shielded.

In one embodiment, the sound source localization module 508 is further configured to:

respectively calculating time delay information of the rest voice signals except the reference voice signal in the voice signal set subjected to the deleting operation relative to the reference voice signal;

acquiring coordinates of the cross-correlation function at a peak value;

and acquiring time delay information according to the coordinates at the peak value.

In one embodiment, the enhancement processing module 510 is further configured to:

generating a voice response matrix of the voice signal set subjected to the deleting operation according to the direction of the sound source;

In one embodiment, as shown in fig. 6, the apparatus further comprises:

an obtaining module 502, configured to obtain updated monitoring information;

a judging module 504, configured to judge whether the shielded microphone returns to normal according to the updated monitoring information;

an updating module 512, configured to update the spatial information list and the speech signal set if the shielded microphone returns to normal;

and the enhancement processing module is used for carrying out voice enhancement processing on the basis of the updated spatial position information list and the updated voice signal set.

For the specific definition of the speech enhancement processing device, reference may be made to the above definition of the speech enhancement processing method, which is not described herein again. The respective modules in the speech enhancement processing device described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store speech enhancement processing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech enhancement processing method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech enhancement processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 7 and 8 are only block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computer device to which the present disclosure may be applied, and a particular computer device may include more or less components than those shown in the figures, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring monitoring information corresponding to each microphone in a microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array; judging whether the corresponding microphone is shielded or not according to the monitoring information; when the shielded microphone exists in the microphone array, deleting the spatial position information of the shielded microphone in a position information list of the microphone array, and deleting the voice signal collected by the shielded microphone in a voice signal set; performing sound source localization on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation; and performing voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

In one embodiment, the processor, when executing the computer program, further performs the steps of: obtaining at least one judgment result of whether the microphone corresponding to at least one monitoring information is shielded or not according to at least one monitoring information corresponding to each microphone in the microphone array; and determining whether the microphone corresponding to the at least one monitoring information is blocked or not based on the at least one judgment result.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the monitoring information is a monitoring image of a microphone in the microphone array, detecting whether a sound inlet hole exists in the monitoring image so as to judge whether the corresponding microphone is blocked; and/or when the monitoring information is the bearing pressure value of the microphone in the microphone array, detecting whether the bearing pressure value exceeds a preset value so as to judge whether the corresponding microphone is shielded; and/or when the monitoring information is infrared information corresponding to the microphones in the microphone array, detecting whether the shielding object can be detected based on the infrared information so as to judge whether the corresponding microphones are shielded.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing echo cancellation operation on the voice signals in the voice signal set subjected to the deleting operation; performing reverberation elimination operation on the voice signal subjected to echo elimination operation; and carrying out sound source localization on the voice signal subjected to the reverberation elimination operation according to the spatial position information in the position information list subjected to the deletion operation.

In one embodiment, the processor, when executing the computer program, further performs the steps of: selecting a reference voice signal serving as a reference datum from the voice signal set subjected to the deleting operation; respectively calculating time delay information of the rest voice signals except the reference voice signal in the voice signal set subjected to the deleting operation relative to the reference voice signal; and obtaining the sound source direction of each voice signal through geometric operation based on the time delay information and the spatial position information in the position information list subjected to the deleting operation.

In one embodiment, the processor, when executing the computer program, further performs the steps of: respectively calculating cross-correlation functions between the rest voice signals except the reference voice signal in the voice signal set and the reference voice signal; acquiring coordinates of the cross-correlation function at a peak value; and acquiring time delay information according to the coordinates at the peak value.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing time delay compensation on each voice signal in the voice signal set subjected to the deleting operation according to the time delay information; generating a voice response matrix of the voice signal set subjected to the deleting operation according to the direction of the sound source; and calculating each voice signal subjected to time delay compensation through the voice response matrix to obtain an enhanced voice signal.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring updated monitoring information; judging whether the shielded microphone is recovered to be normal or not according to the updated monitoring information; if the shielded microphone returns to normal, updating the position information list and the voice signal set; and performing voice enhancement processing based on the updated spatial position information list and the updated voice signal set.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring monitoring information corresponding to each microphone in a microphone array, and acquiring a voice signal set formed by voice signals acquired by each microphone in the microphone array; judging whether the corresponding microphone is shielded or not according to the monitoring information; when the shielded microphone exists in the microphone array, deleting the spatial position information of the shielded microphone in a position information list of the microphone array, and deleting the voice signal collected by the shielded microphone in a voice signal set; performing sound source localization on the voice signals in the voice signal set subjected to the deleting operation according to the spatial position information in the position information list subjected to the deleting operation; and performing voice enhancement processing on the voice signals in the voice signal set subjected to the deleting operation according to the sound source direction determined by the sound source positioning.

In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining at least one judgment result of whether the microphone corresponding to at least one monitoring information is shielded or not according to at least one monitoring information corresponding to each microphone in the microphone array; and determining whether the microphone corresponding to the at least one monitoring information is blocked or not based on the at least one judgment result.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the monitoring information is a monitoring image of a microphone in the microphone array, detecting whether a sound inlet hole exists in the monitoring image so as to judge whether the corresponding microphone is blocked; and/or when the monitoring information is the bearing pressure value of the microphone in the microphone array, detecting whether the bearing pressure value exceeds a preset value so as to judge whether the corresponding microphone is shielded; and/or when the monitoring information is infrared information corresponding to the microphones in the microphone array, detecting whether the shielding object can be detected based on the infrared information so as to judge whether the corresponding microphones are shielded.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing echo cancellation operation on the voice signals in the voice signal set subjected to the deleting operation; performing reverberation elimination operation on the voice signal subjected to echo elimination operation; and carrying out sound source localization on the voice signal subjected to the reverberation elimination operation according to the spatial position information in the position information list subjected to the deletion operation.

In one embodiment, the computer program when executed by the processor further performs the steps of: selecting a reference voice signal serving as a reference datum from the voice signal set subjected to the deleting operation; respectively calculating time delay information of the rest voice signals except the reference voice signal in the voice signal set subjected to the deleting operation relative to the reference voice signal; and obtaining the sound source direction of each voice signal through geometric operation based on the time delay information and the spatial position information in the position information list subjected to the deleting operation.

In one embodiment, the computer program when executed by the processor further performs the steps of: respectively calculating cross-correlation functions between the rest voice signals except the reference voice signal in the voice signal set and the reference voice signal; acquiring coordinates of the cross-correlation function at a peak value; and acquiring time delay information according to the coordinates at the peak value.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing time delay compensation on each voice signal in the voice signal set subjected to the deleting operation according to the time delay information; generating a voice response matrix of the voice signal set subjected to the deleting operation according to the direction of the sound source; and calculating each voice signal subjected to time delay compensation through the voice response matrix to obtain an enhanced voice signal.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring updated monitoring information; judging whether the shielded microphone is recovered to be normal or not according to the updated monitoring information; if the shielded microphone returns to normal, updating the position information list and the voice signal set; and performing voice enhancement processing based on the updated spatial position information list and the updated voice signal set.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile memory may include Read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for speech enhancement, the method comprising:

2. The method of claim 1, wherein determining whether the corresponding microphone is occluded according to the monitoring information comprises:

3. The method of claim 1, wherein determining whether the corresponding microphone is occluded according to the monitoring information comprises:

4. The method according to claim 1, wherein the performing sound source localization on the voice signals in the voice signal set subjected to the deletion operation according to the spatial position information in the position information list subjected to the deletion operation comprises:

5. The method according to claim 1, wherein the performing sound source localization on the voice signals in the voice signal set subjected to the deletion operation according to the spatial position information in the position information list subjected to the deletion operation comprises:

6. The method according to claim 5, wherein the separately calculating the delay information of the remaining voice signals except the reference voice signal in the voice signal set subjected to the deletion operation with respect to the reference voice signal comprises:

obtaining coordinates of the cross-correlation function at a peak value;

7. The method according to claim 5, wherein the performing speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source location comprises:

8. The method according to claim 1, wherein after performing speech enhancement processing on the speech signals in the speech signal set subjected to the deleting operation according to the sound source direction determined by the sound source localization, the method further comprises:

acquiring updated monitoring information;

9. A speech enhancement processing apparatus, characterized in that the apparatus comprises:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.