CN110970049A

CN110970049A - Multi-person voice recognition method, device, equipment and readable storage medium

Info

Publication number: CN110970049A
Application number: CN201911248622.5A
Authority: CN
Inventors: 黄族良; 龙洪锋
Original assignee: Guangzhou Speakin Intelligent Technology Co ltd
Current assignee: Guangzhou Speakin Intelligent Technology Co ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-07

Abstract

The invention discloses a multi-person voice recognition method, a device, equipment and a readable storage medium, wherein the multi-person voice recognition method receives multi-person voice signals in a current multi-person scene by utilizing preset sound source positioning acquisition equipment, and obtains sound source position information of each speaker in the multi-person voice signals, so that the multi-person voice signals are initially separated into single audio signals; based on the sound source position information, single audio signals corresponding to all the speakers are sequentially enhanced, multi-person voice signals are further separated according to sound sources at different positions, and interference of voices of other people on the current single voice to be enhanced is weakened; and identifying each enhanced single-person audio signal to acquire the identity information of each speaker, so that the identity information of all speakers of the multi-person voice signal is finally acquired.

Description

Multi-person voice recognition method, device, equipment and readable storage medium

Technical Field

The invention relates to the technical field of electronic information, in particular to a multi-person voice recognition method, a multi-person voice recognition device, multi-person voice recognition equipment and a readable storage medium.

Background

With the development of science and technology, speech recognition is more and more widely applied in various fields. However, the speech recognition technology at the present stage can only obtain a better effect in the process of recognizing the audio of a single speaker, but cannot effectively recognize the mixed audio of a plurality of speakers. For example, in the prior art, a mode of recognizing a multi-person voice signal based on a microphone array has problems of configuration stationarity and the like, and the finally obtained recognition result is often unsatisfactory, so that the technical problem that a multi-person mixed voice signal is difficult to recognize is caused.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a multi-person voice recognition method, and aims to solve the technical problem that a multi-person mixed voice signal is difficult to recognize.

In order to achieve the above object, the present invention provides a multi-person voice recognition method, which is applied to a multi-person voice recognition apparatus, the multi-person voice recognition method comprising the steps of:

receiving a multi-person voice signal in a current multi-person scene by using preset sound source positioning and collecting equipment, and acquiring sound source position information of each speaker in the multi-person voice signal;

sequentially enhancing the single audio signal corresponding to each sounder based on the sound source position information;

and identifying each enhanced single-person audio signal to acquire the identity information of each speaker.

Optionally, the step of receiving a multi-person voice signal in a current multi-person scene by using a preset sound source positioning and collecting device, and acquiring sound source position information of each speaker in the multi-person voice signal includes:

receiving the multi-person voice signals based on a microphone array in the preset sound source positioning and collecting equipment;

and acquiring the sound source position information based on the microphone array and an infrared sensor in the preset sound source positioning and collecting equipment.

Optionally, the step of acquiring the sound source position information based on the microphone array and the infrared sensor in the preset sound source positioning and collecting device includes:

acquiring first position information of each sounder based on a pyroelectric infrared sensing unit in the infrared sensor;

acquiring second position information of each sounder based on the microphone array;

and integrating the first position information and the second position information to generate the sound source position information.

Optionally, the step of acquiring first position information of each of the speakers based on a pyroelectric infrared sensing unit in the infrared sensor includes:

acquiring a position coordinate parameter and an induction included angle parameter of the pyroelectric infrared induction unit;

and generating the first position information based on a preset positioning algorithm, the position coordinate parameter and the induction included angle parameter.

Optionally, the step of acquiring second position information of each of the speakers based on the microphone array includes:

acquiring incidence angle information of the multi-person audio signal by a signal parameter estimation algorithm based on a rotation invariant technology;

and determining second position information of each speaker according to the incidence angle information.

Optionally, the step of sequentially enhancing the single-person audio signal corresponding to each of the speakers based on the sound source position information includes:

determining that each of the speakers corresponds to a single audio signal among the multi-person voice signals based on the sound source position information;

sequentially taking each single audio signal as a target audio signal, and taking the rest audio signals in the multi-person voice signals as interference audio signals;

carrying out self-adaptive adjustment on a first filtering weight of the target audio signal and a second filtering weight of the interference audio signal;

and performing adaptive space-domain filtering on the multi-person voice signal based on the adaptively adjusted first filtering weight and second filtering weight so as to enhance the target audio signal.

Optionally, the step of identifying each enhanced target audio to obtain identity information of each speaker includes:

preprocessing each enhanced target audio signal to generate a voice signal to be recognized;

and sequentially extracting the target voiceprint characteristics of the voice signal to be recognized, and matching the target voiceprint characteristics with the pre-stored voiceprint characteristics to acquire the identity information of each speaker.

In addition, to achieve the above object, the present invention also provides a multi-person voice recognition apparatus including:

the system comprises a sound source positioning acquisition module, a voice processing module and a voice processing module, wherein the sound source positioning acquisition module is used for receiving a multi-person voice signal in a current multi-person scene by using preset sound source positioning acquisition equipment and acquiring sound source position information of each speaker in the multi-person voice signal;

the single audio enhancement module is used for sequentially enhancing the single audio signals corresponding to the speakers based on the sound source position information;

and the audio signal identification module is used for identifying each enhanced single audio signal so as to acquire the identity information of each speaker.

Further, to achieve the above object, the present invention also provides a multi-person voice recognition apparatus including: the system comprises a memory, a processor and a multi-person voice recognition program stored on the memory and capable of running on the processor, wherein the multi-person voice recognition program realizes the video conference switching steps when being executed by the processor.

Further, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a multiple voice recognition program, which when executed by a processor, implements the steps of the multiple voice recognition method as described above.

The invention provides a multi-person voice recognition method, a multi-person voice recognition device, a multi-person voice recognition equipment and a computer readable storage medium. The multi-person voice recognition method comprises the steps that a preset sound source positioning and collecting device is used for receiving multi-person voice signals in a current multi-person scene, and sound source position information of each speaker in the multi-person voice signals is obtained; sequentially enhancing the single audio signal corresponding to each sounder based on the sound source position information; and identifying each enhanced single-person audio signal to acquire the identity information of each speaker. Through the mode, the sound source position information of each speaker in the multi-person voice signals is acquired through the preset sound source positioning acquisition equipment, so that the multi-person voice signals can be initially separated into single audio signals; by sequentially enhancing the single voice signal, the multi-person voice signal is further separated according to the sound sources at different positions, and the interference of the voice of other people on the current single voice to be enhanced is weakened; through recognizing each enhanced single voice in sequence, the identity information of all the speakers of the multi-person voice signal can be finally acquired, and the technical problem that the multi-person mixed voice signal is difficult to recognize is solved.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a multi-person voice recognition method according to a first embodiment of the present invention;

fig. 3 is a flowchart illustrating a multi-person voice recognition method according to a second embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

The terminal of the embodiment of the invention can be a PC, and can also be a mobile terminal device with a display function, such as a smart phone, a tablet computer, an electronic book reader, an MP3(Moving Picture Experts Group Audio Layer III, dynamic video Experts compress standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, dynamic video Experts compress standard Audio Layer 3) player, a portable computer, and the like.

As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a multi-person voice recognition program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke the multi-person voice recognition program stored in the memory 1005 and perform the following operations:

Further, the processor 1001 may call the multi-person voice recognition program stored in the memory 1005, and also perform the following operations:

Based on the hardware structure, the invention provides various embodiments of the multi-person voice recognition method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the multi-person voice recognition method.

A first embodiment of the present invention provides a multi-person voice recognition method, including:

In this embodiment, to solve the above problem, the present invention provides a multi-person voice recognition method, that is, a preset sound source positioning and collecting device obtains sound source position information of each speaker in a multi-person voice signal, so that the multi-person voice signal is initially separated into single audio signals; by sequentially enhancing the single voice signal, the multi-person voice signal is further separated according to the sound sources at different positions, and the interference of the voice of other people on the current single voice to be enhanced is weakened; through recognizing each enhanced single voice in sequence, the identity information of all the speakers of the multi-person voice signal can be finally acquired, and the technical problem that the multi-person mixed voice signal is difficult to recognize is solved. The multi-person voice recognition method is applied to a terminal provided with a preset sound source positioning and collecting device.

Step S10, receiving a multi-person voice signal in a current multi-person scene by using a preset sound source positioning and collecting device, and acquiring sound source position information of each speaker in the multi-person voice signal;

the preset sound source positioning and collecting equipment can be a microphone array, a passive infrared detector, a radar detector and the like.

In this embodiment, it should be noted that the current application scenario is a scenario in which multiple persons sound at the same time in different directions in a closed space. When multiple persons sound simultaneously in the current space, the terminal collects the voice signals of the multiple persons with preset duration when the multiple persons sound simultaneously through the preset sound source positioning and collecting equipment. The preset duration can be flexibly set according to specific situations, which is not limited in this embodiment. The terminal detects the sound source position of each speaker in the multi-person voice signals through the preset sound source positioning and collecting equipment, and obtains the sound source position information of each speaker. Specifically, if a sound source positioning and collecting device is arranged in the center of the current scene, the south of the placement position of the device is the front. Four speakers in the current scene sound at the same time and are respectively positioned five meters right in front of the sound source positioning and collecting equipment, forty-five degrees and three meters below the left, thirty degrees and four meters below the right. The terminal acquires four-person mixed audio with the duration of three minutes according to preset sound source positioning and collecting equipment, and can obtain the position information of the current four-position speaker right in front of, forty-five degrees below the left, thirty degrees below the right and the distance information of five meters, three meters, four meters and four meters through calculation and analysis.

Step S20 of sequentially enhancing the single-person audio signal corresponding to each of the speakers based on the sound source position information;

in this embodiment, the terminal sequentially separates the single audio signal corresponding to each speaker in the current multi-person audio signal according to the sound source position information of each speaker acquired by the sound source positioning and collecting device. While one of the single-person audio signals is enhanced, the rest of the voice signals are weakened. Specifically, the setting in the specific embodiment in step S10 is continued. The terminal firstly uses a single voice signal in the four-person mixed voice signals corresponding to the voice speakers positioned five meters ahead of the sound source positioning and collecting equipment among the four voice speakers as a target voice signal, and uses three mixed voice signals of the other three voice speakers as interference voice signals to be weakened. The terminal respectively adjusts relevant parameters of the target audio signal and the interference voice signal to achieve the effects of enhancement and weakening, and single audio signals corresponding to the enhanced sounder five meters in front of the sound source positioning and collecting equipment are obtained. And the terminal takes a single audio signal in the four-person mixed voice signal corresponding to a speaker positioned at forty-five degrees and three meters below the left of the sound source positioning and collecting equipment in the current period scene as a target audio signal, takes the other three-person mixed voice signals as interference voice signals, and sequentially adjusts the related parameters of the current target audio signal and the interference voice signals again to obtain the enhanced current voice signal. And the terminal processes the single audio signals of the rest two persons in the four persons by analogy, and finally obtains the single audio signals of the four speakers after independent enhancement.

And step S30, identifying each enhanced single-person audio signal to acquire the identity information of each speaker.

The identity information may be name, gender, age, region, position, etc.

In this embodiment, the terminal may recognize each enhanced single-person audio signal through a preset speech recognition model, and obtain information such as a name, a gender, an age, an area, a position, and the like corresponding to each speaker in the matching result. The preset speech recognition model can be a Gaussian mixture model GMM, a Convolutional Neural Network (CNN) model and the like. Specifically, the GMM model is taken as an example. And the terminal inputs each single audio signal into a preset GMM model sequentially or uniformly. The model firstly removes non-speech signals and silent speech signals in each single-person audio signal; then, performing frame division and windowing processing on each single audio signal; then, carrying out fast Fourier transform on each single audio signal of the time domain signal, and converting the single audio signal into a power spectrum of the signal; inputting each single audio signal subjected to fast Fourier transform into a triangular band-pass filter so as to simulate the masking effect of human ears; and finally, performing discrete cosine transform on each filtered single-person audio signal to remove the correlation among all dimensional signals and map the signals to a low-dimensional space. After the processing, the model can extract the Mel-scale frequency Cepstral Coefficients (MFCC) of the single audio signals, and model training is carried out on the MFCC parameters to obtain the GMM voiceprint model which belongs to the speaker corresponding to the single audio signals. And the terminal judges whether each single audio signal is matched with the voiceprint or not according to the matching operation function of each single audio signal and the corresponding GMM voiceprint model.

The invention provides a multi-person voice recognition method. The multi-person voice recognition method comprises the steps that a preset sound source positioning and collecting device is used for receiving multi-person voice signals in a current multi-person scene, and sound source position information of each speaker in the multi-person voice signals is obtained; sequentially enhancing the single audio signal corresponding to each sounder based on the sound source position information; and identifying each enhanced single-person audio signal to acquire the identity information of each speaker. Through the mode, the sound source position information of each speaker in the multi-person voice signals is acquired through the preset sound source positioning acquisition equipment, so that the multi-person voice signals can be initially separated into single audio signals; by sequentially enhancing the single voice signal, the multi-person voice signal is further separated according to the sound sources at different positions, and the interference of the voice of other people on the current single voice to be enhanced is weakened; through recognizing each enhanced single voice in sequence, the identity information of all the speakers of the multi-person voice signal can be finally acquired, and the technical problem that the multi-person mixed voice signal is difficult to recognize is solved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a multi-person voice recognition method according to a second embodiment of the present invention.

Based on the first embodiment shown in fig. 2 described above, in the present embodiment, step S10 includes:

step S11, receiving the multi-person voice signal based on the microphone array in the preset sound source positioning and collecting equipment;

wherein the microphone array is a system consisting of a certain number of acoustic sensors and used for sampling and processing the spatial characteristics of a sound field. The acoustic sensor here is typically a microphone. The number and arrangement of the acoustic sensors can be flexibly set according to actual requirements, and the number and arrangement of the acoustic sensors are not limited in this embodiment.

In the present embodiment, the microphone array is a set of microphones disposed in a predetermined distance and arrangement. Microphone arrays can achieve better directivity than individual microphones by the interaction of small time differences between the arrival of sound waves at each microphone in the array. By integrating all the microphone signals, the microphone array can be combined into a desired highly directional microphone, forming a directional characteristic called "beam". The wave beams of the microphone array can be controlled by special circuit or program algorithm software, so that the wave beams are directed to a certain sound source direction to enhance the audio acquisition effect. The directional beam forming technology processed by the array algorithm can accurately form a cone-shaped narrow beam, only receives the sound of a target speaker and simultaneously restrains noise and interference in the environment. Information on the relative position between microphone array elements can be obtained by two methods: firstly, performing cross correlation on signals synchronously acquired by a pair of microphones, searching the maximum value of the cross correlation signals to obtain the time delay tau between the two signals, and multiplying the time delay tau by the sound wave propagation speed C0 to obtain the relative position distance; and secondly, measuring the phase difference delta phi of signals synchronously acquired by a pair of microphones, and obtaining the position intervals of the pair of microphones according to the frequency f and the sound propagation speed C0. Through algorithmic control, the microphone array may direct the beam toward the current speaker after searching for the location of the target speaker. The terminal acquires the voice signal and the position information of each speaker based on the microphone array. The terminal can obtain the position information of each current speaker according to a two-dimensional DOA estimation algorithm. The two-dimensional Direction Of Arrival (DOA) estimation algorithm may be a Multiple Signal Classification (MUSIC) algorithm, an estimation Of Signal Parameter Estimation (ESPRIT) algorithm based on a rotation invariant technique, and the like.

Step S12, acquiring the sound source position information based on the microphone array and the infrared sensor in the preset sound source positioning and collecting device.

The infrared sensor in this embodiment may be a pyroelectric infrared sensing unit, a Passive infrared detector (PIR), or the like.

In this embodiment, the terminal may perform operations such as matching and integrating the position information of the plurality of speakers acquired by the microphone array and the position information acquired by the infrared sensor, and generate sound source position information that may include the positions of the speakers.

Further, not shown in the figure, in the present embodiment, the step S12 includes:

step a, acquiring first position information of each sounder based on a pyroelectric infrared sensing unit in the infrared sensor.

In this embodiment, the infrared sensor is a pyroelectric infrared sensing unit using a germanium window infrared optical lens as an optical modulation system. The number and the placement positions of the infrared sensing units can be flexibly set according to actual requirements, which is not limited in this embodiment.

The terminal sequentially acquires the position information of a plurality of current speakers through a plurality of pyroelectric infrared sensing units to serve as first position information.

Step b, acquiring second position information of each sounder based on the microphone array;

in this embodiment, the terminal obtains the current position information of each speaker by calculation based on the microphone array and according to the MUSIC algorithm or the ESPRIT algorithm. The terminal uses the position information as the second position information.

And c, integrating the first position information and the second position information to generate the sound source position information.

In this embodiment, the terminal compares, matches, and verifies the first position information of each speaker in the current scene acquired according to the pyroelectric infrared sensing units and the second position information of each speaker in the current scene acquired according to the microphone array to obtain the specific orientation and distance information of each speaker with higher accuracy, and uses the integrated position information as the sound source position information reflecting the specific position of each speaker.

Further, not shown in the figure, in the present embodiment, the step a includes:

d, acquiring a position coordinate parameter and an induction included angle parameter of the pyroelectric infrared induction unit;

in this embodiment, 12 pyroelectric infrared sensing units are uniformly placed in a circle with a uniform radian difference of 30 degrees in advance, the center of the circle is used as the origin of coordinates, and the coordinates of two infrared sensing units which are equidistantly arranged right and left from the center of the circle are set to be (-a, 0) and (a, 0), the position of a speaker located five meters ahead of the center of the circle is taken as an example, if the position coordinates of the speaker are set to be P (x, y), and the supplementary angle of the PAB angle set to be α angle is set to be β angle.

And e, generating the first position information based on a preset positioning algorithm, the position coordinate parameter and the induction included angle parameter.

In this embodiment, the predetermined positioning algorithm is an algorithm formula for calculating x and y, and the specific formula is

The terminal carries the obtained position coordinate parameter a and the induction included angle parameters α and β into the position coordinate of each speaker for calculation, and the position coordinate of each speaker is converted into second position information.

Further, not shown in the figure, in the present embodiment, step b includes:

step f, acquiring the incident angle information of the multi-person audio signal by a signal parameter estimation algorithm based on a rotation invariant technology;

the signal parameter estimation algorithm based on the rotation invariant technology is the ESPRIT algorithm, and is an antenna array-based algorithm in a two-dimensional DOA estimation algorithm.

In this embodiment, the terminal converts each single audio signal into a plurality of incoming wave signal matrices based on a signal parameter estimation algorithm of a rotation invariant technology, generates a covariance matrix based on the incoming wave signal matrices, performs characteristic decomposition on the covariance matrix to obtain a rotation matrix, and finally obtains incident angle information corresponding to each speaker from the rotation matrix.

And g, determining second position information of each speaker according to the incidence angle information.

In this embodiment, the terminal converts the incident angle information of each speaker, which is obtained based on the ESPRIT algorithm, into the position information on the preset coordinate system, that is, the second position information.

The invention provides a multi-person voice recognition method. The multi-person sound identification method further obtains the sound source position information of each speaker in the current scene through the microphone array and the infrared sensor, and has higher accuracy compared with the sound source position information obtained only through the microphone array; the first position information is acquired through the pyroelectric infrared sensing unit, so that the first position information acquired by the terminal is more accurate and higher cost is not required; acquiring first position information according to the position coordinate parameters and the induction included angle parameters, so that the acquisition mode of the first position information is simpler and more convenient; the second position information is obtained by selecting a signal parameter estimation algorithm based on a rotation invariant technology, and the calculation amount is smaller than that of other antenna array algorithm, so that the calculation efficiency of the terminal is improved.

Not shown in the drawings, a third embodiment of the multi-person voice recognition method according to the present invention is proposed based on the first embodiment shown in fig. 2. In the present embodiment, step S20 includes:

h, determining a single audio signal of each speaker corresponding to the multi-person voice signal based on the sound source position information;

in this embodiment, the terminal collects multi-user voice signals based on the microphone array, and synchronously collects signal phase differences Δ Φ through measuring the multiple microphones, obtains position intervals of the multiple microphones according to the frequency f and the sound propagation speed C0, and converts the time intervals into position information of each speaker, thereby identifying single-user audio signals corresponding to each speaker in the multi-user voice signals.

Step i, sequentially taking each single voice signal as a target voice signal, and taking the rest voice signals in the multi-person voice signals as interference voice signals;

in the present embodiment, it is assumed that there are A, B, C, D four speakers at the same time. The terminal takes a single audio signal in the four-person mixed audio signal corresponding to the A as a target audio signal, takes a three-person mixed audio signal corresponding to the B, C, D as an interference audio signal, and so on, and when the B is taken as the target audio signal, A, C, D is the interference audio signal; when C is taken as the target audio signal, A, B, D is the interference audio signal; when D is used as the target audio signal, A, B, C is the interference audio signal.

Step j, performing adaptive adjustment on a first filtering weight of the target audio signal and a second filtering weight of the interference audio signal;

in this embodiment, the terminal may adjust the filtering weight corresponding to the target audio signal, that is, the filtering weight corresponding to the first filtering weight and the interfering audio signal, that is, the second filtering weight, based on an adaptive filtering algorithm such as a Recursive Least Square algorithm (RLS) and a Least Mean Square algorithm (LMS). The first filtering weight is typically increased and the second filtering weight is decreased.

And k, performing adaptive space-domain filtering on the multi-person voice signal based on the first filtering weight and the second filtering weight after adaptive adjustment so as to enhance the target audio signal.

In this embodiment, after the terminal adjusts the first filtering weight and the second filtering weight based on the adaptive filtering algorithm, the terminal performs spatial filtering on the target audio signal and the interfering audio signal, so that the target audio signal is enhanced and the interfering audio signal is weakened.

Further, in the present embodiment, step S30 includes:

step l, preprocessing each enhanced target audio signal to generate a voice signal to be recognized;

in this embodiment, the terminal preprocesses each enhanced target audio signal, and uses the preprocessed target audio signal as a speech signal to be recognized. The preprocessing can be pre-emphasis, frame windowing, short-time fourier transform, voice activity detection, etc. The pre-emphasis operation is generally realized by a first-order FIR high-pass digital filter serving as a transfer function, and aims to emphasize the high-frequency part of the voice, remove the influence of lip radiation and increase the high-frequency resolution of the voice; the frame windowing is carried out on the voice signal because the voice signal has short-time stationarity (the voice signal can be considered to be approximately unchanged within 10-30 ms), so that the voice signal can be divided into a plurality of short segments for processing. Framing of speech signals is achieved by weighting with movable windows of finite length. Generally, the number of frames per second is about 33 to 100 frames, and this embodiment does not limit the number of frames per second. A common framing method is an overlapping segmentation method, the overlapping part of a previous frame and a next frame is called frame shift, and the ratio of the frame shift to the frame length is generally 0-0.5. After windowing according to the method, discontinuous places appear at the beginning and the end of each frame, so that the more frames are divided, the larger the error with the original signal is. The windowing operation allows the signal after the frame division to become continuous, each frame exhibiting the characteristics of a periodic function. Hamming windows are typically added in speech signal processing. The short-time fourier transform of the framed speech signal is performed to convert the time domain signal of each frame of speech signal into a frequency domain signal, and the frequency domain signals after the fast fourier transform of each frame are stacked in time to obtain a spectrogram, i.e., voiceprint information, corresponding to the speech signal. Voice activity detection is the detection of valid speech segments from a continuous speech stream. The method comprises the steps of detecting a front end point which is a starting point of effective voice and detecting a rear end point which is an end point of the effective voice. Using endpoint detection on speech signals may reduce the amount of data stored or transmitted. There are three main implementations of endpoint detection: first, threshold-based voice activity detection. The aim of distinguishing the voice from the non-voice is achieved by extracting time domain or frequency domain characteristics and reasonably setting a threshold. And secondly, voice activity detection as a classifier. The voice detection can be regarded as a voice/non-voice two-classification problem, and a classifier is trained by a machine learning method to achieve the purpose of voice detection. And thirdly, detecting the voice activity of the model. A complete acoustic model can be used to distinguish the speech section from the non-speech section on the basis of decoding through global information.

And m, sequentially extracting target voiceprint characteristics of the voice signal to be recognized, and matching the target voiceprint characteristics with prestored voiceprint characteristics to acquire identity information of each speaker.

The target voiceprint information can be a pitch period, a short-time zero-crossing rate, a linear prediction cepstrum coefficient, a Mel frequency cepstrum coefficient, an impulse response of a sound channel, an autocorrelation coefficient, a sound channel area function, a denoising cepstrum coefficient and the like.

In this embodiment, it can be understood that the recognition process of the speech signal to be recognized may be performed in a preset voiceprint recognition model. And the voiceprint recognition model sequentially extracts the target voiceprint characteristics of the voice signal to be recognized and matches the target voiceprint characteristics with the voiceprint characteristics prestored in the model. And if the similarity between the target voiceprint characteristic and the pre-stored voiceprint characteristic reaches a preset threshold value, judging that the target voiceprint characteristic is matched with the pre-stored voiceprint characteristic, wherein the speaker corresponding to the pre-stored voiceprint characteristic is also the speaker of the target voiceprint characteristic. The terminal can obtain the identity information such as the name, the age and the like of the corresponding speaker.

The invention provides a multi-person voice recognition method. The multi-person voice recognition method further enables the target audio signal to be enhanced and the interference audio signal to be weakened by respectively adjusting the filtering weight values corresponding to the target audio signal and the interference audio signal, so that the target audio signal is separated from the multi-person voice; the identity information of each speaker is obtained by identifying the voiceprint characteristics of the voice signal to be identified, thereby completing the identification of the mixed voice of a plurality of people,

the invention also provides a multi-person voice recognition device.

The multi-person voice recognition apparatus includes:

The invention also provides a multi-person voice recognition device.

The multi-person voice recognition device comprises a processor, a memory and a multi-person voice recognition program which is stored on the memory and can run on the processor, wherein when the multi-person voice recognition program is executed by the processor, the steps of the multi-person voice recognition method are realized.

The method implemented when the multi-user voice recognition program is executed may refer to each embodiment of the multi-user voice recognition method of the present invention, and details thereof are not repeated herein.

The invention also provides a computer readable storage medium.

The computer-readable storage medium of the present invention has stored thereon a multiple voice recognition program, which when executed by a processor implements the steps of the multiple voice recognition method as described above.

The method implemented when the multi-user voice recognition program is executed may refer to various embodiments of the multi-user voice recognition method of the present invention, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A multi-person voice recognition method, comprising:

2. The multi-person voice recognition method according to claim 1, wherein the step of receiving the multi-person voice signals in the current multi-person scene by using a preset sound source location collecting device and acquiring the sound source position information of each speaker in the multi-person voice signals comprises:

3. The multi-person sound recognition method of claim 2, wherein the step of acquiring the sound source position information based on the microphone array and the infrared sensor in the preset sound source localization collecting device comprises:

4. The multi-person voice recognition method according to claim 3, wherein the step of acquiring the first position information of each of the utterers based on a pyroelectric infrared sensing unit in the infrared sensor comprises:

5. The multi-person voice recognition method of claim 3, wherein the step of acquiring second location information of each of the speakers based on the microphone array comprises:

6. The multi-person voice recognition method according to claim 1, wherein the step of sequentially enhancing the single-person audio signal corresponding to each of the speakers based on the sound source position information comprises:

7. The multi-person voice recognition method of claim 1, wherein the step of recognizing each enhanced target audio to obtain identity information of each of the speakers comprises:

8. A multiple person voice recognition apparatus, characterized in that the multiple person voice recognition apparatus comprises:

9. A multi-person voice recognition apparatus, characterized in that the multi-person voice recognition apparatus comprises: memory, a processor and a multi-person voice recognition program stored on the memory and executable on the processor, the multi-person voice recognition program, when executed by the processor, implementing the steps of the multi-person voice recognition method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a multi-person voice recognition program is stored on the computer-readable storage medium, which when executed by a processor implements the steps of the multi-person voice recognition method according to any one of claims 1 to 7.