CN110545396A

CN110545396A - Voice recognition method and device based on positioning and denoising

Info

Publication number: CN110545396A
Application number: CN201910817769.5A
Authority: CN
Inventors: 李索恒; 汪俊; 郑达; 张志齐
Original assignee: Shanghai Yitu Information Technology Co Ltd
Current assignee: Shanghai Yitu Information Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-12-06

Abstract

the invention relates to the technical field of communication, in particular to a voice recognition method and device based on positioning and denoising. The method comprises the following steps: acquiring an audio signal acquired in a first period; determining an image frame containing a speaking object from the video signal acquired in the first period; the speaking object is determined according to the lip movement characteristics of the same face in the video signal and the sound source position information in the audio signal; performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period; inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

Description

Voice recognition method and device based on positioning and denoising

Technical Field

the invention relates to an audio processing technology, in particular to a voice recognition method and a voice recognition device based on positioning and denoising.

background

In the current society, due to the needs of a conference scene or a monitoring scene, especially a remote conference scene, only a video can be displayed in the conference or the monitoring, and a voice recognition result of a speaker cannot be displayed on a display interface, so that the conference efficiency is not high. Similarly, for the monitoring scene, because the existing monitoring has no speaker voice, the related monitored services are limited, and the actual needs are difficult to meet.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method and device based on positioning and denoising, which are used for improving the accuracy and effectiveness of voice recognition in a monitoring scene or a conference scene.

the embodiment of the invention provides a voice recognition method based on positioning and denoising, which comprises the following steps: acquiring an audio signal acquired in a first period; determining an image frame containing a speaking object from the video signal acquired in the first period; the speaking object is determined according to the lip movement characteristics of the same face in the video signal and the sound source position information in the audio signal; performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period; inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

In the embodiment of the invention, the image frame containing the speaking object is determined by carrying out face recognition on the video signal acquired in the first time period and by the sound source position information in the audio signal; and then, the frame-aligned image frame containing the speaking object and the audio signal acquired in the first period are input into the voice recognition model through frame alignment, so that the reliability of the speaking object recognized through the image frame is increased, the interference of environmental noise or other speaking objects on voice recognition can be effectively reduced, and the accuracy of the voice recognition is improved.

One possible implementation, determining an image frame containing a speaking subject from a video signal acquired during the first period, includes:

determining sound source position information in an audio frame at a first moment, lip movement probability of each object in an image frame at the first moment and position information of each object in the image frame, wherein the first moment is any moment in the first time period;

If the image frame at the first moment contains a first object, determining the image frame at the first moment as an image frame containing a speaking object; and the position information of the first object in the image frame is matched with the sound source position information, and the lip movement probability of the first object accords with the speaking characteristics.

According to the technical scheme, lip movement detection is carried out on the lip region in the image, the image frame of the speaking object is determined, matching is carried out through sound source position information, and the lip movement probability of the first object is determined to be in accordance with the speaking characteristics, so that the accuracy of speaking object recognition is improved, and the accuracy of voice recognition is effectively improved.

one possible implementation manner of determining sound source position information in an audio frame at a first time includes:

determining the positions of all sound sources in an audio frame at a first moment and the sound production probability of all the sound sources;

The position information of the first object in the image frame is matched with the sound source position information, and the lip movement probability of the first object is matched with the lip movement probability of each object according with a set condition, and the method comprises the following steps:

Determining a second object according to the position information of each object in the image frame, wherein the position information of the second object is matched with the position of at least one sound source;

Aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object;

and determining a second object with speaking probability meeting set conditions as the first object.

One possible implementation of frame aligning the image frame containing the speaking subject with the audio signal acquired during the first period includes:

Carrying out frame alignment on the image frame containing the speaking object, the audio frame containing the sound source position information and the audio signal collected in the first period;

inputting the frame-aligned image frame containing the speaking object and the audio signal collected in the first period into a speech recognition model, and determining a speech recognition result of the speaking object, including:

And inputting the image frame containing the speaking object, the audio frame containing the sound source position information and the audio signal collected in the first period after frame alignment into a speech recognition model, and determining the speech recognition result of the speaking object.

according to the technical scheme, lip movement detection and sound source position information joint determination are carried out on a lip region in an image, the reliability of the determined speaking object is improved by detecting the speaking object corresponding to the determined lip movement probability, and then the image frame corresponding to the speaking object is input into the voice recognition model as reference information, so that the robustness of recognition of the speaking object to voice recognition can be improved.

in one possible implementation, the speech recognition model includes sub-models having different attributes;

Determining identity information of the speaking object according to the image frame containing the speaking object;

Inputting the identity information of the speaking object, the image frame containing the speaking object and the audio signal collected in the first time period into the speech recognition model, and determining the speech recognition result of the speaking object; the identity information of the speaking object is used for determining a sub-model used when the speaking object is subjected to voice recognition.

according to the technical scheme, the identity information of the speaking object, the image frame containing the speaking object and the audio signal collected in the first time period are input into the voice recognition model, the accuracy of voice recognition of a single speaking object in a plurality of speaking objects is improved, the corresponding image frame of the speaking object can be quickly detected by correlating the identity information of the speaking object, after the face image of the speaking object is determined, the current image frame to be recognized does not need to be compared with all face images in an image library, only the image frame needs to be preferentially compared with the speaking object, the recognition efficiency of the speaking object is improved, the audio signal of the audio frame is further screened, and the efficiency and the accuracy of the voice recognition are improved.

the embodiment of the invention provides a voice recognition device based on positioning and denoising, which comprises:

The receiving and transmitting unit is used for acquiring audio signals acquired in a first period;

the processing unit is used for determining an image frame containing a speaking object from the video signal acquired in the first period; the speaking object is determined according to the lip movement characteristics of the same face in the video signal and the sound source position information in the audio signal; performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period; inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

In a possible implementation manner, the processing unit is specifically configured to:

determining sound source position information in an audio frame at a first moment, lip movement probability of each object in an image frame at the first moment and position information of each object in the image frame, wherein the first moment is any moment in the first time period; if the image frame at the first moment contains a first object, determining the image frame at the first moment as an image frame containing a speaking object; and the position information of the first object in the image frame is matched with the sound source position information, and the lip movement probability of the first object accords with the speaking characteristics.

In a possible implementation manner, the processing unit is specifically configured to: determining the positions of all sound sources in an audio frame at a first moment and the sound production probability of all the sound sources; determining a second object according to the position information of each object in the image frame, wherein the position information of the second object is matched with the position of at least one sound source; aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object; and determining a second object with speaking probability meeting set conditions as the first object.

an embodiment of the present invention provides a storage medium storing a program of a method for speech recognition, which when executed by a processor performs the method according to any one of the embodiments of the present invention.

An embodiment of the present invention provides a computer device, including one or more processors; and one or more computer-readable media having instructions stored thereon, which, when executed by the one or more processors, cause the apparatus to perform the method of any of the embodiments of the invention.

Drawings

FIG. 1 is a system architecture diagram according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a speech recognition method based on localization and denoising in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a speech recognition apparatus based on localization and denoising in an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of a speech recognition device based on localization and denoising in an embodiment of the present invention.

Detailed Description

In an actual use environment, a speech processing device extracts features from an input speech signal for recognition, but various interferences, such as reverberation, noise and signal distortion, exist in the environment. These interferences cause a large difference between the characteristics of the input speech signal and the characteristics of the speech recognition model, thereby reducing the recognition rate.

Fig. 1 illustrates a schematic diagram of a system architecture to which an embodiment of the present invention is applicable, in which a monitoring device 101 and a server 102 are included. The monitoring device 101 may collect a video stream in real time, and then send the collected video stream to the server 102, where the server 102 includes a positioning and denoising-based speech recognition device, and the server 102 acquires an image frame from the video stream, and then determines an object to be recognized and a corresponding speech recognition result in the image frame. The monitoring device 101 is connected to the server 102 via a wireless network, and is an electronic device having an image capturing function and a sound signal, such as a camera, a video recorder, a microphone, and the like. The server 102 is a server or a server cluster composed of several servers or a cloud computing center.

Based on the system architecture shown in fig. 1, fig. 2 exemplarily shows a flow diagram corresponding to a positioning and denoising-based speech recognition method provided by an embodiment of the present invention, the flow of the method may be executed by a positioning and denoising-based speech recognition device, which may be the server 102 shown in fig. 1, as shown in fig. 2, and specifically includes the following steps:

Step 201: an audio signal acquired during a first time period is acquired.

The first period may be 1 second, and the specific length may be determined according to the characteristics of the audio signal, or may be determined according to the need of speech recognition, for example, the accuracy of online recognition, and the like, which is not limited herein.

Specifically, the audio signal may be at least one voice signal in corresponding sound signals acquired from at least one microphone; or, any at least two paths of voice signals can be selected from the corresponding voice signals acquired by at least one microphone, and the voice signals are combined to obtain more voice information. In practical applications, the sound signal is transmitted in a signal frame mode, and a speech recognition device based on localization and denoising needs to continuously detect the sound signal frame.

Further, the sound source position of the sound of the current audio frame can also be determined by the microphone array. In the specific implementation process, the probability of sound existing in each direction of the sound source can be determined according to the information such as the intensity of the sound signal received by the microphone array in each direction, and then the position of the sound source and the probability of the sound occurring in the position can be comprehensively determined.

Step 202: and carrying out face recognition on the video signals acquired in the first period, and determining an image frame containing a speaking object.

And the speaking object is determined according to the lip movement characteristics of the same face in the video signal and the sound source position information in the audio signal.

Specifically, the video signal acquired in the first period may be N image frames captured by the monitoring device in the first period. The monitoring equipment collects a video stream in real time, the video stream is composed of a plurality of frames of image frames, and the image frames in the video stream can be marked according to time intervals according to a time sequence.

there are various ways to mark the image frame, and one possible implementation way is to mark the image in the video signal that needs to be subjected to human face object detection as the detection frame image. For example, a video signal is set to include 10 image frames, the first frame image and the fifth frame image may be marked as image frames for face recognition, and all the frame images may be used as image frames for face recognition. The image frames may be marked according to whether there is a human face or not, or according to whether there is a voice signal or not, and other factors, which are not limited herein.

Further, when the video image frame is determined to be a face-identified image frame, the corresponding predicted image information of each face object in the image frame may be further determined. Specifically, prediction image information corresponding to each human face object in the image frame can be predicted according to image information corresponding to each human face object in the recognized image; the recognized images may be images adjacent to the image frame, and the image information corresponding to the human face object is determined or predicted.

alternatively, when the image frame is determined to be a face-recognized image frame, face detection may be performed on the image frame, so as to determine detected image information corresponding to each face object in the image frame.

The method comprises the steps of dividing N frames of images in a video to be processed, which are acquired by monitoring equipment, into detection frame images and non-detection frame images, judging whether the image frames are image frames for face recognition or not when the image frames are acquired, if so, detecting face objects in the image frames, otherwise, predicting the face objects in the image frames by using the face objects in other image frames, and therefore, each frame of image does not need to be detected and recognized, the calculated amount of determining the face objects in the image frames in the video signals is reduced, and meanwhile, the efficiency is improved.

Further, object detection may be performed on the image frame first, and a detection image region corresponding to each recognition object in the image frame may be determined, and further, image information in the detection image region corresponding to each recognition object, that is, image information corresponding to each recognition object may be determined. For example, body information, face information, objects associated with the object, etc. of the object may be determined. The image area may be an image frame having a regular shape or an image frame not having a regular shape.

one possible implementation manner, the performing face recognition on the video signal acquired in the first period to determine an image frame containing a speaking object, includes:

performing face recognition on any image frame in the video signals acquired in the first period, and determining each object contained in the image frame;

determining whether the image frame is an image frame containing a speaking object according to the lip movement probability of each object in the image frame; wherein the lip movement probability of each object is determined according to the lip movement characteristics of each object.

According to the technical scheme, the lip movement detection is carried out on the lip region in the image, the image frame of the speaking object is determined, the frame alignment of the audio frame in the audio signal is further realized, and the accuracy rate of voice recognition is improved.

one possible implementation manner, in which the lip movement probability of each object is determined according to the lip movement feature of each object, includes: carrying out face recognition on N frames of images in the video signal acquired in the first period to determine a first object;

Performing lip movement detection on a lip region of the first object in M frames of images including the first object, and determining lip movement characteristics of each frame of image in the M frames of images;

determining the lip motion probability of each frame of image according to the lip motion characteristics of each frame of image in the M frames of images; n is greater than or equal to M.

specifically, the confidence level of lip movement can be determined according to the lip movement characteristics of each frame of image in the M frames of images, and if the confidence level of lip movement of K frames of images in the M frames of images is greater than a first preset threshold, the first object is determined to be the speaking object, and the K frames of images are determined to be the image frames corresponding to the speaking object; n is greater than or equal to M; m is greater than or equal to K. The lip movement characteristics of the first object in the image frame can be determined according to a lip movement characteristic extraction model, and then the confidence coefficient that the lip movement exists is determined according to the lip movement characteristics of each frame of image. Wherein, the value of the confidence coefficient can be a numerical value of [0,1 ].

According to the technical scheme, lip movement detection is carried out on the lip region in the image, the determined lip movement probability is detected and is used as reference information to be input into the voice recognition model, and robustness of the image frame to voice recognition can be improved.

In another implementation manner, whether lip motion exists or not can be determined through a classifier according to the lip motion characteristics of each frame of image. For example, if 0 indicates that there is no lip movement, the image frame is excluded, and if 1 indicates that there is lip movement, the image frame is regarded as the image frame of the speaking subject.

After the lip movement characteristics are determined, the face characteristic image of the first speaking object can be determined according to the lip movement characteristics determined by the first speaking object, and then all the face characteristic images corresponding to the first speaking object and all the image frames corresponding to the face characteristic images in the image frames are determined, so that the image frames can be conveniently and subsequently corresponded, repeated recognition of subsequent face recognition is avoided, and the recognition efficiency is improved.

According to the technical scheme, the lip region in the image is subjected to lip movement detection, the image frame of the speaking object is determined, the voice recognition result in the audio signal is further correlated on line, the voice recognition result corresponding to the speaking object does not need to be searched in the off-line process, and the monitoring effect is improved.

Further, in order to improve the accuracy and reliability of recognizing a speaking object, the position information of a sound source in an audio frame at a first time, the lip movement probability of each object in an image frame at the first time, and the position information of each object in the image frame at the first time may be determined, wherein the first time is any one time in the first time period; if the image frame at the first moment contains a first object, determining the image frame at the first moment as an image frame containing a speaking object; and the position information of the first object in the image frame is matched with the sound source position information, and the lip movement probability of the first object accords with the speaking characteristics.

further, a second object can be determined according to the position information of each object in the image frame, and the position information of the second object is matched with the position of at least one sound source; furthermore, aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object; and determining a second object with speaking probability meeting set conditions as the first object.

step 203: performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period;

For example, if it is determined that the number of frames of the image frame and the audio frame in the first period is the same, the frame number of the image frame for which the speaking object is determined may be associated with the frame number of the audio frame; for example, if the frame 5 of the frame number of the image frame of the speaking object is determined, the same speaking object is associated with the audio frame of the frame 5 in the voice recognition result, and the corresponding other frames in the remaining audio signals are associated with the image frame of the first object. For example, if the other frames corresponding to the first speaking object in the remaining audio signal are the 6 th frame to the 10 th frame, the 6 th frame to the 10 th frame of the image frame of the first speaking object are associated.

In another possible implementation manner, if it is determined that the number of frames of the image frame and the audio frame in the first period is different, the association may be performed according to the correspondence between the number of frames. For example, if it is determined that the image frame includes 20 frames and the audio frame includes 30 frames in the first period, the association of the image frame with the audio frame may be made in proportion. For example, if the 2 nd frame of the frame number of the image frame determined as the speaking object is determined, the audio frame of the 3 rd frame in the speech recognition result is associated as the same speaking object.

Of course, the image frames and the audio frames may be associated according to time points, and the time points of each frame of the image frames and each frame of the audio frames are in one-to-one correspondence, and if it is determined that the image frames of the speaking object and the audio frames in the audio signals can be associated at a certain time point, it is determined that the corresponding relationship between the speaking object and the corresponding audio frames in the audio signals starts to be established at the time point.

In another possible implementation manner, the image frame containing the speaking object, the audio frame containing the sound source position information, and the audio signal collected in the first period may be frame-aligned; furthermore, the frame-aligned image frame containing the speaking object, the audio frame containing the sound source position information and the audio signal collected in the first period can be input into a speech recognition model, and the speech recognition result of the speaking object is determined.

in yet another possible implementation manner, the image frame containing the speaking object, the audio frame containing the sound source position information, the lip language feature of the speaking object in the image frame, and the audio signal collected in the first period may be frame-aligned; furthermore, the image frame containing the speaking object, the audio frame containing the sound source position information, the lip language feature of the speaking object in the image frame and the audio signal collected in the first period after frame alignment can be input into a speech recognition model, and the speech recognition result of the speaking object is determined.

According to the technical scheme, lip movement detection and lip language detection are carried out on the lip region in the image, the image frame of the speaking object is determined, matching is carried out through sound source position information, the lip movement probability of the first object is determined to accord with speaking characteristics, then the accuracy of speaking object recognition is improved, and further the accuracy of voice recognition is effectively improved.

In a specific speech recognition process, speech in the audio signal can be recognized through a speech model to determine a speech recognition result.

step 204: inputting the image frame of the first speaking object after frame alignment and the audio signal collected in the first period into a speech recognition model, and determining the speech recognition result of the first speaking object.

In the embodiment of the invention, the image frame containing the speaking object is determined by carrying out face recognition on the video signal acquired in the first time period and by the sound source position information in the audio signal; and then the frame-aligned image frame containing the speaking object and the audio signal acquired in the first time period are input into a voice recognition model through frame alignment, and then the frame-aligned image frame and the audio signal are simultaneously input into the voice recognition model, so that the voice recognition model can take the audio signal in the audio frame without the image frame as the environmental noise of the voice signal in the current first time period, and further can perform denoising processing on the audio signal of the audio frame with the image frame, thereby reducing the problem of low voice recognition accuracy caused by the fact that various interferences, such as reverberation, noise and signal distortion, exist in the environment, and the characteristics of the input voice signal are greatly different from the characteristics of the voice recognition model. The method reduces the interference of environmental noise or other speaking objects on voice recognition and improves the accuracy of the voice recognition through the speaking objects identified by the image frames.

Taking the speech model as an example, when the speech model is established, the speech recognition device based on localization and denoising may perform the following operations:

firstly, respectively extracting acoustic features of the sound signal on set N frequency bands by a voice recognition device based on positioning and denoising, wherein the acoustic features are used as the acoustic features of the sound signal;

There are various ways to represent the acoustic characteristics of the sound signal in the frequency band, such as energy value, amplitude value, etc.

Then, the voice recognition device based on positioning and denoising takes the acoustic features on the N frequency bands as feature vectors, applies a Gaussian Mixed Model (GMM) to establish a corresponding voice Model, and calculates the likelihood ratio of each acoustic feature based on the voice Model.

Specifically, in calculating the likelihood ratio, based on the feature vector, the GMM may be used to obtain characteristic parameters of the voice-like signal (e.g., mean value of the voice-like signal, variance of the voice-like signal, etc.) in each frequency band, and the GMM may be used to obtain characteristic parameters of the interference-like signal (e.g., mean value of the interference-like signal, variance of the interference-like signal, etc.) in each frequency band, and the likelihood ratio of each acoustic feature may be calculated by using the obtained parameters, and when the likelihood ratio of any one acoustic feature reaches a set threshold, the existence probability of the desired sound source may be set to a specified value indicating the existence of the desired sound source, so as to determine the existence of the voice signal.

Of course, the GMM is only an example, and in practical applications, it is also possible to use other methods to establish a corresponding speech model. For example: a Support Vector Machine (SVM) algorithm, a Deep Neural Network (DNN) algorithm, a Convolutional Neural Network (CNN) algorithm, a Recurrent Neural Network (RNN) algorithm, and the like.

furthermore, the identity information of the possible speaking object in the audio frame can be determined by the identity information of the speaking object identified in the image frame, and then the audio frame associated with the speaking object in other frames can be rapidly identified, so that the calculation amount of the speech identification in the first time interval is reduced, and the effectiveness of the speech identification is improved.

In one possible implementation, the speech recognition model includes sub-models having different attributes; the submodels may be trained from speech signals of different speaking subjects. Therefore, identity information of the speaking object can be determined according to the image frame containing the speaking object; and then, according to the identity information of the speaking object, determining the corresponding sub-model of the speaking object, and after selecting the sub-model of the speaking object, inputting the image frame and the audio frame into the corresponding sub-model of the speaking object so as to improve the voice recognition effect.

in one possible implementation, identity information of the speaking object is determined according to the image frame containing the speaking object; inputting the identity information of the speaking object, the image frame containing the speaking object and the audio signal collected in the first time period into the speech recognition model, and determining the speech recognition result of the speaking object; the identity information of the speaking object is used for determining a sub-model used when the speaking object is subjected to voice recognition.

furthermore, through a speech recognition model, the corresponding relation between the audio frame corresponding to the first speaking object and the image frame corresponding to the speaking object can be determined; the correspondence is used to indicate that the audio frame and the image frame having the correspondence are for the same object.

in the embodiment of the invention, the image frame containing the speaking object is determined by carrying out face recognition on the video signal acquired in the first time period; and then, the frame-aligned image frame containing the speaking object and the audio signal acquired in the first period are input into the voice recognition model through frame alignment, so that the reliability of the speaking object recognized through the image frame is increased, the interference of environmental noise or other speaking objects on voice recognition can be effectively reduced, and the accuracy of the voice recognition is improved.

After determining a speech recognition result corresponding to the speaking object, a possible implementation manner further includes:

and displaying the voice recognition result of the audio frame corresponding to the speaking object and the identity information of the speaking object on the image frame corresponding to the audio frame in an object indication mode, wherein the object indication mode is to establish an association display relationship between the voice recognition result of the audio frame and the speaking object.

According to the technical scheme, after the voice recognition result corresponding to the speaking object is determined, the voice recognition result of the audio frame corresponding to the speaking object and the identity information of the speaking object are directly displayed on the image, so that the effect of real-time display is achieved, and the visualization of monitoring is improved.

In one possible implementation, the method further includes: and determining key points in the face of the speaking object in the image frame corresponding to the audio frame, and displaying the key points on the image frame.

In the technical scheme, the lip movement effect of the first speaking object can be visualized by determining the key points in the face of the speaking object in the image frame which has the corresponding relation with the audio frame, so that the visual monitoring effect is improved.

based on the embodiment, referring to fig. 3, an embodiment of the present invention provides a speech recognition device based on localization and denoising, including:

The transceiving unit 301 is configured to acquire an audio signal acquired in a first period;

The processing unit 302 is used for performing face recognition on the video signal acquired in the first period and determining an image frame containing a speaking object; the speaking object is determined according to lip movement characteristics of the same face in an image frame in the video signal and sound source position information in the audio signal; performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period; inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

In a possible implementation manner, the processing unit 302 is specifically configured to:

In a possible implementation manner, the processing unit 302 is specifically configured to: determining the positions of all sound sources in an audio frame at a first moment and the sound production probability of all the sound sources; determining a second object according to the position information of each object in the image frame, wherein the position information of the second object is matched with the position of at least one sound source; aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object; and determining a second object with speaking probability meeting set conditions as the first object.

In one possible implementation, the speech recognition model includes sub-models having different attributes; the processing unit 302 is specifically configured to:

Determining identity information of the speaking object according to the image frame containing the speaking object; inputting the identity information of the speaking object, the image frame containing the speaking object and the audio signal collected in the first time period into the speech recognition model, and determining the speech recognition result of the speaking object; the identity information of the speaking object is used for determining a sub-model used when the speaking object is subjected to voice recognition.

An embodiment of the present invention provides a storage medium storing a program of a method for speech recognition, where the program is executed by a processor to perform the method according to any one of the above embodiments.

An embodiment of the present invention provides a computer device, including one or more processors; and one or more computer-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the method of any of the above embodiments.

Based on the above embodiments, referring to fig. 4, a schematic structural diagram of a computer device in an embodiment of the present invention is shown.

An embodiment of the present invention provides a computer device, where the computer device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and optionally, the user interface 1003 may include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

those skilled in the art will appreciate that the configuration shown in FIG. 4 does not constitute a limitation of a computer device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

the memory 1005, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a voice recognition program therein. The operating system is a program for managing and controlling hardware and software resources of the parameter acquisition system for voice recognition, supports the recognition module, and is also used for running other programs and other software or programs.

The user interface 1003 is mainly used for connecting the servers to perform data communication with each server; the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server; and the processor 1001 may be configured to invoke a speech recognition program stored in the memory 1005 and perform the following operations:

The processor 1001 is configured to perform face recognition on the video signal acquired in the first period, and determine an image frame including a speaking object; the speaking object is determined according to lip movement characteristics of the same face in an image frame in the video signal and sound source position information in the audio signal; performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period; inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

In one possible implementation, the processor 1001 is specifically configured to:

In one possible implementation, the processor 1001 is specifically configured to: determining the positions of all sound sources in an audio frame at a first moment and the sound production probability of all the sound sources; determining a second object according to the position information of each object in the image frame, wherein the position information of the second object is matched with the position of at least one sound source; aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object; and determining a second object with speaking probability meeting set conditions as the first object.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

while preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. a speech recognition method based on localization and denoising is characterized by comprising the following steps:

Acquiring an audio signal acquired in a first period;

Determining an image frame containing a speaking object from the video signal acquired in the first period; the speaking object is determined according to the lip movement characteristics of the same face in the video signal and the sound source position information in the audio signal;

Performing frame alignment on the image frame containing the speaking object and the audio signal acquired in the first period;

Inputting the image frame containing the speaking object after frame alignment and the audio signal collected in the first time period into a speech recognition model, and determining the speech recognition result of the speaking object.

2. The method of claim 1, wherein determining image frames containing a speaking subject from the video signal acquired during the first period comprises:

3. The method of claim 1, wherein the determining sound source position information in the audio frame at the first time instant comprises:

4. The method of claim 2, wherein frame aligning the image frame containing the speaking subject with the audio signal captured during the first time period comprises:

5. the method of any of claims 1 to 4, wherein the speech recognition models comprise sub-models having different attributes;

6. A speech recognition device based on localization denoising, comprising:

7. The apparatus as claimed in claim 6, wherein said processing unit is specifically configured to:

8. The apparatus as claimed in claim 7, wherein said processing unit is specifically configured to: determining the positions of all sound sources in an audio frame at a first moment and the sound production probability of all the sound sources; determining a second object according to the position information of each object in the image frame, wherein the position information of the second object is matched with the position of at least one sound source; aiming at each second object, determining the speaking probability of the second object with speaking characteristics according to the sound production probability of the sound source position corresponding to the second object and the lip movement probability of the second object; and determining a second object with speaking probability meeting set conditions as the first object.

9. A storage medium, characterized in that a program of a method for speech recognition is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 5.

10. a computer device comprising one or more processors; and

One or more computer-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-5.