CN113611308B

CN113611308B - Voice recognition method, device, system, server and storage medium

Info

Publication number: CN113611308B
Application number: CN202111048642.5A
Authority: CN
Inventors: 齐昕
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2024-05-07
Anticipated expiration: 2041-09-08
Also published as: CN113611308A

Abstract

The embodiment of the invention provides a voice recognition method, a device, a system, a server and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining speaking images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, wherein the voice signals comprise voice signals generated by simultaneous speaking of the plurality of speakers, identifying the speaking images, determining azimuth information and lip movement information of each speaker, inputting the lip movement information, the voiceprint information, the azimuth information and the voice signals of each speaker into a pre-trained voice recognition model aiming at each speaker to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, and the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users. Because the voice signals do not need to be separated, the completeness of the voice signals is ensured, and the accuracy of voice recognition is improved.

Description

Voice recognition method, device, system, server and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, device, system, server, and storage medium.

Background

Currently, video conferencing has become a common communication mode in people's work and life. In order to record conference contents and the like, the speech of each person in the conference needs to be collected and recognized to obtain corresponding text information. However, in a conference, a situation where a plurality of users speak at the same time is unavoidable, and for this case, it is necessary to recognize what each person speaking at the same time speaks.

In the current voice recognition mode, after voice signals generated by a plurality of users speaking simultaneously are obtained, voice signals generated by the plurality of users speaking simultaneously are separated to obtain voice information corresponding to each user, and then voice recognition is performed on the voice information corresponding to each user to obtain the content spoken by each user.

Because the frequency spectrum of the voice signal is damaged in the voice separation process of the voice signal, the accuracy of voice recognition is low.

Disclosure of Invention

The embodiment of the invention aims to provide a voice recognition method, a device, a system, a server and a storage medium, so as to improve the accuracy of voice recognition, and the specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for voice recognition, where the method includes:

acquiring speaking images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, wherein the voice signals comprise voice signals generated by simultaneous speaking of the plurality of speakers;

Identifying the speaking image and determining azimuth information and lip movement information of each speaker;

And inputting lip movement information, voiceprint information, azimuth information and the voice signals of each speaker into a pre-trained voice recognition model for each speaker to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, and the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users of each user.

Optionally, the voice signal is a voice signal collected by a microphone array, and the microphone array includes a plurality of array elements;

the step of inputting the lip movement information, the voiceprint information, the azimuth information and the voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker comprises the following steps:

Inputting lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model, so that the voice recognition model extracts voice characteristics corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and phase characteristics among a plurality of array elements, and performs voice recognition by combining the voice characteristics with the lip movement information to obtain text information corresponding to the speaker.

Optionally, the speech recognition model includes: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

the step of extracting the voice characteristic corresponding to the speaker from the voice signal by the voice recognition model based on the azimuth information, the voiceprint information and the phase characteristics among the plurality of array elements, and carrying out voice recognition on the voice characteristic combined with the lip movement information to obtain text information corresponding to the speaker, comprises the following steps:

The residual error layer performs feature extraction on the lip movement information to obtain lip features, and inputs the lip features into the second splicing layer;

the first splicing layer splices the voice signal, the azimuth information and the voiceprint information, and inputs a spliced result to the convolution layer;

The convolution layer extracts voice characteristics corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and the phase characteristics among the plurality of array elements, and inputs the voice characteristics into the second splicing layer;

The second splicing layer splices the voice features and the lip features and inputs the spliced features into the recognition layer;

and the recognition layer performs voice recognition based on the spliced features to obtain corresponding text information of the speaker, and outputs the text information.

Optionally, before the step of acquiring the images, the voice signals, and the voiceprint information of each speaker, the method further includes:

Acquiring a conference image in a conference, performing lip movement detection on the conference image, and determining a target speaker who is speaking;

Determining identity information of the target speaker based on a pre-established face library;

Acquiring a voice signal of the target speaker, and extracting voiceprint information of the voice signal;

and correspondingly recording the voiceprint information and the identity information.

Optionally, the step of identifying the speech image and determining the azimuth information of each speaker includes:

identifying the speaking image and determining the face pixel point of each speaker;

For each speaker, determining the angle information of the speaker relative to the voice acquisition device as the azimuth information of the speaker based on the position of the face pixel point of the speaker in the speaking image, the pre-calibrated parameters of the image acquisition device for shooting the speaking image and the position of the voice acquisition device.

Optionally, the training mode of the speech recognition model includes:

Acquiring the multi-user voice sample and an initial model;

Each multi-user voice sample comprises text information corresponding to each user as a sample label;

inputting each multi-user voice sample into the initial model to obtain predicted text information;

And adjusting model parameters of the initial model based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain the voice recognition model.

Optionally, the method further comprises:

and generating a conference record based on the text information corresponding to each speaker.

In a second aspect, an embodiment of the present invention provides a voice recognition apparatus, including:

The system comprises a first acquisition module, a second acquisition module and a first processing module, wherein the first acquisition module is used for acquiring speaking images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, and the voice signals comprise voice signals generated by the simultaneous speaking of the plurality of speakers;

The first determining module is used for identifying the speaking image and determining azimuth information and lip movement information of each speaker;

The recognition module is used for inputting lip movement information, voiceprint information, azimuth information and the voice signals of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, and the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users of each user.

The identification module comprises:

The first recognition unit is used for inputting lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model, so that the voice recognition model extracts voice features corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and phase characteristics among the plurality of array elements, and performs voice recognition by combining the voice features with the lip movement information to obtain text information corresponding to the speaker.

The first recognition unit includes:

The first extraction subunit is used for extracting the characteristics of the lip movement information by the residual error layer to obtain lip characteristics and inputting the lip characteristics into the second splicing layer;

the first splicing subunit is used for splicing the voice signal, the azimuth information and the voiceprint information by the first splicing layer and inputting the spliced result into the convolution layer;

The second extraction subunit is configured to extract, from the speech signal, a speech feature corresponding to the speaker based on the azimuth information, the voiceprint information, and phase characteristics among the plurality of array elements, and input the speech feature into the second splicing layer;

The second splicing subunit is used for splicing the voice features and the lip features by the second splicing layer and inputting the spliced features into the recognition layer;

and the recognition subunit is used for carrying out voice recognition on the basis of the spliced characteristics by the recognition layer to obtain corresponding text information of the speaker and outputting the text information.

Optionally, the apparatus further includes:

The second acquisition module is used for acquiring a conference image in a conference, carrying out lip movement detection on the conference image and determining a target speaker who is speaking;

The second determining module is used for determining the identity information of the target speaker based on a pre-established face database;

The third acquisition module is used for acquiring the voice signal of the target speaker and extracting voiceprint information of the voice signal;

And the recording module is used for correspondingly recording the voiceprint information and the identity information.

Optionally, the first determining module includes:

the second recognition unit is used for recognizing the speaking image and determining the face pixel point of each speaker;

And the determining unit is used for determining the angle information of each speaker relative to the voice acquisition device as the azimuth information of the speaker based on the position of the face pixel point of the speaker in the speaking image, the pre-calibrated parameters of the image acquisition device for shooting the speaking image and the position of the voice acquisition device.

Optionally, the speech recognition model is obtained by training in advance by a model training module, and the model training module includes:

the sample acquisition unit is used for acquiring the multi-user voice sample and the initial model;

The label determining unit is used for taking text information corresponding to each user included in each multi-user voice sample as a sample label;

the text prediction unit is used for inputting each multi-user voice sample into the initial model to obtain predicted text information;

And the parameter adjustment unit is used for adjusting the model parameters of the initial model based on the difference between the predictive text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain the voice recognition model.

Optionally, the apparatus further includes:

and the generation module is used for generating a conference record based on the text information corresponding to each speaker.

In a third aspect, an embodiment of the present invention provides a speech recognition system, where the system includes a server and a terminal, and the terminal is provided with an image capturing device and a speech capturing device, where:

The image acquisition equipment is used for acquiring images in a conference;

the voice acquisition equipment is used for acquiring voice signals in a conference;

The terminal is used for sending the image and the voice signal to the server;

the server is configured to receive the image and the voice signal, and perform the method steps of any one of the first aspect above.

In a fourth aspect, an embodiment of the present invention provides a server, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

a processor for implementing the method steps of any of the above first aspects when executing a program stored on a memory.

In a fifth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps of any of the first aspects described above.

The embodiment of the invention has the beneficial effects that:

In the scheme provided by the embodiment of the invention, a server can acquire speaking images, voice signals and voiceprint information of each speaker in a conference, wherein the voice signals comprise voice signals generated by simultaneous speaking of the speakers, the speaking images are identified, azimuth information and lip movement information of each speaker are determined, and the lip movement information, the voiceprint information, the azimuth information and the voice signals of each speaker are input into a pre-trained voice recognition model for each speaker to obtain text information corresponding to the speaker, wherein the voice recognition model is obtained based on multi-user voice sample training, and the multi-user voice sample comprises lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users of each user. Through the scheme, the server can input the acquired speaking images and voice signals of the multiple speakers and the voiceprint information of each speaker into the voice recognition model, and the voice signals of the multiple speakers are not required to be separated according to different speakers, so that the integrity of the frequency spectrums of the voice signals of the different speakers is ensured, and the accuracy of voice recognition is improved. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a schematic diagram of an implementation scenario in which a speech recognition method according to an embodiment of the present invention is applied;

FIG. 2 is a flowchart of a method for speech recognition according to an embodiment of the present invention;

FIG. 3 is a flowchart of recognizing a speech recognition model according to an embodiment of the present invention;

FIG. 4 is a flowchart of another speech recognition method according to an embodiment of the present invention;

FIG. 5 is a flowchart based on step S202 in the embodiment shown in FIG. 2;

FIG. 6 is a flow chart of speech recognition model training provided by an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of another voice recognition device according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another voice recognition device according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by the person skilled in the art based on the present invention are included in the scope of protection of the present invention.

In order to improve the accuracy of speech recognition, embodiments of the present invention provide a speech recognition method, apparatus, system, server, computer readable storage medium, and computer program product. In order to facilitate understanding of a speech recognition method provided by the embodiments of the present invention, an implementation scenario to which the speech recognition method provided by the embodiments of the present invention may be applied is first described below.

Fig. 1 is a schematic diagram of an implementation scenario to which a speech recognition method according to an embodiment of the present invention is applied. The multiple conference participants participate in the video conference, and the multiple conference participants may include conference participant 1, conference participant 2, conference participant 3, conference participant 4, conference participant 5, conference participant 6, and conference participant 7, and the server 130 is in communication connection with the terminal 140 for data transmission. The terminal 140 may be an electronic device with a display screen, for example, may be a conference tablet, a touch integrated machine, etc., the terminal 140 may also be provided with a voice acquisition device 110 and an image acquisition device 120, where the voice acquisition device 110 is used to acquire a voice signal sent by a conference participant when speaking in the conference process, the image acquisition device 120 is used to acquire an image of the conference participant in the conference process, and the display screen may display conference related information.

The voice capture device 110 may be a microphone array, where the microphone array may be: a linear array, a triangular array, a T-shaped array, a uniform circular array, etc., as exemplified by the linear array in fig. 1. The image capturing device 120 may be a device capable of capturing an image, such as a camera, and is not particularly limited herein.

After the conference is finished, the terminal 140 may send the voice signal collected by the voice collecting device 110 and the image collected by the image collecting device 120 in the conference process to the server 130, and the server 130 may obtain the conference video including the speaking images and the voice signals of the multiple speakers in the conference process. The following describes a voice recognition method provided by the embodiment of the invention.

As shown in fig. 2, a method for speech recognition, the method comprising:

S201, acquiring speaking images, voice signals and voiceprint information of each speaker in a conference;

wherein the voice signal comprises a voice signal generated by the simultaneous speaking of the plurality of speakers.

S202, recognizing the speaking image, and determining azimuth information and lip movement information of each speaker;

s203, inputting lip movement information, voiceprint information, azimuth information and the voice signals of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker.

The voice recognition model is trained based on multi-user voice samples, wherein the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information of each user and voice signals generated by simultaneous speaking of multiple users.

In the solution provided in the embodiment of the present invention, the server may obtain speech images, speech signals of multiple speakers and voiceprint information of each speaker in the conference, where the speech signals include speech signals generated by simultaneous speech of the multiple speakers, identify the speech images, determine azimuth information and lip movement information of each speaker, and input, for each speaker, the lip movement information, the voiceprint information, the azimuth information and the speech signals of the speaker into a pre-trained speech recognition model to obtain text information corresponding to the speaker, where the speech recognition model is obtained based on multi-user speech sample training, and the multi-user speech sample includes lip movement information, voiceprint information, azimuth information and speech signals generated by simultaneous speech of multiple users of each user. Through the scheme, the server can input the acquired speaking images and voice signals of the multiple speakers and the voiceprint information of each speaker into the voice recognition model, and the voice signals of the multiple speakers are not required to be separated according to different speakers, so that the integrity of the frequency spectrums of the voice signals of the different speakers is ensured, and the accuracy of voice recognition is improved.

The conference video sent to the server by the terminal may include videos corresponding to multiple speaking conditions in the conference, where the speaking conditions may include: one speaker speaking or multiple speakers speaking both.

For the case where multiple speakers speak at the same time, the server may acquire the speaking images, the voice signals, and the voiceprint information of each speaker from the conference video. The speaking images of the multiple speakers may be multi-frame images which are collected by the image collecting device in the conference video and can represent lip actions of the speakers, and the multi-frame images may be images including all conference participants, or images respectively aiming at each conference participant, which is not limited specifically herein.

In one embodiment, the server may identify a conference image in the conference video, determine a lip image feature of a speaker in the conference image, determine the number of speakers at the current time according to motion information of the lip image feature of the speaker, and when the number of speakers is multiple, the server may use the conference image as a speaking image of multiple speakers, and acquire a voice signal acquired at a time corresponding to the speaking image and voiceprint information of each speaker.

If the number of the speakers is one, the server can acquire the voice signal acquired at the moment corresponding to the speaking image, wherein the voice signal is the voice signal sent by the speaker, and further, the server can perform voice recognition on the voice signal by adopting a voice recognition algorithm, so that corresponding text information can be obtained.

The voice signals of the plurality of speakers are the voice signals generated by the simultaneous speaking of the plurality of speakers and collected by the voice collecting device in the period of simultaneous speaking of the plurality of speakers in the conference video. Which is a speech signal formed by mixing together speech signals from a plurality of speakers.

For example, in the process of identifying the conference image by the server, the lip image features of the speaker in the conference image 1 are extracted, the fact that the speaker a and the speaker B simultaneously speak at the time points corresponding to the conference image 1 is determined according to the lip image features, the server continues to identify the conference image sequentially until the conference image 20 determines that only the speaker a is speaking at the time points corresponding to the conference image 20 according to the lip image features, and then the time period from the time points corresponding to the conference image 1 to the time points corresponding to the conference image 20 is the time period when the speaker a and the speaker B simultaneously speak, and the voice signals corresponding to the time period are the voice signals generated by the fact that the speaker a and the speaker B simultaneously speak. The method provided by the embodiment of the invention can be adopted for identifying the voice signal.

The voiceprint information is information capable of representing the voice spectrum characteristics of the speakers, and in order to facilitate obtaining the voiceprint information, the server can obtain and store the voiceprint information of each speaker when the speaker speaks independently for the first time in a conference process, and further determine the identity information of the speaker based on a face library and a speaking image which are established in advance and obtain the voiceprint information of the speaker based on the identity information of the speaker when voice signals generated by the simultaneous speaking of a plurality of speakers are required to be identified.

After acquiring the speech images, the voice signals, and the voiceprint information of each speaker in the conference, the server may perform step S202 described above, that is, identify the speech images, and determine the azimuth information and the lip movement information of each speaker.

The server may identify the speaking image, extract the lip image feature of each speaker, and for each speaker, may use any lip image feature as the position of the speaker in the speaking image, or may calculate an average value of the lip image features of the speaker, and use the point corresponding to the average value as the position of the speaker in the speaking image, which is reasonable.

After determining the position of the speaker in the speaking image, the server can determine the actual position information of the speaker in the conference scene according to the external parameters and the internal parameters of the image acquisition device calibrated in advance, and then calculate the relative position relation between the speaker and the voice acquisition device according to the position of the voice acquisition device, so that the azimuth information of the speaker can be determined.

In one embodiment, the image capturing device is a camera, the voice capturing device is a microphone array, a coordinate system is established by taking the position of the microphone array in the conference scene as the origin of the three-dimensional coordinate system, the X axis and the Y axis form a horizontal plane, the server can extract the lip image characteristic of each speaker in the speaking image 1, the lip image characteristic A is taken as the position of the speaker A in the frame speaking image, and according to the internal parameters of the camera and the external parameters of the camera, the three-dimensional coordinates (X, Y, z) of the speaker A in the coordinate system established by taking the position of the microphone array in the conference scene as the origin of the three-dimensional coordinate system are calculated, and then the angle corresponding to tanx/Y is calculated, so that the azimuth information of the speaker can be obtained.

The server can identify the speaking image of a plurality of speakers simultaneously speaking, extract the lip image characteristic of each speaker from the speaking image, and take the change information of the lip image characteristic of the speaker in the multi-frame speaking image as the lip movement information of the speaker.

In one embodiment, since the server may need to determine the lip image feature of each speaker in the speech image when determining the number of speakers currently speaking simultaneously, in this case, the server may use the lip image feature determined when determining the number of speakers currently speaking simultaneously as lip movement information of the corresponding speaker without identifying the speech image.

Next, the server may perform the step S203 described above, that is, for each speaker, inputs lip movement information, voiceprint information, azimuth information, and a voice signal of the speaker into a pre-trained voice recognition model, so as to obtain text information corresponding to the speaker.

The server inputs lip movement information, voiceprint information, azimuth information of the speaker and voice signals generated by the simultaneous speaking of the plurality of speakers into a pre-trained voice recognition model together, so as to obtain text information corresponding to the speaker, instead of separating the voice signals generated by the simultaneous speaking of the plurality of speakers into multiple voice signals.

The voice recognition model is trained based on multi-user voice samples in advance, wherein the multi-user voice samples can comprise lip movement information, voiceprint information and azimuth information of each user and voice signals generated by simultaneous speaking of a plurality of users, namely the voice recognition model is trained based on lip movement information, voiceprint information and azimuth information of each user and voice signals generated by simultaneous speaking of a plurality of users.

In the training process, the voice signals generated by the simultaneous speaking of the multiple users are the voice signals formed by mixing the voice signals sent by the multiple users, the voice signals generated by the simultaneous speaking of the multiple users are not separated according to different users, the voice recognition model can learn the corresponding relation between lip movement information, voiceprint information and azimuth information of each user and the text information corresponding to the user, and further in the using process of the voice recognition model, response processing can be carried out on the lip movement information, voiceprint information and azimuth information of each speaker and the voice signals generated by the simultaneous speaking of the multiple speakers, so that the text information corresponding to the speaker can be obtained.

Aiming at the condition that a plurality of speakers speak simultaneously, the server can perform voice recognition on the speakers one by one, namely, traverse the plurality of speakers speaking simultaneously, and input lip movement information, voiceprint information, azimuth information and voice signals corresponding to each speaker into a voice recognition model when traversing one speaker, so that text information corresponding to each speaker can be obtained respectively, and further voice recognition of the plurality of speakers speaking simultaneously is completed.

For example, the server determines that there is a simultaneous speech of speaker a, speaker B, and speaker C between 2 minutes 5 seconds and 5 minutes 10 seconds, and the server can acquire lip movement information, voiceprint information, azimuth information, and speech signal a generated by simultaneous speech of speaker a, speaker B, and speaker C, respectively. And traverse each speaker.

Specifically, when traversing the speaker a, the server may input lip movement information, voiceprint information, azimuth information, and a speech signal a of the speaker a into the speech recognition model, and obtain text information corresponding to the speaker a. And traversing the speaker B, and inputting lip movement information, voiceprint information, azimuth information and a voice signal a of the speaker B into the voice recognition model to obtain text information corresponding to the speaker B. And traversing the speaker C, and inputting lip movement information, voiceprint information, azimuth information and a voice signal a of the speaker C into the voice recognition model to obtain text information corresponding to the speaker C.

The voice recognition model is trained based on lip movement information, voiceprint information, azimuth information of each user and voice signals generated by simultaneous speaking of a plurality of users, in addition, in the process of training the voice recognition model, the voice signals generated by simultaneous speaking of the plurality of users are not separated according to different users, and further, in the process of using the voice recognition model, the server inputs the lip movement information, the voiceprint information, the azimuth information of the speaker and the voice signals generated by simultaneous speaking of the plurality of speakers into the voice recognition model which is trained in advance, so that the voice signals generated by simultaneous speaking of the plurality of speakers can be separated according to different speakers, text information can be obtained through recognition, the spectrum integrity of the voice signals of different speakers is guaranteed, and the accuracy of voice recognition is improved.

As an implementation manner of the embodiment of the present invention, the voice signal may be a voice signal collected by a microphone array, where the microphone array includes a plurality of array elements. Because the positions of the array elements in the microphone array are different, the voice signals of a plurality of speakers are received at the same moment, namely, the phase characteristics of the waveforms of the voice signals received by each array element are different, so that the characteristics can be utilized to accurately identify the voice characteristics of different speakers according to the phase characteristics of the waveforms of the voice signals under the condition that the voice signals generated by simultaneously speaking a plurality of speakers are not separated.

In this case, the step of inputting the lip movement information, the voiceprint information, the azimuth information, and the voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker may include:

For each speaker, the server may input lip movement information, voiceprint information, azimuth information and a voice signal of the speaker into a pre-trained voice recognition model, and the voice recognition model may extract a voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and phase characteristics among a plurality of array elements because phase characteristics of waveforms of the voice signals received by each array element of the microphone array are different. The lip movement information can represent the feature of the lip image of the speaker when speaking, and the voice feature is combined with the lip movement feature to perform voice recognition, so that the accuracy of voice recognition can be improved when speaking simultaneously for a plurality of speakers, and further text information corresponding to the speakers is obtained.

It can be seen that, in this embodiment, the server may input lip movement information, voiceprint information, azimuth information and a voice signal of each speaker into a pre-trained voice recognition model, so that the voice recognition model extracts a voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and phase characteristics among a plurality of array elements, and performs voice recognition by combining the voice feature with the lip movement information to obtain text information corresponding to the speaker. Because the time delay exists in the voice signals received by a plurality of array elements in the microphone array at the same time, different phase characteristics can be generated, and the voice recognition model can accurately recognize the voice characteristics of different speakers under the condition that the voice signals generated by simultaneously speaking a plurality of speakers are not separated. Therefore, the complete frequency spectrum of the voice signal of each speaker is ensured, and the accuracy of voice recognition is improved.

As shown in fig. 3, the speech recognition model may include: residual layer 350, first splice layer 340, convolutional layer 330, second splice layer 320, and identification layer 310.

Correspondingly, the step of extracting the voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and the phase characteristics among the plurality of array elements by the voice recognition model, and performing voice recognition by combining the voice feature with the lip movement information to obtain text information corresponding to the speaker may include:

The residual layer 350 performs feature extraction on the lip movement information 304 to obtain lip features, and inputs the lip features to the second splicing layer 320, the first splicing layer 340 splices the voice signal 301, the azimuth information 303 and the voiceprint information 302, inputs the spliced result to the convolution layer 330, the convolution layer 330 extracts the voice features corresponding to the speaker from the voice signal 301 based on the azimuth information 303, the voiceprint information 302 and the phase characteristics among a plurality of array elements, inputs the voice features to the second splicing layer 320, the second splicing layer 320 splices the voice features and the lip features, inputs the spliced features to the recognition layer 310, and the recognition layer 310 performs voice recognition based on the spliced features to obtain the corresponding text information of the speaker and outputs the text information.

The convolutional layer may be a convolutional neural network (Convolutional Neural Networks, CNN), the residual layer may be a residual network, and the recognition layer may be an end-to-end automatic speech recognition (Automatic Speech Recognition, ASR) module, which is not specifically limited herein.

The lip feature may represent the lip image feature of the speaker when speaking, and the spliced result output by the first splicing layer is that the voice signals of multiple speakers speaking simultaneously are spliced together with the azimuth information of the speaker and the voiceprint information of the speaker, and may include the azimuth feature of the speaker, the spectral feature of the user speaking and the mixed voice signal feature of the multi-user mixed speaking.

Because the voice signals of the multiple speakers speaking simultaneously are collected by the microphone array, and the positions of the array elements in the microphone array are different, the voice signals of the multiple speakers are received at the same time, namely, the phase characteristics of the waveforms of the voice signals received by each array element are different, so that the convolution layer 330 can extract the voice characteristics corresponding to the speaker from the voice signals 301 based on the azimuth information 303, the voiceprint information 302 and the phase characteristics among the multiple array elements, and the voice characteristics and the lip characteristics are spliced in the second splicing layer 320.

At this time, the voice feature and the lip feature are features corresponding to the speaker, and the features of the speaker are represented from two dimensions of the voice feature and the image feature, and then the voice feature and the lip feature of the speaker are spliced and then input into the recognition layer 310, and the recognition layer 310 can accurately recognize and obtain the corresponding text information of the speaker based on the fusion features of the two dimensions of the voice feature and the image feature and output the text information.

It can be seen that, in this embodiment, the residual layer in the speech recognition model performs feature extraction on the lip movement information to obtain a lip feature, and inputs the lip feature into the second splicing layer, the first splicing layer splices the speech signal, the azimuth information and the voiceprint information, and inputs the spliced result into the convolution layer, the convolution layer extracts the speech feature corresponding to the speaker from the speech signal based on the azimuth information, the voiceprint information and the phase characteristics among the plurality of array elements, and inputs the speech feature into the second splicing layer, the second splicing layer splices the speech feature and the lip feature, inputs the spliced feature into the recognition layer, and the recognition layer performs speech recognition based on the spliced feature to obtain the corresponding text information of the speaker, and outputs the text information.

As shown in fig. 4, before the step of obtaining the images, the voice signals, and the voiceprint information of each speaker, the method may further include:

s401, acquiring a conference image in a conference, and carrying out lip movement detection on the conference image to determine a target speaker who is speaking;

The server may acquire a conference image in the current conference, where the conference image may be an image in the current conference video, and the conference image corresponds to a certain time point in the current conference video, for example, the conference image a corresponds to a conference picture 1 minute 13 seconds in the conference video. For speech recognition, the server performs lip movement detection on the conference image, and may determine the target speaker that is speaking, where the target speaker may be one or more, and is not specifically limited herein.

S402, determining identity information of the target speaker based on a pre-established face library;

After determining the target speaker that is speaking, the server may determine the identity information of the target speaker based on the pre-established face library and the face image of the speaker. In order to facilitate determination of identity information of the target speaker, a face library may be pre-established, and the face library may store pre-acquired face model information and corresponding identity information of each person, for example, may be a correspondence between face features and names.

Before the meeting starts, the terminal can acquire a list of meeting participants, the list comprises identity information of the meeting participants, and according to the identity information in the list of the meeting participants, the terminal can extract face features of the meeting participants from the face library and record the face features of the meeting participants, so that registration of the meeting participants is completed. The terminal can store the face characteristics of the conference participants and the identity information of the conference participants locally, or record the face characteristics and the identity information of the conference participants and send the record to the server, which is reasonable.

S403, acquiring a voice signal of the target speaker, and extracting voiceprint information of the voice signal;

In one embodiment, when the target speaker speaks alone when speaking for the first time, the server may directly acquire the voice signal of the target speaker acquired by the voice acquisition device during the period of speaking alone by the target speaker, and extract voiceprint information of the voice signal.

In another embodiment, when the target speaker speaks for the first time, if a plurality of people including the target speaker speak at the same time, the server may acquire the voice signals of the plurality of speakers acquired by the voice acquisition device in a period of time when the plurality of people including the target speaker speak at the same time, extract the voice signals of the target speaker from the voice signals of the plurality of speakers according to the lip movement information and the azimuth information of the target speaker, and extract the voiceprint information of the voice signals.

In the above two embodiments, the voice acquisition device may be a microphone array, and the voice signals acquired by the microphone array may be subjected to beamforming processing, that is, beamforming, where the beamforming is to perform delay or phase compensation and amplitude weighting processing on the output of each array element, so as to form a beam pointing to a specific direction. Thus, the server can obtain the voice signal of the target speaker more accurately, so that the extracted voiceprint information can be more accurate.

The above-mentioned voiceprint information may be extracted from the voice signal by using techniques such as a time delay neural network (TIME DELAY Neural Network, TDNN) and a Probabilistic linear discriminant analysis (probabilitic LINEAR DISCRIMINANT ANALYSIS, PLDA), and the above-mentioned beamforming may use a minimum variance undistorted response (Minimum Variance Distortionless Response, MVDR), which is not limited herein.

S404, recording the voiceprint information and the identity information correspondingly.

After the identity information of the target speaker is determined and the voiceprint information of the target speaker is obtained, the server can correspondingly record the identity information of the target speaker and the voiceprint information of the target speaker, so that the corresponding relation between the voiceprint information of the target speaker and the target speaker in the conference is obtained. For example: the target speaker is target speaker a, and after extracting voiceprint information 1 of target speaker a, the target speaker may be recorded as "target speaker a-voiceprint information 1". The corresponding recording mode may be recording by using a table, and the like, and is not particularly limited herein. For example, it can be shown in the following table:

Sequence number	Speaker (S)	Voiceprint information
			1	Target speaker A	Voiceprint information 1
2	Target speaker B	Voiceprint information 1
			3	Target speaker C	Voiceprint information 3

It can be seen that, in this embodiment, the server may obtain a conference image in a conference, perform lip movement detection on the conference image, determine a target speaker that is speaking, determine identity information of the target speaker based on a face library established in advance, obtain a voice signal of the target speaker, extract voiceprint information of the voice signal, and record the voiceprint information in correspondence with the identity information. In the related art, voice print registration of conference participants is performed before a conference starts, but voice print information of the same conference participant fluctuates greatly in different time periods after voice print registration, and in the actual use process, the problem of low voice recognition rate can be caused. In this embodiment, voice print registration of conference participants is not required before the conference starts as in the related art, but voice print information registration is performed during the process of extracting voice signals sent by the conference participants. Therefore, the problems of environmental change before and after the meeting is started and inaccurate pre-registered voiceprint information caused by large fluctuation of the voiceprint information of the meeting staff are avoided, the voiceprint information of the target speaker is more accurate, and the accuracy of subsequent voice recognition is improved.

As shown in fig. 5, the step of identifying the speech image and determining the azimuth information of each speaker may include:

s501, recognizing the speaking image and determining the face pixel point of each speaker;

the server may identify the speech image, determine the face pixel point of each speaker in the speech image, select any point of the face pixel points as the position of the face pixel point of the speaker in the image, and calculate the average value of the face pixel points, and use the point corresponding to the average value as the position of the face pixel point of the speaker in the image, which is not specifically limited herein.

S502, for each speaker, determining angle information of the speaker relative to the voice acquisition device as azimuth information of the speaker based on a position of the face pixel point of the speaker in the speaking image, a pre-calibrated parameter of an image acquisition device for capturing the speaking image, and a position of the voice acquisition device.

In one embodiment, after obtaining the position of the face pixel of the speaker in the speaking image, the server may calculate, based on the position of the face pixel of the speaker in the speaking image obtained and the parameters of the image capturing device that captures the speaking image calibrated in advance, the position of the speaker in the conference scene, and based on the relative positions of the voice capturing device and the camera, so that the angle information of the speaker relative to the voice capturing device may be calculated and obtained as the azimuth information of the speaker.

In one embodiment, the image capturing device is a camera, the voice capturing device is a microphone array, the position of the camera in the conference scene is used as an origin of a three-dimensional coordinate system to establish a coordinate system, the X-axis and the Y-axis form a horizontal plane, the server can extract face pixel points of each speaker from the speaking image 1, the point corresponding to the average value of the face pixel points is used as the position of the speaker a, the three-dimensional coordinates (X1, Y1, z 1) of the speaker in the coordinate system are established by taking the position of the camera in the conference scene as the origin of the three-dimensional coordinate system according to the internal parameters of the camera and the external parameters of the camera, the three-dimensional coordinates (X2, Y2, z 2) in the coordinate system are established by taking the position of the camera in the conference scene as the origin of the three-dimensional coordinate system are calculated, and the angle corresponding to tan|x1|+|x2|1|+|y2|as the azimuth information of the speaker.

It can be seen that, in this embodiment, the server may identify the speech image, determine the face pixel point of each speaker, and for each speaker, determine the angle information of the speaker relative to the voice acquisition device based on the position of the face pixel point of the speaker in the speech image, the parameters of the image acquisition device that captures the speech image and the position of the voice acquisition device, which are calibrated in advance, as the azimuth information of the speaker, so that the server may accurately determine the azimuth information of the speaker, and further may ensure the accuracy of subsequent voice recognition.

As shown in fig. 6, the training method of the speech recognition model according to the embodiment of the present invention may include:

S601, acquiring the multi-user voice sample and an initial model;

the server may obtain a multi-user voice sample and an initial model, where the multi-user voice sample includes lip movement information, voiceprint information, azimuth information, a voice signal, and text information corresponding to each user. The structure of the initial model is the same as that of the speech recognition model, and may include: the residual layer, the first splicing layer, the convolution layer, the second splicing layer and the identification layer, and initial parameters of the initial model can be default values or can be initialized randomly, and are not limited in detail herein.

S602, each multi-user voice sample comprises text information corresponding to each user as a sample label;

The server may obtain each multi-user voice sample including text information corresponding to each user, where the text information is determined manually, or may determine the text information in advance, so that multiple users send voice signals according to the corresponding text information at the same time, to obtain the multi-user voice sample. The text information corresponding to each multi-user voice sample can be used as a sample label corresponding to the multi-user voice sample.

S603, inputting each multi-user voice sample into the initial model to obtain predicted text information;

For each user included in the multi-user voice sample, the lip movement information of the user can be input into the residual layer of the initial model to perform feature extraction on the lip movement information, and the lip characteristics are obtained and then input into the second splicing layer. And inputting the voiceprint information of the user, the azimuth information of the user and the voice signals which are simultaneously spoken by a plurality of users into a first splicing layer for splicing, and inputting the spliced result into a convolution layer.

The convolution layer may extract a voice feature corresponding to the user from voice signals of a plurality of users speaking at the same time based on voiceprint information of the user, azimuth information of the user, and phase characteristics among a plurality of array elements included in the microphone array, and input the voice feature into the second concatenation layer. In order to ensure that the trained speech recognition model can accurately process speech signals, the microphone array may be identical to the microphone array described in the above embodiments.

Furthermore, the second splicing layer can splice the voice features and the lip features, the spliced features are input into the recognition layer, and the recognition layer can perform voice recognition based on the spliced features to obtain text information serving as predicted text information.

S604, adjusting model parameters of the initial model based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain the voice recognition model.

Because the current initial model may not be able to accurately identify the speech signal, model parameters of the initial model may be adjusted based on differences between the predicted text information corresponding to each multi-user speech sample and the sample labels, so that the parameters of the initial model are more and more suitable, and accuracy of speech recognition is improved until the initial model converges. The parameters of the initial model may be adjusted by a gradient descent algorithm, a random gradient descent algorithm, or the like, which is not particularly limited herein.

In one embodiment, the function value of the loss function may be calculated based on the difference between the predicted text information and the sample tag, and when the function value reaches a preset value, the current initial model is determined to converge, so as to obtain the speech recognition model. In one embodiment, after the number of iterations of the multi-user speech sample reaches a preset number, the initial model may be considered to converge to obtain a speech recognition model.

It can be seen that, in this embodiment, the server may obtain a multi-user voice sample and an initial model, take each multi-user voice sample including text information corresponding to each user as a sample tag, input each multi-user voice sample into the initial model to obtain predicted text information, and adjust model parameters of the initial model based on differences between the predicted text information corresponding to each multi-user voice sample and the sample tag until the initial model converges to obtain a voice recognition model. The training mode can train to obtain a model capable of accurately identifying the lip movement information, the voiceprint information, the azimuth information and the voice signals generated by simultaneous talking of multiple users, thereby ensuring the accuracy of subsequent voice identification.

As an implementation manner of the embodiment of the present invention, the method may further include:

In the conference video, a plurality of speakers speak simultaneously or a single speaker speaks, and the server can record text information corresponding to the speakers according to corresponding time sequences under different conditions to generate a conference record.

For example, the text information corresponding to the speech signal generated by the speaker a speaking at the time a is "the present conference content is the work report of the last quarter", and the text information corresponding to the speech signal generated by the speaker B speaking at the time B after the speaker a speaking is "the last quarter i department completed an item", and the text information corresponding to the speech signal generated by the speaker C speaking is "i have a problem and want to solve the problem". The server may generate a meeting record: time a: a speaker A, wherein the conference content is reported by the work in the last quarter; time b: speaker B, last quarter i department completed an item; speaker C, i have a problem to solve.

In one embodiment, the meeting record may further include information such as meeting location, meeting name, etc., which is not specifically limited herein.

It can be seen that, in this embodiment, the server may generate the conference record based on the text information corresponding to each speaker, and because the server may record the speaking conditions of multiple speakers and a single speaker in the conference according to the conference time sequence, and for the case that multiple speakers speak simultaneously, the server may accurately perform voice recognition to obtain accurate text information, without additionally providing conference recording personnel, thereby saving manpower and cost.

Correspondingly, with the voice recognition method, the embodiment of the invention also provides a voice recognition device, and the voice recognition device provided by the embodiment of the invention is introduced.

As shown in fig. 7, a voice recognition apparatus may include:

A first obtaining module 710, configured to obtain speech images of a plurality of speakers in a conference, a speech signal, and voiceprint information of each speaker, where the speech signal includes speech signals generated by the plurality of speakers speaking simultaneously;

A first determining module 720, configured to identify the speech image, and determine azimuth information and lip movement information of each speaker;

The recognition module 730 is configured to input, for each speaker, lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, where the voice recognition model is obtained based on multi-user voice sample training, and the multi-user voice sample includes lip movement information, voiceprint information, azimuth information and a voice signal generated by simultaneous speaking of multiple users.

As an implementation manner of the embodiment of the present invention, the voice signal may be a voice signal collected by a microphone array, where the microphone array includes a plurality of array elements;

The identification module 730 may include:

As an implementation manner of the embodiment of the present invention, the above-mentioned speech recognition model may include: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

The first recognition unit may include:

As an implementation manner of the embodiment of the present invention, as shown in fig. 8, the apparatus may further include:

A second obtaining module 740, configured to obtain a conference image in a conference, and perform lip movement detection on the conference image, to determine a target speaker that is speaking;

A second determining module 750, configured to determine identity information of the target speaker based on a pre-established face database;

a third obtaining module 760, configured to obtain a voice signal of the target speaker, and extract voiceprint information of the voice signal;

The recording module 770 is configured to record the voiceprint information in correspondence with the identity information.

As an implementation manner of the embodiment of the present invention, the first determining module 720 may include:

As an implementation manner of the embodiment of the present invention, the above-mentioned speech recognition model is obtained by training in advance by a model training module, where the model training module may include:

As an implementation manner of the embodiment of the present invention, as shown in fig. 9, the apparatus may further include:

And the generating module 780 is configured to generate a conference record based on the text information corresponding to each speaker.

Correspondingly, with the voice recognition method, the embodiment of the invention also provides a voice recognition system, and the voice recognition system provided by the embodiment of the invention is introduced below.

As shown in fig. 10, a voice recognition system includes a server 1004 and a terminal 1003 provided with an image pickup device 1001 and a voice pickup device 1002, wherein:

the image acquisition device 1001 is used for acquiring images in a conference;

the voice acquisition device 1002 is configured to acquire a voice signal in a conference;

the terminal 1003 is configured to send the image and the voice signal to the server 1004;

The server 1004 is configured to receive the image and the voice signal, and the steps of the voice recognition method according to any one of the above embodiments

In the solution provided in the embodiment of the present invention, the image capturing device may capture an image in a conference, the voice capturing device may capture a voice signal in a conference, the terminal may send the image and the voice signal to the server, the server may obtain speech images of a plurality of speakers in the conference, the voice signal and voiceprint information of each speaker, where the voice signal includes a voice signal generated by the simultaneous speech of the plurality of speakers, identify the speech image, determine azimuth information and lip movement information of each speaker, and input, for each speaker, the lip movement information, the voiceprint information, the azimuth information and the voice signal of the speaker into a voice recognition model trained in advance to obtain text information corresponding to the speaker, where the voice recognition model is obtained based on multi-user voice sample training, and the multi-user voice sample includes lip movement information, voiceprint information, azimuth information and simultaneously generated voice signal of each user. Through the scheme, the server can input the speaking images and the voice signals of the multiple speakers and the voiceprint information of each speaker into the voice recognition model, and the voice signals of the multiple speakers are not required to be separated according to different speakers, so that the complete frequency spectrum of the voice signals of the different speakers is ensured, and the accuracy of voice recognition is improved.

The embodiment of the present invention also provides a server, as shown in fig. 11, including a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102 and the memory 1103 complete communication with each other through the communication bus 1104,

A memory 1103 for storing a computer program;

The processor 1101 is configured to implement the steps of the voice recognition method according to any one of the above embodiments when executing the program stored in the memory 1103.

The communication bus mentioned by the server may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the server and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In a further embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of the speech recognition method according to any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the speech recognition method of any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a speech recognition method, apparatus, system, server, computer-readable storage medium and computer program product, the description is relatively simple, as it is substantially similar to the method embodiments, and relevant places are referred to in the description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of speech recognition, the method comprising:

acquiring speaking images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, wherein the voice signals comprise voice signals generated by simultaneous speaking of the plurality of speakers, the voice signals are collected by a microphone array, and the microphone array comprises a plurality of array elements;

Inputting lip movement information, voiceprint information, azimuth information and the voice signals of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is trained based on multi-user voice samples, the multi-user voice samples comprise lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users of each user, and the voice recognition model comprises: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

Inputting lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model so that the residual layer performs feature extraction on the lip movement information to obtain lip features, and inputting the lip features into the second splicing layer;

2. The method of claim 1, wherein prior to the step of acquiring images, voice signals, and voiceprint information for each of a plurality of speakers, the method further comprises:

3. The method of claim 1, wherein the step of identifying the speech image and determining the azimuth information for each speaker comprises:

4. A method according to any one of claims 1-3, wherein the training mode of the speech recognition model comprises:

Acquiring the multi-user voice sample and an initial model;

5. A method according to any one of claims 1-3, wherein the method further comprises:

6. A speech recognition device, the device comprising:

The system comprises a first acquisition module, a second acquisition module and a first processing module, wherein the first acquisition module is used for acquiring speech images of a plurality of speakers in a conference, voice signals and voiceprint information of each speaker, the voice signals comprise voice signals generated by the simultaneous speech of the plurality of speakers, the voice signals are collected by a microphone array, and the microphone array comprises a plurality of array elements;

the recognition module is configured to input, for each speaker, lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, where the voice recognition model is obtained based on multi-user voice sample training, the multi-user voice sample includes lip movement information, voiceprint information, azimuth information of each user and a voice signal generated by simultaneous speaking of multiple users, and the voice recognition model includes: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

The identification module comprises:

The first recognition unit is used for inputting lip movement information, voiceprint information, azimuth information and the voice signal of the speaker into a pre-trained voice recognition model so that the voice recognition model extracts voice characteristics corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and phase characteristics among a plurality of array elements, and performs voice recognition by combining the voice characteristics with the lip movement information to obtain text information corresponding to the speaker;

The first recognition unit includes:

7. The apparatus of claim 6, wherein the apparatus further comprises:

8. The apparatus of claim 6, wherein the first determining module comprises:

9. The apparatus according to any one of claims 6-8, wherein the speech recognition model is pre-trained by a model training module comprising:

10. The apparatus according to any one of claims 6-8, further comprising:

11. A speech recognition system, characterized in that the system comprises a server and a terminal, the terminal being provided with an image acquisition device and a speech acquisition device, wherein:

The image acquisition equipment is used for acquiring images in a conference;

The terminal is used for sending the image and the voice signal to the server;

The server being adapted to receive the image and the speech signal and to perform the method steps of any of claims 1-5.

12. The server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

A memory for storing a computer program;

A processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.