CN113611308A

CN113611308A - Voice recognition method, device, system, server and storage medium

Info

Publication number: CN113611308A
Application number: CN202111048642.5A
Authority: CN
Inventors: 齐昕
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-11-05

Abstract

The embodiment of the invention provides a voice recognition method, a voice recognition device, a voice recognition system, a server and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining speaking images, voice signals and voiceprint information of a plurality of speakers in a conference, wherein the voice signals comprise voice signals generated by the speakers speaking at the same time, identifying the speaking images, determining direction information and lip movement information of the speakers, inputting the lip movement information, the voiceprint information, the direction information and the voice signals of the speakers into a pre-trained voice recognition model aiming at each speaker, and obtaining text information corresponding to the speaker, wherein the voice recognition model is obtained based on multi-user voice sample training, and the multi-user voice sample comprises the lip movement information, the voiceprint information, the direction information of each user and the voice signals generated by the multiple users speaking at the same time. Because the voice signals do not need to be separated, the completeness of the voice signals is ensured, and the accuracy of voice recognition is improved.

Description

Voice recognition method, device, system, server and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, system, server, and storage medium.

Background

At present, video conferencing has become a common communication mode in people's work and life. In order to record the conference contents, it is necessary to collect and recognize utterances of each person in the conference to obtain corresponding text information. However, in a conference, it is inevitable that a plurality of users speak at the same time, and in this case, it is necessary to recognize what each of the users speaking at the same time speaks.

In the current voice recognition mode, after voice signals generated when a plurality of users speak simultaneously are acquired, voice separation is performed on the voice signals generated when the plurality of users speak simultaneously, so that voice information corresponding to each user is obtained, and then voice recognition is performed on the voice information corresponding to each user respectively, so that content spoken by each user is obtained.

Since the spectrum of the voice signal is damaged in the process of performing voice separation on the voice signal, the accuracy of voice recognition is low.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, an apparatus, a system, a server and a storage medium for speech recognition, so as to improve the accuracy of speech recognition, and the specific technical solution is as follows:

in a first aspect, an embodiment of the present invention provides a speech recognition method, where the method includes:

acquiring speaking images, voice signals and voiceprint information of each of a plurality of speakers in a conference, wherein the voice signals comprise voice signals generated by speaking of the plurality of speakers at the same time;

recognizing the speaking image, and determining the direction information and lip movement information of each speaker;

and inputting the lip movement information, the voiceprint information, the azimuth information and the voice signals of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is obtained based on multi-user voice sample training, and the multi-user voice sample comprises the lip movement information, the voiceprint information, the azimuth information of each user and the voice signals generated by multi-user simultaneous speaking.

Optionally, the speech signal is a speech signal collected by a microphone array, where the microphone array includes a plurality of array elements;

the step of inputting the lip movement information, the voiceprint information, the orientation information and the voice signal of the speaker into a pre-trained voice recognition model to obtain the text information corresponding to the speaker comprises the following steps:

inputting the lip movement information, the voiceprint information, the orientation information and the voice signal of the speaker into a pre-trained voice recognition model, so that the voice recognition model extracts the voice feature corresponding to the speaker from the voice signal based on the orientation information, the voiceprint information and the phase characteristics among the array elements, and performs voice recognition by combining the voice feature with the lip movement information to obtain the text information corresponding to the speaker.

Optionally, the speech recognition model includes: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

the step of the voice recognition model extracting the voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and the phase characteristics among the array elements, and performing voice recognition by combining the voice feature with the lip movement information to obtain the text information corresponding to the speaker includes:

the residual error layer performs feature extraction on the lip movement information to obtain lip features, and the lip features are input into the second splicing layer;

the first splicing layer splices the voice signal, the azimuth information and the voiceprint information and inputs a spliced result to the convolutional layer;

the convolutional layer extracts a voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and the phase characteristics among the array elements, and inputs the voice feature into the second splicing layer;

the second splicing layer splices the voice features and the lip features and inputs the spliced features into the recognition layer;

and the recognition layer performs voice recognition based on the spliced features to obtain corresponding text information of the speaker and outputs the text information.

Optionally, before the step of acquiring the images of the multiple speakers, the voice signals, and the voiceprint information of each speaker, the method further includes:

acquiring a conference image in a conference, carrying out lip movement detection on the conference image, and determining a target speaker who is speaking;

determining the identity information of the target speaker based on a pre-established face library;

acquiring a voice signal of the target speaker, and extracting voiceprint information of the voice signal;

and correspondingly recording the voiceprint information and the identity information.

Optionally, the step of recognizing the speaking image and determining the direction information of each speaker includes:

identifying the speaking image and determining facial pixel points of each speaker;

and for each speaker, determining angle information of the speaker relative to the voice acquisition equipment as azimuth information of the speaker based on the position of the facial pixel point of the speaker in the speech image, the pre-calibrated parameter of the image acquisition equipment for shooting the speech image and the position of the voice acquisition equipment.

Optionally, the training mode of the speech recognition model includes:

acquiring the multi-user voice sample and an initial model;

taking each multi-user voice sample comprising text information corresponding to each user as a sample label;

inputting each multi-user voice sample into the initial model to obtain predicted text information;

and adjusting the model parameters of the initial model based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain the voice recognition model.

Optionally, the method further includes:

and generating a conference record based on the text information corresponding to each speaker.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus, where the apparatus includes:

the conference processing device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring speaking images, voice signals and voiceprint information of each of a plurality of speakers in a conference, and the voice signals comprise voice signals generated by the plurality of speakers speaking at the same time;

the first determining module is used for identifying the speaking image and determining the direction information and the lip movement information of each speaker;

and the recognition module is used for inputting the lip movement information, the voiceprint information and the direction information of each speaker and the voice signals into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is obtained based on the training of a multi-user voice sample, and the multi-user voice sample comprises the lip movement information, the voiceprint information and the direction information of each user and the voice signals generated by the simultaneous speaking of multiple users.

the identification module comprises:

and the first recognition unit is used for inputting the lip movement information, the voiceprint information, the azimuth information and the voice signal of the speaker into a pre-trained voice recognition model, so that the voice recognition model extracts the voice feature corresponding to the speaker from the voice signal based on the azimuth information, the voiceprint information and the phase characteristics among the array elements, and performs voice recognition by combining the voice feature with the lip movement information to obtain the text information corresponding to the speaker.

the first recognition unit includes:

the first extraction subunit is used for performing feature extraction on the lip movement information by the residual error layer to obtain lip features, and inputting the lip features into the second splicing layer;

the first splicing subunit is used for splicing the voice signal, the azimuth information and the voiceprint information by the first splicing layer and inputting a spliced result to the convolutional layer;

a second extraction subunit, configured to, by the convolutional layer, extract a speech feature corresponding to the speaker from the speech signal based on the azimuth information, the voiceprint information, and phase characteristics among the plurality of array elements, and input the speech feature to the second concatenation layer;

the second splicing subunit is used for splicing the voice feature and the lip feature by the second splicing layer and inputting the spliced feature into the recognition layer;

and the recognition subunit is used for performing voice recognition on the recognition layer based on the spliced features to obtain corresponding text information of the speaker and outputting the text information.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a conference image in a conference, performing lip movement detection on the conference image and determining a target speaker who is speaking;

the second determining module is used for determining the identity information of the target speaker based on a pre-established face library;

the third acquisition module is used for acquiring the voice signal of the target speaker and extracting the voiceprint information of the voice signal;

and the recording module is used for correspondingly recording the voiceprint information and the identity information.

Optionally, the first determining module includes:

the second identification unit is used for identifying the speaking image and determining facial pixel points of each speaker;

and the determining unit is used for determining the angle information of the speaker relative to the voice collecting equipment as the direction information of the speaker according to the position of the facial pixel point of the speaker in the speaking image, the pre-calibrated parameter of the image collecting equipment for shooting the speaking image and the position of the voice collecting equipment.

Optionally, the speech recognition model is obtained by pre-training through a model training module, where the model training module includes:

the sample acquisition unit is used for acquiring the multi-user voice sample and an initial model;

the label determining unit is used for taking the text information corresponding to each user in each multi-user voice sample as a sample label;

the text prediction unit is used for inputting each multi-user voice sample into the initial model to obtain predicted text information;

and the parameter adjusting unit is used for adjusting the model parameters of the initial model based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label until the initial model converges to obtain the voice recognition model.

Optionally, the apparatus further comprises:

and the generating module is used for generating a conference record based on the text information corresponding to each speaker.

In a third aspect, an embodiment of the present invention provides a speech recognition system, where the system includes a server and a terminal, and the terminal is provided with an image acquisition device and a speech acquisition device, where:

the image acquisition equipment is used for acquiring images in a conference;

the voice acquisition equipment is used for acquiring voice signals in a conference;

the terminal is used for sending the image and the voice signal to the server;

the server is configured to receive the image and the voice signal, and perform the method steps of any of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a server, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor adapted to perform the method steps of any of the above first aspects when executing a program stored in the memory.

In a fifth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the above first aspects.

The embodiment of the invention has the following beneficial effects:

in the scheme provided by the embodiment of the invention, a server can acquire speaking images, voice signals and voiceprint information of each of a plurality of speakers in a conference, wherein the voice signals comprise voice signals generated by the plurality of speakers speaking at the same time, the speaking images are recognized, the direction information and lip movement information of each speaker are determined, and the lip movement information, the voiceprint information, the direction information and the voice signals of each speaker are input into a pre-trained voice recognition model aiming at each speaker to obtain text information corresponding to the speaker, wherein the voice recognition model is obtained by training based on a multi-user voice sample, and the multi-user voice sample comprises the lip movement information, the voiceprint information, the direction information and the voice signals generated by the simultaneous speaking of a plurality of users. By the scheme, the server can input the acquired speech images and speech signals of the multiple speakers and the voiceprint information of each speaker into the speech recognition model, and the speech signals of the multiple speakers do not need to be separated according to different speakers, so that the completeness of the frequency spectrums of the speech signals of the different speakers is guaranteed, and the accuracy of speech recognition is improved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

Fig. 1 is a schematic view of an implementation scenario in which a speech recognition method according to an embodiment of the present invention is applied;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a flow chart of recognition by the speech recognition model according to the embodiment of the present invention;

FIG. 4 is a flow chart of another speech recognition method according to an embodiment of the present invention;

FIG. 5 is a specific flowchart based on step S202 in the embodiment shown in FIG. 2;

FIG. 6 is a flow chart of speech recognition model training provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of another speech recognition apparatus according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of another speech recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a speech recognition system according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present invention are within the scope of the present invention.

In order to improve the accuracy of speech recognition, embodiments of the present invention provide a speech recognition method, apparatus, system, server, computer-readable storage medium, and computer program product. For convenience of understanding the speech recognition method provided in the embodiment of the present invention, an implementation scenario in which the speech recognition method provided in the embodiment of the present invention can be applied is first described below.

Fig. 1 is a schematic diagram of an implementation scenario in which a speech recognition method according to an embodiment of the present invention is applied. A plurality of conference participants join the video conference, the plurality of conference participants may include conference participant 1, conference participant 2, conference participant 3, conference participant 4, conference participant 5, conference participant 6, and conference participant 7, and the server 130 is in communication connection with the terminal 140 for data transmission. The terminal 140 may be an electronic device with a display screen, for example, a conference tablet, a touch all-in-one machine, and the like, the terminal 140 may further be provided with a voice collecting device 110 and an image collecting device 120, the voice collecting device 110 is configured to collect a voice signal sent by a participant when speaking in the conference process, the image collecting device 120 is configured to collect an image of the participant in the conference process, and the display screen may show conference related information.

The voice collecting device 110 may be a microphone array, and the microphone array may be: a linear array, a triangular array, a T-shaped array, or a uniform circular array, etc., and fig. 1 illustrates a linear array. The image capturing device 120 may be a camera or other device capable of capturing an image, and is not particularly limited herein.

After the conference is finished, the terminal 140 may send the voice signal acquired by the voice acquisition device 110 and the image acquired by the image acquisition device 120 to the server 130 during the conference, and the server 130 may acquire a conference video including the speech image of the multiple speakers and the voice signal during the conference, where a situation that the multiple speakers speak at the same time may occur during the conference, and a current manner of recognizing the multiple speakers at the same time is not accurate enough, and for such a situation, the server 130 may recognize the voice signal generated by the multiple speakers speaking at the same time by using the voice recognition method provided in the embodiment of the present invention. A speech recognition method according to an embodiment of the present invention is described below.

As shown in fig. 2, a speech recognition method, the method comprising:

s201, obtaining speaking images and voice signals of a plurality of speakers and voiceprint information of each speaker in a conference;

wherein the speech signal comprises a speech signal resulting from the plurality of speakers speaking simultaneously.

S202, recognizing the speaking image, and determining the direction information and lip movement information of each speaker;

and S203, inputting the lip movement information, the voiceprint information, the azimuth information and the voice signal of each speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker.

The voice recognition model is obtained based on training of a multi-user voice sample, and the multi-user voice sample comprises lip movement information, voiceprint information and azimuth information of each user and voice signals generated by simultaneous speaking of multiple users.

It can be seen that in the scheme provided in the embodiment of the present invention, a server may obtain utterance images, voice signals, and voiceprint information of each speaker of multiple speakers in a conference, where a voice signal includes a voice signal generated by multiple speakers speaking at the same time, identify an utterance image, determine direction information and lip movement information of each speaker, and for each speaker, input the lip movement information, voiceprint information, direction information, and voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, where the voice recognition model is obtained by training based on a multi-user voice sample, and the multi-user voice sample includes the lip movement information, voiceprint information, direction information, and voice signal generated by multiple users speaking at the same time. By the scheme, the server can input the acquired speech images and speech signals of the multiple speakers and the voiceprint information of each speaker into the speech recognition model, and the speech signals of the multiple speakers do not need to be separated according to different speakers, so that the completeness of the frequency spectrums of the speech signals of the different speakers is guaranteed, and the accuracy of speech recognition is improved.

The conference video sent by the terminal to the server may include videos corresponding to multiple speaking situations in the conference, where the speaking situations may include: one speaker speaks or multiple speakers speak simultaneously.

For a case where a plurality of speakers speak at the same time, the server may acquire, from the conference video, utterance images, voice signals, and voiceprint information of each of the speakers. The speaking images of the multiple speakers may be multi-frame images capable of representing lip movements of the speakers, which are collected by the image collecting device in the conference video, and may include images of all participants, or may be images of each participant, which is not specifically limited herein.

In an embodiment, the server may identify a conference image in the conference video, determine lip image features of speakers in the conference image, determine the number of speakers at the current time according to motion information of the lip image features of the speakers, and when the number of speakers is multiple, the server may use the conference image as a speech image of the multiple speakers, and acquire a voice signal acquired at a time corresponding to the speech image and voiceprint information of each speaker.

If the number of the speakers is one, the server can acquire the voice signal acquired at the moment corresponding to the speaking image, the voice signal is the voice signal sent by the speaker, and then the server can perform voice recognition on the voice signal by adopting a voice recognition algorithm so as to acquire corresponding text information.

The voice signals of the multiple speakers are voice signals generated by the multiple speakers speaking at the same time and collected by the voice collecting device in a time period in which the multiple speakers speak at the same time in the conference video. Which is a speech signal formed by mixing together speech signals uttered by a plurality of speakers.

For example, in the process that the server recognizes the conference image, the lip image features of the speakers in the conference image 1 are extracted, it is determined that the speaker a and the speaker B speak at the same time at the time point corresponding to the conference image 1 according to the lip image features, the server continues to sequentially recognize the conference image until the conference image 20, it is determined that only the speaker a speaks at the time point corresponding to the conference image 20 according to the lip image features, then, a time period from the time point corresponding to the conference image 1 to the time point corresponding to the conference image 20 is a time period when two speakers a and B speak at the same time, and a voice signal corresponding to the time period is a voice signal generated when two speakers a and B speak at the same time. The voice signal can be identified by adopting the method provided by the embodiment of the invention.

The voiceprint information is information capable of representing voice spectrum characteristics of the speakers, and in order to obtain the voiceprint information conveniently, the server can obtain and store the voiceprint information when each speaker independently speaks for the first time in a conference process, and further determine identity information of the speakers based on a pre-established face library and a speaking image and obtain the voiceprint information of the speakers based on the identity information of the speakers when voice signals generated by the simultaneous speaking of a plurality of speakers need to be identified.

After acquiring the speech images, the voice signals, and the voiceprint information of each of the multiple speakers in the conference, the server may perform step S202, i.e., recognizing the speech images, and determining the orientation information and the lip movement information of each of the speakers.

The server may recognize the speech image, extract lip image features of each speaker, and for each speaker, it is reasonable to use any one of the lip image features as the position of the speaker in the speech image, or calculate an average value of the lip image features of the speaker, and use a point corresponding to the average value as the position of the speaker in the speech image.

After the position of the speaker in the speech image is determined, the server can determine the actual position information of the speaker in the conference scene according to the external parameters and the internal parameters of the image acquisition equipment which are calibrated in advance, and then according to the position of the voice acquisition equipment, the relative position relationship between the speaker and the voice acquisition equipment can be calculated, so that the direction information of the speaker can be determined.

In one embodiment, the image capturing device is a camera, the voice capturing device is a microphone array, a coordinate system is established with the position of the microphone array in the conference scene as the origin of a three-dimensional coordinate system, the X axis and the Y axis form a horizontal plane, the server can extract lip image features of each speaker in the speech image 1, the lip image features a are used as the position of the speaker a in the speech image of the frame, three-dimensional coordinates (X, Y, z) in the coordinate system established with the position of the microphone array in the conference scene as the origin of the three-dimensional coordinate system are calculated according to internal parameters of the camera and external parameters of the camera, and further, an angle corresponding to tanx/Y is calculated, so that the direction information of the speaker can be obtained.

The server can identify a plurality of frames of speaking images in which a plurality of speakers speak simultaneously, extract lip image features of each speaker from the images, and take the change information of the lip image features of the speaker in the plurality of frames of speaking images as the lip movement information of the speaker.

As an embodiment, since the server may need to determine the lip image feature of each speaker in the utterance image when determining the number of speakers currently speaking simultaneously, in this case, the server may use the lip image feature determined when determining the number of speakers currently speaking simultaneously as the lip movement information of the corresponding speaker without recognizing the utterance image again.

Next, the server may execute step S203, that is, for each speaker, inputting the lip movement information, the voiceprint information, the orientation information, and the voice signal of the speaker into the pre-trained voice recognition model, so as to obtain the text information corresponding to the speaker.

The server inputs the lip movement information, the voiceprint information, the direction information and the voice signals generated by the simultaneous speaking of the plurality of speakers into a pre-trained voice recognition model together, and then obtains text information corresponding to the speakers, instead of separating the voice signals generated by the simultaneous speaking of the plurality of speakers into a plurality of paths of voice signals.

The voice recognition model is obtained by training based on a multi-user voice sample in advance, the multi-user voice sample can comprise lip movement information, voiceprint information, direction information and voice signals generated by a plurality of users speaking at the same time, namely the voice recognition model is trained based on the lip movement information, the voiceprint information, the direction information and the voice signals generated by the users speaking at the same time.

In the training process, the voice signals generated by the simultaneous speaking of the multiple users are formed by mixing the voice signals sent by the multiple users, the voice signals generated by the simultaneous speaking of the multiple users are not separated according to different users, the voice recognition model can learn the lip movement information, the voiceprint information, the direction information and the corresponding relationship between the voice signals generated by the simultaneous speaking of the multiple users and the text information corresponding to the users of each user, and further, in the using process of the voice recognition model, the input lip movement information, the voiceprint information, the direction information of each speaker and the voice signals generated by the simultaneous speaking of the multiple speakers can be responded and processed, and then the text information corresponding to the speaker is obtained.

Aiming at the condition that a plurality of speakers speak simultaneously, the server can perform voice recognition on the speakers one by one, namely, the plurality of speakers who speak simultaneously are traversed, and when one speaker is traversed, the lip movement information, the voiceprint information, the direction information and the voice signal which correspond to the speaker are input into the voice recognition model, so that the text information corresponding to each speaker can be obtained respectively, and the voice recognition of the speakers who speak simultaneously is completed.

For example, the server determines that the speaker a, the speaker B, and the speaker C speak simultaneously in 2 minutes, 5 seconds, to 5 minutes, and 10 seconds, and the server may acquire lip movement information, voiceprint information, and orientation information of the speaker a, the speaker B, and the speaker C, and a voice signal a generated by the speaker a, the speaker B, and the speaker C speaking simultaneously, respectively. And then traverse each speaker.

Specifically, the server may input lip movement information, voiceprint information, orientation information, and the speech signal a of the speaker a to the speech recognition model to obtain text information corresponding to the speaker a, through traversing the speaker a. And traversing the speaker B, and inputting the lip movement information, the voiceprint information, the direction information and the voice signal a of the speaker B into the voice recognition model to obtain the text information corresponding to the speaker B. And traversing the speaker C, and inputting the lip movement information, the voiceprint information, the direction information and the voice signal a of the speaker C into the voice recognition model to obtain the text information corresponding to the speaker C.

Since the speech recognition model is trained based on the lip movement information, the voiceprint information, the orientation information of each user, and the speech signal generated by a plurality of users speaking simultaneously, and, in the process of training the speech recognition model, speech signals generated by a plurality of users speaking simultaneously are not separated according to different users, furthermore, in the process of using the voice recognition model, the server inputs the lip movement information, the voiceprint information, the orientation information and the voice signals generated by a plurality of speakers speaking simultaneously into the pre-trained voice recognition model, text information can be recognized and obtained without separating voice signals generated by a plurality of speakers speaking at the same time according to different speakers, therefore, the complete frequency spectrum of the voice signals of different speakers is ensured, and the accuracy of voice recognition is improved.

As an implementation manner of the embodiment of the present invention, the voice signal may be a voice signal collected by a microphone array, where the microphone array includes a plurality of array elements. Because the positions of the array elements in the microphone array are different, the voice signals of a plurality of speakers received at the same time have time delay, namely, the phase characteristics of the waveforms of the voice signals received by each array element are different, so that the voice characteristics of different speakers can be accurately identified according to the phase characteristics of the waveforms of the voice signals under the condition that the voice signals generated by a plurality of speakers speaking at the same time are not separated.

In this case, the step of inputting the lip movement information, the voiceprint information, the orientation information, and the speech signal of the speaker into the pre-trained speech recognition model to obtain the text information corresponding to the speaker may include:

For each speaker, the server may input lip movement information, voiceprint information, orientation information, and a speech signal of the speaker into a pre-trained speech recognition model, and since the phase characteristics of the waveform of the speech signal received by each array element of the microphone array are different, the speech recognition model may extract a speech feature corresponding to the speaker from the speech signal based on the orientation information, the voiceprint information, and the phase characteristics between the plurality of array elements. The lip movement information can represent the characteristics of the lip image of the speaker during speaking, and the voice characteristics are combined with the lip movement characteristics for voice recognition, so that the accuracy of voice recognition can be improved when multiple speakers speak simultaneously, and the text information corresponding to the speakers can be obtained.

As can be seen, in this embodiment, the server may input the lip movement information, the voiceprint information, the orientation information, and the voice signal of each speaker into the pre-trained voice recognition model, so that the voice recognition model extracts the voice feature corresponding to the speaker from the voice signal based on the phase characteristics between the orientation information, the voiceprint information, and the multiple array elements, and performs voice recognition by combining the voice feature with the lip movement information to obtain the text information corresponding to the speaker. Because a plurality of array elements in the microphone array receive the voice signals at the same time, different phase characteristics can be generated, and the voice recognition model can accurately recognize the voice characteristics of different speakers by using the phase characteristics under the condition of not separating the voice signals generated by speaking a plurality of speakers at the same time. Therefore, the complete frequency spectrum of the voice signal of each speaker is ensured, and the accuracy of voice recognition is improved.

As an implementation manner of the embodiment of the present invention, as shown in fig. 3, the speech recognition model may include: a residual layer 350, a first splice layer 340, a convolutional layer 330, a second splice layer 320, and an identification layer 310.

Correspondingly, the step of extracting the speech feature corresponding to the speaker from the speech signal by the speech recognition model based on the azimuth information, the voiceprint information, and the phase characteristics among the array elements, and performing speech recognition by combining the speech feature with the lip movement information to obtain the text information corresponding to the speaker may include:

the residual layer 350 extracts features of the lip movement information 304 to obtain lip features, and inputs the lip features into the second splicing layer 320, the first splicing layer 340 splices the voice signal 301, the azimuth information 303 and the voiceprint information 302, and inputs a spliced result into the convolutional layer 330, the convolutional layer 330 extracts voice features corresponding to the speaker from the voice signal 301 based on phase characteristics among the azimuth information 303, the voiceprint information 302 and a plurality of array elements, and inputs the voice features into the second splicing layer 320, the second splicing layer 320 splices the voice features and the lip features and inputs the spliced features into the recognition layer 310, and the recognition layer 310 performs voice recognition based on the spliced features to obtain text information corresponding to the speaker and outputs the text information.

The Convolutional layer may employ a Convolutional Neural Network (CNN), the residual layer may employ a residual network, and the Recognition layer may employ an end-to-end Automatic Speech Recognition (ASR) module, which is not specifically limited herein.

The lip characteristics can represent lip image characteristics of the speaker during speaking, the spliced result output by the first splicing layer is that a voice signal spoken by a plurality of speakers simultaneously, the azimuth information of the speaker and the voiceprint information of the speaker are spliced together, and the spliced result can include the azimuth characteristics of the speaker, the frequency spectrum characteristics of the user speaking and the mixed voice signal characteristics of a plurality of users during mixed speaking.

Because the voice signals of multiple speakers speaking simultaneously are collected by the microphone array, and the positions of the array elements in the microphone array are different, there is a time delay when the voice signals of the multiple speakers are received at the same time, that is, the phase characteristics of the waveform of the voice signal received by each array element are different, so that the convolutional layer 330 can extract the voice feature corresponding to the speaker from the voice signal 301 based on the azimuth information 303, the voiceprint information 302 and the phase characteristics among the multiple array elements, and the voice feature and the lip feature are spliced in the second splicing layer 320.

At this time, the voice feature and the lip feature are both features corresponding to the speaker, the features of the speaker are represented respectively from two dimensions of the voice feature and the image feature, and then the voice feature and the lip feature of the speaker are spliced and input to the recognition layer 310, so that the recognition layer 310 can accurately recognize and obtain corresponding text information of the speaker based on the fusion feature of the two dimensions of the voice feature and the image feature, and output the text information.

In this embodiment, it can be seen that, in the speech recognition model, the residual error layer performs feature extraction on the lip movement information to obtain lip features, and input into the second splicing layer, the first splicing layer splices the voice signal, the azimuth information and the voiceprint information, and inputting the spliced result into a convolutional layer, extracting the voice feature corresponding to the speaker from the voice signal by the convolutional layer based on the azimuth information, the voiceprint information and the phase characteristics among a plurality of array elements, and the voice features are input into a second splicing layer which splices the voice features with the lip features, inputting the spliced features into a recognition layer, performing voice recognition by the recognition layer based on the spliced features to obtain corresponding text information of the speaker, outputting the text information, the voice recognition model with the structure can accurately perform voice recognition, and the accuracy of the obtained text information is ensured.

As an implementation manner of the embodiment of the present invention, as shown in fig. 4, before the step of acquiring the images and the voice signals of the multiple speakers and the voiceprint information of each speaker, the method may further include:

s401, acquiring a conference image in a conference, performing lip movement detection on the conference image, and determining a target speaker who is speaking;

the server may obtain a conference image in the conference, where the conference image may be an image in the conference video, and the conference image corresponds to a certain time point in the conference video, for example, the conference image a corresponds to a conference picture in the conference video at 1 minute and 13 seconds. For voice recognition, the server performs lip movement detection on the conference image, and may determine a target speaker that is speaking, where the target speaker may be one or more speakers, and is not particularly limited herein.

S402, determining the identity information of the target speaker based on a pre-established face library;

after determining the target speaker who is speaking, the server may determine the identity information of the target speaker based on a pre-established face library and a face image of the speaker. In order to determine the identity information of the target speaker, a face library may be established in advance, and the face library may store face model information and corresponding identity information of each person, which are acquired in advance, for example, a correspondence between a face feature and a name.

Before the conference begins, the terminal can obtain the list of the conference participants, the list comprises the identity information of the conference participants, the terminal can extract the face features of the conference participants from the face library according to the identity information in the list of the conference participants, and the face features of the conference participants are recorded, so that the registration of the conference participants is completed. The terminal can correspondingly store the face characteristics of the participants and the identity information of the participants in the local terminal, or correspondingly record the face characteristics and the identity information of the participants and send the face characteristics and the identity information to the server, which is reasonable.

S403, acquiring the voice signal of the target speaker, and extracting voiceprint information of the voice signal;

in one embodiment, when the target speaker speaks alone for the first time, the server may directly acquire the voice signal of the target speaker collected by the voice collection device in the time period when the target speaker speaks alone, and extract the voiceprint information of the voice signal.

In another embodiment, if the target speaker speaks for the first time, the target speaker is the situation that multiple persons including the target speaker speak simultaneously, then the server may acquire the voice signals of the multiple speakers acquired by the voice acquisition device in a time period in which the multiple persons including the target speaker speak simultaneously, extract the voice signal of the target speaker from the voice signals of the multiple speakers according to the lip movement information and the direction information of the target speaker, and extract the voiceprint information of the voice signal.

In the two embodiments, the voice collecting device may be a microphone array, and the voice signals collected by the microphone array may be subjected to beamforming, that is, beam forming, where the beamforming is to perform delay or phase compensation and amplitude weighting processing on the output of each array element to form a beam pointing to a specific direction. Thus, the server can obtain a more accurate voice signal of the target speaker, and the extracted voiceprint information can be more accurate.

The extracting of the voiceprint information from the voice signal may employ technologies such as Time Delay Neural Network (TDNN) and Probabilistic Linear Discriminant Analysis (PLDA), and the beamforming may employ Minimum Variance Distortionless Response (MVDR), which is not specifically limited herein.

S404, recording the voiceprint information and the identity information correspondingly.

After determining the identity information of the target speaker and acquiring the voiceprint information of the target speaker, the server may correspondingly record the identity information of the target speaker and the voiceprint information of the target speaker, so as to acquire a corresponding relationship between the voiceprint information of the target speaker and the target speaker in the conference. For example: the target speaker is the target speaker a, and after the voiceprint information 1 of the target speaker a is extracted, the target speaker a-voiceprint information 1 can be correspondingly recorded. The corresponding record mode may be recording by using a table, and the like, and is not particularly limited herein. For example, it can be shown in the following table:

serial number	Speaker	Voiceprint information
			1	Target speaker A	Voiceprint information 1
2	Target speaker B	Voiceprint information 1
			3	Target speaker C	Voiceprint information 3

As can be seen, in this embodiment, the server may obtain a conference image in a conference, perform lip movement detection on the conference image, determine a target speaker who is speaking, determine identity information of the target speaker based on a pre-established face library, obtain a voice signal of the target speaker, extract voiceprint information of the voice signal, and record the voiceprint information and the identity information correspondingly. In the related art, voiceprint registration of participants is performed before a conference starts, but after the same participant is registered in the voiceprint, voiceprint information fluctuation is large in different time periods, and the problem of low voice recognition rate can be caused in the actual use process. In this embodiment, voice print registration of participants is not required before the conference starts as in the related art, but voice signals sent by the participants are extracted during the conference to register voice print information. Therefore, the problems that the environment changes before and after the conference begins and the pre-registered voiceprint information is inaccurate due to large fluctuation of the voiceprint information of the participants are solved, the voiceprint information of the target speaker is more accurate, and the accuracy of subsequent voice recognition is improved.

As an embodiment of the present invention, as shown in fig. 5, the step of recognizing the utterance image and determining the direction information of each speaker may include:

s501, recognizing the speaking image and determining facial pixel points of each speaker;

the server can identify the speech image and determine the facial pixel points of each speaker in the speech image, the server can select any point of the facial pixel points as the position of the facial pixel points of the speaker in the image, the average value of the facial pixel points can also be calculated, and the point corresponding to the average value is used as the position of the facial pixel points of the speaker in the image, and no specific limitation is made here.

S502, for each speaker, determining angle information of the speaker relative to the voice acquisition equipment based on the position of the facial pixel point of the speaker in the speech image, the pre-calibrated parameter of the image acquisition equipment for shooting the speech image and the position of the voice acquisition equipment, and taking the angle information as the direction information of the speaker.

In an embodiment, after obtaining the position of the facial pixel point of the speaker in the speech image, the server may calculate the position of the speaker in the conference scene based on the position of the facial pixel point of the speaker in the speech image and a pre-calibrated parameter of the image capturing device capturing the speech image, and may calculate the angle information of the speaker relative to the voice capturing device as the direction information of the speaker based on the relative position of the voice capturing device and the camera.

In one embodiment, the image capturing device is a camera, the voice capturing device is a microphone array, a coordinate system is established with a position of the camera in a conference scene as an origin of a three-dimensional coordinate system, an X axis and a Y axis form a horizontal plane, the server can extract a face pixel point of each speaker from the utterance image 1, a point corresponding to an average value of the face pixel points is taken as a position of the speaker a, three-dimensional coordinates (X1, Y1, z1) in the coordinate system established with the position of the camera in the conference scene as the origin of the three-dimensional coordinate system are calculated according to internal parameters of the camera and external parameters of the camera, and an angle corresponding to tan | X1| + | X2|/| + | Y1| + | Y2| is calculated with the microphone array positioned in the coordinate system established with the position of the camera in the conference scene as the origin of the three-dimensional coordinate system is (X2, Y2, z2), as the azimuth information of the speaker.

Therefore, in this embodiment, the server may identify the utterance image, determine the facial pixel point of each speaker, and determine, for each speaker, angle information of the speaker relative to the voice collecting device based on the position of the facial pixel point of the speaker in the utterance image, the pre-calibrated parameter of the image collecting device that shoots the utterance image, and the position of the voice collecting device, as the direction information of the speaker, so that the server may accurately determine the direction information of the speaker, and further may ensure the accuracy of subsequent voice identification.

As an implementation manner of the embodiment of the present invention, as shown in fig. 6, the training manner of the speech recognition model may include:

s601, obtaining the multi-user voice sample and an initial model;

the server can obtain a multi-user voice sample and an initial model, wherein the multi-user voice sample comprises lip movement information, voiceprint information, azimuth information, voice signals of a plurality of users and text information corresponding to each user. The structure of the initial model is the same as that of the speech recognition model, that is, the initial model may include: the initial parameters of the initial model may be default values or may be initialized randomly, and are not specifically limited herein.

S602, each multi-user voice sample comprises text information corresponding to each user, and the text information is used as a sample label;

the server can obtain each multi-user voice sample including the text information corresponding to each user, the text information is manually determined, or the text information can be predetermined, so that a plurality of users can send out voice signals according to the corresponding text information at the same time, and the multi-user voice sample is obtained. The corresponding text information in each multi-user voice sample can be used as a sample label corresponding to the multi-user voice sample.

S603, inputting each multi-user voice sample into the initial model to obtain predicted text information;

for a user included in each multi-user voice sample, lip movement information of the user can be input to a residual error layer of the initial model to perform feature extraction on the lip movement information, and the lip movement information is input to a second splicing layer after lip features are obtained. And inputting the voiceprint information of the user, the direction information of the user and the voice signals of the simultaneous speaking of a plurality of users into a first splicing layer for splicing, and inputting the spliced result into a convolutional layer.

The convolutional layer may extract a voice feature corresponding to the user from a voice signal spoken by a plurality of users simultaneously based on voiceprint information of the user, azimuth information of the user, and phase characteristics between a plurality of array elements included in the microphone array, and input the voice feature into the second concatenation layer. In order to ensure that the trained speech recognition model can accurately process the speech signal, the microphone array may be the same as the microphone array described in the above embodiments.

Furthermore, the second splicing layer can splice the voice features and the lip features, input the spliced features into the recognition layer, and perform voice recognition on the recognition layer based on the spliced features to obtain text information serving as predicted text information.

S604, based on the difference between the predicted text information corresponding to each multi-user voice sample and the sample label, adjusting the model parameters of the initial model until the initial model converges to obtain the voice recognition model.

Because the current initial model may not be able to accurately identify the speech signal, the model parameters of the initial model may be adjusted based on the difference between the predicted text information corresponding to each multi-user speech sample and the sample label, so that the parameters of the initial model become more and more appropriate, and the accuracy of speech identification is improved until the initial model converges. The parameters of the initial model may be adjusted by using a gradient descent algorithm, a random gradient descent algorithm, or the like, which is not specifically limited herein.

In one embodiment, a function value of the loss function may be calculated based on a difference between the predicted text information and the sample label, and when the function value reaches a preset value, it is determined that the current initial model converges, resulting in a speech recognition model. In one embodiment, after the number of iterations of the multi-user speech sample reaches a preset number, the initial model may be considered to be converged, and a speech recognition model is obtained.

As can be seen, in this embodiment, the server may obtain the multi-user voice samples and the initial model, use text information corresponding to each user in each of the multi-user voice samples as a sample tag, input each of the multi-user voice samples into the initial model to obtain predicted text information, and adjust model parameters of the initial model based on a difference between the predicted text information corresponding to each of the multi-user voice samples and the sample tag until the initial model converges to obtain the voice recognition model. By the training mode, a model which can accurately identify lip movement information, voiceprint information, azimuth information and voice signals generated by simultaneous speaking of multiple users can be obtained through training, so that the accuracy of subsequent voice identification is ensured.

As an implementation manner of the embodiment of the present invention, the method may further include:

Because there are situations where multiple speakers speak at the same time or a single speaker speaks in the conference video, the server can record the text information corresponding to the speakers in different situations according to the corresponding time sequence to generate the conference record.

For example, the text information corresponding to the speech signal generated by the speaker a speaking at the time a is "the conference content is a working report of the previous quarter", at the time B after the speaker a speaks, the speaker B and the speaker C speak simultaneously, the text information corresponding to the speech signal generated by the speaker B speaking is "the department of the previous quarter has completed one project", and the text information corresponding to the speech signal generated by the speaker C speaking is "i have a problem to understand". The server may generate a meeting record: time a: the speaker A, the content of the meeting is the work report of the previous quarter; time b: speaker B, department of my department completed a project in the last quarter; speaker C, i have a problem to solve.

In an embodiment, the meeting record may further include information such as a meeting location, a meeting name, and the like, which is not limited herein.

As can be seen, in this embodiment, the server may generate a conference record based on the text information corresponding to each speaker, and because the server may record the situations of speaking of multiple speakers and a single speaker in the conference according to the conference time sequence, and for the situation of speaking of multiple speakers at the same time, it may also accurately perform voice recognition to obtain accurate text information, and no additional conference recording personnel are required, thereby saving labor and cost.

Correspondingly, in the speech recognition method, an embodiment of the present invention further provides a speech recognition apparatus, and a speech recognition apparatus provided in an embodiment of the present invention is described below.

As shown in fig. 7, a speech recognition apparatus may include:

a first obtaining module 710, configured to obtain utterance images, voice signals, and voiceprint information of each of a plurality of speakers in a conference, where the voice signals include voice signals generated by the plurality of speakers speaking simultaneously;

a first determining module 720, configured to identify the utterance image, and determine orientation information and lip movement information of each speaker;

the recognition module 730 is configured to, for each speaker, input lip movement information, voiceprint information, orientation information of the speaker and the speech signal into a pre-trained speech recognition model to obtain text information corresponding to the speaker, where the speech recognition model is obtained based on training of a multi-user speech sample, and the multi-user speech sample includes the lip movement information, the voiceprint information, the orientation information of each user and the speech signal generated by a multi-user speech at the same time.

As an implementation manner of the embodiment of the present invention, the voice signal may be a voice signal collected by a microphone array, where the microphone array includes a plurality of array elements;

the identification module 730 may include:

As an implementation manner of the embodiment of the present invention, the speech recognition model may include: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

the first recognition unit may include:

As an implementation manner of the embodiment of the present invention, as shown in fig. 8, the apparatus may further include:

a second obtaining module 740, configured to obtain a conference image in a conference, perform lip movement detection on the conference image, and determine a target speaker who is speaking;

a second determining module 750, configured to determine identity information of the target speaker based on a pre-established face library;

a third obtaining module 760, configured to obtain a voice signal of the target speaker, and extract voiceprint information of the voice signal;

a recording module 770, configured to correspondingly record the voiceprint information and the identity information.

As an implementation manner of the embodiment of the present invention, the first determining module 720 may include:

As an implementation manner of the embodiment of the present invention, the speech recognition model is obtained by pre-training through a model training module, and the model training module may include:

As an implementation manner of the embodiment of the present invention, as shown in fig. 9, the apparatus may further include:

a generating module 780, configured to generate a meeting record based on the text information corresponding to each speaker.

Correspondingly, the embodiment of the present invention provides a speech recognition system corresponding to the speech recognition method, and a speech recognition system provided by the embodiment of the present invention is introduced below.

As shown in fig. 10, a voice recognition system includes a server 1004 and a terminal 1003 provided with an image capture device 1001 and a voice capture device 1002, wherein:

the image capturing device 1001 is configured to capture an image in a conference;

the voice acquisition device 1002 is used for acquiring voice signals in a conference;

the terminal 1003 is configured to send the image and the voice signal to the server 1004;

the server 1004 is configured to receive the image and the voice signal, and the steps of the voice recognition method according to any of the above embodiments

It can be seen that in the scheme provided in the embodiment of the present invention, an image acquisition device can acquire an image in a conference, a voice acquisition device can acquire a voice signal in the conference, a terminal can send the image and the voice signal to a server, and the server can acquire a speech image, a voice signal, and voiceprint information of each of a plurality of speakers in the conference, wherein the voice signal includes a voice signal generated by the plurality of speakers speaking at the same time, recognize the speech image, determine orientation information and lip movement information of each speaker, and for each speaker, input the lip movement information, the voiceprint information, the orientation information, and the voice signal of the speaker into a pre-trained voice recognition model to obtain text information corresponding to the speaker, wherein the voice recognition model is obtained by training based on a multi-user voice sample, and the multi-user voice sample includes the lip movement information of each user, Voiceprint information, orientation information, and a speech signal generated by a multi-user speaking simultaneously. By the scheme, the server can input the speaking images and the voice signals of a plurality of speakers and the voiceprint information of each speaker into the voice recognition model, and the voice signals of the plurality of speakers do not need to be separated according to different speakers, so that the completeness of the frequency spectrums of the voice signals of different speakers is ensured, and the accuracy of voice recognition is improved.

The embodiment of the present invention further provides a server, as shown in fig. 11, including a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102 and the memory 1103 complete mutual communication through the communication bus 1104,

a memory 1103 for storing a computer program;

the processor 1101 is configured to implement the steps of the speech recognition method according to any of the above embodiments when executing the program stored in the memory 1103.

The communication bus mentioned in the above server may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the server and other devices.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the speech recognition method described in any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the speech recognition method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, a speech recognition method, apparatus, system, server, computer-readable storage medium, and computer program product are described in a relatively simple manner, as they are substantially similar to the method embodiments, with reference to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the speech signal is a speech signal collected by a microphone array, the microphone array comprising a plurality of array elements;

3. The method of claim 2, wherein the speech recognition model comprises: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

4. The method of claim 1, wherein prior to the step of obtaining images, voice signals, and voiceprint information for each of a plurality of speakers, the method further comprises:

5. The method of claim 1, wherein the step of identifying the speech image and determining orientation information for each speaker comprises:

6. The method according to any one of claims 1-5, wherein the training of the speech recognition model comprises:

acquiring the multi-user voice sample and an initial model;

7. The method according to any one of claims 1-5, further comprising:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. The apparatus of claim 8, wherein the speech signal is a speech signal collected by a microphone array, the microphone array comprising a plurality of array elements;

the identification module comprises:

10. The apparatus of claim 9, wherein the speech recognition model comprises: the device comprises a residual error layer, a first splicing layer, a convolution layer, a second splicing layer and an identification layer;

the first recognition unit includes:

11. The apparatus of claim 8, further comprising:

12. The apparatus of claim 8, wherein the first determining module comprises:

13. The apparatus according to any one of claims 8-12, wherein the speech recognition model is pre-trained by a model training module, the model training module comprising:

14. The apparatus of any one of claims 8-12, further comprising:

15. The voice recognition system is characterized by comprising a server and a terminal, wherein the terminal is provided with an image acquisition device and a voice acquisition device, wherein:

the image acquisition equipment is used for acquiring images in a conference;

the terminal is used for sending the image and the voice signal to the server;

the server for receiving the image and the speech signal and performing the method steps of any of claims 1-7.

16. A server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

17. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.