CN111862967A

CN111862967A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111862967A
Application number: CN202010265398.7A
Authority: CN
Inventors: 蒋栋蔚
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-10-30

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein received voice to be recognized is segmented into a plurality of voice sequences with preset duration according to a receiving time sequence, a high-dimensional feature vector corresponding to each voice sequence is determined, then the high-dimensional feature vector of each voice sequence is sequentially input into a voice recognition model according to a segmentation sequence to obtain a text sequence corresponding to each voice sequence, and further text information of the voice to be recognized is determined based on the obtained plurality of text sequences and the segmentation sequence corresponding to each text sequence. Therefore, after the voice information is received in real time, the voice sequence to be recognized is obtained through the segmentation sequence, and the voice sequence can be immediately input into the voice recognition model according to the segmentation sequence, so that the online voice recognition can be quickly and conveniently realized, and the recognition accuracy is high.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a storage medium.

Background

Speech Recognition (ASR) is an object of research on Speech, and allows a machine to automatically recognize and understand human-spoken Speech through Speech signal processing. Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text through a recognition and understanding process.

Generally, speech recognition is performed in an offline scene, and speech recognition in an online scene can be rarely achieved, so that how to achieve online speech recognition is a technical problem to be solved urgently at present on the premise of ensuring recognition accuracy.

Disclosure of Invention

In view of the above, the present application at least provides a speech recognition method, a speech recognition device, an electronic device, and a storage medium, wherein the segmentation order is associated with each speech sequence received in real time, and the associated speech sequences are sequentially input into a speech recognition model, so that not only can online speech recognition be implemented, but also the accuracy of speech recognition can be improved.

According to a first aspect of the present application, there is provided a speech recognition method comprising:

according to a receiving time sequence, segmenting received voice to be recognized into a plurality of voice sequences with preset duration;

Determining a high-dimensional feature vector corresponding to each voice sequence;

according to the segmentation order, sequentially inputting the high-dimensional feature vector of each voice sequence into a voice recognition model to obtain a text sequence corresponding to each voice sequence;

and determining the text information of the voice to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence.

In some embodiments of the present application, the high-dimensional feature vector corresponding to each speech sequence is determined according to the following steps:

framing each voice sequence according to a preset length, and windowing the framed voice sequence to obtain spectrogram information corresponding to each voice sequence;

extracting voice features from the spectrogram information frame by frame to obtain a voice feature vector corresponding to each voice sequence;

and coding the voice feature vectors to obtain high-dimensional feature vectors corresponding to each voice sequence.

In some embodiments of the present application, for each speech sequence, sequentially inputting the high-dimensional feature vector of each speech sequence into the speech recognition model according to the segmentation order to obtain a text sequence corresponding to each speech sequence, including:

According to the segmentation order, sequentially determining a voice sequence to be processed as a current voice sequence and determining a high-dimensional feature vector corresponding to the current voice sequence as a current voice vector;

acquiring an initial state vector corresponding to the current voice sequence;

and inputting the current voice vector and the initial state vector into the voice recognition model to obtain a text sequence of the current voice sequence.

In some embodiments of the present application, the initial state vector is an intermediate state vector output after a previous speech sequence, which is arranged before the current speech sequence in segmentation order, is input to the speech recognition model.

In some embodiments of the present application, when the current speech sequence is the first speech sequence in the segmentation order, the initial state vector is a predetermined state vector.

In some embodiments of the present application, the inputting the current speech vector and the initial state vector into the speech recognition model to obtain a text sequence of the current speech sequence includes:

inputting the current voice vector and the initial state vector into a decoding layer of the voice recognition model to obtain an intermediate text vector corresponding to the current voice vector, a target position of the intermediate text vector in the current voice vector, a position weight corresponding to the target position and an intermediate state vector;

Determining an intermediate speech vector for decoding processing based on the obtained position weight and the current speech vector;

taking the determined intermediate voice vector as the current voice vector, taking the obtained intermediate state vector as the initial state vector, and continuing decoding until the decoding is stopped after preset times;

and determining the text sequence of the current voice sequence based on the plurality of intermediate text vectors obtained by decoding and the position weight of each intermediate text vector.

In some embodiments of the present application, the decoding layer comprises a decoder and a classifier; inputting the current speech vector and the initial state vector into a decoding layer of the speech recognition model to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector and a position weight corresponding to the target position, and an intermediate state vector, including:

inputting the current speech vector and the initial state vector into the decoder to obtain an intermediate text vector corresponding to the current speech vector, different positions of the intermediate text vector in the current speech vector, position weights corresponding to each position, and an intermediate state vector;

And inputting different positions of the intermediate text vector in the current speech vector and the position weight corresponding to each position into the classifier to obtain a target position of the intermediate text vector in the current speech vector and the position weight corresponding to the target position.

In some embodiments of the present application, the determining text information of a speech to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence includes:

and merging the text sequences according to the segmentation order to obtain the text information of the voice to be recognized.

In some embodiments of the present application, before segmenting the received speech to be recognized into a plurality of speech sequences of a preset duration according to the receiving timing sequence, the speech recognition method further includes training the speech recognition model according to the following steps:

acquiring a voice information sample and text information corresponding to the voice information sample;

and training the voice recognition model according to the voice information sample and the text information corresponding to the voice information sample.

According to a second aspect of the present application, there is provided a speech recognition apparatus comprising:

The segmentation module is used for segmenting the received voice to be recognized into a plurality of voice sequences with preset duration according to the receiving time sequence;

the first determining module is used for determining a high-dimensional feature vector corresponding to each voice sequence;

the generating module is used for sequentially inputting the high-dimensional feature vector of each voice sequence into the voice recognition model according to the segmentation order to obtain a text sequence corresponding to each voice sequence;

and the second determining module is used for determining the text information of the voice to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence.

In some embodiments of the present application, the first determining module is configured to determine the high-dimensional feature vector corresponding to each speech sequence according to the following steps:

In some embodiments of the present application, the generating module comprises:

the determining unit is used for sequentially determining the voice sequence to be processed as a current voice sequence and determining a high-dimensional feature vector corresponding to the current voice sequence as a current voice vector according to the segmentation order;

an obtaining unit, configured to obtain an initial state vector corresponding to the current speech sequence;

and the generating unit is used for inputting the current voice vector and the initial state vector into the voice recognition model to obtain a text sequence of the current voice sequence.

In some embodiments of the present application, the generating unit comprises:

a first generating subunit, configured to input the current speech vector and the initial state vector into a decoding layer of the speech recognition model, so as to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector, a position weight corresponding to the target position, and an intermediate state vector;

A first determining subunit, configured to determine an intermediate speech vector for decoding processing based on the obtained position weight and the current speech vector;

a stopping subunit, configured to use the determined intermediate speech vector as the current speech vector, use the obtained intermediate state vector as the initial state vector, and continue decoding until a preset number of times, and then stop decoding;

and the second determining subunit is used for determining the text sequence of the current voice sequence based on the plurality of intermediate text vectors obtained by decoding and the position weight of each intermediate text vector.

In some embodiments of the present application, the decoding layer comprises a decoder and a classifier; the first generating subunit is specifically configured to:

In some embodiments of the present application, the second determining module is configured to determine the text information of the speech to be recognized according to the following steps:

In some embodiments of the present application, the speech recognition apparatus further comprises a training module; the training module is used for training the voice recognition model according to the following steps:

According to a third aspect of the present application, there is provided an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the speech recognition method as described above.

According to a fourth aspect of the present application, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, performs the steps of the speech recognition method as described above.

In the embodiment of the application, the received voice to be recognized is segmented into a plurality of voice sequences with preset duration according to the receiving time sequence, the high-dimensional feature vector corresponding to each voice sequence is determined, then the high-dimensional feature vector of each voice sequence is sequentially input into the voice recognition model according to the segmentation order, the text sequence corresponding to each voice sequence is obtained, and further, the text information of the voice to be recognized is determined based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence. Therefore, after the voice information is received in real time, the voice sequence to be recognized is obtained through the segmentation sequence, and the voice sequence can be immediately input into the voice recognition model according to the segmentation sequence, so that the online voice recognition can be quickly and conveniently realized, and the recognition accuracy is high.

Further, according to the segmentation order, the voice sequence to be processed is sequentially determined as a current voice sequence, the high-dimensional feature vector corresponding to the current voice sequence is determined as a current voice vector, and then after the initial state vector corresponding to the current voice sequence is obtained, the current voice vector and the initial state vector are input into the voice recognition model, so that the text sequence of the current voice sequence is obtained. Therefore, the accuracy of the voice recognition can be improved by using the state vector and the high-dimensional feature vector of the voice sequence as the input of the voice recognition model.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic architecture diagram of a speech recognition system according to an embodiment of the present application;

fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a flow chart of a speech recognition method according to another embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the structure of the generating module in FIG. 4;

FIG. 6 is a schematic diagram of the structure of the generating unit in FIG. 5;

fig. 7 is a second schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

In order to enable a person skilled in the art to use the present disclosure, the following embodiments are given in connection with the specific application scenario "speech recognition". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in terms of speech recognition, it should be understood that this is merely one exemplary embodiment.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

One aspect of the present application relates to a speech recognition system. The voice recognition system can obtain the text sequence corresponding to each voice sequence by inputting the high-dimensional characteristic vector of each voice sequence into the voice recognition model in sequence, and further obtain the text information of the voice to be recognized, so that the accuracy of voice recognition can be guaranteed, and on-line voice recognition can be realized.

It is worth noting that, before the application is provided in the present application, few technical solutions can implement voice recognition in an online scene, or when implementing online recognition, accuracy of voice recognition cannot be considered.

Fig. 1 is a schematic architecture diagram of a speech recognition system according to an embodiment of the present application. The voice recognition system may be an online transportation service platform for transportation services such as taxi, designated drive service, express, carpool, bus service, driver rental, or regular bus service, or any combination thereof, which may provide voice recognition services. The speech recognition system may include one or more of a server 110, a network 120, a database 130, a service provider terminal 140, a service requester terminal 150.

In some embodiments, the server 110 may include a processor. The processor may process information and/or data related to speech recognition to perform one or more of the functions described herein. For example, the processor may segment the received speech to be recognized into a plurality of speech sequences with preset duration according to the receiving time sequence, and sequentially input the high-dimensional feature vector of each speech sequence into the speech recognition model according to the segmentation order to obtain a text sequence corresponding to each speech sequence, and further obtain text information of the speech to be recognized.

Among them, the service provider terminal 140 and the service requester terminal 150 may be terminal devices, which are not limited to mobile terminals and personal computers.

In some embodiments, a processor may include one or more processing cores (e.g., a single-core processor (S) or a multi-core processor (S)). Merely by way of example, a Processor may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a Reduced Instruction Set computer (Reduced Instruction Set computer), a microprocessor, or the like, or any combination thereof.

In some embodiments, a database 130 may be connected to the network 120 to communicate with one or more components in the speech recognition system (e.g., the server 110, the service provider terminal 140, the service requester terminal 150). One or more components in the speech recognition system may access data or instructions stored in database 130 via network 120. In some embodiments, the database 130 may be directly connected to one or more components in the speech recognition system, or the database 130 may be part of the server 110.

The following describes the speech recognition method provided in the embodiment of the present application in detail with reference to the content described in the speech recognition system shown in fig. 1.

Referring to fig. 2, fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present application. The method can be executed by a processor in the speech recognition system, and the specific execution process is as follows:

s201: and segmenting the received voice to be recognized into a plurality of voice sequences with preset duration according to the receiving time sequence.

In the step, the voice to be recognized is received in real time, the voice to be recognized is intercepted according to the receiving time sequence and the preset time length, and a plurality of voice sequences of the preset time length are intercepted.

In an example, the speech to be recognized is a segment of 2s speech information, and if the preset duration is 500ms, 4 speech sequences may be intercepted from the speech to be recognized.

It should be noted that the receiving timing refers to the sequence of the receiving time; the preset duration can be set according to actual needs, and here, the preset duration can be set with reference to the length of the speech recognition model suitable for the recognized speech sequence, and preferably, the preset duration can be set to 300 ms.

S202: and determining the corresponding high-dimensional feature vector of each voice sequence.

In the step, after the voice to be recognized is segmented into a plurality of continuous voice sequences with preset duration, for each voice sequence in the plurality of voice sequences corresponding to the voice to be recognized, a high-dimensional feature vector corresponding to each voice sequence is determined.

Here, the high-dimensional feature vector may be understood as a high-level sound feature obtained by encoding an acoustic feature, and specifically, a speech sequence with a relatively long length is converted into a high-level short sound sequence, that is, a sound feature is extracted. By inputting the high-dimensional feature vector of each voice sequence into the voice recognition model and relatively directly inputting the original voice sequence into the voice recognition model, on one hand, the recognition time can be greatly shortened, and on the other hand, the voice recognition model is suitable for extracting useful sound features from the high-dimensional feature vectors with relatively short-level lengths.

Further, that is, the determining the high-dimensional feature vector corresponding to each speech sequence in step S202 includes the following steps:

framing each voice sequence according to a preset length, and windowing the framed voice sequence to obtain spectrogram information corresponding to each voice sequence; extracting voice features from the spectrogram information frame by frame to obtain a voice feature vector corresponding to each voice sequence; and coding the voice feature vectors to obtain high-dimensional feature vectors corresponding to each voice sequence.

In this step, when determining the high-dimensional feature vector corresponding to each speech sequence, the speech sequence needs to be preprocessed, specifically, the speech sequence is framed according to a preset length, that is, the speech sequence is divided into small segments with fixed length, and the windowing process is performed on the frame-divided speech sequence, wherein, the windowing refers to bringing each frame of speech sequence into a window function, furthermore, spectrogram information corresponding to each voice sequence is obtained, the spectrogram information can represent the spectral information of the voice, further, voice features are extracted from the spectrogram information frame by frame, voice feature vectors corresponding to the voice sequences are obtained, the voice features can be fbank features, therefore, the preprocessing process of the voice sequences is completed, further, the voice feature vectors corresponding to the voice sequences are input into a coding network for coding, and high-dimensional feature vectors corresponding to the voice sequences can be obtained.

Here, the preset length may be set according to actual needs, and may generally be 10-30ms as one frame, where the time length corresponding to the preset length is much shorter than the length of the speech sequence, that is, the preset length is much shorter than the preset duration.

It should be noted that the purpose of framing the voice sequence is to frame the voice sequence in the voice recognition, because the voice signal is rapidly changed and does not facilitate fourier transform, so that the voice sequence has a sufficient period and does not change too severely in one frame, and fourier transform is conveniently performed on each frame of voice; since the speech is continuously changed in a long range and cannot be processed without fixed characteristics, each frame of speech sequence is brought into a window function, and the value outside the window is set to be 0, so that the problem that signals at two ends of each frame of speech sequence are discontinuous can be solved. Here, the coding network may be a Bi-directional Long Short-Term Memory network (BLSTM), and the coding network includes a multi-layer network, and the speech feature vector corresponding to each speech sequence is coded layer by layer through the coding network to obtain a high-dimensional feature vector corresponding to the speech sequence, where a dimension of the speech feature vector corresponding to each speech sequence is much larger than a dimension of the high-dimensional feature vector corresponding to the speech sequence.

S203: and according to the segmentation order, sequentially inputting the high-dimensional feature vector of each voice sequence into the voice recognition model to obtain a text sequence corresponding to each voice sequence.

In the step, high-dimensional feature vectors of a plurality of voice sequences corresponding to the voice to be recognized are sequentially input into a voice recognition model according to a segmentation order, and the voice recognition model performs recognition processing on the high-dimensional feature vectors of each voice sequence to obtain a text sequence corresponding to each voice sequence, wherein the text sequence is a text recognition result.

Here, the speech recognition model may be a Transformer model that includes a neural network self-attention, i.e., a self-attentive mechanism, which is a concept used by the Transformer to convert the "understanding" of other related words into the word we are dealing with.

Further, for each voice sequence, sequentially inputting the high-dimensional feature vector of each voice sequence into the voice recognition model according to the segmentation order in step S203 to obtain a text sequence corresponding to each voice sequence, including the following steps:

step a: and according to the segmentation order, sequentially determining the voice sequence to be processed as a current voice sequence, and determining a high-dimensional feature vector corresponding to the current voice sequence as a current voice vector.

In this step, when performing speech recognition on a speech to be recognized, each speech sequence corresponding to the speech to be recognized is processed in sequence according to a segmentation order, and specifically, when it turns to process a certain speech sequence, the speech sequence is determined as a current speech sequence, and a high-dimensional feature vector corresponding to the current speech sequence is determined as a current speech vector, and then the current speech vector is processed so as to obtain a text sequence corresponding to the current speech sequence.

Step b: and acquiring an initial state vector corresponding to the current voice sequence.

In this step, when performing speech recognition on a current speech sequence, an initial state vector corresponding to the current speech sequence needs to be obtained, where the initial vector corresponding to the current speech sequence is related to a segmentation order of the current speech sequence in a speech to be recognized, and generally, the initial state vector is divided into two cases, that is, a middle or last speech sequence of the current speech sequence in the segmentation order, and that is, a first speech sequence of the current speech sequence in the segmentation order.

It should be noted that after the high-dimensional feature vector corresponding to each speech sequence is input to the speech recognition model, the high-dimensional feature vector corresponding to each speech sequence is output, and besides the text sequence corresponding to the speech sequence, an intermediate state vector corresponding to the speech sequence is also output correspondingly, where the intermediate state vector may represent state information of the speech sequence to a certain extent.

The first condition is as follows: if the current speech sequence is in the middle or the last speech sequence in the segmentation order, the initial state vector is the intermediate state vector output after the previous speech sequence arranged before the current speech sequence in the segmentation order is input into the speech recognition model.

Case two: the current speech sequence is the first speech sequence in the segmentation order, and the initial state vector is a preset state vector.

Here, the preset state vector may be a one-dimensional constant vector.

Step c: and inputting the current voice vector and the initial state vector into the voice recognition model to obtain a text sequence of the current voice sequence.

In the step, after the initial state vector corresponding to the current voice sequence is obtained, the current voice vector corresponding to the current voice sequence and the initial state vector are input into the voice recognition model together, so that the text sequence corresponding to the current voice sequence can be obtained.

S204: and determining the text information of the voice to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence.

In the step, after the high-dimensional feature vectors of a plurality of voice sequences corresponding to the voice to be recognized are sequentially input into the voice recognition model according to the segmentation order to obtain the text sequence corresponding to each voice sequence, that is, after the text sequence corresponding to each voice sequence in the plurality of voice sequences in the voice to be recognized is obtained, the obtained text sequences corresponding to each voice sequence are integrated according to the segmentation order corresponding to each text sequence, so that the text information of the voice to be recognized can be obtained.

Further, in step S204, determining text information of the speech to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence, includes the following steps:

In the step, the text sequences corresponding to each obtained voice sequence are merged according to the segmentation order, that is, each single text sequence is connected into a continuous text sequence, and then complete and continuous text information corresponding to the voice to be recognized is obtained.

According to the voice recognition method provided by the embodiment of the application, received voice to be recognized is segmented into a plurality of voice sequences with preset duration according to a receiving time sequence, a high-dimensional feature vector corresponding to each voice sequence is determined, then the high-dimensional feature vector of each voice sequence is sequentially input into a voice recognition model according to a segmentation order, a text sequence corresponding to each voice sequence is obtained, and further, text information of the voice to be recognized is determined based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence. Therefore, after the voice information is received in real time, the voice sequence to be recognized is obtained through the segmentation sequence, and the voice sequence can be immediately input into the voice recognition model according to the segmentation sequence, so that the online voice recognition can be quickly and conveniently realized, and the recognition accuracy is high.

Referring to fig. 3, fig. 3 is a flowchart of a speech recognition method according to another embodiment of the present application. The method can be executed by a processor in the speech recognition system, and the specific execution process is as follows:

s301: and segmenting the received voice to be recognized into a plurality of voice sequences with preset duration according to the receiving time sequence.

S302: and determining the corresponding high-dimensional feature vector of each voice sequence.

S303: and for each voice sequence, sequentially determining the voice sequence to be processed as a current voice sequence and determining a high-dimensional feature vector corresponding to the current voice sequence as a current voice vector according to the segmentation order.

S304: and acquiring an initial state vector corresponding to the current voice sequence.

S305: and inputting the current voice vector and the initial state vector into a decoding layer of a voice recognition model to obtain an intermediate text vector corresponding to the current voice vector, a target position of the intermediate text vector in the current voice vector, a position weight corresponding to the target position and an intermediate state vector.

In the step, when performing speech recognition on each speech sequence, determining the speech sequence as a current speech sequence, determining a high-dimensional feature vector corresponding to the current speech sequence as a current speech vector, and further, inputting the current speech vector and an initial state vector corresponding to the current speech vector into a first decoding layer of a speech recognition model together, so as to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector, a position weight corresponding to the target position, and an intermediate state vector corresponding to the current speech vector, so as to complete a decoding process at the first decoding layer, and then inputting the content output by the first decoding into a second decoding layer of the speech recognition model for decoding until decoding of all decoding layers of the speech recognition model is completed, and finally obtaining a text sequence corresponding to the current voice vector after finishing the whole decoding process, namely obtaining the text sequence corresponding to the voice sequence.

Here, the speech recognition model includes a plurality of decoding layers connected in series, and the current speech vector and the initial state vector corresponding to the current speech sequence complete decoding once through each decoding layer.

Further, training the speech recognition model according to the following steps: acquiring a voice information sample and text information corresponding to the voice information sample; and training the voice recognition model according to the voice information sample and the text information corresponding to the voice information sample.

In this step, before the speech recognition model is used to recognize the speech to be recognized, the speech recognition model needs to be trained to improve the accuracy of the speech recognition model in speech recognition, specifically, a large number of speech information samples can be obtained, and text information corresponding to each speech information sample can be obtained, where the text information corresponding to each speech information sample can be an accurate text recognition result recognized manually, and further, the speech recognition model is trained through the speech information samples and the text information corresponding to the speech information samples, in the specific training process, after each speech information sample is input to the speech recognition model, the text information corresponding to the speech information sample is recognized, and further, the text information corresponding to the speech information sample is recognized and obtained is compared with the text information corresponding to the speech information sample, and obtaining a comparison result, adjusting parameters in the voice recognition model according to the comparison result, and continuously training the voice recognition model in the same way until the recognition accuracy of the voice recognition model reaches a preset threshold value, so that the training process of the voice recognition model is finished.

Further, the decoding layer comprises a decoder and a classifier; in step S305, inputting the current speech vector and the initial state vector into a decoding layer of the speech recognition model, to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector, and a position weight corresponding to the target position, and an intermediate state vector, including the following steps:

step A: and inputting the current voice vector and the initial state vector into the decoder to obtain an intermediate text vector corresponding to the current voice vector, different positions of the intermediate text vector in the current voice vector, position weights corresponding to each position, and an intermediate state vector.

In this step, each decoding layer in the speech recognition model includes a decoder and a classifier, the current speech vector and the initial state vector corresponding to each speech sequence are input into the decoding layer of the speech recognition model, and substantially the current speech vector and the initial state vector corresponding to the speech sequence are input into the decoder first, so that an intermediate text vector corresponding to the current speech vector is obtained, and different positions of the intermediate text vector in the current speech vector, position weights corresponding to each position, and the intermediate state vector are obtained.

And B: and inputting different positions of the intermediate text vector in the current speech vector and the position weight corresponding to each position into the classifier to obtain a target position of the intermediate text vector in the current speech vector and the position weight corresponding to the target position.

In this step, after the different positions of the intermediate text vector in the current speech vector and the position weight corresponding to each position are obtained by the decoder, the target position of the intermediate text vector in the current speech vector and the position weight corresponding to the target position are obtained by inputting the different positions of the intermediate text vector in the current speech vector and the position weight corresponding to each position into the classifier.

Here, the classifier may be a logistic regression function (Softmax local regression), and the classifier may determine the category to which the object to be classified belongs, and for the technical solution of the present application, the classifier may determine the target position of the intermediate text vector in the current speech vector and the position weight corresponding to the target position.

S306: and determining an intermediate voice vector for decoding processing based on the obtained position weight and the current voice vector.

In this step, after the current speech vector and the initial state vector corresponding to each speech sequence are input into a decoding layer of the speech recognition model to obtain a target position of the intermediate text vector in the current speech vector and a position weight corresponding to the target position, an intermediate speech vector for decoding processing can be obtained according to the obtained position weight corresponding to the target position and the current speech vector, and specifically, the position weight is multiplied by the current speech vector corresponding to the speech sequence to obtain the intermediate speech vector.

It should be noted that, by using the obtained position weight and the current speech vector, the intermediate speech vector for decoding processing is determined, and a self-attention mechanism (self-attention) can be used for processing, so that the "understanding" of the previous intermediate text vector can be converted into a concept of the current intermediate text vector being generated.

Here, the speech recognition model includes a plurality of decoding layers, and the input to each decoding layer depends on the output of the last decoding layer, and the decoding order may be decoding from top to bottom, where the input to the decoding layer is the intermediate speech vector and the intermediate state vector output by the last decoding layer.

S307: and taking the determined intermediate voice vector as the current voice vector, taking the obtained intermediate state vector as the initial state vector, and continuing decoding until the decoding is stopped after preset times.

After the decoding is completed for one time, taking the intermediate speech vector determined by the last decoding layer as the current speech vector, taking the obtained intermediate state vector as the initial state vector, further taking the determined current speech vector and the initial state vector as the input of the current decoding layer to continue decoding, sequentially decoding through a plurality of decoding layers in the speech recognition model according to the above mode, and after the decoding of all the decoding layers in the speech recognition model is completed, determining the text sequence corresponding to the current speech sequence.

Here, the preset number of times may be set as the number of decoding layers in the speech recognition model, so that the decoding is stopped after the number of decoding times is equal to the preset number of times, and the decoding process is completed, that is, the text sequence of the current speech sequence is obtained.

It should be noted that instead of setting a preset number of times to stop decoding, a left-to-right beam search algorithm may be used, with the beam width set to the number of decoding layers in the speech recognition model, so that decoding starts from the < SOS > tag until the end of decoding is encountered with the < EOS > tag.

S308: and determining the text sequence of the current voice sequence based on the plurality of intermediate text vectors obtained by decoding and the position weight of each intermediate text vector.

In this step, after inputting a high-dimensional feature vector corresponding to the current speech sequence, that is, the current speech vector, into a decoding layer in the speech recognition model, a plurality of intermediate text vectors and a position weight of each intermediate text vector can be obtained, and the position of each intermediate text vector in the text sequence can be determined by the position weight of each intermediate text vector, and then, after determining the positions of the plurality of intermediate text vectors, the text sequence of the current speech sequence can be determined.

S309: and determining the text information of the voice to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence.

The descriptions of S301, S302, and S309 may refer to the descriptions of S201, S202, and S204, and the same technical effect can be achieved, and therefore, no further explanation is provided.

Here, according to the execution sequence of the steps, a complete implementation process of recognizing each speech sequence in the application to obtain a text sequence corresponding to the speech sequence is explained, which includes the following steps:

Step (1): determining high-dimensional feature vector H corresponding to current voice sequence X₀And combining the high-dimensional feature vector H₀Determined as the current speech vector H₀。

Step (2): obtaining an initial state vector C corresponding to the current voice sequence X₀。

And (3): the current speech vector H₀And an initial state vector C₀Inputting the intermediate state vector C to the first coding layer to obtain the intermediate state vector C corresponding to the current speech sequence X₁Current speech vector H₀Corresponding intermediate text vector y₀Intermediate text vector y₀At the current speech vector H₀Position weight α corresponding to target position in (1)₀。

And (4): according to the position weight alpha₀And current speech vector H₀Determining the intermediate speech vector H corresponding to the current speech sequence X₁。

And (5): determining the intermediate speech vector H₁As a new current speech vector, and obtaining an intermediate state vector C₁Counting the decoding times as a new initial state vector, and returning to the step (3) to continue decoding if the decoding times do not reach the preset times; and if the decoding times reach the preset times, stopping decoding, and determining the text sequence Y of the current voice sequence X according to the obtained multiple intermediate text vectors and the position weight of each intermediate text vector.

Referring to fig. 4 to 7, fig. 4 is a schematic structural diagram of a speech recognition apparatus 400 according to an embodiment of the present application; FIG. 5 is a schematic diagram of the generating module 430 in FIG. 4; fig. 6 is a schematic structural diagram of the generation unit 436 in fig. 5; fig. 7 is a second schematic structural diagram of a speech recognition apparatus 400 according to an embodiment of the present application.

As shown in fig. 4 and 7, the speech recognition apparatus 400 includes:

a segmentation module 410, configured to segment a received speech to be recognized into multiple speech sequences of a preset duration according to a receiving timing sequence;

a first determining module 420, configured to determine a high-dimensional feature vector corresponding to each speech sequence;

the generating module 430 is configured to sequentially input the high-dimensional feature vector of each speech sequence into the speech recognition model according to the segmentation order, so as to obtain a text sequence corresponding to each speech sequence;

and a second determining module 440, configured to determine text information of the speech to be recognized based on the obtained multiple text sequences and the segmentation order corresponding to each text sequence.

Further, as shown in fig. 4, the first determining module 420 is configured to determine the high-dimensional feature vector corresponding to each speech sequence according to the following steps:

Further, as shown in fig. 5, the generating module 430 includes:

a determining unit 432, configured to sequentially determine, according to a segmentation order, a speech sequence to be processed as a current speech sequence, and determine a high-dimensional feature vector corresponding to the current speech sequence as a current speech vector;

an obtaining unit 434, configured to obtain an initial state vector corresponding to the current speech sequence;

a generating unit 436, configured to input the current speech vector and the initial state vector into the speech recognition model, so as to obtain a text sequence of the current speech sequence.

Further, the initial state vector is an intermediate state vector output after a previous speech sequence arranged before the current speech sequence is input to the speech recognition model in the segmentation order.

Further, when the current speech sequence is the first speech sequence in the segmentation order, the initial state vector is a preset state vector.

Further, as shown in fig. 6, the generating unit 436 includes:

a first generating subunit 4361, configured to input the current speech vector and the initial state vector into a decoding layer of the speech recognition model, to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector, a position weight corresponding to the target position, and an intermediate state vector;

a first determining subunit 4362, configured to determine an intermediate speech vector for decoding processing based on the obtained position weight and the current speech vector;

a stopping subunit 4363, configured to use the determined intermediate speech vector as the current speech vector, use the obtained intermediate state vector as the initial state vector, and continue decoding until the decoding is stopped after a preset number of times;

a second determining subunit 4364, configured to determine a text sequence of the current speech sequence based on the decoded multiple intermediate text vectors and the position weight of each intermediate text vector.

Further, as shown in fig. 6, the decoding layer includes a decoder and a classifier; the first generating subunit 4361 is specifically configured to:

Further, as shown in fig. 4, the second determining module 440 is configured to determine text information of the speech to be recognized according to the following steps:

Further, as shown in fig. 7, the speech recognition apparatus 400 further includes a training module 450; the training module 450 is configured to train the speech recognition model according to the following steps:

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 includes a processor 810, a memory 820, and a bus 830.

The memory 820 stores machine-readable instructions executable by the processor 810, when the electronic device 800 runs, the processor 810 and the memory 820 communicate through the bus 830, and when the machine-readable instructions are executed by the processor 810, the steps of the voice recognition method in the method embodiments shown in fig. 2 and fig. 3 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the speech recognition method in the method embodiments shown in fig. 2 and fig. 3 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, characterized in that the speech recognition method comprises:

2. The speech recognition method of claim 1, wherein the high-dimensional feature vector corresponding to each speech sequence is determined according to the following steps:

3. The speech recognition method according to claim 1, wherein for each speech sequence, sequentially inputting the high-dimensional feature vector of each speech sequence into the speech recognition model according to the segmentation order to obtain the text sequence corresponding to each speech sequence, comprises:

Acquiring an initial state vector corresponding to the current voice sequence;

4. The speech recognition method of claim 3, wherein the initial state vector is an intermediate state vector output after a previous speech sequence, which is arranged before the current speech sequence in the segmentation order, is input to the speech recognition model.

5. The speech recognition method of claim 4, wherein the initial state vector is a default state vector when the current speech sequence is the first speech sequence in the segmentation order.

6. The speech recognition method of claim 3, wherein the inputting the current speech vector and the initial state vector into the speech recognition model to obtain a text sequence of the current speech sequence comprises:

7. The speech recognition method of claim 6, wherein the decoding layer comprises a decoder and a classifier; inputting the current speech vector and the initial state vector into a decoding layer of the speech recognition model to obtain an intermediate text vector corresponding to the current speech vector, a target position of the intermediate text vector in the current speech vector and a position weight corresponding to the target position, and an intermediate state vector, including:

8. The speech recognition method according to claim 1, wherein the determining text information of the speech to be recognized based on the obtained plurality of text sequences and the segmentation order corresponding to each text sequence comprises:

9. The speech recognition method according to claim 1, wherein before the segmenting the received speech to be recognized into a plurality of speech sequences of a preset duration according to the receiving timing, the speech recognition method further comprises training the speech recognition model according to the following steps:

10. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:

11. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the speech recognition method according to any one of claims 1 to 9.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to one of claims 1 to 9.