CN113851111A

CN113851111A - Voice recognition method and voice recognition device

Info

Publication number: CN113851111A
Application number: CN202111067913.1A
Authority: CN
Inventors: 郭莉莉; 王旭阳; 洪密
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2021-12-28

Abstract

The embodiment of the application provides a voice recognition method and a voice recognition device, wherein the method comprises the following steps: windowing the voice data stream, and determining voice data in a window; carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section; and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result. Therefore, when windowing is carried out on the voice data stream, the window length is flexibly adjusted according to the result of the object recognition processing so as to obtain target voice sections with different sizes, and the recognition speed and the recognition effect can be considered at the same time, so that the voice recognition performance of an end-to-end voice recognition scene is comprehensively improved.

Description

Voice recognition method and voice recognition device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method and a speech recognition apparatus.

Background

The end-to-end voice recognition is a recognition technology that the input end is a voice signal and the output end is a text signal, and the voice input and decoding recognition is expected to be directly realized without complicated alignment work and pronunciation dictionary making work, so that a large amount of time can be saved, and the method has application in various scenes.

In the related art, one mainstream framework of end-to-end speech recognition is a neural network based on a self-Attention (Attention) model. However, this framework does not support streaming scenarios, i.e. real-time recognition, which poses difficulties for practical applications. To solve this problem, a speech stream is usually windowed in a real-time speech recognition scenario to obtain a plurality of speech segments (Chunk), so as to implement streaming end-to-end speech recognition using an Attention model. However, the result of windowing in the related art is not ideal, resulting in poor speech recognition performance.

Disclosure of Invention

The embodiment of the application provides a voice recognition method and a voice recognition device, which can give consideration to both recognition speed and recognition effect, so that the voice recognition performance of an end-to-end voice recognition scene is comprehensively improved.

The technical scheme of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a speech recognition method, where the method includes:

windowing the voice data stream, and determining voice data in a window;

carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section;

and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, which includes a processing unit, an adjusting unit, and a recognition unit; wherein,

a processing unit configured to perform windowing on the voice data stream and determine voice data located within a window;

the adjusting unit is configured to perform object recognition processing on the voice data in the window, perform length adjustment on the window according to the object recognition processing result, and determine the voice data in the adjusted window as a target voice section;

and the recognition unit is configured to perform voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result.

The embodiment of the application provides a voice recognition method and a voice recognition device, wherein a voice data stream is subjected to windowing processing, and voice data in a window are determined; carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section; and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result. Therefore, when windowing is carried out on the voice data stream, the window length is flexibly adjusted according to the result of the object recognition processing so as to obtain target voice sections with different sizes, and the recognition speed and the recognition effect can be considered at the same time, so that the voice recognition performance of an end-to-end voice recognition scene is comprehensively improved.

Drawings

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech recognition network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of another speech recognition network according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" are used merely to distinguish similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under certain ordering or sequence if so permitted so that the embodiments of the present application described herein can be implemented in other orders than that shown or described herein.

It should be understood that end-to-end speech recognition is mainly divided into two frameworks, which are respectively based on RNN-T (Current Neural Network transmitter) model and on self-Attention model, especially Attention model. However, for an Attention-based neural network (transducer), reference to context characteristics is required, and thus Streaming (Streaming) translation is not supported, and the requirements of practical application scenarios cannot be met.

To solve this problem, it is generally necessary to perform windowing on the speech stream, and to segment the speech stream into a plurality of different speech segments (generally called chunks) for input, so as to construct context information. Here, the size of Chunk is generally a fixed size. It has been found that if the length of Chunk is too large, it will cause a large delay, and if Chunk is too small, the recognition result is poor.

Based on this, the embodiment of the present application provides a speech recognition method, and the basic idea of the method is: windowing the voice data stream, and determining voice data in a window; carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section; and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result. Therefore, when windowing is carried out on the voice data stream, the window length is flexibly adjusted according to the object recognition processing result so as to obtain target voice segments with different sizes, the recognition speed and the recognition effect can be considered, and the voice recognition performance of an end-to-end voice recognition scene is comprehensively improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, referring to fig. 1, a flowchart of a speech recognition method provided in an embodiment of the present application is shown. As shown in fig. 1, the method may include:

s101: the stream of speech data is windowed, and speech data located within the window is determined.

It should be noted that the speech recognition method provided in the embodiment of the present application is applied to various electronic devices, for example, the electronic devices may be computers, smart phones, tablet computers, notebook computers, palm computers, Personal Digital Assistants (PDAs), navigation devices, smart furniture, vehicle-mounted devices, and the like, and this is not particularly limited in the embodiment of the present application.

In the embodiment of the present application, the speech recognition method is explained with end-to-end speech recognition as an application scenario, but this does not constitute a relevant limitation.

In end-to-end speech recognition, it is often necessary to perform windowing (or referred to as segmentation) on a speech data stream so as to divide the continuous speech data stream into a plurality of speech segments (chunks), and then perform speech recognition in units of chunks. For example, in the speech recognition model based on Attention, speech recognition needs to be performed depending on context, but in an end-to-end speech scenario, a speech stream needs to be translated in real time, so that "context information" is lacking; by windowing, the real-time voice data stream is divided into different voice segments, and the voice segments before and after a certain voice segment can be regarded as the 'context information' of the voice segment, so that the attribute-based voice recognition model can be applied to end-to-end voice recognition.

However, in a related embodiment, the size of Chunk is fixed, the larger the Chunk, the higher the identification accuracy, but the longer the delay; the smaller the Chunk, the faster the recognition speed, but the less accurate. Therefore, how to reasonably set the Chunk size becomes an urgent problem to be solved.

Therefore, the embodiment of the application provides a voice recognition method, which can flexibly adjust the size of Chunk, thereby giving consideration to both recognition speed and recognition effect, and thus comprehensively improving the voice recognition performance of an end-to-end voice recognition scene.

S102: and carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section.

When windowing the voice data stream, object recognition processing is performed on the voice data in the window, and the length of the window is adjusted based on the result of the object recognition processing to determine the target voice segment. Here, the length of the window refers to the number of speech frames contained in the window.

In the embodiment of the present application, the window is a virtual concept, which refers to an algorithm concept for dividing the voice data stream in the windowing process, but does not specify any physical concept. For example, the window may be a rectangular window (Rectwin), Hamming window (Hamming), Hanning window (Hann), and the like.

The object recognition processing is to determine whether or not a recognition object (or referred to as a minimum recognizable unit or Token) exists in the voice data. It should be understood that a Chunk has only recognizable meaning if there are one or more recognition objects. In particular, the definition of Token is different according to the usage scenario, and for example, a word (Char), a word (word), a phoneme, or the like may be adopted as Token.

That is, in a speech recognition scenario, the length of the recognition object is uncertain, and is influenced by the speaking habits of the speaker, the speech characteristics of words, the scenario in which the speech data stream is generated, and so on. However, for the speech recognition process, it is desirable that at least one Token exists for each Chunk in order to achieve better recognition effect and reduce recognition delay. .

In the embodiment of the application, the length of the window is adjusted according to the object recognition processing result, and the voice data in the adjusted window is determined as a target Chunk, that is, the size of the Chunk can be flexibly adjusted according to the duration of Token, so that each Chunk has at least one Token, and the voice recognition effect is improved.

In some embodiments, the window may be adjusted by extending the window. The adjusting the length of the window according to the result of the object recognition processing may include:

if the object recognition processing result indicates that the voice data in the window does not have the recognition object, the length of the window is extended until the voice data in the adjusted window has the recognition object;

if the object recognition processing result indicates that the voice data in the window has a recognition object, the length of the window is kept unchanged.

It should be noted that, if Token exists in the voice data in the window, the voice data in the window may be determined as a Chunk, and sent to a subsequent processing link; otherwise, if the speech data in the window does not have Token, the length of the window may need to be extended until the speech data in the window has the recognition object, and at this time, the adjusted speech data in the window is determined to be a Chunk and sent to a subsequent processing link.

Further, in some embodiments, the extending the length of the window may include:

and adjusting the length of the window to be the initial length of the preset multiple.

It should be noted that, when the window is extended, the window may be extended by an initial length each time. Taking an initial length of 5 frames as an example, if there is no Token in the speech data in the window, the length of the window is increased by 5 frames, and the speech data in the window is detected again, and if there is no Token in the speech data in the window after 5 frames are increased, the window after 5 frames is continuously increased by 5 frames … … until there is a Token in the speech data in the adjusted window.

In other embodiments, the window may be adjusted by shortening the extended window. The adjusting the length of the window according to the result of the object recognition processing may include:

if the object recognition processing result indicates that the voice data in the window has the recognition object, the length of the window is shortened according to the position of the recognition object.

It should be noted that, if Token exists in the speech data in the window, the length of the window may be shortened according to the boundary position of Token, so as to obtain Chunk with less redundant data, and improve the efficiency of speech recognition.

Further, in some embodiments, after determining the target speech segment, the method may further include:

if the length of the adjusted window is different from the initial length of the window, performing length recovery processing on the adjusted window to obtain the window with the initial length;

the window is slid over the stream of speech data and the step of determining the speech data located within the window is performed again.

It should be noted that, after the target voice segment is determined, if the window is subjected to length adjustment, the length of the window needs to be restored to the initial length; the window then continues to slide in the voice data stream, thereby continuing the windowing process.

Here, the distance of window sliding is generally smaller than the length of the window, i.e. there may be overlapping speech frames for different chunks, so as to avoid missing Token.

It should be noted that the criterion for determining whether the recognition object exists in a piece of voice data may be determined according to an actual application scenario, for example, an ending boundary of Token is used as a criterion for Token existence, that is, it is determined that Token exists in the voice data as long as a boundary exists in one end of voice data.

Thus, in some embodiments, performing object recognition processing on speech data within a window may include:

carrying out frame-by-frame detection on the voice data in the window;

if all frames are detected to have no object boundary, determining that the voice data in the object recognition processing result indicating window has no recognition object;

and if the object boundary in at least one frame is detected, determining that the object recognition processing result indicates that the voice data in the window has the recognition object.

It should be noted that, in other embodiments, the criterion that the ending boundaries of two tokens exist for tokens may also be adopted. At this time, it is determined that the recognition object is detected only when at least two frames of the voice data located within the window are both object boundaries.

It should be understood that the object recognition process may be implemented by a variety of models, such as a portal network, a CTC model, a Scotnet boundary model, a Token duration model, and so forth.

In summary, the embodiment of the present application provides a method, which can cut a continuous data stream into data segments (chunks) with different sizes, and the method can be applied to the field of end-to-end voice recognition, and also can be applied to the fields of image processing, signal processing, and the like.

Taking speech recognition as an example, the embodiment of the present application further provides a plurality of different data segment size adjustment methods, such as window extension, window shortening, window extension + window shortening, and several different examples are given below.

Example one:

setting a window with a smaller initial length, for example, the initial length is 5 frames;

windowing the voice data stream by using the window, and carrying out object recognition processing on the voice data positioned in the window frame by frame;

if a certain voice frame is detected as an object boundary, determining the voice data frame positioned in the window as a Chunk;

if all the voice frames are not the object boundary, the length of the window is increased by 5 frames, and the step of carrying out object recognition processing on the voice data in the window frame by frame is returned until a certain voice frame is detected to be the object boundary, and the voice data in the window (namely the adjusted window) at the moment is determined to be a Chunk.

Example two:

setting a window with a larger initial length, for example, the initial length is 15 frames;

performing windowing processing on the voice data stream by using the window, and performing object recognition processing on the voice data positioned in the window frame by frame from a starting point (or an end point);

if a certain voice frame is detected as an object boundary, shortening the ending edge of the window to the voice frame, and determining the voice data frame in the window as a Chunk.

Example three:

setting a window with a moderate initial length, for example, the initial length is 10 frames;

if a certain voice frame is detected as an object boundary, shortening the ending edge of the window to the voice frame, and determining the voice data frame positioned in the window as a Chunk;

if all the voice frames are not the object boundary, the length of the window is increased by 5 frames, and the step of carrying out object recognition processing on the voice data in the window frame by frame is returned.

Therefore, by the methods, the Chunk can be determined in the voice data stream, and the size of the Chunk can be flexibly changed along with the duration of Token, so that each Chunk at least comprises one Token, and the efficiency and the accuracy of voice recognition are improved.

S103: and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result.

It should be noted that, after the target speech segment is obtained, the target speech segment is sent to a subsequent recognition model (or called as a speech recognition model) to obtain a target recognition result.

It should be appreciated that in the context of end-to-end speech recognition, the speech data stream may be windowed into a plurality of speech segments that may be continuously entered into the recognition model. In some embodiments, the recognition model recognizes the target speech segment in combination with the speech segments preceding and following the target speech segment.

Further, in some embodiments, the recognition model includes an encoder and a decoder; the performing, based on the recognition model, speech recognition processing on the target speech segment to obtain a target recognition result may include:

encoding the target speech segment by using an encoder to obtain feature vector information;

and decoding the characteristic vector information by using a decoder to obtain a target identification result.

It should be noted that the encoder and decoder are the most common neural network architecture. Specifically, the encoder is configured to determine feature vector information of the target speech segment, and the decoder is configured to perform decoding processing to determine a target recognition result.

Here, since the data to be transmitted to the encoder is in units of Chunk, the decoder can decode only in units of Chunk. However, in the speech recognition field, the more efficient and stable decoding unit is Token, that is, the recognition model will combine tokens before and after the target Token to recognize the target Token. Therefore, in some embodiments, the recognition model further includes at least one location recognition model, and the decoding process performed on the feature vector information by the decoder to obtain the target recognition result includes:

carrying out position identification processing on the target voice segment by utilizing at least one position identification model to obtain at least one boundary position information;

performing confidence screening on at least one boundary position information to determine target boundary position information;

and inputting the target boundary position information and the feature vector information into a decoder to obtain a target identification result.

It should be noted that the boundary position information is used to indicate the specific position of Token in each Chunk. In this way, the decoder can decode in Token units according to the specific positions of tokens, thereby improving the efficiency and correctness of identification.

At present, there are many principle location recognition models for determining the specific location of Token in the speech data, but different location recognition models have advantages and disadvantages, and only one location recognition model is inevitably adopted to reduce the accuracy. Therefore, in the embodiment of the application, a plurality of position recognition models can be used for position recognition processing, and then the result with the highest confidence coefficient is selected to determine the target boundary information, so that the accuracy of the boundary position information can be improved, and the accuracy of the recognition result is further improved.

Of course, in an application scenario with a priority on speed, a position recognition model may also be used for position recognition processing, at this time, the confidence screening process may be skipped, and the result of the position recognition model is used to determine the target boundary position information.

Further, the location identification model includes at least one of: connecting the time sequence classification model, the boundary identification model and the duration identification model; carrying out position identification processing on the target speech segment by using at least one position identification model to obtain at least one boundary position information, wherein the position identification processing at least comprises the following steps:

analyzing the number of objects of the feature vector information by using a connection time sequence classification model, and determining first boundary position information;

performing boundary position analysis processing on the target voice segment by using the boundary identification model to determine second boundary position information;

carrying out object duration analysis processing on the target voice segment by using the duration recognition model to determine third boundary position information;

determining the first boundary position information, the second boundary position information, and the third boundary position information as at least one boundary position information.

It should be noted that the connection timing classification model may be a ctc (connectionist Temporal classification) network, and may count the number of target speech segments and output confidence information; the boundary identification model can be a Scotnet network, and can judge the boundary of the Word (each Word is regarded as a Token) and output confidence information; the duration recognition model can be a Token duration network, and can analyze the duration of each Token in the target voice segment and score the confidence. In addition, the CTC network can label each frame of the target speech segment to indicate whether the current frame is blank, so as to determine the range of Token.

In other words, the connection timing classification model, the boundary identification model, and the duration identification model can all give the boundary position information. Therefore, confidence degree comparison is carried out on the results output by the connection time sequence classification model, the boundary recognition model and the duration recognition model, and the model with the highest confidence degree is selected to determine the final target boundary information, so that the recognition accuracy is improved.

In addition, the above position recognition model may also be selectively used for the object recognition processing in step S102.

The embodiment of the application provides a voice recognition method, wherein a voice data stream is subjected to windowing processing, and voice data in a window are determined; carrying out object recognition processing on the voice data in the window, carrying out length adjustment on the window according to the object recognition processing result, and determining the voice data in the adjusted window as a target voice section; and carrying out voice recognition processing on the target voice section based on the recognition model to obtain a target recognition result. Therefore, on one hand, the voice recognition method provided by the embodiment of the application can determine the size of the Chunk according to the Token duration, and gives consideration to the processing speed and the recognition accuracy; on the other hand, the recognition model comprises a plurality of position recognition models, and the target boundary position information is determined by adopting a multi-stage judgment mode, so that the boundary position deviation caused by a single judgment standard is avoided, and the recognition accuracy is further improved.

In an embodiment of the present application, refer to fig. 2, which shows a schematic structural diagram of a speech recognition network 20 provided in an embodiment of the present application. The speech recognition network 20 is capable of performing speech recognition methods as previously described.

As shown in fig. 2, the speech recognition network 20 at least partially includes a speech input module 201, a Chunk setting module 202, a gate network module 203, a Chunk confirmation module 204, an Encoder (Encoder)205, a connection timing classification model 206, a Decoder (Decoder)207, and a recognition result output module 208. .

The voice input module 201 is configured to collect a voice signal from a terminal device such as a mobile phone or a notebook computer, and form a voice data stream.

A Chunk setting module 202, configured to intercept a smaller Chunk in the voice data stream according to the initial window length. For example, the preset length may be 5 frames.

The gate network module 203 is configured to perform object recognition processing on the Chunk output by the Chunk setting module 202 to determine whether the current Chunk has a recognition object; and if the current Chunk does not have the identification object, adding 5 frames to the current Chunk, and performing the object identification processing again until the identification object is detected.

It should be noted that the door network module 203 can be constructed by various network models. Illustratively, the gate Network module 203 may be composed of several layers of fully connected layers or Convolutional Neural Networks (CNNs). At this time, the gate network module 203 inputs a segment of voice data signal, and the gate network module 203 performs frame-by-frame determination on the voice data signal and outputs a determination result of whether the current frame is a Token boundary.

A Chunk confirming module 204, configured to confirm the size of the Chunk according to the output result of the gate network.

Here, if the output result of the gate network module 203 is that the current frame is not the boundary of Token, the judgment of the next frame is performed in a loop. If all frames of the current Chunk do not judge the boundary of Token, increasing the length of the Chunk by 2 times, 3 times and 4 times … …, and continuing the object identification processing after the length of the Chunk is increased; on the contrary, if the current frame is a boundary of Token, the current Chunk is determined as the target Chunk and is input to the encoder 205.

The encoder module 205 is configured to perform encoding processing on the Chunk, and output the result as a hidden layer node, that is, feature vector information of the Chunk.

Here, the Encoder module 205 may be a coder (a coding method) based Encoder, for example, the Encoder may include a down-sampling module, a position coding module/Self-attention feedback (Self-attention) module, a forward propagation (Feed-forward) module, and a CNN module.

And the connection time sequence classification model 206 is used for identifying the specific position of Token in Chunk, that is, determining the position information of the target boundary.

Here, the nature of the connection timing classification model 206 is a location identification model, which can be implemented through a CTC network. The CTC network can obtain alignment information at a frame level, and can model a mapping relationship between two unequal-length sequences, because the CTC network adds a Blank symbol (Blank) in a set of reference symbols, which means that the frame has no predicted value output, and therefore, many Blank symbols may be included in the predicted output based on the CTC network, and finally Blank-removing and de-recompression is performed. This feature is applicable to the sequence modeling problem in the speech recognition research, and the character sequence or phoneme sequence of each speech segment is known in advance, but the alignment relationship between the character sequence or phoneme sequence and the input feature sequence is uncertain, and the length of the character sequence or phoneme sequence is generally much smaller than that of the input frame sequence, so that the CTC network can be used for speech sequence prediction.

And the decoder 207 is configured to decode the feature vector information according to the target boundary information, and determine a decoding result.

And the recognition result output module 208 is configured to determine a target recognition result according to the decoding result.

Here, the Decoder 207 may be a Decoder, and the recognition result output module 208 may be a scoring algorithm (e.g., Softmax algorithm).

In particular, the Decoder and the recognition result output module may be logically regarded as a whole. Illustratively, the concept of the decoder in the foregoing embodiment is actually the entirety of the decoder 207 and the recognition result output module 208. In other words, the input of the decoder can be considered as feature vector information and object boundary information, and the object recognition result is output.

It should be noted that the decoder 207 may adopt an Attention-based framework, and in this case, reference is also needed to Token's context information, so as to better determine the prediction result. Illustratively, the decoder 207 may be composed of two Self-orientation modules, the first Self-orientation module inputs a historical predicted Token sequence (i.e., context information of the target Token) and outputs a vector representing the feature of the next Token predicted based on the context; the input of the second Self-orientation module is the output of the first Self-orientation module and the output of the Encoder, and the Token label is predicted through Softmax to determine the final prediction result.

In summary, the speech recognition network 20 involves both Chunk determination and speech recognition.

On one hand, the Chunk determining part comprises a voice input module 201, a Chunk setting module 202, a gate network module 203 and a Chunk confirming module 204, and is mainly used for flexibly determining a Chunk time length according to the Token time length and considering both the recognition speed and the recognition efficiency.

On the other hand, the speech recognition section (equivalent to the aforementioned recognition model) includes an encoder 205, a connection timing classification model 206, a decoder 207, and a recognition result output module 208. The encoder 205 and the decoder 207 are respectively composed of a Multi-Head self-Attention module (Multi-Head Attention), the input speech enters the encoder 205 after being subjected to windowing processing to extract features, frame-based classification results (the classification results indicate whether each frame is an object boundary) are obtained by adopting a CTC network, when the CTC network outputs non-Blank tokens, a current frame node is input to the decoder 207 to be decoded, and a next Token label is predicted.

It should be understood that in the related art, a dynamic Chunk is often used when training a neural network, and after training, an appropriate Chunk size needs to be determined according to actual application requirements, mainly according to the acceptance degree of delay, for example, in an offline scenario, the Chunk may be larger; in real time, Chunk is reduced as needed. However, the above method has the following disadvantages: when the actual processing is performed on the voice data stream to be processed, the setting of the Chunk size is fixed, adjustment is not performed, resource waste is caused when the Chunk is larger, and the recognition performance is reduced when the Chunk is smaller.

Therefore, in the embodiment of the present application, a modeling unit (including Chunk setting module 202, gate network module 203, and Chunk determining module 204) is first used to estimate Token time length, and according to the length of Token, tokens with different time lengths use different chunks and use chunks with different sizes for decoding, which not only can reduce unnecessary resource waste, but also can improve recognition accuracy

Specifically, in the embodiment of the present application, a smaller fixed Chunk is used for input, the Chunk input gate network is used for predicting Token (Token may be Word) duration, the size of the Chunk is adjusted according to the size of the Token (Word) duration to obtain a target Chunk, the target Chunk is further sent to an encoder for feature extraction, and then hidden layer information (equivalent to feature vector information) output by the encoder is input to a Decoder for prediction of the current Token.

The embodiment of the present application provides a speech recognition method, and the specific implementation method of the foregoing embodiment is elaborated in detail through this embodiment, and it can be seen that, when windowing is performed on a speech data stream, the window length is flexibly adjusted according to the result of object recognition processing to obtain target speech segments of different sizes, that is, the size of a Chunk is determined according to the Token duration, and the recognition speed and the recognition effect can be taken into consideration, so that the speech recognition performance of an end-to-end speech recognition scene is comprehensively improved.

In yet another embodiment, refer to fig. 3, which shows a schematic structural diagram of a speech recognition network 30 provided in the embodiment of the present application. The speech recognition network 30 is capable of performing speech recognition methods as previously described.

As shown in fig. 3, the speech recognition model 30 at least partially includes a speech input module 301, a Chunk setting module 302, a boundary recognition model 303, a duration recognition model 304, an Encoder (Encoder)305, a connection timing classification model 306, a confidence level judgment module 307, a Decoder (Decoder)308, and a recognition result output module 309.

The voice input module 301 is configured to collect a voice signal from a terminal device such as a mobile phone or a notebook computer, and form a voice data stream.

A Chunk setting module 302, configured to perform windowing on a voice data stream to obtain multiple chunks.

And the boundary identification model 303 is configured to perform boundary identification processing on each Chunk to obtain a boundary identification result and a first confidence degree.

Here, the boundary recognition model 303 may employ a Scotnet network, which may be composed of several fully connected layers or a CNN network. The Scotnet network can judge the voice data in Chunk frame by frame to determine which frame is the boundary of Word; and if the current Chunk has no Word boundary, judging the next Chunk. In the embodiment of the present application, one Word is one Token.

And the duration recognition model 304 is configured to perform Token duration recognition processing on the target Chunk to obtain a duration recognition result and a second confidence level.

Here, the duration recognition model 304 may adopt a Token duration network, and may be composed of several fully connected layers or CNN networks, where the input is a segment of speech signal, and the output scores the Token duration and the confidence of the Token duration in the speech signal.

And an Encoder (Encoder)306, configured to perform encoding processing on the Chunk, and output feature vector information of the hidden layer node, i.e., the Chunk.

Here, the Encoder 305 may be a coder based on a coder (a coding method), for example, the coder may include a down-sampling module, a position coding module, a Self-attention-feedback (Self-attention-attack) module, a forward-propagation (Feed-forward) module, and a CNN module.

And the connection time sequence classification model 306 is used for performing Token number statistical processing on each Chunk to obtain a number statistical result and a third confidence coefficient.

Here, the connection timing classification model 306 may employ a CTC network. Here, unlike Scotnet networks and Token duration networks, the input of CTC networks preferentially uses feature vector information of Chunk.

The confidence judgment module 307 is configured to compare the first confidence, the second confidence and the third confidence, and determine target boundary information.

Herein, Scotnet network, Token duration network and CTC network can be considered as location identification models for determining Token location in each Chunk. Therefore, for a specific Chunk, it can be determined which network has the most accurate position identification through confidence comparison, and the final target boundary information is determined according to the network with the highest confidence.

And a Decoder (Decoder)308 for decoding the feature vector information according to the target boundary information and determining a decoding result.

And an identification result output module 309, configured to determine a target identification result according to the decoding result.

In summary, in a specific embodiment, the number of tokens may be obtained by using a peak of a CTC curve, a Scotnet network is used to determine a Word boundary, a Token duration network is used to determine a Token duration, and target boundary information is determined according to a network with the highest confidence, so as to provide accuracy of an identification result.

In the related art, according to the output of the Encoder, the number of tokens, the boundary of Word, or the duration of tokens is generally adopted to output the Encoder, and the Encoder is input to the Decoder for decoding. However, in the case of using a judgment criterion for judgment, the boundary judgment may be inaccurate, which directly affects the Decoder performance and causes inaccurate identification result.

In the embodiment of the application, the number of Token and the Word boundary are judged by confidence, the result with high confidence is output, and the Encoder result is confirmed according to the duration of Token. Therefore, by adopting a multi-stage mixing mode, more accurate output information of the Encoder can be obtained, and the identification result is improved.

The embodiment of the application provides a voice recognition method, and the specific implementation method of the embodiment is elaborated in detail through the embodiment, so that the recognition model comprises a plurality of position recognition models, and the target boundary position information is determined in a multi-stage judgment mode, so that the boundary position deviation caused by a single judgment standard is avoided, and the recognition accuracy is further improved.

In a further embodiment of the present application, refer to fig. 4, which shows a schematic structural diagram of a speech recognition apparatus 40 provided in an embodiment of the present application. As shown in fig. 4, the speech recognition apparatus 40 includes a processing unit 401, an adjusting unit 402, and a recognition unit 403; wherein,

a processing unit 401 configured to perform windowing on the voice data stream, and determine voice data located in a window;

an adjusting unit 402, configured to perform object recognition processing on the voice data in the window, perform length adjustment on the window according to the object recognition processing result, and determine the voice data in the adjusted window as a target voice segment;

and a recognition unit 403 configured to perform speech recognition processing on the target speech segment based on the recognition model, and obtain a target recognition result.

In some embodiments, the adjusting unit 402 is specifically configured to, if the object recognition processing result indicates that the speech data in the window does not have the recognition object, extend the length of the window until the speech data in the adjusted window has the recognition object; if the object recognition processing result indicates that the voice data in the window has a recognition object, the length of the window is kept unchanged.

In some embodiments, the adjusting unit 402 is further configured to shorten the length of the window according to the position of the recognition object if the object recognition processing result indicates that the recognition object exists in the speech data in the window.

In some embodiments, the adjusting unit 402 is further configured to, if the length of the adjusted window is different from the initial length of the window, perform length recovery processing on the adjusted window to obtain a window with the initial length;

accordingly, the processing unit 401 is further configured to slide the window over the stream of speech data and to perform the step of determining the speech data located within the window again.

In some embodiments, the adjusting unit 402 is further configured to perform frame-by-frame detection on the speech data located in the window; if all frames are detected to have no object boundary, determining that the voice data in the object recognition processing result indicating window has no recognition object; and if the object boundary in at least one frame is detected, determining that the object recognition processing result indicates that the voice data in the window has the recognition object.

In some embodiments, the adjusting unit 402 is further configured to adjust the length of the window to an initial length of a preset multiple.

In some embodiments, the recognition model includes an encoder and a decoder; an identifying unit 403, configured to perform encoding processing on the target speech segment by using an encoder, so as to obtain feature vector information;

In some embodiments, the recognition model further includes at least one location recognition model, and the recognition unit 403 is further configured to perform location recognition processing on the target speech segment by using the at least one location recognition model, so as to obtain at least one boundary location information; performing confidence screening on at least one boundary position information to determine target boundary position information; and inputting the target boundary position information and the feature vector information into a decoder to obtain a target identification result.

In some embodiments, the location identification model comprises at least one of: connecting the time sequence classification model, the boundary identification model and the duration identification model; the identification unit 403 is specifically configured to perform object quantity analysis processing on the feature vector information by using a connection timing classification model, and determine first boundary position information; performing boundary position analysis processing on the target voice segment by using the boundary identification model to determine second boundary position information; carrying out object duration analysis processing on the target voice segment by using the duration recognition model to determine third boundary position information; determining the first boundary position information, the second boundary position information, and the third boundary position information as at least one boundary position information.

It is understood that in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and may also be a module, or may also be non-modular. Moreover, each component in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Accordingly, the present embodiment provides a computer storage medium storing a computer program which, when executed by a plurality of processors, implements the steps of the method of any one of the preceding embodiments.

Based on the above-mentioned components of the speech recognition device 40 and the computer storage medium, referring to fig. 5, a schematic diagram of a hardware structure of an electronic device 50 according to an embodiment of the present application is shown. As shown in fig. 5, the electronic device 50 may include: a communication interface 501, a memory 502, and a processor 503; the various components are coupled together by a bus device 504. It is understood that bus device 504 is used to enable connected communication between these components. Bus device 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus device 504 in figure 5. The communication interface 501 is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

a memory 502 for storing a computer program capable of running on the processor 503;

a processor 503, configured to execute, when running the computer program:

windowing the voice data stream, and determining voice data in a window;

and carrying out voice recognition processing on the target voice section based on a recognition model to obtain a target recognition result.

It will be appreciated that the memory 502 in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous chained SDRAM (Synchronous link DRAM, SLDRAM), and Direct memory bus RAM (DRRAM). The memory 502 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

And the processor 503 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 503. The Processor 503 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 502, and the processor 503 reads the information in the memory 502 and completes the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions of the present Application, or a combination thereof.

For a software implementation, the techniques of this application may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions of the present application. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 503 is further configured to perform the steps of the method of any one of the preceding embodiments when running the computer program.

In a further embodiment of the present application, based on the above-mentioned composition diagram of the speech recognition device 40, refer to fig. 6, which shows a composition structure diagram of another electronic device 50 provided in an embodiment of the present application. As shown in fig. 6, the electronic device 50 includes at least the speech recognition apparatus 40 of any of the previous embodiments.

As for the electronic device 50, since it includes the speech recognition device 40, when windowing the speech data stream, the window length is flexibly adjusted according to the result of the object recognition processing to obtain target speech segments with different sizes, so as to achieve both recognition speed and recognition effect, thereby comprehensively improving the speech recognition performance of the end-to-end speech recognition scenario.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

It should be noted that, in the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, the method comprising:

windowing the voice data stream, and determining voice data in a window;

2. The speech recognition method according to claim 1, wherein the adjusting the length of the window according to the result of the object recognition processing comprises:

if the object recognition processing result indicates that the voice data in the window does not have the recognition object, extending the length of the window until the voice data in the adjusted window has the recognition object;

and if the object recognition processing result indicates that the voice data in the window has a recognition object, keeping the length of the window unchanged.

3. The speech recognition method according to claim 1, wherein the adjusting the length of the window according to the result of the object recognition processing further comprises:

and if the object recognition processing result indicates that the voice data in the window has a recognition object, shortening the length of the window according to the position of the recognition object.

4. A speech recognition method according to claim 2 or 3, after determining the target speech segment, the method further comprising:

sliding the window over the stream of speech data and again performing the step of determining speech data located within the window.

5. The speech recognition method of claim 2, wherein the performing object recognition processing on the speech data within the window comprises:

performing frame-by-frame detection on the voice data in the window;

if all frames are detected to have no object boundary, determining that the object recognition processing result indicates that no recognition object exists in the voice data in the window;

and if the object boundary in at least one frame is detected, determining that the object recognition processing result indicates that the voice data in the window has a recognition object.

6. The speech recognition method of claim 2, the extending the length of the window, comprising:

and adjusting the length of the window to be the initial length of a preset multiple.

7. The speech recognition method of claim 1, the recognition model comprising an encoder and a decoder; the voice recognition processing is carried out on the target voice section based on the recognition model to obtain a target recognition result, and the method comprises the following steps:

encoding the target speech segment by using the encoder to obtain feature vector information;

and decoding the characteristic vector information by using the decoder to obtain the target identification result.

8. The speech recognition method according to claim 7, wherein the recognition models further include at least one location recognition model, and the decoding, by the decoder, the feature vector information to obtain the target recognition result includes at least:

performing position recognition processing on the target speech segment by using the at least one position recognition model to obtain at least one boundary position information;

performing confidence screening on the at least one boundary position information to determine target boundary position information;

and inputting the target boundary position information and the feature vector information into the decoder to obtain the target identification result.

9. The speech recognition method of claim 8, the location recognition model comprising at least one of: connecting the time sequence classification model, the boundary identification model and the duration identification model; the performing, by using the at least one location identification model, location identification processing on the target speech segment to obtain at least one boundary location information, which at least includes:

analyzing the quantity of the objects of the characteristic vector information by using a connection time sequence classification model, and determining first boundary position information;

performing boundary position analysis processing on the target speech segment by using the boundary identification model to determine second boundary position information;

determining the first boundary position information, the second boundary position information, and the third boundary position information as the at least one boundary position information.

10. A speech recognition apparatus includes a processing unit, an adjusting unit, and a recognition unit; wherein,

the processing unit is configured to perform windowing processing on the voice data stream and determine voice data located in a window;

and the recognition unit is configured to perform voice recognition processing on the target voice segment based on a recognition model to obtain a target recognition result.