CN111128191A

CN111128191A - Online end-to-end voice transcription method and system

Info

Publication number: CN111128191A
Application number: CN201911415035.0A
Authority: CN
Inventors: 张鹏远; 缪浩然; 程高峰; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-08
Anticipated expiration: 2039-12-31
Also published as: CN111128191B

Abstract

The invention provides a method and a system for online end-to-end voice transcription, wherein in one embodiment, acoustic features are extracted from an audio file; performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences; modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences; and taking the Chinese character sequence with the highest score as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.

Description

Online end-to-end voice transcription method and system

Technical Field

The invention relates to the technical field of voice transcription, in particular to an online end-to-end voice transcription method and system.

Background

The voice transcription technology is an important technology for converting input audio into text, and is also an important research content in the field of human-computer interaction.

The traditional speech transcription technology comprises an acoustic model, a pronunciation dictionary and a language model, and a complex decoding network is constructed by means of a weighted finite state converter, so that an acoustic feature sequence is converted into a text sequence. The current emerging end-to-end language transcription technology adopts a single neural network model to directly convert acoustic features into text sequences, thereby greatly simplifying the decoding flow in the voice transcription process. However, the current high-performance end-to-end voice transcription must wait for the complete audio input before starting to convert into a text sequence, so that the application of the end-to-end voice transcription technology to the online task of real-time transcription is limited.

Disclosure of Invention

In view of this, the embodiments of the present application provide an online end-to-end voice transcription method and system, which overcome the problem that the existing end-to-end voice transcription technology cannot be applied to transcribing an online task in real time, and by improving the end-to-end voice transcription technology based on the structures of an encoder and a decoder, the encoder and the decoder can start to convert into a text sequence without relying on complete audio.

In a first aspect, the present invention provides an online end-to-end voice transcription method, including:

acquiring an audio file, and extracting acoustic features from the audio file;

performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence;

partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;

modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;

and taking the Chinese character sequence with the highest score as a final transcription result.

Optionally, the obtaining the audio file, and extracting the acoustic features from the audio file includes:

and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.

Optionally, the encoder is an online encoder based on a self-attention mechanism;

the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.

Optionally, the processing the second feature sequence, outputting a plurality of groups of chinese character sequences, and scoring the plurality of groups of chinese character sequences includes:

constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;

the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network.

Optionally, the modeling the second feature sequence by the decoder, and scoring the output multiple groups of chinese character sequences includes:

sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;

the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;

and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.

In a second aspect, the present invention provides an online end-to-end voice transcription system, including:

a collecting unit: the system is used for acquiring audio and extracting acoustic features from the audio;

a processing unit: the acoustic feature extraction unit is used for carrying out nonlinear transformation and down sampling on the acoustic features extracted by the acquisition unit and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;

the processing unit is further used for modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;

an output unit: and the Chinese character sequence with the highest score in the Chinese character sequences output by the processing unit is used as a final transcription result and is output.

Optionally, extracting acoustic features from the audio comprises:

The embodiment of the application provides an online end-to-end voice transcription method and system, wherein in one embodiment, logarithmic Mel spectrum features are extracted from audio to serve as frame-level acoustic features; constructing a front-end neural network, and performing nonlinear transformation and down-sampling on logarithmic Mel spectrum characteristics; constructing an online encoder based on a self-attention mechanism, modeling an output characteristic sequence of a front-end neural network, and outputting a group of new characteristic sequences; constructing an online decoder based on a self-attention mechanism, modeling a characteristic sequence output by an encoder, and outputting a plurality of groups of Chinese character sequences; and searching the character sequence with the highest score by using a beam search algorithm, and taking the character sequence as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an online end-to-end voice transcription system according to the present invention;

FIG. 2 is a flow chart of a method for on-line end-to-end voice transcription in accordance with the present invention;

FIG. 3 is a flow chart of a process for an online encoder based on a self-attention mechanism for a sequence of features input thereto;

fig. 4 is a flow chart of the processing of a feature sequence input to an online decoder based on a self-attention mechanism.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Fig. 1 is a schematic structural diagram of an online end-to-end voice transcription method according to the present invention, and referring to fig. 1, an end-to-end far-field voice recognition system in the embodiment of the present invention includes: an acquisition unit 101, a processing unit 102 and an output unit 103.

The acquisition unit 101 is configured to acquire an audio signal, and pre-emphasize the acquired audio signal by passing the audio signal through a high-pass filter to increase a high-frequency portion of the audio signal.

The audio signal passed through the high pass filter is framed with 25 ms per frame and 10 ms frame shift. Windowing is performed on each frame, and the window function is a hamming window. And then carrying out fast Fourier transform on each frame to obtain the frequency spectrum of each frame, and further obtaining the energy spectrum of each frame. Further, the energy passing through the mel filter is calculated for the energy spectrum of each frame, and logarithm is taken to obtain a logarithmic mel spectrum, wherein the number of the mel filters is 80, so that 80-dimensional logarithmic mel spectrum characteristics are obtained for each frame.

The processing unit 102 includes: a first processing unit 1021, a second processing unit 1022 and a third processing unit 1023.

The first processing unit 1021 is used to build a front-end neural network. The front-end neural network comprises two layers of two-dimensional convolution networks, a layer of linear network and a layer of position coding network. The convolution kernel size of the convolution network is 3, the step length is 2, the convolution kernel number is 256, the length of the feature sequence after two layers of two-dimensional convolution layers is one fourth of the original length, the output features of the convolution layers are projected to 256 dimensions by the linear layer, and the output features of the linear layer and the 256-dimensional position features are added by the position coding layer.

The logarithmic Mel-spectrum feature sequence extracted by the acquisition unit 101 is input into a front-end neural network for nonlinear transformation and down-sampling, and the sampling rate is one fourth.

The second processing unit 1022 is configured to construct an on-line encoder based on the self-attention mechanism, where the encoder is composed of 12 identical sub-modules stacked together, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stacked together.

And partitioning the characteristic sequences output from the front-end neural network, sequentially inputting each characteristic sequence into an online encoder, and outputting a plurality of groups of new characteristic sequences. Wherein the signature sequence output from the in-line encoder is the same as the signature sequence input to the in-line encoder.

The third processing unit 1023 is used to construct an on-line decoder based on the self-attention mechanism, which is composed of a stack of 6 identical sub-modules, each of which is composed of a self-attention network, a residual error network, a layer normalization network, a cut-off attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stack in turn.

Before inputting the output characteristics of the on-line encoder into the on-line decoder, a start symbol is also input to the on-line decoder as a starting point of the on-line decoder. The embedded features of the words of the starting symbol are added to the position features and input to an on-line decoder based on a self-attention mechanism.

The output unit 103 is configured to output the chinese character sequence with the highest score in the chinese character sequences output by the processing unit 102 as a final transcription result.

In one possible embodiment, the output unit 103 uses a beam search algorithm to control the third processing unit 1023 to output the number of Chinese characters each time, sort the Chinese characters according to scores from high to low, and input the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs the terminator. After the end, the Chinese character sequence with the highest score is used as the final transcription result.

Fig. 2 is a flowchart of an online end-to-end voice transcription method according to the present invention, and referring to fig. 2, an online end-to-end voice transcription method includes steps S201 to S205:

step S201: and acquiring an audio file, and extracting acoustic features of the audio file.

Logarithmic mel-frequency spectral features are extracted from the audio as acoustic features at the frame level. The method specifically comprises the following steps: and pre-emphasis is carried out on the acquired audio file, and the high-frequency part is promoted. I.e. passing the speech signal in the audio file through a high pass filter:

H(z)＝1-0.97z^-1(1)

the audio in the audio file is framed and windowed, wherein each frame is 25 ms, the frame is shifted by 10 ms, and the window function is a Hamming window.

And performing fast Fourier transform on each frame to obtain a frequency spectrum of each frame, and further processing the frequency spectrum of each frame to obtain an energy spectrum of each frame.

And calculating the energy of the energy spectrum of each frame passing through the Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum, wherein the number of the Mel filters is 80, so that each frame obtains 80-dimensional logarithmic Mel spectrum characteristics.

Step S202: the acoustic features are non-linearly transformed and down-sampled and a first feature sequence is output.

Inputting the extracted acoustic features into a front-end neural network, carrying out nonlinear transformation and down-sampling on the acoustic features by the front-end neural network, and outputting a first feature sequence; wherein the sampling rate is one fourth.

In one possible embodiment, the constructed front-end neural network comprises two layers of two-dimensional convolutional networks, one layer of linear network and one layer of position coding network, wherein the convolutional kernel size of the convolutional network is 3, the step length is 2, the number of the convolutional kernels is 256, the length of the feature sequence becomes one fourth of the original length after passing through the two layers of two-dimensional convolutional layers, the linear layer projects the output features of the convolutional layers to 256 dimensions, the position coding layer adds the output features of the linear layer and the 256-dimensional position features, and the overall calculation process is as follows:

Y＝ReLU(Conv(ReLU(Conv(X))) (2)

z_i＝Linear(y_i)+p_i(3)

wherein Conv (·) denotes a convolutional layer; ReLU (·) represents an activation function, whose expression is:

ReLU(x)＝max(0,x) (4)

linear (·) represents a Linear layer, X and Y represent a logarithmic Mel-spectrum feature sequence and an output feature sequence of two-dimensional convolution layers, respectively, Y_i,p_i,z_iRespectively representing the output characteristic of the ith two-dimensional convolutional layer, the ith position characteristic and the output characteristic of the ith front-end neural network, wherein the calculation formula of each dimension of the position characteristic is as follows:

p_i,2k+1＝sin(i/10000^k/128) (5)

p_i,2k+2＝cos(i/10000^k/128) (6)

step S203: and partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder, and outputting a plurality of groups of second characteristic sequences.

In one possible embodiment, each feature sequence is input into an online encoder based on a self-attention mechanism, the encoder is formed by stacking 12 identical sub-modules, each sub-module is sequentially a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and each layer of network output features is 256 dimensions.

And partitioning the first characteristic sequence into blocks, wherein the length of each block of the characteristic sequence is 64. Inputting each block of feature sequence into the encoder for processing, the processing flow chart is shown in fig. 3, and the processing flow chart comprises steps S2031-S2038:

step S2031: inputting each feature sequence into the self-attention network of the first module of the encoder, and calculating as follows:

SAN(Q,K,V)＝[head₁,…,head₄]W^O(7)

wherein W_i ^Q,W_i ^K,W_i ^V,W^OAnd representing a parameter matrix, wherein Q represents one feature sequence, 64 future input features are spliced on the right side of the feature sequence, K represents one feature sequence, 64 historical input features are spliced on the left side of the feature sequence, 64 future input features are spliced on the right side of the feature sequence, and V and K are the same.

Step S2032: the input and output characteristics from the attention network are added as the output characteristics of the residual network.

Step S2033: the output characteristics of the residual error network are input into the normalized network, and the calculation is as follows:

wherein the mean μ and variance var are calculated for each frame of input features h, by means of model parameters w_iAnd b_iCarrying out regular and linear transformation on each dimension value of h and outputting a new characteristic sequence

Step S2034: inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula of the network is as follows:

F(x)＝max(0,xW₁+b₁)W₂+b₂(12)

step S2035: the input and output characteristics of the fully connected network are added as the output characteristics of the residual network.

Step S2036: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.

Step S2037: judging whether the current module is the last sub-module or not, and if the current module is the last sub-module, ending the calculation flow of the encoder; otherwise, step S2038 is performed.

Step S2038: the save layer normalizes the output characteristics of the network as input from the attention network for the next sub-module and performs step S2032.

And taking the output characteristics of the storage layer normalized network as a history frame for splicing in the next characteristic processing process, and inputting the output characteristics into the next submodule.

In one possible embodiment, the first signature sequence input to the on-line encoder and the second signature sequence output by the on-line encoder are the same length.

Step S204: and processing the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences.

The second signature sequence output by the on-line encoder is modeled by an on-line decoder. The online decoder based on the self-attention mechanism is formed by stacking 6 identical sub-modules, each sub-module sequentially comprises a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and the output characteristics of each layer of network are 256 dimensions.

And processing the second characteristic sequence output by the online encoder through an online decoder. The processing flow chart is shown in fig. 4, and comprises the following steps:

step S2041: the character sequence formed by adding the word embedding character and the position character of the starting symbol and the character sequence output by the decoder are input into the attention network. The calculation formula in the self-attention network is the same as the formula (7) and the formula (8). Wherein Q represents an input feature, K represents a feature sequence composed of input features of the current history and all histories, and V and K are the same.

Step S2042: the input and output characteristics from the attention network are added as the output characteristics of the residual network.

Step S2043: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.

Step S2044: ith output characteristic q of a layer-normalized network_iInputting the truncated attention network, the calculation formula is as follows, and the output characteristic k of each encoder_jSequentially calculating j ═ 1,2, …:

when a certain j satisfies p_ij>0.5, and j is greater than the last truncation point, it is established as the truncation point of the current truncated attention network, and the output characteristic is calculated:

step S2045: adding the input and output characteristics of the truncated attention network to serve as the output characteristics of the residual error network;

step S2046: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.

Step S2047: and inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula is the same as that in the step S2034.

Step S2048: and adding the input and output characteristics of the fully-connected network to be used as the output characteristics of the residual error network, and inputting the output characteristics of the residual error network into the normalized network. The calculation formula is the same as that in step S2033;

step S2049: judging whether the current module is the last sub-module, if so, executing the step S20411; otherwise, step S20410 is executed.

Step S20410: the saving layer normalizes the output characteristics of the network as the input from the attention network of the next submodule, and performs step S2042.

Step S20411: the output characteristics of the layer normalized network are input into a Chinese character classifier, a plurality of groups of Chinese characters and corresponding scores are output, each Chinese character is added with the word embedding and position characteristics, and an input decoder predicts the next Chinese character.

Step S205: and taking the Chinese character sequence with the highest score as a final transcription result.

And controlling the number of Chinese characters output by the decoder each time by adopting a beam search algorithm, sequencing the Chinese characters from high to low according to scores, and respectively inputting the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs a terminator. And after finishing, taking the Chinese character sequence with the highest score as a final transcription result.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. An online end-to-end voice transcription method comprises the following steps:

acquiring an audio file, and extracting acoustic features from the audio file;

2. The method of claim 1, wherein the obtaining an audio file and the extracting acoustic features from the audio file comprises:

3. The method of claim 1, wherein the encoder is an on-line encoder based on a self-attention mechanism;

4. The method of claim 1, wherein the processing the second signature sequence, outputting a plurality of chinese character sequences and scoring the plurality of chinese character sequences comprises:

the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is composed of a self-attention network layer, a residual error network layer, a normalization network layer by layer, a truncation attention network layer, a residual error network layer, a normalization network layer by layer, a full-connection network layer, a residual error network layer and a normalization network layer by layer.

5. The method of claim 4, wherein the decoder modeling the second token sequence and scoring the output plurality of kanji sequences comprises:

6. An online end-to-end voice transcription system, comprising:

7. The system of claim 6, wherein extracting acoustic features from the audio comprises:

8. The system of claim 6, wherein the encoder is an on-line encoder based on a self-attention mechanism;

9. The system of claim 6, wherein said processing the second token sequence, outputting a plurality of chinese character sequences and scoring the plurality of chinese character sequences comprises:

10. The system of claim 9, wherein the decoder models the second token sequence and scoring the output plurality of kanji sequences comprises: