CN111128191A - Online end-to-end voice transcription method and system - Google Patents

Online end-to-end voice transcription method and system Download PDF

Info

Publication number
CN111128191A
CN111128191A CN201911415035.0A CN201911415035A CN111128191A CN 111128191 A CN111128191 A CN 111128191A CN 201911415035 A CN201911415035 A CN 201911415035A CN 111128191 A CN111128191 A CN 111128191A
Authority
CN
China
Prior art keywords
layer
network
chinese character
sequence
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911415035.0A
Other languages
Chinese (zh)
Other versions
CN111128191B (en
Inventor
张鹏远
缪浩然
程高峰
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201911415035.0A priority Critical patent/CN111128191B/en
Publication of CN111128191A publication Critical patent/CN111128191A/en
Application granted granted Critical
Publication of CN111128191B publication Critical patent/CN111128191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention provides a method and a system for online end-to-end voice transcription, wherein in one embodiment, acoustic features are extracted from an audio file; performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences; modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences; and taking the Chinese character sequence with the highest score as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.

Description

Online end-to-end voice transcription method and system
Technical Field
The invention relates to the technical field of voice transcription, in particular to an online end-to-end voice transcription method and system.
Background
The voice transcription technology is an important technology for converting input audio into text, and is also an important research content in the field of human-computer interaction.
The traditional speech transcription technology comprises an acoustic model, a pronunciation dictionary and a language model, and a complex decoding network is constructed by means of a weighted finite state converter, so that an acoustic feature sequence is converted into a text sequence. The current emerging end-to-end language transcription technology adopts a single neural network model to directly convert acoustic features into text sequences, thereby greatly simplifying the decoding flow in the voice transcription process. However, the current high-performance end-to-end voice transcription must wait for the complete audio input before starting to convert into a text sequence, so that the application of the end-to-end voice transcription technology to the online task of real-time transcription is limited.
Disclosure of Invention
In view of this, the embodiments of the present application provide an online end-to-end voice transcription method and system, which overcome the problem that the existing end-to-end voice transcription technology cannot be applied to transcribing an online task in real time, and by improving the end-to-end voice transcription technology based on the structures of an encoder and a decoder, the encoder and the decoder can start to convert into a text sequence without relying on complete audio.
In a first aspect, the present invention provides an online end-to-end voice transcription method, including:
acquiring an audio file, and extracting acoustic features from the audio file;
performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence;
partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
and taking the Chinese character sequence with the highest score as a final transcription result.
Optionally, the obtaining the audio file, and extracting the acoustic features from the audio file includes:
and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.
Optionally, the encoder is an online encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
Optionally, the processing the second feature sequence, outputting a plurality of groups of chinese character sequences, and scoring the plurality of groups of chinese character sequences includes:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network.
Optionally, the modeling the second feature sequence by the decoder, and scoring the output multiple groups of chinese character sequences includes:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
In a second aspect, the present invention provides an online end-to-end voice transcription system, including:
a collecting unit: the system is used for acquiring audio and extracting acoustic features from the audio;
a processing unit: the acoustic feature extraction unit is used for carrying out nonlinear transformation and down sampling on the acoustic features extracted by the acquisition unit and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
the processing unit is further used for modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
an output unit: and the Chinese character sequence with the highest score in the Chinese character sequences output by the processing unit is used as a final transcription result and is output.
Optionally, extracting acoustic features from the audio comprises:
and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.
Optionally, the encoder is an online encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
Optionally, the processing the second feature sequence, outputting a plurality of groups of chinese character sequences, and scoring the plurality of groups of chinese character sequences includes:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network.
Optionally, the modeling the second feature sequence by the decoder, and scoring the output multiple groups of chinese character sequences includes:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
The embodiment of the application provides an online end-to-end voice transcription method and system, wherein in one embodiment, logarithmic Mel spectrum features are extracted from audio to serve as frame-level acoustic features; constructing a front-end neural network, and performing nonlinear transformation and down-sampling on logarithmic Mel spectrum characteristics; constructing an online encoder based on a self-attention mechanism, modeling an output characteristic sequence of a front-end neural network, and outputting a group of new characteristic sequences; constructing an online decoder based on a self-attention mechanism, modeling a characteristic sequence output by an encoder, and outputting a plurality of groups of Chinese character sequences; and searching the character sequence with the highest score by using a beam search algorithm, and taking the character sequence as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic structural diagram of an online end-to-end voice transcription system according to the present invention;
FIG. 2 is a flow chart of a method for on-line end-to-end voice transcription in accordance with the present invention;
FIG. 3 is a flow chart of a process for an online encoder based on a self-attention mechanism for a sequence of features input thereto;
fig. 4 is a flow chart of the processing of a feature sequence input to an online decoder based on a self-attention mechanism.
Detailed Description
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Fig. 1 is a schematic structural diagram of an online end-to-end voice transcription method according to the present invention, and referring to fig. 1, an end-to-end far-field voice recognition system in the embodiment of the present invention includes: an acquisition unit 101, a processing unit 102 and an output unit 103.
The acquisition unit 101 is configured to acquire an audio signal, and pre-emphasize the acquired audio signal by passing the audio signal through a high-pass filter to increase a high-frequency portion of the audio signal.
The audio signal passed through the high pass filter is framed with 25 ms per frame and 10 ms frame shift. Windowing is performed on each frame, and the window function is a hamming window. And then carrying out fast Fourier transform on each frame to obtain the frequency spectrum of each frame, and further obtaining the energy spectrum of each frame. Further, the energy passing through the mel filter is calculated for the energy spectrum of each frame, and logarithm is taken to obtain a logarithmic mel spectrum, wherein the number of the mel filters is 80, so that 80-dimensional logarithmic mel spectrum characteristics are obtained for each frame.
The processing unit 102 includes: a first processing unit 1021, a second processing unit 1022 and a third processing unit 1023.
The first processing unit 1021 is used to build a front-end neural network. The front-end neural network comprises two layers of two-dimensional convolution networks, a layer of linear network and a layer of position coding network. The convolution kernel size of the convolution network is 3, the step length is 2, the convolution kernel number is 256, the length of the feature sequence after two layers of two-dimensional convolution layers is one fourth of the original length, the output features of the convolution layers are projected to 256 dimensions by the linear layer, and the output features of the linear layer and the 256-dimensional position features are added by the position coding layer.
The logarithmic Mel-spectrum feature sequence extracted by the acquisition unit 101 is input into a front-end neural network for nonlinear transformation and down-sampling, and the sampling rate is one fourth.
The second processing unit 1022 is configured to construct an on-line encoder based on the self-attention mechanism, where the encoder is composed of 12 identical sub-modules stacked together, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stacked together.
And partitioning the characteristic sequences output from the front-end neural network, sequentially inputting each characteristic sequence into an online encoder, and outputting a plurality of groups of new characteristic sequences. Wherein the signature sequence output from the in-line encoder is the same as the signature sequence input to the in-line encoder.
The third processing unit 1023 is used to construct an on-line decoder based on the self-attention mechanism, which is composed of a stack of 6 identical sub-modules, each of which is composed of a self-attention network, a residual error network, a layer normalization network, a cut-off attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stack in turn.
Before inputting the output characteristics of the on-line encoder into the on-line decoder, a start symbol is also input to the on-line decoder as a starting point of the on-line decoder. The embedded features of the words of the starting symbol are added to the position features and input to an on-line decoder based on a self-attention mechanism.
The output unit 103 is configured to output the chinese character sequence with the highest score in the chinese character sequences output by the processing unit 102 as a final transcription result.
In one possible embodiment, the output unit 103 uses a beam search algorithm to control the third processing unit 1023 to output the number of Chinese characters each time, sort the Chinese characters according to scores from high to low, and input the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs the terminator. After the end, the Chinese character sequence with the highest score is used as the final transcription result.
Fig. 2 is a flowchart of an online end-to-end voice transcription method according to the present invention, and referring to fig. 2, an online end-to-end voice transcription method includes steps S201 to S205:
step S201: and acquiring an audio file, and extracting acoustic features of the audio file.
Logarithmic mel-frequency spectral features are extracted from the audio as acoustic features at the frame level. The method specifically comprises the following steps: and pre-emphasis is carried out on the acquired audio file, and the high-frequency part is promoted. I.e. passing the speech signal in the audio file through a high pass filter:
H(z)=1-0.97z-1(1)
the audio in the audio file is framed and windowed, wherein each frame is 25 ms, the frame is shifted by 10 ms, and the window function is a Hamming window.
And performing fast Fourier transform on each frame to obtain a frequency spectrum of each frame, and further processing the frequency spectrum of each frame to obtain an energy spectrum of each frame.
And calculating the energy of the energy spectrum of each frame passing through the Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum, wherein the number of the Mel filters is 80, so that each frame obtains 80-dimensional logarithmic Mel spectrum characteristics.
Step S202: the acoustic features are non-linearly transformed and down-sampled and a first feature sequence is output.
Inputting the extracted acoustic features into a front-end neural network, carrying out nonlinear transformation and down-sampling on the acoustic features by the front-end neural network, and outputting a first feature sequence; wherein the sampling rate is one fourth.
In one possible embodiment, the constructed front-end neural network comprises two layers of two-dimensional convolutional networks, one layer of linear network and one layer of position coding network, wherein the convolutional kernel size of the convolutional network is 3, the step length is 2, the number of the convolutional kernels is 256, the length of the feature sequence becomes one fourth of the original length after passing through the two layers of two-dimensional convolutional layers, the linear layer projects the output features of the convolutional layers to 256 dimensions, the position coding layer adds the output features of the linear layer and the 256-dimensional position features, and the overall calculation process is as follows:
Y=ReLU(Conv(ReLU(Conv(X))) (2)
zi=Linear(yi)+pi(3)
wherein Conv (·) denotes a convolutional layer; ReLU (·) represents an activation function, whose expression is:
ReLU(x)=max(0,x) (4)
linear (·) represents a Linear layer, X and Y represent a logarithmic Mel-spectrum feature sequence and an output feature sequence of two-dimensional convolution layers, respectively, Yi,pi,ziRespectively representing the output characteristic of the ith two-dimensional convolutional layer, the ith position characteristic and the output characteristic of the ith front-end neural network, wherein the calculation formula of each dimension of the position characteristic is as follows:
pi,2k+1=sin(i/10000k/128) (5)
pi,2k+2=cos(i/10000k/128) (6)
step S203: and partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder, and outputting a plurality of groups of second characteristic sequences.
In one possible embodiment, each feature sequence is input into an online encoder based on a self-attention mechanism, the encoder is formed by stacking 12 identical sub-modules, each sub-module is sequentially a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and each layer of network output features is 256 dimensions.
And partitioning the first characteristic sequence into blocks, wherein the length of each block of the characteristic sequence is 64. Inputting each block of feature sequence into the encoder for processing, the processing flow chart is shown in fig. 3, and the processing flow chart comprises steps S2031-S2038:
step S2031: inputting each feature sequence into the self-attention network of the first module of the encoder, and calculating as follows:
SAN(Q,K,V)=[head1,…,head4]WO(7)
Figure BDA0002350970320000081
wherein Wi Q,Wi K,Wi V,WOAnd representing a parameter matrix, wherein Q represents one feature sequence, 64 future input features are spliced on the right side of the feature sequence, K represents one feature sequence, 64 historical input features are spliced on the left side of the feature sequence, 64 future input features are spliced on the right side of the feature sequence, and V and K are the same.
Step S2032: the input and output characteristics from the attention network are added as the output characteristics of the residual network.
Step S2033: the output characteristics of the residual error network are input into the normalized network, and the calculation is as follows:
Figure BDA0002350970320000082
Figure BDA0002350970320000083
Figure BDA0002350970320000084
wherein the mean μ and variance var are calculated for each frame of input features h, by means of model parameters wiAnd biCarrying out regular and linear transformation on each dimension value of h and outputting a new characteristic sequence
Figure BDA0002350970320000085
Step S2034: inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula of the network is as follows:
F(x)=max(0,xW1+b1)W2+b2(12)
step S2035: the input and output characteristics of the fully connected network are added as the output characteristics of the residual network.
Step S2036: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2037: judging whether the current module is the last sub-module or not, and if the current module is the last sub-module, ending the calculation flow of the encoder; otherwise, step S2038 is performed.
Step S2038: the save layer normalizes the output characteristics of the network as input from the attention network for the next sub-module and performs step S2032.
And taking the output characteristics of the storage layer normalized network as a history frame for splicing in the next characteristic processing process, and inputting the output characteristics into the next submodule.
In one possible embodiment, the first signature sequence input to the on-line encoder and the second signature sequence output by the on-line encoder are the same length.
Step S204: and processing the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences.
The second signature sequence output by the on-line encoder is modeled by an on-line decoder. The online decoder based on the self-attention mechanism is formed by stacking 6 identical sub-modules, each sub-module sequentially comprises a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and the output characteristics of each layer of network are 256 dimensions.
And processing the second characteristic sequence output by the online encoder through an online decoder. The processing flow chart is shown in fig. 4, and comprises the following steps:
step S2041: the character sequence formed by adding the word embedding character and the position character of the starting symbol and the character sequence output by the decoder are input into the attention network. The calculation formula in the self-attention network is the same as the formula (7) and the formula (8). Wherein Q represents an input feature, K represents a feature sequence composed of input features of the current history and all histories, and V and K are the same.
Step S2042: the input and output characteristics from the attention network are added as the output characteristics of the residual network.
Step S2043: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2044: ith output characteristic q of a layer-normalized networkiInputting the truncated attention network, the calculation formula is as follows, and the output characteristic k of each encoderjSequentially calculating j ═ 1,2, …:
Figure BDA0002350970320000101
Figure BDA0002350970320000102
when a certain j satisfies pij>0.5, and j is greater than the last truncation point, it is established as the truncation point of the current truncated attention network, and the output characteristic is calculated:
Figure BDA0002350970320000103
step S2045: adding the input and output characteristics of the truncated attention network to serve as the output characteristics of the residual error network;
step S2046: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2047: and inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula is the same as that in the step S2034.
Step S2048: and adding the input and output characteristics of the fully-connected network to be used as the output characteristics of the residual error network, and inputting the output characteristics of the residual error network into the normalized network. The calculation formula is the same as that in step S2033;
step S2049: judging whether the current module is the last sub-module, if so, executing the step S20411; otherwise, step S20410 is executed.
Step S20410: the saving layer normalizes the output characteristics of the network as the input from the attention network of the next submodule, and performs step S2042.
Step S20411: the output characteristics of the layer normalized network are input into a Chinese character classifier, a plurality of groups of Chinese characters and corresponding scores are output, each Chinese character is added with the word embedding and position characteristics, and an input decoder predicts the next Chinese character.
Step S205: and taking the Chinese character sequence with the highest score as a final transcription result.
And controlling the number of Chinese characters output by the decoder each time by adopting a beam search algorithm, sequencing the Chinese characters from high to low according to scores, and respectively inputting the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs a terminator. And after finishing, taking the Chinese character sequence with the highest score as a final transcription result.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (10)

1. An online end-to-end voice transcription method comprises the following steps:
acquiring an audio file, and extracting acoustic features from the audio file;
performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence;
partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
and taking the Chinese character sequence with the highest score as a final transcription result.
2. The method of claim 1, wherein the obtaining an audio file and the extracting acoustic features from the audio file comprises:
and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.
3. The method of claim 1, wherein the encoder is an on-line encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
4. The method of claim 1, wherein the processing the second signature sequence, outputting a plurality of chinese character sequences and scoring the plurality of chinese character sequences comprises:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is composed of a self-attention network layer, a residual error network layer, a normalization network layer by layer, a truncation attention network layer, a residual error network layer, a normalization network layer by layer, a full-connection network layer, a residual error network layer and a normalization network layer by layer.
5. The method of claim 4, wherein the decoder modeling the second token sequence and scoring the output plurality of kanji sequences comprises:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
6. An online end-to-end voice transcription system, comprising:
a collecting unit: the system is used for acquiring audio and extracting acoustic features from the audio;
a processing unit: the acoustic feature extraction unit is used for carrying out nonlinear transformation and down sampling on the acoustic features extracted by the acquisition unit and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
the processing unit is further used for modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
an output unit: and the Chinese character sequence with the highest score in the Chinese character sequences output by the processing unit is used as a final transcription result and is output.
7. The system of claim 6, wherein extracting acoustic features from the audio comprises:
and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.
8. The system of claim 6, wherein the encoder is an on-line encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
9. The system of claim 6, wherein said processing the second token sequence, outputting a plurality of chinese character sequences and scoring the plurality of chinese character sequences comprises:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is composed of a self-attention network layer, a residual error network layer, a normalization network layer by layer, a truncation attention network layer, a residual error network layer, a normalization network layer by layer, a full-connection network layer, a residual error network layer and a normalization network layer by layer.
10. The system of claim 9, wherein the decoder models the second token sequence and scoring the output plurality of kanji sequences comprises:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
CN201911415035.0A 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system Active CN111128191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911415035.0A CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911415035.0A CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Publications (2)

Publication Number Publication Date
CN111128191A true CN111128191A (en) 2020-05-08
CN111128191B CN111128191B (en) 2023-03-28

Family

ID=70506638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911415035.0A Active CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Country Status (1)

Country Link
CN (1) CN111128191B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493812A (en) * 2009-03-06 2009-07-29 中国科学院软件研究所 Tone-character conversion method
CN107590135A (en) * 2016-07-07 2018-01-16 三星电子株式会社 Automatic translating method, equipment and system
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
CN108932941A (en) * 2017-10-13 2018-12-04 北京猎户星空科技有限公司 Audio recognition method, device and computer equipment, storage medium and program product
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
US20190236451A1 (en) * 2016-10-10 2019-08-01 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493812A (en) * 2009-03-06 2009-07-29 中国科学院软件研究所 Tone-character conversion method
CN107590135A (en) * 2016-07-07 2018-01-16 三星电子株式会社 Automatic translating method, equipment and system
US20190236451A1 (en) * 2016-10-10 2019-08-01 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
CN108932941A (en) * 2017-10-13 2018-12-04 北京猎户星空科技有限公司 Audio recognition method, device and computer equipment, storage medium and program product
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
VASWANI, ASHISH ET.AL: "《Attention Is All You Need》", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833886A (en) * 2020-07-27 2020-10-27 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof

Also Published As

Publication number Publication date
CN111128191B (en) 2023-03-28

Similar Documents

Publication Publication Date Title
US11948066B2 (en) Processing sequences using convolutional neural networks
EP3680894B1 (en) Real-time speech recognition method and apparatus based on truncated attention, device and computer-readable storage medium
CN111477221B (en) Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
CN112037798B (en) Voice recognition method and system based on trigger type non-autoregressive model
CN111048082B (en) Improved end-to-end speech recognition method
CN111402891B (en) Speech recognition method, device, equipment and storage medium
CN113205817B (en) Speech semantic recognition method, system, device and medium
EP4018437B1 (en) Optimizing a keyword spotting system
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN110277088B (en) Intelligent voice recognition method, intelligent voice recognition device and computer readable storage medium
WO2009075990A1 (en) Grapheme-to-phoneme conversion using acoustic data
CN111916058A (en) Voice recognition method and system based on incremental word graph re-scoring
WO2016119604A1 (en) Voice information search method and apparatus, and server
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
CN111798840A (en) Voice keyword recognition method and device
CN113436612B (en) Intention recognition method, device, equipment and storage medium based on voice data
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN113327585B (en) Automatic voice recognition method based on deep neural network
CN111128191B (en) Online end-to-end voice transcription method and system
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
CN114974310A (en) Emotion recognition method and device based on artificial intelligence, computer equipment and medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113160828A (en) Intelligent auxiliary robot interaction method and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant