CN111128191B - Online end-to-end voice transcription method and system - Google Patents

Online end-to-end voice transcription method and system Download PDF

Info

Publication number
CN111128191B
CN111128191B CN201911415035.0A CN201911415035A CN111128191B CN 111128191 B CN111128191 B CN 111128191B CN 201911415035 A CN201911415035 A CN 201911415035A CN 111128191 B CN111128191 B CN 111128191B
Authority
CN
China
Prior art keywords
network
layer
output
chinese character
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911415035.0A
Other languages
Chinese (zh)
Other versions
CN111128191A (en
Inventor
张鹏远
缪浩然
程高峰
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201911415035.0A priority Critical patent/CN111128191B/en
Publication of CN111128191A publication Critical patent/CN111128191A/en
Application granted granted Critical
Publication of CN111128191B publication Critical patent/CN111128191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention provides a method and a system for online end-to-end voice transcription, wherein in one embodiment, acoustic features are extracted from an audio file; carrying out nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences; modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences; and taking the Chinese character sequence with the highest score as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.

Description

Online end-to-end voice transcription method and system
Technical Field
The invention relates to the technical field of voice transcription, in particular to an online end-to-end voice transcription method and system.
Background
The voice transcription technology is an important technology for converting input audio into text, and is also an important research content in the field of human-computer interaction.
The traditional speech transcription technology comprises an acoustic model, a pronunciation dictionary and a language model, and a complex decoding network is constructed by means of a weighted finite state machine to convert an acoustic feature sequence into a text sequence. The current emerging end-to-end language transcription technology adopts a single neural network model to directly convert acoustic features into text sequences, thereby greatly simplifying the decoding flow in the voice transcription process. However, the current high-performance end-to-end voice transcription must wait for the complete audio input before starting to convert into a text sequence, so that the application of the end-to-end voice transcription technology to the online task of real-time transcription is limited.
Disclosure of Invention
In view of this, the embodiments of the present application provide an online end-to-end voice transcription method and system, which overcome the problem that the existing end-to-end voice transcription technology cannot be applied to transcribing an online task in real time, and by improving the end-to-end voice transcription technology based on the structures of an encoder and a decoder, the encoder and the decoder can start to convert into a text sequence without relying on complete audio.
In a first aspect, the present invention provides an online end-to-end voice transcription method, including:
acquiring an audio file, and extracting acoustic features from the audio file;
performing nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence;
partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
and taking the Chinese character sequence with the highest score as a final transcription result.
Optionally, the obtaining the audio file and extracting the acoustic features from the audio file include:
and extracting logarithmic Mel spectral features from the acquired audio file as frame-level acoustic features.
Optionally, the encoder is an online encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
Optionally, the processing the second feature sequence, outputting a plurality of groups of chinese character sequences, and scoring the plurality of groups of chinese character sequences includes:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network.
Optionally, the modeling, by the decoder, the second feature sequence, and scoring the output multiple groups of chinese character sequences includes:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
In a second aspect, the present invention provides an online end-to-end voice transcription system, including:
the acquisition unit: the system is used for acquiring audio and extracting acoustic features from the audio;
a processing unit: the acoustic feature extraction unit is used for carrying out nonlinear transformation and down sampling on the acoustic features extracted by the acquisition unit and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
the processing unit is also used for modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
an output unit: and the Chinese character sequence with the highest score in the Chinese character sequences output by the processing unit is used as a final transcription result and is output.
Optionally, extracting acoustic features from the audio comprises:
and extracting logarithmic Mel spectral features from the obtained audio file as frame-level acoustic features.
Optionally, the encoder is an online encoder based on a self-attention mechanism;
the encoder is formed by stacking 12 identical sub-modules, and each sub-module is sequentially formed by stacking a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network.
Optionally, the processing the second feature sequence, outputting multiple groups of chinese character sequences, and scoring the multiple groups of chinese character sequences includes:
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module comprises a self-attention network layer, a residual error network layer, a normalized network layer by layer, a truncated attention network layer, a residual error network layer, a normalized network layer by layer, a fully-connected network layer, a residual error network layer and a normalized network layer by layer.
Optionally, the modeling the second feature sequence by the decoder, and scoring the output multiple groups of chinese character sequences includes:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
The embodiment of the application provides an online end-to-end voice transcription method and system, wherein in one embodiment, logarithmic Mel spectrum features are extracted from audio to serve as frame-level acoustic features; constructing a front-end neural network, and performing nonlinear transformation and down-sampling on logarithmic Mel spectrum characteristics; constructing an online encoder based on a self-attention mechanism, modeling an output characteristic sequence of a front-end neural network, and outputting a group of new characteristic sequences; constructing an online decoder based on a self-attention mechanism, modeling a characteristic sequence output by the encoder, and outputting a plurality of groups of Chinese character sequences; and searching the character sequence with the highest score by using a beam search algorithm, and taking the character sequence as a final transcription result. The encoder structure is improved to process the partitioned audio; by improving the structure of the decoder, the Chinese characters are output on the basis of audio truncation. So that text is transcribed while audio is being input.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic structural diagram of an online end-to-end voice transcription system according to the present invention;
FIG. 2 is a flow chart of a method for on-line end-to-end voice transcription according to the present invention;
FIG. 3 is a flow chart of the process of an on-line encoder based on the self-attention mechanism for a sequence of features input thereto;
fig. 4 is a flow chart of the processing of the feature sequence input to the on-line decoder based on the self-attention mechanism.
Detailed Description
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Fig. 1 is a schematic structural diagram of an online end-to-end voice transcription method according to the present invention, and referring to fig. 1, an end-to-end far-field voice recognition system in the embodiment of the present invention includes: an acquisition unit 101, a processing unit 102 and an output unit 103.
The collecting unit 101 is configured to collect an audio signal, and pre-emphasize the collected audio signal by using a high-pass filter to increase a high-frequency portion of the audio signal.
The audio signal passed through the high pass filter is framed with 25 ms per frame and 10 ms frame shift. Windowing is performed on each frame, and the window function is a hamming window. And then carrying out fast Fourier transform on each frame to obtain the frequency spectrum of each frame, and further obtaining the energy spectrum of each frame. Further, the energy passing through the mel filter is calculated for the energy spectrum of each frame, and logarithm is taken to obtain a logarithmic mel spectrum, wherein the number of the mel filters is 80, so that 80-dimensional logarithmic mel spectrum characteristics are obtained for each frame.
The processing unit 102 includes: a first processing unit 1021, a second processing unit 1022 and a third processing unit 1023.
The first processing unit 1021 is used to build a front-end neural network. The front-end neural network comprises two layers of two-dimensional convolution networks, a layer of linear network and a layer of position coding network. The convolution kernel size of the convolution network is 3, the step length is 2, the convolution kernel number is 256, the length of the feature sequence after two layers of two-dimensional convolution layers is one fourth of the original length, the output features of the convolution layers are projected to 256 dimensions by the linear layer, and the output features of the linear layer and the 256-dimensional position features are added by the position coding layer.
The logarithmic Mel-spectrum characteristic sequence extracted by the acquisition unit 101 is input into a front-end neural network, nonlinear transformation and down-sampling are carried out, and the sampling rate is one fourth.
The second processing unit 1022 is configured to construct an on-line encoder based on the self-attention mechanism, where the encoder is composed of 12 identical sub-modules stacked together, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stacked together.
And partitioning the characteristic sequences output from the front-end neural network, sequentially inputting each characteristic sequence into an online encoder, and outputting a plurality of groups of new characteristic sequences. Wherein the signature sequence output from the in-line encoder is the same as the signature sequence input to the in-line encoder.
The third processing unit 1023 is used to construct an on-line decoder based on the self-attention mechanism, which is composed of a stack of 6 identical sub-modules, each of which is composed of a self-attention network, a residual error network, a layer normalization network, a cut-off attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network, and a layer normalization network stack in turn.
Before inputting the output characteristics of the on-line encoder into the on-line decoder, a start symbol is also input to the on-line decoder as a starting point of the on-line decoder. The embedded features of the words of the starting symbol are added to the position features and input to an on-line decoder based on a self-attention mechanism.
The output unit 103 is configured to output the chinese character sequence with the highest score in the chinese character sequences output by the processing unit 102 as a final transcription result.
In one possible embodiment, the output unit 103 uses a beam search algorithm to control the third processing unit 1023 to output the number of Chinese characters each time, sort the Chinese characters according to scores from high to low, and input the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs the terminator. After the end, the Chinese character sequence with the highest score is used as the final transcription result.
Fig. 2 is a flowchart of an online end-to-end voice transcription method according to the present invention, and referring to fig. 2, the online end-to-end voice transcription method includes steps S201 to S205:
step S201: and acquiring an audio file, and extracting acoustic features of the audio file.
Logarithmic mel-frequency spectral features are extracted from the audio as acoustic features at the frame level. The method specifically comprises the following steps: and pre-emphasis is carried out on the acquired audio file, and the high-frequency part is promoted. I.e. passing the speech signal in the audio file through a high pass filter:
H(z)=1-0.97z -1 (1)
the audio in the audio file is framed and windowed, wherein each frame is 25 ms, the frame is shifted by 10 ms, and the window function is a Hamming window.
And performing fast Fourier transform on each frame to obtain a frequency spectrum of each frame, and further processing the frequency spectrum of each frame to obtain an energy spectrum of each frame.
And calculating the energy of the energy spectrum of each frame passing through the Mel filter, and taking the logarithm to obtain a logarithmic Mel spectrum, wherein the number of the Mel filters is 80, so that each frame obtains 80-dimensional logarithmic Mel spectrum characteristics.
Step S202: the acoustic features are nonlinearly transformed and downsampled and a first feature sequence is output.
Inputting the extracted acoustic features into a front-end neural network, carrying out nonlinear transformation and down-sampling on the acoustic features by the front-end neural network, and outputting a first feature sequence; wherein the sampling rate is one fourth.
In one possible embodiment, the constructed front-end neural network comprises two layers of two-dimensional convolutional networks, one layer of linear network and one layer of position coding network, wherein the convolutional kernel size of the convolutional network is 3, the step length is 2, the number of the convolutional kernels is 256, the length of the feature sequence after passing through the two layers of two-dimensional convolutional layers becomes one fourth of the original length, the linear layer projects the output features of the convolutional layers to 256 dimensions, the position coding layer adds the output features of the linear layer and the 256-dimensional position features, and the overall calculation process is as follows:
Y=ReLU(Conv(ReLU(Conv(X))) (2)
z i =Linear(y i )+p i (3)
wherein Conv (·) denotes a convolutional layer; reLU (·) represents an activation function, whose expression is:
ReLU(x)=max(0,x) (4)
linear (·) represents a Linear layer, X and Y represent a logarithmic Mel-spectrum feature sequence and an output feature sequence of two-dimensional convolution layers, respectively, Y i ,p i ,z i Respectively representing the output characteristic of the ith two-dimensional convolutional layer, the ith position characteristic and the output characteristic of the ith front-end neural network, wherein the calculation formula of each dimension of the position characteristic is as follows:
p i,2k+1 =sin(i/10000 k/128 ) (5)
p i,2k+2 =cos(i/10000 k/128 ) (6)
step S203: and partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences.
In one possible embodiment, each feature sequence is input into an online encoder based on a self-attention mechanism, the encoder is formed by stacking 12 identical sub-modules, each sub-module is sequentially a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and each layer of network output features is 256 dimensions.
And partitioning the first characteristic sequence into blocks, wherein the length of each block of the characteristic sequence is 64. Inputting each block of feature sequence into the encoder for processing, the processing flow chart is shown in fig. 3, and includes steps S2031-S2038:
step S2031: inputting each feature sequence into the self-attention network of the first module of the encoder, and calculating as follows:
SAN(Q,K,V)=[head 1 ,…,head 4 ]W O (7)
Figure BDA0002350970320000081
wherein W i Q ,W i K ,W i V ,W O And representing a parameter matrix, wherein Q represents one feature sequence, 64 future input features are spliced on the right side of the feature sequence, K represents one feature sequence, 64 historical input features are spliced on the left side of the feature sequence, 64 future input features are spliced on the right side of the feature sequence, and V and K are the same.
Step S2032: the input and output characteristics from the attention network are added as the output characteristics of the residual network.
Step S2033: the output characteristics of the residual error network are input into the normalized network, and the calculation is as follows:
Figure BDA0002350970320000082
Figure BDA0002350970320000083
Figure BDA0002350970320000084
wherein the mean μ and variance var are calculated for each frame input feature h, by means of the model parameters w i And b i Carrying out regular and linear transformation on each dimension value of h and outputting a new characteristic sequence
Figure BDA0002350970320000085
Step S2034: inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula of the network is as follows:
F(x)=max(0,xW 1 +b 1 )W 2 +b 2 (12)
step S2035: the input and output characteristics of the fully connected network are added as the output characteristics of the residual network.
Step S2036: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2037: judging whether the current module is the last sub-module or not, and if the current module is the last sub-module, ending the calculation flow of the encoder; otherwise, step S2038 is performed.
Step S2038: the save layer normalizes the output characteristics of the network as input from the attention network for the next sub-module and performs step S2032.
And taking the output characteristic of the storage layer normalized network as a history frame for splicing in the next characteristic processing process, and inputting the output characteristic into the next submodule.
In one possible embodiment, the first signature sequence input to the on-line encoder and the second signature sequence output by the on-line encoder are the same length.
Step S204: and processing the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences.
The second signature sequence output by the on-line encoder is modeled by an on-line decoder. The online decoder based on the self-attention mechanism is formed by stacking 6 identical sub-modules, each sub-module sequentially comprises a layer of self-attention network, a layer of residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network, and the output characteristics of each layer of network are 256 dimensions.
And processing the second characteristic sequence output by the online encoder through an online decoder. The processing flow chart is shown in fig. 4, and comprises the following steps:
step S2041: the character sequence formed by adding the word embedding character of the starting symbol and the position character and the character sequence output by the decoder are input into the attention network. The calculation formula in the self-attention network is the same as the formula (7) and the formula (8). Wherein Q represents an input feature, K represents a feature sequence composed of input features of the current history and all histories, and V and K are the same.
Step S2042: the input and output characteristics from the attention network are added as the output characteristics of the residual network.
Step S2043: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2044: ith output characteristic q of a layer-normalized network i Inputting the truncated attention network, the calculation formula is as follows, and the output characteristic k of each encoder j Sequentially calculating j =1,2, \8230;:
Figure BDA0002350970320000101
Figure BDA0002350970320000102
when a certain j satisfies p ij >0.5, and j is greater than the last truncation point, establishing it as the truncation point of the current truncation attention network, and calculating the output characteristic:
Figure BDA0002350970320000103
step S2045: adding the input and output characteristics of the truncated attention network to serve as the output characteristics of the residual error network;
step S2046: and (4) inputting the output characteristics of the residual error network into the normalized network, wherein the calculation formula is the same as that in the step S2033.
Step S2047: and inputting the output characteristics of the layer normalized network into the full-connection network, wherein the calculation formula is the same as that in the step S2034.
Step S2048: and adding the input and output characteristics of the fully-connected network to be used as the output characteristics of the residual error network, and inputting the output characteristics of the residual error network into the normalized network. The calculation formula is the same as that in step S2033;
step S2049: judging whether the current module is the last sub-module, if so, executing the step S20411; otherwise, go to step S20410.
Step S20410: the saving layer normalizes the output characteristics of the network as the input from the attention network of the next submodule, and performs step S2042.
Step S20411: the output characteristics of the layer normalized network are input into a Chinese character classifier, a plurality of groups of Chinese characters and corresponding scores are output, each Chinese character is added with the word embedding and position characteristics, and an input decoder predicts the next Chinese character.
Step S205: and taking the Chinese character sequence with the highest score as a final transcription result.
And (3) controlling the number of Chinese characters output each time in the decoder by adopting a beam search algorithm, sequencing the Chinese characters from high to low according to scores, and respectively inputting the top ten Chinese characters into the decoder to output the next Chinese character until the decoder outputs a terminator. And after finishing, taking the Chinese character sequence with the highest score as a final transcription result.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (6)

1. An online end-to-end voice transcription method, comprising:
acquiring an audio file, and extracting acoustic features of the audio file, wherein the method comprises the following steps:
extracting logarithmic Mel spectrum characteristics of the obtained audio file as frame level acoustic characteristics;
carrying out nonlinear transformation and down-sampling on the acoustic features and outputting a first feature sequence;
partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
constructing an online decoder based on a self-attention mechanism, wherein the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 6 identical sub-modules, wherein each sub-module is composed of a self-attention network layer, a residual error network layer, a normalized network layer by layer, a truncated attention network layer, a residual error network layer, a normalized network layer by layer, a fully-connected network layer, a residual error network layer and a normalized network layer by layer;
the truncation attention network processes the output features in the second feature sequence one by one, determines a truncation point, truncates the second feature sequence into a plurality of subsequences to calculate output features, and sends the output features to other sub-modules for processing to obtain final output features;
inputting the final output characteristics into a Chinese character classifier for scoring;
and taking the Chinese character sequence with the highest score as a final transcription result.
2. The method of claim 1, wherein the encoder is an on-line encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
3. The method of claim 1, wherein the decoder models the second token sequence and scoring the output plurality of kanji sequences comprises:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
4. An online end-to-end voice transcription system, comprising:
a collecting unit: the method is used for collecting audio and extracting acoustic features from the audio and comprises the following steps:
extracting logarithmic Mel spectrum characteristics of the obtained audio file as frame level acoustic characteristics;
a processing unit: the acoustic feature extraction unit is used for carrying out nonlinear transformation and down sampling on the acoustic features extracted by the acquisition unit and outputting a first feature sequence; partitioning the first characteristic sequence, sequentially inputting each block of characteristic sequence into an encoder and outputting a plurality of groups of second characteristic sequences;
the processing unit is further used for modeling the second characteristic sequence, outputting a plurality of groups of Chinese character sequences and scoring the plurality of groups of Chinese character sequences;
the processing unit constructs an online decoder based on a self-attention mechanism, the decoder models the second characteristic sequence and scores the output multiple groups of Chinese character sequences;
the decoder is formed by stacking 5 identical sub-modules, wherein each sub-module is composed of a self-attention network, a residual error network, a layer of normalized network, a layer of truncated attention network, a layer of residual error network, a layer of normalized network, a layer of fully-connected network, a layer of residual error network and a layer of normalized network;
the truncation attention network processes the output features in the second feature sequence one by one, determines a truncation point, truncates the second feature sequence into a plurality of subsequences to calculate output features, and sends the output features to other sub-modules for processing to obtain final output features;
inputting the final output characteristics into a Chinese character classifier for scoring;
an output unit: and the Chinese character sequence with the highest score in the Chinese character sequences output by the processing unit is used as a final transcription result and is output.
5. The system of claim 4, wherein the encoder is an on-line encoder based on a self-attention mechanism;
the encoder is composed of 12 identical sub-modules in a stacked mode, and each sub-module is composed of a self-attention network, a residual error network, a layer normalization network, a full-connection network, a residual error network and a layer normalization network in a stacked mode.
6. The system of claim 4, wherein the decoder models the second token sequence and scoring the output plurality of kanji sequences comprises:
sequentially passing multiple groups of second characteristic sequences through 6 submodules of the decoder, and inputting the output characteristics of the layer standard network of the last submodule into the Chinese character classifier;
the Chinese character classifier outputs a plurality of groups of Chinese characters and scores corresponding to each group of Chinese characters;
and taking the top ten Chinese characters to input the decoder to output the next Chinese character until the decoder outputs the terminator.
CN201911415035.0A 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system Active CN111128191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911415035.0A CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911415035.0A CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Publications (2)

Publication Number Publication Date
CN111128191A CN111128191A (en) 2020-05-08
CN111128191B true CN111128191B (en) 2023-03-28

Family

ID=70506638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911415035.0A Active CN111128191B (en) 2019-12-31 2019-12-31 Online end-to-end voice transcription method and system

Country Status (1)

Country Link
CN (1) CN111128191B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833886B (en) * 2020-07-27 2021-03-23 中国科学院声学研究所 Fully-connected multi-scale residual error network and voiceprint recognition method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493812A (en) * 2009-03-06 2009-07-29 中国科学院软件研究所 Tone-character conversion method
CN107590135A (en) * 2016-07-07 2018-01-16 三星电子株式会社 Automatic translating method, equipment and system
CN108932941A (en) * 2017-10-13 2018-12-04 北京猎户星空科技有限公司 Audio recognition method, device and computer equipment, storage medium and program product
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018071389A1 (en) * 2016-10-10 2018-04-19 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101493812A (en) * 2009-03-06 2009-07-29 中国科学院软件研究所 Tone-character conversion method
CN107590135A (en) * 2016-07-07 2018-01-16 三星电子株式会社 Automatic translating method, equipment and system
CN108932941A (en) * 2017-10-13 2018-12-04 北京猎户星空科技有限公司 Audio recognition method, device and computer equipment, storage medium and program product
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《Attention Is All You Need》;Vaswani, Ashish et.al;《31st Conference on Neural Information Processing Systems》;20171209;第1-11页 *

Also Published As

Publication number Publication date
CN111128191A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
US11948066B2 (en) Processing sequences using convolutional neural networks
CN111429889B (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN111477221B (en) Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111048082B (en) Improved end-to-end speech recognition method
CN111402891B (en) Speech recognition method, device, equipment and storage medium
CN111933129A (en) Audio processing method, language model training method and device and computer equipment
CN113205817B (en) Speech semantic recognition method, system, device and medium
EP4018437B1 (en) Optimizing a keyword spotting system
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN110277088B (en) Intelligent voice recognition method, intelligent voice recognition device and computer readable storage medium
CN111916058A (en) Voice recognition method and system based on incremental word graph re-scoring
WO2016119604A1 (en) Voice information search method and apparatus, and server
CN111798840A (en) Voice keyword recognition method and device
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
CN113436612B (en) Intention recognition method, device, equipment and storage medium based on voice data
CN113646835A (en) Joint automatic speech recognition and speaker binarization
CN112184859A (en) End-to-end virtual object animation generation method and device, storage medium and terminal
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN111128191B (en) Online end-to-end voice transcription method and system
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN113327585B (en) Automatic voice recognition method based on deep neural network
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
CN114974310A (en) Emotion recognition method and device based on artificial intelligence, computer equipment and medium
CN113160828A (en) Intelligent auxiliary robot interaction method and system, electronic equipment and storage medium
CN111583902A (en) Speech synthesis system, method, electronic device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant