Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
FIG. 1 illustrates a schematic diagram of an exemplary operating environment 100 in which embodiments of the present disclosure can be implemented. In the operating environment 100, a terminal device 102, a network 104, and a server 106 are included.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Terminal devices 102 include, but are not limited to, cell phones, tablet computers, notebook computers, desktop computers, and the like. A user may interact with a server 106 through a network 104 using a terminal device 102 to receive or send messages, etc.
The network 104 is used to provide a communication link between the terminal device 102 and the server 106. The network 104 may be of various connection types, such as a wired communication link, a wireless communication link, and so forth.
The server 106 may be a server that provides voice recognition services. The server 106 may be a single entity server, a server cluster of multiple servers, a cloud server, or the like.
For example, the user collects the voice signal to be recognized by using the terminal device 102, and then uploads the voice signal to be recognized to the server 106 through the network 104. After receiving the current speech signal to be recognized, the server 106 may extract feature information of the current speech signal to be recognized; inputting the characteristic information into a Deep-FSMN model to obtain an output sequence representing the probability of each phoneme; processing the output sequence through a CTC model to obtain a corresponding phoneme sequence; then, converting the phoneme sequence into a final text sequence through a language model to serve as a recognition result; the server 106 may feed back the recognition result of the voice signal to be recognized currently to the terminal device 10); after receiving the returned identification result, the terminal device 102 may display the identification result to facilitate the user to view.
It should be noted that, the scheme provided by the embodiment of the present disclosure is not limited to application to the above-mentioned speech recognition application scenario, and may also be extended to any other application scenario.
It should be noted that, the voice recognition method provided in the embodiments of the present disclosure may be executed by the server 106, and accordingly, the voice recognition device may be disposed in the server 106. However, in other embodiments of the present disclosure, the terminal may also have a similar function as the server, thereby performing the voice recognition scheme provided by the embodiments of the present disclosure.
Fig. 2 shows a flowchart of a speech recognition method 200 according to an embodiment of the present disclosure. The method 200 may be performed by the server 106 in fig. 1.
Step 210, obtaining a voice characteristic sequence of a voice signal to be recognized;
in some embodiments, the terminal 102 collects the voice signal to be recognized using the detected voice of VAD (Voice Activity Detection) algorithm, and uploads the voice signal to be recognized to the server 106 through the network 104. In the embodiment of the disclosure, each batch obtained by VAD detection is 100ms. Streaming is performed in units of each batch. The server 106 obtains feature information of the current speech signal to be recognized, including: and carrying out framing treatment and windowing treatment on the voice signal to be recognized, and extracting voice characteristics.
In some embodiments, the extracted speech features may be logarithmic power spectrum or MFCC (Mel-Frequency Cepstral Coefficients, mel-frequency cepstral coefficient) features.
In some embodiments, the speech signal to be recognized is first subjected to framing and windowing, then each frame is respectively subjected to FFT (Fast Fourier Transformation, fast fourier transform), a discrete power spectrum after FFT is determined, and logarithm is obtained on the obtained discrete power spectrum, so as to obtain a logarithmic power spectrum, and the logarithmic power spectrum is the speech feature.
In some embodiments, the speech signal to be recognized is subjected to framing and windowing, then each frame is subjected to FFT (Fast Fourier Transformation, fast fourier transform) respectively, a discrete power spectrum after FFT is determined, the obtained discrete power spectrum enters a Mel frequency filter bank and then is logarithmized, discrete cosine transform is performed on the logarithm power spectrum to obtain a cepstrum coefficient and a differential coefficient is obtained, and the cepstrum coefficient and the differential coefficient form an MFCC coefficient. With MFCC extraction features, by default, 13 feature values are extracted for a frame of speech data.
In some embodiments, the window length is selected in the windowing process taking into account the pitch period of the speech signal. It is generally considered that 1-7 pitch periods should be included in a speech frame. However, the pitch period varies greatly from 2ms for females and children to 14ms for older males (i.e., the pitch frequency varies from 500 to 70 Hz), so that the selection of N is difficult. Typically, at a sampling frequency of 8kHz, the N-compromise is chosen to be between 80 and 160 points, i.e., 10 to 20ms duration. With conventional acoustic models, the duration of each frame of speech is typically 10ms. In the disclosed embodiment, a Low Frame Rate (LFR) modeling scheme is employed: by binding the speech frames at adjacent moments as inputs, an average output target is predicted from the target outputs of these speech frames. Three frame stitching may be employed without losing the performance of the model. Through the LFR modeling scheme, the input and output can be reduced to one third or more of the original input and output, and the calculation and decoding efficiency of the acoustic score during voice recognition can be greatly improved.
Step 220, inputting the voice characteristic sequence into a Deep-FSMN model which is trained in advance to obtain an output sequence representing the probability of each phoneme;
in some embodiments, as shown in fig. 3 of the drawings, the Deep-FSMN model includes an Input Layer (Input Layer), a plurality of Deep-FSMN layers (Deep-FSMN Layer), and an Output Layer (Output Layer); each Deep-FSMN Layer includes a Linear function Layer (Linear Layer), an activation function Layer (ReLU Layer), and a Memory Block (Memory Block). Each Deep-FSMN layer contains 2048 nodes and 512 proxels.
The number of layers of the Deep-DSMN layer can be any number of layers such as 3, 6 and 10 layers.
The memory module is used for storing historical information and future information which are useful for judging the current voice frame. A skip connection (skip connection) is added between the memory modules, so that the history information of the lower memory module can directly flow into the higher memory module. In the back propagation process, the gradient of the high-level memory module can also directly flow into the low-level memory module, so that the condition of gradient disappearance can be overcome. Since the memory modules of the Deep-FSMN layers have the same dimensions, the jump connection can be realized by an identity transformation.
In some embodiments, the parameters of the memory module include an order and a step size, the order is used to represent the number of frames of the memory module for extracting the history or future frames, and the order may be 5, 10, 15, 20, etc.; the step size is used to indicate how many adjacent frames are skipped when the memory module extracts information of historical or future frames, and the step size can be 1 or 2.
Wherein the memory module calculates a dimension-wise weighted sum of a number of frames preceding and following the current frame and a low-dimensional vector of the current frame,
wherein, the liquid crystal display device comprises a liquid crystal display device,the output of the t-th moment of the l-1 layer memory module is shown.
s 1 、s 2 Coding steps identifying historical and future times, respectively, e.g. s 1 =2 means that 1 frame adjacent frame is skipped when the history information is encoded, i.e., a value is taken as an input every other time.
N 1 、N 2 Respectively representing the number of frames, i.e. the order, of the frames of the historical moment or future moment extracted by the memory module.
When the input voice characteristic sequence is short or the prediction delay requirement is high, a smaller memory module future order N can be used 2 In which case only information of frames near the current frame is used to predict the output of the current frame; if the input speech feature sequence is long, or in a scenario where the prediction delay is not so important, a larger memory module future order N may be used 2 The long-range information of the voice feature sequence can be effectively utilized and modeled, and the performance of the Deep-FSMN model is improved.
In some embodiments, for the last batch identified by the VAD, the number of frames to extract the future frame may be set to zero, N 2 =0 to reduce the delay.
Respectively represent marksThe memory module calculates weights for a dimension-wise weighted sum of historical or future frames.
In some embodiments, the Deep-FSMN model adopts a attention mechanism to multiply the characteristics of each moment in the memory module with the characteristics of other moments, calculates the correlation between frames, focuses on the frames with high correlation, and improves the frame weights with high correlation.
In some embodiments, the output sequence representing the probability of each phoneme, i.e. the respective probability of this frame data on a preset classification, e.g. the number of categories of phonemes is 26 (26 letters), the preset classification is 28, comprising 26 phonemes, and two classifications of blank and no label.
In some embodiments, the Deep-FSMN model uses the Librispeech dataset, a thousand hours free open source English dataset, through standard back propagation algorithm (BP) model training.
In some embodiments, a teachers-student framework is employed,
training a 'non-streaming' large model using artificially labeled small-scale speech sample data, looking back and forth for a number of frames (N will be 1 、N 2 Is set as follows;
inputting a large-scale voice sample which is not marked manually into a trained 'non-streaming' large model, and taking an output result as a voice sample label, wherein label is the respective probability of a phoneme on a preset classification;
and training a small 'streaming' model by using the label and the corresponding voice data, wherein the number of frames of the small streaming model to be preset for reference is smaller.
The "non-stream" is simply that the user returns the recognition result after the whole sentence is spoken, and the user returns the recognition result while speaking in the "stream" mode.
In some embodiments, the history and future time each capture 60 frames (600 milliseconds, 10ms per frame) is the upper limit of the context length required for modeling of the speech synthesis acoustic model. I.e. N 1 、N 2 Set to 60, a "non-streaming" large model is trained. And will N 1 Is arranged as15、N 2 Set to 5, train a "streaming" small model.
In some embodiments, the depth of the "non-streaming" large model is large, e.g., 10; while the depth of the "streaming" small model is small, e.g. 6 or 3.
The 'streaming' small model trained by the method has the characteristics of small model size (MB), improved prediction speed and no loss of much precision, so that the model can be distributed to clients.
Step 230, inputting the output sequence into a pre-trained CTC (Connectionist Temporal Classification) model to obtain a corresponding phoneme sequence;
in some embodiments, assuming that the M-frame speech new number generates feature information of M frames, M is a positive integer greater than or equal to 1, processing the feature information by the Deep-FSMN model to obtain M speech feature vectors, and processing the M speech feature vectors by the CTC model to obtain N pronunciation units in the M speech feature vectors, where N is a positive integer less than or equal to M and greater than or equal to 1.
CTC mainly solves the problem of correspondence between a labeling sequence and an input sequence in a traditional neural network model, and a blank symbol blank is added in a label symbol set, then RNN is utilized for labeling, and when a certain effective output cannot be judged, a blank symbol is output; when it is sufficient to determine a certain effective unit (here, other output units than the output unit corresponding to the blank symbol), an effective symbol (here, other symbols than the blank symbol) is output, so that the peak (spike) position of the effective symbol in the label can be obtained in the CTC.
Compared with the traditional acoustic model training, the acoustic model training adopting CTC as a loss function is a complete end-to-end acoustic model training, and can be trained only by one input sequence and one output sequence without aligning data in advance. Therefore, data alignment and one-to-one labeling are not needed, and the CTC directly outputs the probability of sequence prediction, and external post-processing is not needed.
In some embodiments, gradients of the loss function relative to the (non-normalized) output probability are calculated, and the CTC model is trained according to a back propagation algorithm.
Step 240, inputting the phoneme sequence into a language model, and converting the phoneme sequence into a final text sequence as a recognition result.
In some embodiments, the voice model is RNN-LM, and the phoneme sequence output by the CTC model is processed by the RNN-LM to obtain a voice recognition result based on the voice to be recognized.
In some embodiments, the RNN-LM language model is an RNN (Recurrent Neural Network ) -based language model obtained by training a speech training sample, e.g., by taking characters output based on the speech training sample in a speech recognition network model as input, and text content of the speech training sample as result. In particular applications, the RNN-LM can more effectively utilize the speech training samples than conventional language models.
According to the embodiment of the disclosure, the following technical effects are achieved:
1. improving a memory module of the FSMN model by adopting a self-attention mechanism to obtain an improved Deep-FSMN model;
2. the improved Deep-FSMN model performance is improved;
3. by setting parameters, the time delay is reduced;
4. model training is carried out by adopting a teacher-student framework, so that the model is simplified, and the training cost is reduced;
5. the Deep-FSMN model and the CTC model are adopted, so that the operand is reduced, and the recognition effect is improved.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.
Fig. 4 shows a block diagram of a speech recognition apparatus 400 according to an embodiment of the disclosure. The apparatus 400 may be included in the server 106 of fig. 1 or implemented as the server 106. As shown in fig. 4, the apparatus 400 includes:
a voice feature sequence obtaining module 410, configured to obtain a voice feature sequence of a voice signal to be recognized currently;
a voice feature sequence processing module 420, configured to input the voice feature sequence into a Deep-FSMN model obtained by training in advance, to obtain an output sequence representing probabilities of each phoneme;
the peak position obtaining module 430 is configured to input the output sequence into a pre-trained CTC model to obtain a corresponding phoneme sequence;
the recognition result obtaining module 440 is configured to input the phoneme sequence into a language model, and convert the phoneme sequence into a final text sequence as a recognition result.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present disclosure. The device 500 may be used to implement the server 106 of fig. 1. As shown, the device 500 includes a Central Processing Unit (CPU) 501 that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit 501 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by CPU 501, one or more steps of method 200 described above may be performed. Alternatively, in other embodiments, CPU 501 may be configured to perform method 200 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, the present disclosure provides a voice recognition method, including: acquiring a voice characteristic sequence of a current voice signal to be recognized; inputting the voice characteristic sequence into a Deep-FSMN model obtained by training in advance to obtain an output sequence representing the probability of each phoneme; inputting the output sequence into a pre-trained CTC model to obtain a corresponding phoneme sequence; and inputting the phoneme classification sequence into a language model, and converting the phoneme classification sequence into a final text sequence as a recognition result.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, the voice signal to be recognized is a voice signal collected by detecting voice by using a VAD algorithm, and is in batch.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, the voice feature sequence is a log power spectrum or mel-frequency cepstrum coefficient feature.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, the Deep-FSMN model includes an input layer, N Deep-FSMN layers, and an output layer, where N is a positive integer greater than or equal to 1; each Deep-FSMN layer comprises a linear function layer, an activation function layer and a memory module; jump connection is added between the memory modules; the Deep-FSMN model employs an attention mechanism.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, the Deep-FSMN model is obtained through training of a teachers-student framework, including: training a non-streaming large model using manually labeled small-scale speech sample data; inputting a large-scale voice sample which is not marked manually into a trained 'non-streaming' large model, and taking an output result as the voice sample label; and training a streaming small model by using the label and the corresponding voice data.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, the depth of the non-streaming large model is greater than the depth of the streaming small model; the order of the memory module of the non-flow large model is larger than that of the memory module of the flow small model.
According to one or more embodiments of the present disclosure, in the voice recognition method provided by the present disclosure, for the last batch recognized by the VAD, the order of the extracted future time frame of the memory module is set to zero.
In accordance with one or more embodiments of the present disclosure, in the speech recognition method provided by the present disclosure, the input and output of the Deep-FSMN model adopts a low frame rate LFR modeling scheme.
According to one or more embodiments of the present disclosure, there is provided a voice recognition apparatus including: the voice characteristic sequence acquisition module is used for acquiring a voice characteristic sequence of a current voice signal to be recognized; the voice feature sequence processing module is used for inputting the voice feature sequence into a Deep-FSMN model which is trained in advance to obtain an output sequence representing the probability of each phoneme; the peak position obtaining module is used for inputting the output sequence into a pre-trained CTC model to obtain a corresponding phoneme sequence; and the recognition result obtaining module is used for inputting the phoneme classification sequence into a language model and converting the phoneme classification sequence into a final text sequence as a recognition result.
According to one or more embodiments of the present disclosure, in the voice recognition device provided by the present disclosure, the voice signal to be recognized is a voice signal collected by detecting voice by using a VAD algorithm, and is in batch.
In accordance with one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, the voice feature sequence is a log power spectrum or mel-frequency cepstrum coefficient feature.
According to one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, the Deep-FSMN model includes an input layer, N Deep-FSMN layers, and an output layer, where N is a positive integer greater than or equal to 1; each Deep-FSMN layer comprises a linear function layer, an activation function layer and a memory module; jump connection is added between the memory modules; the Deep-FSMN model employs an attention mechanism.
According to one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, the Deep-FSMN model is obtained through training of a teachers-student framework, including: training a non-streaming large model using manually labeled small-scale speech sample data; inputting a large-scale voice sample which is not marked manually into a trained 'non-streaming' large model, and taking an output result as the voice sample label; and training a streaming small model by using the label and the corresponding voice data.
According to one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, a depth of the non-streaming large model is greater than a depth of the streaming small model; the order of the memory module of the non-flow large model is larger than that of the memory module of the flow small model.
According to one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, for the last batch recognized by the VAD, the order of the extracted future time frame of the memory module is set to zero.
In accordance with one or more embodiments of the present disclosure, in the voice recognition apparatus provided by the present disclosure, the input/output of the Deep-FSMN model adopts a low frame rate LFR modeling scheme.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.