CN112951213A - End-to-end online voice detection and recognition method, system and equipment - Google Patents

End-to-end online voice detection and recognition method, system and equipment Download PDF

Info

Publication number
CN112951213A
CN112951213A CN202110175961.6A CN202110175961A CN112951213A CN 112951213 A CN112951213 A CN 112951213A CN 202110175961 A CN202110175961 A CN 202110175961A CN 112951213 A CN112951213 A CN 112951213A
Authority
CN
China
Prior art keywords
voice
recognition
data
model
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110175961.6A
Other languages
Chinese (zh)
Other versions
CN112951213B (en
Inventor
周世玉
徐波
李蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110175961.6A priority Critical patent/CN112951213B/en
Publication of CN112951213A publication Critical patent/CN112951213A/en
Application granted granted Critical
Publication of CN112951213B publication Critical patent/CN112951213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention belongs to the field of voice detection and recognition, and particularly relates to an end-to-end online voice detection and recognition method, system and device, aiming at solving the problems that the existing online voice recognition technology needs to train and deploy a plurality of models, the model calculation efficiency is low, the deployment and parameter adjustment processes are complex, and the dependence on labeled sample data is strong. The invention comprises the following steps: obtaining a pre-training wav2vec2.0 model through label-free voice data self-supervision training; performing first-stage and second-stage fine adjustment on the model, and training through multi-task voice data to obtain a multi-task model for voice detection and recognition; and for the online audio data, carrying out blocking and edge splicing, and obtaining a real-time voice recognition text through online recognition and edge elimination of a multi-task model. The method has low dependency on tag data, a small number of model parameters and a simple structure, reduces calculation consumption by combined modeling, can be used for scenes with low resources and high real-time requirements, and has accurate identification and high precision.

Description

End-to-end online voice detection and recognition method, system and equipment
Technical Field
The invention belongs to the field of voice detection and recognition, and particularly relates to an end-to-end online voice detection and recognition method, system and device.
Background
With the popularization of innovative applications such as intelligent devices, speech recognition is an important entrance of human-computer interaction, and is now widely applied to various scenes, such as speech input, speech search, speech translation, smart home and the like. Some of the scenes have high requirements for real-time online recognition, such as voice control, conference summary, and the like.
At present, the mainstream online speech recognition method needs to train and deploy a speech detection model and a speech recognition model, firstly, a speech detection model is used for detecting a speech part in an audio signal, and then, the speech recognition model is used for online recognition aiming at data of the speech part. The disadvantages are: the two models of voice detection and voice recognition need to be trained and deployed and combined in a series connection mode, so that the accuracy of the voice detection model has a large influence on the performance of a subsequent voice recognition model, and in practical application, the relevant parameters of the voice detection model need to be dynamically adjusted according to an application scene so as to ensure that the performances of the voice detection model and the voice recognition model are optimal. In addition, a large amount of labeled sample data is needed in model training in the prior art, the labeled sample data has strong dependence on the labeled sample, and the robustness and the generalization of the model need to be improved.
Disclosure of Invention
In order to solve the above problems in the prior art, the invention provides an end-to-end online voice detection and recognition method, which integrates a voice detection model and a voice recognition model into one model, and simplifies the training and deployment processes; a multi-task learning technology is adopted to perform joint optimization on the voice detection module and the voice recognition module in the unified model, the performance of each is improved by utilizing the complementarity among tasks, and the robustness of the model is enhanced; the method fully utilizes mass non-labeled audio data, adopts the self-supervised learning technology to train the pre-training model, not only reduces the dependence of the supervised learning stage on labeled data, but also increases the robustness of the model in the supervised learning stage, so the method is suitable for developing online voice recognition systems of various low-resource languages, and can quickly build a set of online voice recognition systems of the low-resource languages only by a small amount of labeled data. The end-to-end online voice detection and recognition method comprises the following steps:
step A10, based on the obtained non-labeled voice data, using a wav2vec2.0 model to perform self-supervision training to obtain a pre-training wav2vec2.0 model;
step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model;
step A30, using the obtained multitask voice data as multitask training data, and performing voice detection and training of recognizing a multitask model through fine tuning of the voice recognition single-task model to obtain a trained voice detection and recognition multitask model;
step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size;
and A50, performing state judgment based on the audio analysis result, extracting effective voice data based on the state judgment result, performing edge splicing on the effective voice data, and performing online recognition and edge elimination through the voice recognition module of the trained voice detection and recognition multitask model to obtain a real-time voice recognition text.
In some preferred embodiments, the trained multi-task model for speech detection and recognition is trained by:
step B10, for each group of data in the multitask training data, respectively obtaining the predicted output of the corresponding task of each group of data through the voice detection and recognition multitask model; the multitask training data comprises training data of voice detection and voice recognition which are respectively obtained;
step B20, respectively calculating loss values between the predicted output and the true value of the corresponding tasks of each group of data, and obtaining the loss value weight corresponding to each task through model self-learning;
step B30, weighting each loss value by the loss value weight corresponding to each task to obtain the multitask loss value of the multitask model for voice detection and recognition;
and B40, updating the parameters of the voice detection and recognition model in the gradient descending direction of the multitask loss value, and jumping to the step B10 to perform model iterative training until a set training ending condition is reached to obtain the voice detection and recognition multitask model.
In some preferred embodiments, the multitask model for speech detection and recognition, the multitask training data in training and the audio data in online application are preprocessed by edge splicing, and the method includes:
and blocking each piece of data in the trained multi-task training data according to a random size and/or blocking each piece of data in the audio data of the online application according to a preset fixed size, and splicing set audio with a set length at the edge of a blocked audio block to obtain preprocessed multi-task training data and/or preprocessed audio data of the online application.
In some preferred embodiments, after passing through the voice detection and recognition multitask model, the preprocessing multitask training data and/or the preprocessing online application audio data are both post-processed by adopting concatenation result elimination, and the method includes:
and inputting the preprocessed multitask training data and/or preprocessed online application audio data into the multitask model for voice detection and recognition, and removing the logits output of the spliced audio in the model output result to obtain the logits output of the multitask training data and/or the online application audio data.
In some preferred embodiments, a model distillation step is further provided between step a30 and step a40 by:
step A40a, using the voice detection and recognition multitask model as a teacher network, and constructing a voice detection and recognition multitask model with parameters and network layer numbers less than those of the teacher network as a student network; the network layer number of the teacher network is M, the network layer number of the student network is N, and M is greater than N;
step A40b, acquiring a distilled student voice detection and recognition multitask model through the nth layer output distribution of the mth layer learning teacher network of the student network; wherein M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, M is N/k, and k is a positive integer representing the network layer number relationship between the teacher network and the student network.
In some preferred embodiments, the nth layer output distribution of the teacher network is learned through the mth layer of the student network in step a40b, and the loss of the learning process includes attention moment matrix loss, hidden layer output loss and final output layer loss;
the attention moment array loss is the MSE loss of an attention score matrix of each attention head learning teacher network corresponding to the attention head of the student network;
the hidden layer output loss is MSE loss of hidden layer output corresponding to each layer of transducer hidden layer output learning teacher network of the student network;
and the final output layer loss is the cross entropy loss of the final output of the student network learning teacher network.
In another aspect of the present invention, an end-to-end online voice detection and recognition system is provided, which includes:
the voice data acquisition module is used for acquiring and inputting a user voice data stream as data to be detected and identified;
the data caching unit comprises a voice detection caching module and a voice recognition caching module and is used for caching data to be detected and recognized on line;
the voice preprocessing unit is used for carrying out voice format normalization, denoising and dereverberation preprocessing operations on the data in the voice detection cache module;
the voice detection and recognition unit comprises a voice activation detection module, a state judgment module and a voice recognition module; the voice activation detection module is used for calculating the voice detection classification probability of each frame of data in the voice detection cache module after the preprocessing; the state judgment module is used for judging the state of each frame according to the voice detection classification probability of each frame; if the voice state is a voice starting end point, inputting data into the voice recognition module, if the boundary state of the voice recognition block is activated, performing voice recognition and emptying the corresponding voice recognition module, and if the voice state is a voice ending end point, emptying the corresponding voice recognition module and outputting a real-time voice recognition text;
and the recognition result display module is used for displaying the real-time voice recognition text.
In some preferred embodiments, the state determination module has a state determination method that:
step C10, setting a threshold value T of the voice detection classification probability, if the voice detection classification probability of the current frame is greater than T, the current frame is voice, otherwise, the current frame is mute;
step C20, setting a voice activation threshold B and a voice ending threshold E, and if the length of the continuous voice frame is greater than B, the state is voice; if the continuous mute length is greater than E, the state is non-voice;
step C30, if the state is converted from non-speech to speech, the frame whose state is speech is the speech start end point; if the state is converted from voice to non-voice, the frame with the state of voice is a voice ending end point; when the data accumulation length in the voice recognition block reaches L, the boundary state of the voice recognition block is activated; wherein, L is the activation length of the set voice recognition boundary.
In a third aspect of the present invention, an electronic device is provided, including:
at least one processor; and
a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,
the memory stores instructions executable by the processor for execution by the processor to implement the end-to-end online voice detection and recognition method described above.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for being executed by the computer to implement the end-to-end online voice detection and recognition method described above.
The invention has the beneficial effects that:
(1) the end-to-end online voice detection and recognition method disclosed by the invention fuses voice detection tasks and voice recognition tasks, does not need an additional voice detection model, fully utilizes the complementarity of the two tasks, improves the performance and robustness of each task through unified modeling, and simplifies the system deployment process.
(2) The end-to-end online voice detection and recognition method uses the pre-trained wav2vec2.0 model for a voice detection task, achieves a good detection effect, and improves the voice detection performance.
(3) The end-to-end online voice detection and recognition method adopts the form of the cache block to perform data cache before detection and recognition, can flexibly adjust the size of the cache block, and achieves the stable and high-performance recognition effect.
(4) According to the end-to-end online voice detection and recognition method, the pre-training wav2vec model is used for the voice recognition task, dependence on training data volume is greatly reduced, and the method can be quickly applied to the low-resource voice recognition task.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of an end-to-end online voice detection and recognition method of the present invention;
FIG. 2 is a schematic diagram of the voice detection and recognition multitask joint modeling of the end-to-end online voice detection and recognition method of the present invention;
FIG. 3 is a schematic diagram of data block splicing according to an embodiment of the end-to-end online voice detection and recognition method of the present invention;
FIG. 4 is a schematic diagram of model distillation for one embodiment of the end-to-end online speech detection and recognition method of the present invention;
FIG. 5 is a system framework diagram of one embodiment of the end-to-end online speech detection and recognition method of the present invention;
FIG. 6 is a schematic diagram of online voice detection and decoding according to an embodiment of the end-to-end online voice detection and recognition method of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention provides an end-to-end online voice detection and recognition method, which is based on the end-to-end voice detection and recognition combined modeling of a pre-training model, can unify voice recognition and voice detection in one model, converts continuous voice input into character output, and simplifies the deployment process of online voice recognition. In addition, a large amount of label-free data are used for pre-training in the model training, dependence on the number of labeled samples can be reduced, meanwhile, the robustness of the model is enhanced, dependence of a voice detection task on a model structure and model parameters is greatly reduced, and the calculation efficiency is improved. In modeling, a method of partitioning and edge splicing is used for training and predicting a voice recognition task, different delay requirements can be supported during use, and performance of low-delay recognition is improved.
The model of the invention is based on a pre-training model to carry out two-stage fine tuning, wherein the model structure of the pre-training model is wav2vec2.0 structure, the fine tuning of the first stage is used for training a voice recognition single-task model, and the fine tuning of the second stage is used for training a voice detection and recognition multi-task model.
The invention discloses an end-to-end online voice detection and identification method, which comprises the following steps:
step A10, based on the obtained non-labeled voice data, using a wav2vec2.0 model to perform self-supervision training to obtain a pre-training wav2vec2.0 model;
step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model;
step A30, using the obtained multitask voice data as multitask training data, and performing voice detection and training of recognizing a multitask model through fine tuning of the voice recognition single-task model to obtain a trained voice detection and recognition multitask model;
step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size;
and A50, performing state judgment based on the audio analysis result, extracting effective voice data based on the state judgment result, performing edge splicing on the effective voice data, and performing online recognition and edge elimination through the voice recognition module of the trained voice detection and recognition multitask model to obtain a real-time voice recognition text.
In order to more clearly describe the end-to-end online voice detection and recognition method of the present invention, the following describes the steps in the embodiment of the present invention in detail with reference to fig. 1 and 2.
The end-to-end online voice detection and recognition method of the first embodiment of the invention comprises the steps A10-A50, and the steps are described in detail as follows:
and step A10, performing self-supervision training by using a wav2vec2.0 model based on the obtained unlabeled voice data to obtain a pre-training wav2vec2.0 model.
The data selected by the wav2vec2.0 model for self-supervision pre-training is mass unmarked voice data, the wav2vec model is trained on large-scale unmarked data through contrast loss, the finally obtained feature representation can replace the traditional acoustic features, the model has low dependence on marked sample data, high calculation efficiency and high real-time performance.
Step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model.
Model fine adjustment is carried out based on the pre-training model, so that a large amount of computing resources and computing time can be saved, computing efficiency is improved, and even accuracy is improved.
And step A30, taking the obtained multitask voice data as multitask training data, and performing voice detection and training of the recognition multitask model through fine tuning of the voice recognition single-task model to obtain the trained voice detection and recognition multitask model.
The training method of the trained multi-task model for voice detection and recognition comprises the following steps:
step B10, for each group of data in the multitask training data, respectively obtaining the predicted output of the corresponding task of each group of data through the voice detection and recognition multitask model; the multitask training data comprises training data of voice detection and voice recognition which are respectively obtained;
step B20, respectively calculating loss values between the predicted output and the true value of the corresponding tasks of each group of data, and obtaining the loss value weight corresponding to each task through model self-learning;
the process of obtaining the weight of the loss value corresponding to each task through model self-learning may refer to a task uncertainty loss learning technology proposed by Kendall, Alex and the like, which is not described in detail herein.
Step B30, weighting each loss value by the loss value weight corresponding to each task to obtain the multitask loss value of the multitask model for voice detection and recognition;
and B40, updating the parameters of the voice detection and recognition model in the gradient descending direction of the multitask loss value, and jumping to the step B10 to perform model iterative training until a set training ending condition is reached to obtain the voice detection and recognition multitask model.
In order to support better streaming voice detection and recognition effect, a method of sliding block edge splicing is used for processing multitask voice data for model training in the training and prediction process, as shown in fig. 3, a data block splicing schematic diagram of an embodiment of the end-to-end online voice detection and recognition method is shown, audio blocking is performed on each section of audio data in the obtained multitask voice data according to a random size, set audio of a set length is spliced on the block edge of a blocked audio block, and training data of a voice detection and recognition multitask model are obtained.
The multitask model of voice detection and recognition, the multitask training data in its training and the audio data in the online application, all adopt the edge splice to carry on the preconditioning, its method is:
and blocking each piece of data in the trained multi-task training data according to a random size and/or blocking each piece of data in the audio data of the online application according to a preset fixed size, and splicing set audio with a set length at the edge of a blocked audio block to obtain preprocessed multi-task training data and/or preprocessed audio data of the online application.
Preprocessing multitask training data and/or preprocessing online application audio data, and after the multitask model of voice detection and recognition is passed, adopting splicing result elimination to carry out post-processing, wherein the method comprises the following steps:
inputting the preprocessed multitask training data and/or preprocessed online application audio data into the multitask model for voice detection and recognition, eliminating the logits output of the spliced audio in the model output result, and obtaining the logits output of the multitask training data and/or the online application audio data
The audio data are randomly blocked in a training stage, the blocking size is set independently in a prediction stage, set audio with a certain length is spliced at the edge of a data block, a data block at the beginning of a sample is only spliced at the right edge, and a database at the end of the sample is only spliced at the left edge;
inputting the data blocks spliced at the edge into a model, wherein the corresponding output of the edge part is only used for providing additional information and does not participate in prediction;
and eliminating the edge part of the data block prediction result and then merging.
As shown in fig. 4, which is a schematic diagram of model distillation in an embodiment of the end-to-end online voice detection and recognition method of the present invention, a model distillation step is further provided between step a30 and step a40, and includes performing a transformer network layer distillation on the wav2vec2.0 structure, so that the student network structure learns the self-attention matrix and hidden layer output of the teacher network structure respectively; and (3) distilling the prediction layer to enable the prediction layer of the student network learning teacher network to output probability distribution, which specifically comprises the following steps:
step A40a, using the voice detection and recognition multitask model as a teacher network, and constructing a voice detection and recognition multitask model with parameters and network layer numbers less than those of the teacher network as a student network; the network layer number of the teacher network is M, the network layer number of the student network is N, and M is greater than N;
step A40b, acquiring a distilled student voice detection and recognition multitask model through the nth layer output distribution of the mth layer learning teacher network of the student network; wherein M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, M is N/k, and k is a positive integer representing the network layer number relationship between the teacher network and the student network. In one embodiment of the present invention, k is 4. Here, the network layer number relationship between the teacher network and the student network in the distillation learning selects a multiple mapping relationship, and in other application scenarios, other mapping relationships may also be selected, for example, an exponential mapping relationship or other pre-specified mapping relationships, and the present invention is not described in detail herein.
And (3) distilling the k-separation layer: learning the mth layer of the teacher network through the mth layer of the student network, wherein the loss of the learning process comprises attention moment array loss and hidden layer output loss:
attention is paid to moment array loss, and MSE loss of an attention score matrix of the teacher network corresponding to the attention head is learned for each attention head of the student network;
hidden layer output loss, namely MSE loss of hidden layer output corresponding to each layer of transducer hidden layer output learning teacher network of the student network;
prediction of layer distillation: and (3) learning the prediction layer output probability distribution of the teacher network through the last prediction layer of the student network, wherein the loss in the learning process is cross entropy loss.
And step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size.
And A50, performing state judgment based on the audio analysis result, extracting effective voice data based on the state judgment result, performing edge splicing on the effective voice data, and performing online recognition and edge elimination through the voice recognition module of the trained voice detection and recognition multitask model to obtain a real-time voice recognition text.
Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.
As shown in fig. 5, an end-to-end online voice detecting and recognizing system according to a second embodiment of the present invention includes:
the voice data acquisition module is used for acquiring and inputting a user voice data stream as data to be detected and identified;
the data caching unit comprises a voice detection caching module and a voice recognition caching module and is used for caching data to be detected and recognized on line;
the voice preprocessing unit is used for carrying out voice format normalization, denoising and dereverberation preprocessing operations on the data in the voice detection cache module;
the voice detection and recognition unit comprises a voice activation detection module, a state judgment module and a voice recognition module; the voice activation detection module is used for calculating the voice detection classification probability of each frame of data in the voice detection cache module after the preprocessing; the state judgment module is used for judging the state of each frame according to the voice detection classification probability of each frame; if the voice state is a voice starting end point, inputting data into the voice recognition module, if the boundary state of the voice recognition block is activated, performing voice recognition and emptying the corresponding voice recognition module, and if the voice state is a voice ending end point, emptying the corresponding voice recognition module and outputting a real-time voice recognition text;
and the recognition result display module is used for displaying the real-time voice recognition text.
As shown in fig. 6, a schematic diagram of online voice detection and decoding according to an embodiment of the end-to-end online voice detection and recognition method of the present invention is shown:
step one, inputting voice signal data into a voice detection block to carry out voice activation detection, wherein a network structure uses a bottom-layer CNN network structure in wav2vec 2.0;
inputting the detection result in the voice detection block into a state transition model, and judging the state (voice starting end point/voice recognition block boundary/voice ending end point) of each frame; wherein the detection result is the speech detection classification probability of each frame;
and operating according to the state obtained in the step two. If the state is a voice starting end point, starting to put data into a voice recognition block; if the state is a voice recognition boundary, recognizing the voice in the voice recognition block, emptying the voice recognition block, and then continuously storing data into the voice recognition block; if the state is a voice ending end point, recognizing and clearing data in the current voice recognition block, and recording a recognition result of the complete voice; the modeling unit of the voice recognition is Chinese characters/letters, and the decoding method is a beam search method. The external language model may use an N-gram statistical language model or a neural network based language model.
The state judgment module comprises a state judgment module and a state judgment module, wherein the state judgment method comprises the following steps:
step C10, setting a threshold value T of the voice detection classification probability, if the voice detection classification probability of the current frame is greater than T, the current frame is voice, otherwise, the current frame is mute;
step C20, setting a voice activation threshold B and a voice ending threshold E, and if the length of the continuous voice frame is greater than B, the state is voice; if the continuous mute length is greater than E, the state is non-voice;
step C30, if the state is converted from non-speech to speech, the frame whose state is speech is the speech start end point; if the state is converted from voice to non-voice, the frame with the state of voice is a voice ending end point; when the data accumulation length in the voice recognition block reaches L, the boundary state of the voice recognition block is activated; wherein, L is the activation length of the set voice recognition boundary.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the end-to-end online voice detection and recognition system provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
An electronic apparatus according to a third embodiment of the present invention includes:
at least one processor; and
a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,
the memory stores instructions executable by the processor for execution by the processor to implement the end-to-end online voice detection and recognition method described above.
A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for being executed by the computer to implement the end-to-end online voice detection and recognition method described above.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. An end-to-end online voice detection and recognition method, characterized in that the online voice detection and recognition method comprises:
step A10, based on the obtained non-labeled voice data, using a wav2vec2.0 model to perform self-supervision training to obtain a pre-training wav2vec2.0 model;
step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model;
step A30, using the obtained multitask voice data as multitask training data, and performing voice detection and training of recognizing a multitask model through fine tuning of the voice recognition single-task model to obtain a trained voice detection and recognition multitask model;
step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size;
and A50, performing state judgment based on the audio analysis result, extracting effective voice data based on the state judgment result, performing edge splicing on the effective voice data, and performing online recognition and edge elimination through the voice recognition module of the trained voice detection and recognition multitask model to obtain a real-time voice recognition text.
2. The method for end-to-end online voice detection and recognition according to claim 1, wherein the trained multitask model for voice detection and recognition is trained by:
step B10, for each group of data in the multitask training data, respectively obtaining the predicted output of the corresponding task of each group of data through the voice detection and recognition multitask model; the multitask training data comprises training data of voice detection and voice recognition which are respectively obtained;
step B20, respectively calculating loss values between the predicted output and the true value of the corresponding tasks of each group of data, and obtaining the loss value weight corresponding to each task through model self-learning;
step B30, weighting each loss value by the loss value weight corresponding to each task to obtain the multitask loss value of the multitask model for voice detection and recognition;
and B40, updating the parameters of the voice detection and recognition model in the gradient descending direction of the multitask loss value, and jumping to the step B10 to perform model iterative training until a set training ending condition is reached to obtain the voice detection and recognition multitask model.
3. An end-to-end online speech detection and recognition method according to claim 1 or 2, wherein the multitask model for speech detection and recognition, the multitask training data in training and the audio data in online application are preprocessed by edge splicing, and the method comprises:
and blocking each piece of data in the trained multi-task training data according to a random size and/or blocking each piece of data in the audio data of the online application according to a preset fixed size, and splicing set audio with a set length at the edge of a blocked audio block to obtain preprocessed multi-task training data and/or preprocessed audio data of the online application.
4. The end-to-end online voice detection and recognition method of claim 3, wherein the pre-processed multitask training data and/or the pre-processed online application audio data are post-processed by adopting stitching result elimination after passing through the multitask model for voice detection and recognition, and the method comprises:
and inputting the preprocessed multitask training data and/or preprocessed online application audio data into the multitask model for voice detection and recognition, and removing the logits output of the spliced audio in the model output result to obtain the logits output of the multitask training data and/or the online application audio data.
5. The end-to-end online voice detection and recognition method of claim 1, wherein a model distillation step is further provided between step a30 and step a40, and the method comprises:
step A40a, using the voice detection and recognition multitask model as a teacher network, and constructing a voice detection and recognition multitask model with parameters and network layer numbers less than those of the teacher network as a student network; the network layer number of the teacher network is M, the network layer number of the student network is N, and M is greater than N;
step A40b, acquiring a distilled student voice detection and recognition multitask model through the nth layer output distribution of the mth layer learning teacher network of the student network; wherein M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, M is N/k, and k is a positive integer representing the network layer number relationship between the teacher network and the student network.
6. The end-to-end online voice detecting and recognizing method according to claim 5, wherein the n-th layer output distribution of the teacher network is learned through the m-th layer of the student network in step A40b, and the loss of the learning process includes attention moment matrix loss, hidden layer output loss and final output layer loss;
the attention moment array loss is the MSE loss of an attention score matrix of each attention head learning teacher network corresponding to the attention head of the student network;
the hidden layer output loss is MSE loss of hidden layer output corresponding to each layer of transducer hidden layer output learning teacher network of the student network;
and the final output layer loss is the cross entropy loss of the final output of the student network learning teacher network.
7. An end-to-end online voice detection and recognition system, comprising:
the voice data acquisition module is used for acquiring and inputting a user voice data stream as data to be detected and identified;
the data caching unit comprises a voice detection caching module and a voice recognition caching module and is used for caching data to be detected and recognized on line;
the voice preprocessing unit is used for carrying out voice format normalization, denoising and dereverberation preprocessing operations on the data in the voice detection cache module;
the voice detection and recognition unit comprises a voice activation detection module, a state judgment module and a voice recognition module; the voice activation detection module is used for calculating the voice detection classification probability of each frame of data in the voice detection cache module after the preprocessing; the state judgment module is used for judging the state of each frame according to the voice detection classification probability of each frame; if the voice state is a voice starting end point, inputting data into the voice recognition module, if the boundary state of the voice recognition block is activated, performing voice recognition and emptying the corresponding voice recognition module, and if the voice state is a voice ending end point, emptying the corresponding voice recognition module and outputting a real-time voice recognition text;
and the recognition result display module is used for displaying the real-time voice recognition text.
8. The system according to claim 7, wherein the status determining module determines the status by:
step C10, setting a threshold value T of the voice detection classification probability, if the voice detection classification probability of the current frame is greater than T, the current frame is voice, otherwise, the current frame is mute;
step C20, setting a voice activation threshold B and a voice ending threshold E, and if the length of the continuous voice frame is greater than B, the state is voice; if the continuous mute length is greater than E, the state is non-voice;
step C30, if the state is converted from non-speech to speech, the frame whose state is speech is the speech start end point; if the state is converted from voice to non-voice, the frame with the state of voice is a voice ending end point; when the data accumulation length in the voice recognition block reaches L, the boundary state of the voice recognition block is activated; wherein, L is the activation length of the set voice recognition boundary.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,
the memory stores instructions executable by the processor for execution by the processor to implement the end-to-end online voice detection and recognition method of any of claims 1-6.
10. A computer-readable storage medium storing computer instructions for execution by the computer to implement the end-to-end online voice detection and recognition method of any one of claims 1-6.
CN202110175961.6A 2021-02-09 2021-02-09 End-to-end online voice detection and recognition method, system and equipment Active CN112951213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110175961.6A CN112951213B (en) 2021-02-09 2021-02-09 End-to-end online voice detection and recognition method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110175961.6A CN112951213B (en) 2021-02-09 2021-02-09 End-to-end online voice detection and recognition method, system and equipment

Publications (2)

Publication Number Publication Date
CN112951213A true CN112951213A (en) 2021-06-11
CN112951213B CN112951213B (en) 2022-05-24

Family

ID=76244553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110175961.6A Active CN112951213B (en) 2021-02-09 2021-02-09 End-to-end online voice detection and recognition method, system and equipment

Country Status (1)

Country Link
CN (1) CN112951213B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium
CN113782000A (en) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN114218424A (en) * 2022-02-22 2022-03-22 杭州一知智能科技有限公司 Voice interaction method and system for tone word insertion based on wav2vec
CN115662401A (en) * 2022-12-14 2023-01-31 国家电网有限公司客户服务中心 Customer service call voice recognition method based on continuous learning
CN117252213A (en) * 2023-07-06 2023-12-19 天津大学 End-to-end speech translation method using synthesized speech as supervision information
WO2024023946A1 (en) * 2022-07-26 2024-02-01 日本電信電話株式会社 Speech processing device, speech processing method, and speech processing program
CN117252213B (en) * 2023-07-06 2024-05-31 天津大学 End-to-end speech translation method using synthesized speech as supervision information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065021A (en) * 2018-10-18 2018-12-21 江苏师范大学 The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution
CN110689879A (en) * 2019-10-10 2020-01-14 中国科学院自动化研究所 Method, system and device for training end-to-end voice transcription model
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
WO2020146873A1 (en) * 2019-01-11 2020-07-16 Applications Technology (Apptek), Llc System and method for direct speech translation system
US20200372897A1 (en) * 2019-05-23 2020-11-26 Google Llc Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065021A (en) * 2018-10-18 2018-12-21 江苏师范大学 The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution
WO2020146873A1 (en) * 2019-01-11 2020-07-16 Applications Technology (Apptek), Llc System and method for direct speech translation system
US20200372897A1 (en) * 2019-05-23 2020-11-26 Google Llc Variational Embedding Capacity in Expressive End-to-End Speech Synthesis
CN110689879A (en) * 2019-10-10 2020-01-14 中国科学院自动化研究所 Method, system and device for training end-to-end voice transcription model
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘佳文等: "基于Transformer的越南语连续语音识别", 《信息工程大学学报》 *
刘娟宏等: "端到端的深度卷积神经网络语音识别", 《计算机应用与软件》 *
杨鸿武等: "基于改进混合CTC/attention架构的端到端普通话语音识别", 《西北师范大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium
CN113782000A (en) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN113782000B (en) * 2021-09-29 2022-04-12 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN114218424A (en) * 2022-02-22 2022-03-22 杭州一知智能科技有限公司 Voice interaction method and system for tone word insertion based on wav2vec
CN114218424B (en) * 2022-02-22 2022-05-13 杭州一知智能科技有限公司 Voice interaction method and system for tone word insertion based on wav2vec
WO2024023946A1 (en) * 2022-07-26 2024-02-01 日本電信電話株式会社 Speech processing device, speech processing method, and speech processing program
CN115662401A (en) * 2022-12-14 2023-01-31 国家电网有限公司客户服务中心 Customer service call voice recognition method based on continuous learning
CN115662401B (en) * 2022-12-14 2023-03-10 国家电网有限公司客户服务中心 Customer service call voice recognition method based on continuous learning
CN117252213A (en) * 2023-07-06 2023-12-19 天津大学 End-to-end speech translation method using synthesized speech as supervision information
CN117252213B (en) * 2023-07-06 2024-05-31 天津大学 End-to-end speech translation method using synthesized speech as supervision information

Also Published As

Publication number Publication date
CN112951213B (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN112951213B (en) End-to-end online voice detection and recognition method, system and equipment
CN110728997B (en) Multi-modal depression detection system based on context awareness
CN110718223B (en) Method, apparatus, device and medium for voice interaction control
US11043209B2 (en) System and method for neural network orchestration
CN111931929B (en) Training method and device for multitasking model and storage medium
US9728183B2 (en) System and method for combining frame and segment level processing, via temporal pooling, for phonetic classification
US8676574B2 (en) Method for tone/intonation recognition using auditory attention cues
CN110852215B (en) Multi-mode emotion recognition method and system and storage medium
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
KR102281676B1 (en) Audio classification method based on neural network for waveform input and analyzing apparatus
KR20210015967A (en) End-to-end streaming keyword detection
CN111653274B (en) Wake-up word recognition method, device and storage medium
KR20220130565A (en) Keyword detection method and apparatus thereof
CN109582839B (en) Writing auxiliary method and writing auxiliary client
CN115512692B (en) Voice recognition method, device, equipment and storage medium
KR20230093826A (en) Video data labeling method and devicd for animal detection and classification
CN113421593A (en) Voice evaluation method and device, computer equipment and storage medium
Cheng et al. Video reasoning for conflict events through feature extraction
KR102462144B1 (en) AI Chatbot System with MR Content for Tutoring
KR102564570B1 (en) System and method for analyzing multimodal emotion
Cao et al. Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition.
CN111210830B (en) Voice awakening method and device based on pinyin and computer equipment
Deonise et al. Improved Speech Activity Detection Model Using Convolutional Neural Networks
CN118053420A (en) Speech recognition method, apparatus, device, medium and program product
Schuller et al. Chain of audio processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant