CN114512122A - Acoustic model training method, speech recognition algorithm, storage medium, and electronic device - Google Patents

Acoustic model training method, speech recognition algorithm, storage medium, and electronic device Download PDF

Info

Publication number
CN114512122A
CN114512122A CN202210103816.1A CN202210103816A CN114512122A CN 114512122 A CN114512122 A CN 114512122A CN 202210103816 A CN202210103816 A CN 202210103816A CN 114512122 A CN114512122 A CN 114512122A
Authority
CN
China
Prior art keywords
acoustic model
small sliding
prediction result
audio characteristic
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210103816.1A
Other languages
Chinese (zh)
Inventor
吴才泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sisu Technology Co ltd
Original Assignee
Shenzhen Sisu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sisu Technology Co ltd filed Critical Shenzhen Sisu Technology Co ltd
Priority to CN202210103816.1A priority Critical patent/CN114512122A/en
Publication of CN114512122A publication Critical patent/CN114512122A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an acoustic model training method and device, a storage medium and electronic equipment, wherein the acoustic model training method comprises the steps of obtaining an audio characteristic map in a data set; acquiring an original acoustic model; segmenting the audio characteristic map according to a first preset time length to obtain a plurality of segmented audio characteristic maps, and sliding a window on each segmented audio characteristic map according to a second preset time length to obtain a plurality of small sliding windows; inputting a plurality of small sliding windows into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results; stripping and splicing the prediction result of each small sliding window to obtain an overall prediction result; and determining an effective voice recognition acoustic model according to the overall prediction result and a chain time sequence classification algorithm so as to solve the technical problem that time slices are cut to have great influence on recognition accuracy rate during voice recognition.

Description

Acoustic model training method, speech recognition algorithm, storage medium, and electronic device
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to an acoustic model training method and apparatus, a storage medium, and an electronic device.
Background
The natural language embedded voice recognition technology is a high and new technology for recording audio in real time, extracting audio features and writing the audio features into corresponding characters on an embedded board card with low computation amount and low resources at the end side, and is mainly applied to the fields of real-time voice translation, voice interaction robots and the like. The speech recognition process is roughly referred to as audio preprocessing, acoustic model recognition, and language model recognition, wherein acoustic model recognition is a key step in converting audio features into acoustic phonemes.
The embedded equipment carries out primary preprocessing on the audio recorded in real time to obtain an FBANK characteristic map, the FBANK characteristic map is input to the acoustic model for calculation, and the acoustic model can calculate and predict the acoustic phoneme prediction value corresponding to each frame of characteristic according to the local characteristic and the long-term characteristic of the FBANK characteristic map. In Chinese corpus recognition, acoustic models are typically based on initial and final phonemes, or non-tone pinyin, or tone-containing pinyin for phoneme modeling. Taking the tonal pinyin modeling as an example, the acoustic model outputs the predicted tonal pinyin corresponding to the audio characteristic map.
At present, an acoustic model is generally based on a deep learning method, and the adopted model structures include CNN + CTC, DFSMN + CTC, CONFORMER and the like, wherein the CONFORMER model has large calculation amount and cannot be applied to low-calculation-force embedded equipment, and most of the existing common end-side models are still based on the CTC model.
In the conventional acoustic model training, the whole sentence corpus is used for training, and the prediction can be carried out only by acquiring a complete characteristic picture in the corresponding inference process, so that the function of converting speech into text can be realized.
In a real-time speech recognition scene, only acoustic information before the current moment can be obtained, so that a common acoustic model needs to be modified to adapt to the requirements of stream recognition, namely: the acoustic model operates and predicts the phonemes of the corresponding frames through an acoustic feature map which is of a finite length and only contains past information.
In the prior art, in order to realize stream recognition, a voice is segmented into segments to perform segment recognition, and recognition results are spliced together to serve as an integral recognition result.
However, the short voice segmentation time slice can cause insufficient context information and lower accuracy, and the long voice segmentation time slice can cause overlong voice recognition delay; moreover, because the CTC method cannot strictly align the output phoneme time, possible phoneme deletion or repetition at the splicing part can be caused in the splicing result, and the identification accuracy is further reduced.
Disclosure of Invention
The invention aims to provide an acoustic model training method and device, a storage medium and electronic equipment, and solves the technical problem that time slice segmentation has great influence on recognition accuracy rate in speech recognition in the prior art.
In order to achieve the above object, the present invention provides an acoustic model training method, including:
acquiring an audio characteristic map in a data set;
obtaining an original acoustic model;
segmenting the audio characteristic map according to a first preset time length to obtain a plurality of segmented audio characteristic maps, and sliding a window on each segmented audio characteristic map according to a second preset time length to obtain a plurality of small sliding windows;
inputting a plurality of small sliding windows into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results;
stripping and splicing the prediction result of each small sliding window to obtain an overall prediction result;
and determining an effective speech recognition acoustic model according to the overall prediction result and a chain time sequence classification algorithm.
Optionally, the step of obtaining the audio feature map in the data set includes:
acquiring a plurality of original audio data;
and performing data processing on a plurality of the original audio data to obtain a data set containing a plurality of audio characteristic maps.
Optionally, the step of segmenting the audio feature map according to a first preset time length to obtain a plurality of segmented audio feature maps, and sliding the window of each segmented audio feature map according to a second preset time length to obtain a plurality of small sliding windows includes:
carrying out random time translation and cepstrum mean variance normalization processing on the audio characteristic map data;
and updating the audio characteristic map into a processed audio characteristic map.
Optionally, the step of obtaining the overall prediction result after stripping and splicing the prediction results of each small sliding window includes:
removing half of a second preset time length from the front end of each small sliding window prediction result and the rear end of each small sliding window prediction result to obtain an actual prediction speech measure;
and splicing each actual prediction speech measure to obtain an overall prediction result.
Optionally, the step of determining an effective speech recognition acoustic model according to the overall prediction result and the chain sequential classification algorithm includes:
inputting the overall prediction result into a chain type time sequence classification algorithm to obtain a probability value;
when the probability value is in a first reliable preset range value, confirming that the current training is effective;
and updating the original acoustic model into a trained voice recognition acoustic model, and taking the trained voice recognition acoustic model as an effective voice recognition acoustic model.
Optionally, the step of inputting the overall prediction result into a chain time-series classification algorithm to obtain a probability value further includes:
and when the probability value is not in a first reliable preset range value, continuously performing a chain time sequence classification algorithm to obtain the probability value, and training the original acoustic model again according to the data set.
Optionally, the obtaining the original acoustic model further comprises:
acquiring a non-streaming recognized acoustic model;
and taking the acoustic model of the non-streaming recognition as an original acoustic model.
In order to achieve the above object, the present invention further provides a speech recognition algorithm, including:
acquiring a real-time audio characteristic map and an effective voice recognition acoustic model;
segmenting the real-time audio characteristic spectrum and performing sliding window processing to obtain a plurality of small sliding windows;
inputting a plurality of small sliding windows into an effective speech recognition acoustic model to determine a plurality of small sliding window recognition results;
stripping and splicing a plurality of small sliding window identification results, and determining an identification result corresponding to the real-time audio characteristic map;
wherein the valid speech recognition acoustic model is determined according to the acoustic model training method as described above.
In order to achieve the above object, the present invention further provides a storage medium, in which at least one executable instruction is stored, and when the executable instruction is executed on an electronic device, the electronic device executes the operations of the acoustic model training method as described above.
In order to achieve the above object, the present invention also provides an electronic device, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the acoustic model training method as described above.
The invention obtains the audio characteristic map in the data set; obtaining an original acoustic model; segmenting the audio characteristic map according to a first preset time length to obtain a plurality of segmented audio characteristic maps, and sliding a window on each segmented audio characteristic map according to a second preset time length to obtain a plurality of small sliding windows; inputting a plurality of small sliding windows into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results; stripping and splicing the prediction result of each small sliding window to obtain an overall prediction result; and determining an effective speech recognition acoustic model according to the overall prediction result and a chain time sequence classification algorithm. Through the steps, the data used for training the model are processed and then used for training the effective speech recognition acoustic model, so that the finally obtained effective speech recognition acoustic model can realize time segmentation of the audio at will without changing the recognition result when performing speech recognition, and the technical problem that the segmentation time slice has great influence on the recognition accuracy rate in the prior art is solved.
Drawings
The invention is further described below with reference to the accompanying drawings and examples;
FIG. 1 is a flow diagram illustrating a method for acoustic model training in one embodiment.
FIG. 2 is a flow diagram of a speech recognition algorithm in one embodiment.
Detailed Description
Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
In order to solve the technical problem that time slice segmentation has great influence on recognition accuracy in speech recognition in the prior art, the invention provides an acoustic model training method and device, a storage medium and electronic equipment.
In an embodiment, as shown in fig. 1, the acoustic model training method includes:
s1, acquiring an audio characteristic map in the data set;
the data set is actually a training set established by existing voice data, the training set at this time may include one or more, the voice data may be in various languages such as chinese, japanese, english, and the like, and the training set may also be divided into various functional training sets such as commonly used phrases, professional words, and the like. Therefore, the acoustic model can be conveniently trained by a user according to different requirements so as to obtain the acoustic model with more accurate speech recognition after training. The audio feature map is an image in which the horizontal axis is time and the vertical axis is a frequency domain.
S2, acquiring an original acoustic model;
the original acoustic model at this time is selected from acoustic models in the prior art.
S3, segmenting the audio characteristic map according to a first preset time length to obtain a plurality of segmented audio characteristic maps, and sliding the segmented audio characteristic maps according to a second preset time length to obtain a plurality of small sliding windows;
wherein, the first preset time length and the second preset time length can be set according to the requirement, since the audio feature map is still a signal that varies according to time, after the audio feature map is segmented according to a first preset time length, a plurality of segmented audio feature maps with the same time length can be obtained, a plurality of segmentation results at the moment are the segmented audio feature maps with the same time length, each segmented audio feature map is slid according to a second preset time length, the sliding window at the moment means that the whole time period of the next segmented audio feature map covers the end time of the previous segmented audio feature map and extends forwards for the second preset time length, the end time of the next segmented audio feature map also extends forwards for the second preset time length, thereby obtaining a plurality of small sliding windows, namely a second preset time length that the time of the plurality of small sliding windows covers the time of the previous small sliding window. Through the data processing of the steps, the integrity and the compensatability of the audio characteristic map are ensured.
S4, inputting the small sliding windows into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results;
at this time, a plurality of small sliding windows are input into the original acoustic model to perform feature extraction and model training, each small sliding window represents a section of complete voice, the small sliding windows are input into the original acoustic model to obtain a small sliding window prediction result, and when a plurality of small sliding windows exist, a plurality of small sliding window prediction results can be obtained.
S5, stripping and splicing the small sliding window prediction results to obtain an overall prediction result;
since the segmented audio feature maps in the previous steps have coverage of time length, there is also repetition of segments in the small sliding window prediction result at this time. The stripping at this time is essentially to strip the repeated segments. And then splicing the stripped small sliding window prediction results according to a time sequence to obtain an overall prediction result. The following distances illustrate the peeling process:
taking the first preset time length as T _ frame and the second preset time length as T _ shift, the time period of the small sliding window prediction result should be (0-T _ frame), (T _ frame-T _ shift, 2T _ frame-T _ shift), (2T _ frame-2T _ shift, 3T _ frame-2T _ shift), and so on, taking (0-T _ frame), (T _ frame-T _ shift, 2T _ frame-T _ shift) as an example, and the stripping is to delete a part of the process between (0-T _ frame), (T _ frame-T _ shift, 2T _ frame-T _ shift) (T _ frame-T _ shift, T _ frame), for example, (0-T _ frame) and (T _ frame-T _ shift) to delete a part of the process, for example, (T _ frame) delete the small sliding window prediction result, T _ shift, deleting the small sliding window prediction result of the (X, T _ frame) time section in the (2T _ frame-T _ shift), or deleting the time section of the overlapped part of the (0-T _ frame) and the (2T _ frame-T _ shift) arbitrarily selected for deletion, and taking the completely deleted overlapping time of the two time sections as a final result, namely the technical scheme and the technical effect of the invention can be realized.
And S6, determining an effective speech recognition acoustic model according to the overall prediction result and the chain time sequence classification algorithm.
When the acoustic model is actually trained, after a proper speech data training set is selected according to the application scene, the training set is used for training after being processed in the process in S3, so that a proper first preset time length and a proper second preset time length can be adopted according to the actual application scene, a plurality of small sliding windows after being processed are input into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results, at the moment, because the time of the small sliding windows is repeated by the second preset time length, namely the speech is repeated, the loss of a certain segment recognition error in the recognition process can be avoided, and because the segmentation and the influence of different segments on the recognition precision are different, at the moment, the combined recognition of different segments exists in the previous small sliding window and the next small sliding window, the recognition accuracy can be further improved, and finally, stripping and splicing each small sliding window prediction result to obtain an overall prediction result so as to eliminate repeated results, splicing the overall prediction result into a complete overall prediction result, and finally determining an effective speech recognition acoustic model according to the overall prediction result and a chain sequential classification algorithm. Through the scheme, when the trained effective speech recognition acoustic model is applied to speech recognition, time segmentation can be carried out on audio at will without changing the recognition result, so that the technical problem that the time slice has great influence on the recognition accuracy rate during speech recognition in the prior art is solved.
Optionally, the original acoustic model is an original non-streaming recognized acoustic model or a convolution-based acoustic model.
The original acoustic model is selected from an original non-flow type recognition acoustic model or a convolution-based acoustic model, and the original non-flow type recognition acoustic model or the convolution-based acoustic model can be optimally trained into a flow type recognition acoustic model through combining with the acoustic model training method, and the computation amount and the precision are kept basically unchanged. And the optimization of the acoustic model is realized in a simpler way without increasing the workload of labeling under the condition of only modifying the head and tail data processing mode in the model during training.
In one embodiment, the step of obtaining the audio feature map in the data set comprises:
acquiring a plurality of original audio data;
and performing data processing on a plurality of the original audio data to obtain a data set containing a plurality of audio characteristic maps.
The audio feature map is an image in which the horizontal axis is time and the vertical axis is a frequency domain, which is seen by audio (audio software) or audio editing software. The original audio data is essentially an analog signal, and the audio feature map can be obtained by front-end processing the analog signal. Through the steps, the audio characteristic spectrum in the data set can be standardized, and the audio characteristic spectrum is convenient to train and recognize.
Optionally, taking the audio feature map as an Fbank audio feature map as an example, the process of preprocessing the audio feature map is roughly divided into the following steps:
firstly, collected voice is a time domain signal, pre-emphasis, framing, windowing and Fourier transform are carried out on the time domain signal to obtain a frequency domain signal, and then the frequency domain signal is subjected to amplitude square, Mel filtering and logarithm taking to obtain an Fbank audio characteristic map. The pre-emphasis is to boost the signal-to-noise ratio of the signal in the high frequency part through a first-order high-pass filter.
In an embodiment, the step of segmenting the audio feature map according to a first preset time length to obtain a plurality of segmented audio feature maps, and sliding the window of each segmented audio feature map according to a second preset time length to obtain a plurality of small sliding windows further includes:
performing random time translation and Cepstrum Mean Variance Normalization (CMVN) on the audio feature map data;
and updating the audio characteristic map into a processed audio characteristic map.
It should be noted that, before the audio feature map data is subjected to random time translation and cepstrum mean square deviation normalization (CMVN), the audio feature map data that is originally acquired needs to be preprocessed, and the process of preprocessing the audio feature map data is roughly divided into the following steps, taking the audio feature map data as Fbank audio feature map data as an example:
firstly, collected voice is a time domain signal, pre-emphasis, framing, windowing and Fourier transform are carried out on the time domain signal to obtain a frequency domain signal, and then the frequency domain signal is subjected to amplitude square, Mel filtering and logarithm taking to obtain Fbank audio characteristic map data. The pre-emphasis is to boost the signal-to-noise ratio of the signal in the high frequency part through a first-order high-pass filter.
In the above embodiment, the audio feature map data is subjected to random time shift and cepstrum mean square deviation normalization (CMVN), that is, the processed audio feature map data is processed, where the random time shift refers to generation of a random integer within a certain range, and the audio feature map data is subjected to shift of the value, and the method can effectively prevent overfitting of training. Cepstral Mean Variance Normalization (CMVN) is to calculate the mean and variance of the feature map, and perform a normalization operation, which can enhance the robustness of the acoustic model.
In an embodiment, the step of obtaining the overall prediction result after stripping and splicing the prediction results of each small sliding window includes:
removing half of a second preset time length from the front end of each small sliding window prediction result and the rear end of each small sliding window prediction result to obtain an actual prediction speech measure;
in the above, it is illustrated that, when the first predetermined time length is denoted as T _ frame and the second predetermined time length is denoted as T _ shift, the time period of the prediction result with small sliding window should be (0-T _ frame), (T _ frame-T _ shift, 2T _ frame-T _ shift), (2T _ frame-2T _ shift, 3T _ frame-2T _ shift), and so on, and the example is (0-T _ frame), (T _ frame-T _ shift, 2T _ frame-T _ shift), and the example is stripping that the part between the overlapped parts (T _ frame-T _ shift, T _ frame) and (T _ frame-T _ shift, 2T _ frame-T _ shift) in the process is removed, for example, (0-T _ frame-T _ shift) is removed, x) the small sliding window prediction result of the time section, (X, T _ frame) is deleted from the small sliding window prediction result of the time section, (2T _ frame-T _ shift), or the time section of the overlapped part of (0-T _ frame) and (2T _ frame-T _ shift) is deleted arbitrarily, so that the overlapped time of the two time sections is completely deleted as a final result.
Based on the above procedure, in this embodiment, X is equal to half of the second preset time length, that is, X is equal to T _ shift/2, 0-T _ frame), and the small sliding window prediction result of the (T _ frame-T _ shift, T _ frame-T _ shift/2) time segment is deleted from the (T _ frame-T _ shift, 2T _ frame-T _ shift), and the small sliding window prediction result of the (T _ frame-T _ shift, T _ frame/2) time segment is deleted. By limiting the stripping time, when the deletion is limited to be half of the second preset time, the deletion can be conveniently carried out, and then the calculation is carried out by adopting a complex algorithm, so that the reduced safety cost of the scheme is reduced.
Further, details are given by T _ frame 1536ms and T _ shift 768 ms:
at this time
[0,1536ms],[768ms,2304ms],[1536ms,3072ms],..
That is, each window T _ frame is 1536ms, the next window slides relative to the current window T _ shift is T _ frame/2 is 768ms, and the slicing and splicing in S108 means that the first window takes out the result corresponding to the intermediate 768ms ([384ms,1152ms ]), and the second window also takes out the intermediate 768ms result, and then sequentially splices them together.
And splicing each actual prediction speech measure to obtain an overall prediction result.
Through the splicing, the rapid stripping and splicing can be realized, so that the rapid and complete splicing can be realized, in addition, the possible phoneme deletion or repetition at the splicing part can be avoided, and the identification accuracy is further improved.
In one embodiment, the step of determining an effective speech recognition acoustic model according to the overall prediction result and the chain timing classification algorithm comprises:
determining a loss value and an error rate of current training data according to a chain time sequence classification algorithm, the overall prediction result and the original acoustic model;
when the loss value is smaller than a first preset threshold value and the error rate is lower than a second preset threshold value;
and confirming that the trained voice recognition acoustic model is effective.
Inputting the overall prediction result into a chain type time sequence classification algorithm to obtain a probability value;
taking the chain sequential classification algorithm as CTC loss as an example:
the input predicted value of CTC (connectionist Temporal classification) loss is the overall prediction result, and the input labeled value is the labeled value of the conventional whole corpus. Due to the characteristics of the CTC, the result output by the whole model is automatically approached to the correct labeled value according to the original acoustic model, in the process, the loss value and the error rate are continuously acquired, and only when the loss value is smaller than a first preset threshold value and the error rate is lower than a second preset threshold value, the trained acoustic model can be confirmed to be an effective speech recognition acoustic model, so that the unsupervised automatic alignment training effect is achieved. In this case, the first preset threshold and the second preset threshold may be set by a user, or may be determined by a plurality of experiments, and are not limited herein.
When the probability value is in a first reliable preset range value, confirming that the current training is effective;
and updating the original acoustic model into a trained voice recognition acoustic model, and taking the trained voice recognition acoustic model as an effective voice recognition acoustic model.
In an embodiment, the step of inputting the overall prediction result into a chain time-series classification algorithm to obtain a probability value further includes:
and when the probability value is not in a first reliable preset range value, continuously performing a chain time sequence classification algorithm to obtain the probability value, and re-training the original acoustic model according to the data set.
By the aid of the chain time sequence classification algorithm, the result output by the whole model can be automatically approached to a correct marked value, and accordingly unsupervised automatic alignment training effect is achieved.
In an embodiment, the obtaining the original acoustic model further comprises:
acquiring a non-streaming recognized acoustic model;
and taking the acoustic model of the non-streaming recognition as an original acoustic model.
The non-streaming recognized acoustic model is used as an original acoustic model and combined with the acoustic model training method, the original non-streaming recognized acoustic model can be optimally trained into the streaming recognized acoustic model under the conditions that the main body structure of the original model is kept, the labeling workload is not increased, and the head and tail structures of the model during training are only modified, and the calculation amount and the precision are kept basically unchanged.
In order to solve the above problem, the present invention further provides a speech recognition algorithm, as shown in fig. 2, the speech recognition algorithm includes:
s10, acquiring a real-time audio characteristic map and an effective speech recognition acoustic model;
the audio feature map is an image whose horizontal axis is time and vertical axis is a frequency domain. The effective speech recognition acoustic model is determined in accordance with the acoustic model training method described above.
S20, segmenting the real-time audio characteristic map and performing sliding window processing to obtain a plurality of small sliding windows;
it should be noted that the segmentation and sliding window processing at this time are performed according to a first preset time length and a second preset time length in a plurality of small sliding windows obtained by "segmenting the audio feature map according to the first preset time length" used in training the effective speech recognition acoustic model to obtain a plurality of segmented audio feature maps, and performing segmentation and sliding window processing on each segmented audio feature map according to the second preset time length and the first preset time length in the plurality of small sliding windows. After the training of the effective speech recognition acoustic model is completed, a first preset time length and a second preset time length used in the training can be deployed at the embedded end side. Thereby obtaining the matching precision effect.
S30, inputting the small sliding windows into an effective speech recognition acoustic model to determine a plurality of small sliding window recognition results;
s40, stripping and splicing the small sliding window identification results, and determining the identification result corresponding to the real-time audio characteristic map;
note that, in this case, the peeling is performed only by deleting the overlapped portion, and it is needless to say that the effect of peeling off the 1/2 portion of the second predetermined time length before and after each small sliding window recognition result is more excellent.
In the above embodiments, the valid speech recognition acoustic model is determined according to the acoustic model training method as described above.
When an acoustic model is actually trained, after a proper speech data training set is selected according to an application scene, the training set is used for training after being processed according to the process in the step S3 in the acoustic model training method, so that a plurality of small sliding windows subjected to the processing are input into an original acoustic model for calculation to obtain a plurality of small sliding window prediction results according to the practical application scene by adopting proper first preset time length and second preset time length, at the moment, because the time of the small sliding windows is repeated by the second preset time length, namely repeated speech, the loss of a certain segment recognition error in the recognition process can be avoided, and because the segmentation and the influence of different segments on the recognition accuracy are different, at the moment, the combined recognition of different segments exists in the previous small sliding window and the next small sliding window, and finally, determining an effective speech recognition acoustic model according to the overall prediction result and a chain sequential classification algorithm, and adjusting parameters of the acoustic model to enable the overall output of the effective speech recognition acoustic model to be infinitely close to speech signals represented by an audio characteristic map in a data set. Through the scheme, when the trained effective speech recognition acoustic model is applied to speech recognition, time segmentation can be carried out on audio at will without changing the recognition result, so that the technical problem that the time slice has great influence on the recognition accuracy rate during speech recognition in the prior art is solved. In addition, the real-time audio characteristic map is identified according to the principle, the model length and the context relationship between training and deduction are kept, automatic annotation alignment is achieved by ingeniously benefiting the characteristics of CTC, splicing results can be spliced in order, and the effect of achieving streaming identification and keeping identification precision as far as possible is achieved.
Optionally, on the embedded device, it is also necessary to send the audio feature map to the effective speech recognition acoustic model frame by frame with a first preset time length window length and a second preset time length sliding step length, and strip 1/2 portions of the output result before and after the second preset time length, so as to output the recognition result of the current window in real time.
In order to solve the above problem, the present invention further provides a storage medium, where at least one executable instruction is stored, and when the executable instruction is executed on an electronic device, the electronic device executes the operations of the acoustic model training method described above.
It should be noted that, since the storage medium of the present application includes all the steps of the acoustic model training method, the storage medium may also implement all the schemes of the acoustic model training method, and has the same beneficial effects, and details are not described herein again.
An acoustic model training method of the above method embodiments is performed. The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage 15 storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
In order to solve the above problem, the present invention further provides an electronic device, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the acoustic model training method as described above.
It should be noted that, because the electronic device of the present application includes all the steps of the acoustic model training method, the electronic device may also implement all the schemes of the acoustic model training method, and has the same beneficial effects, and details are not repeated herein.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims (10)

1. An acoustic model training method, characterized in that the acoustic model training method comprises:
acquiring an audio characteristic map in a data set;
obtaining an original acoustic model;
segmenting the audio characteristic map according to a first preset time length to obtain a plurality of segmented audio characteristic maps, and sliding a window on each segmented audio characteristic map according to a second preset time length to obtain a plurality of small sliding windows;
inputting a plurality of small sliding windows into the original acoustic model for calculation to obtain a plurality of small sliding window prediction results;
stripping and splicing the prediction result of each small sliding window to obtain an overall prediction result;
and determining an effective speech recognition acoustic model according to the overall prediction result and a chain time sequence classification algorithm.
2. The acoustic model training method of claim 1, wherein the step of obtaining an audio feature map in the data set comprises:
acquiring a plurality of original audio data;
and performing data processing on a plurality of the original audio data to obtain a data set containing a plurality of audio characteristic maps.
3. The method for training an acoustic model according to claim 1, wherein the step of segmenting the audio feature map according to a first preset time length to obtain a plurality of segmented audio feature maps, and sliding each of the segmented audio feature maps according to a second preset time length to obtain a plurality of small sliding windows further comprises:
carrying out random time translation and cepstrum mean variance normalization processing on the audio characteristic map data;
and updating the audio characteristic map into a processed audio characteristic map.
4. The acoustic model training method of claim 1, wherein the step of obtaining the overall prediction result after the stripping and splicing of each of the small sliding window prediction results comprises:
removing half of a second preset time length from the front end of each small sliding window prediction result and the rear end of each small sliding window prediction result to obtain an actual prediction speech measure;
and splicing each actual prediction speech measure to obtain an overall prediction result.
5. The method of acoustic model training of claim 1, wherein the step of determining an effective speech recognition acoustic model based on the overall prediction result and a chain timing classification algorithm comprises:
inputting the overall prediction result into a chain type time sequence classification algorithm to obtain a probability value;
when the probability value is in a first reliable preset range value, confirming that the current training is effective;
and updating the original acoustic model into a trained voice recognition acoustic model, and taking the trained voice recognition acoustic model as an effective voice recognition acoustic model.
6. The acoustic model training method of claim 5, wherein the step of inputting the overall prediction into a chain temporal classification algorithm to obtain probability values further comprises:
and when the probability value is not in a first reliable preset range value, continuously performing a chain time sequence classification algorithm to obtain the probability value, and training the original acoustic model again according to the data set.
7. The acoustic model training method of claim 1, wherein the obtaining an original acoustic model further comprises, prior to:
acquiring a non-streaming recognized acoustic model;
and taking the acoustic model of the non-streaming recognition as an original acoustic model.
8. A speech recognition algorithm, the speech recognition algorithm comprising:
acquiring a real-time audio characteristic map and an effective voice recognition acoustic model;
segmenting the real-time audio characteristic spectrum and performing sliding window processing to obtain a plurality of small sliding windows;
inputting a plurality of small sliding windows into an effective speech recognition acoustic model to determine a plurality of small sliding window recognition results;
stripping and splicing a plurality of small sliding window identification results, and determining an identification result corresponding to the real-time audio characteristic map;
wherein the valid speech recognition acoustic model is determined according to the acoustic model training method according to any of claims 1-7.
9. A storage medium having stored therein at least one executable instruction that, when executed on an electronic device, causes the electronic device to perform operations of the acoustic model training method according to any one of claims 1 to 7.
10. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the acoustic model training method of any of claims 1-7.
CN202210103816.1A 2022-01-27 2022-01-27 Acoustic model training method, speech recognition algorithm, storage medium, and electronic device Pending CN114512122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210103816.1A CN114512122A (en) 2022-01-27 2022-01-27 Acoustic model training method, speech recognition algorithm, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210103816.1A CN114512122A (en) 2022-01-27 2022-01-27 Acoustic model training method, speech recognition algorithm, storage medium, and electronic device

Publications (1)

Publication Number Publication Date
CN114512122A true CN114512122A (en) 2022-05-17

Family

ID=81549936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210103816.1A Pending CN114512122A (en) 2022-01-27 2022-01-27 Acoustic model training method, speech recognition algorithm, storage medium, and electronic device

Country Status (1)

Country Link
CN (1) CN114512122A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171700A (en) * 2022-06-13 2022-10-11 武汉大学 Voiceprint recognition voice assistant method based on pulse neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171700A (en) * 2022-06-13 2022-10-11 武汉大学 Voiceprint recognition voice assistant method based on pulse neural network
CN115171700B (en) * 2022-06-13 2024-04-26 武汉大学 Voiceprint recognition voice assistant method based on impulse neural network

Similar Documents

Publication Publication Date Title
US11276407B2 (en) Metadata-based diarization of teleconferences
CN107657947B (en) Speech processing method and device based on artificial intelligence
CN109545190B (en) Speech recognition method based on keywords
CN109065031B (en) Voice labeling method, device and equipment
US8543402B1 (en) Speaker segmentation in noisy conversational speech
CN109754783B (en) Method and apparatus for determining boundaries of audio sentences
US20180047387A1 (en) System and method for generating accurate speech transcription from natural speech audio signals
CN110827801A (en) Automatic voice recognition method and system based on artificial intelligence
CN103544955A (en) Method of recognizing speech and electronic device thereof
CN110265001B (en) Corpus screening method and device for speech recognition training and computer equipment
JP2023542685A (en) Speech recognition method, speech recognition device, computer equipment, and computer program
CN111951796B (en) Speech recognition method and device, electronic equipment and storage medium
CN111613215B (en) Voice recognition method and device
CN113838460A (en) Video voice recognition method, device, equipment and storage medium
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
DE112020004348T5 (en) DETECTING AND RECOVERING OUT OF VOCABULARY WORDS IN SPEECH-TO-TEXT TRANSCRIPTION SYSTEMS
CN108364655B (en) Voice processing method, medium, device and computing equipment
US20220157322A1 (en) Metadata-based diarization of teleconferences
CN114512122A (en) Acoustic model training method, speech recognition algorithm, storage medium, and electronic device
CN110853627A (en) Method and system for voice annotation
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
CN114398952B (en) Training text generation method and device, electronic equipment and storage medium
CN114822516A (en) Acoustic model training method, speech recognition algorithm, storage medium, and electronic device
CN112397059B (en) Voice fluency detection method and device
CN113160796B (en) Language identification method, device and equipment for broadcast audio and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination