CN112420075B

CN112420075B - Multitask-based phoneme detection method and device

Info

Publication number: CN112420075B
Application number: CN202011156288.3A
Authority: CN
Inventors: 谢川
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2022-08-19
Anticipated expiration: 2040-10-26
Also published as: CN112420075A

Abstract

The invention discloses a phoneme detection method based on multitask, which comprises the following steps: step A1) training a phoneme detection model; step A2) obtaining a voice sequence to be detected; step A3) dividing the speech sequence into a plurality of basic subsequences; step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence; step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The invention solves the technical problems that the phoneme recognition task and the phoneme alignment task can not be completed at the same time, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task can not share the learning result.

Description

Multitask-based phoneme detection method and device

Technical Field

The invention relates to the field of data intelligence, in particular to a multi-task-based phoneme detection method and device.

Background

With the development of deep learning technology, the technology of speech recognition, voiceprint recognition, speech synthesis, speech emotion analysis and the like based on deep speech processing is continuously broken through. Phonemes are the minimum phonetic unit divided by the natural attributes of speech, and play a very important role in deep speech processing, which is the basis of most speech processing. Meanwhile, the phoneme has very important significance for the rapid response of the deep speech processing system in a practical scene. Meanwhile, the existing data set has very few voice databases containing phoneme alignment information and is limited by the phoneme definition specifications of the databases, so that the condition that the phoneme definition specifications of the databases are not uniform is easily met, and great obstruction is generated in the research of voice in the phoneme related field. The research on phonemes of speech is limited by the fact that each database is not universal for phoneme definitions, data cannot be expanded, and data enhancement cannot be performed at a phoneme level. Meanwhile, the artificial phoneme detection method is adopted, so that the problem of great cost increase exists, a large amount of labor cost is consumed, a large amount of time cost is required, artificial phoneme detection cannot be performed on a large amount of data, and the requirement of the existing algorithm on the training data volume cannot be met.

Disclosure of Invention

The invention aims to provide a multi-task-based phoneme detection method and a multi-task-based phoneme detection device, which are used for solving the technical problems that a phoneme recognition task and a phoneme alignment task cannot be completed simultaneously, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task cannot share a learning result.

The invention solves the problems through the following technical scheme:

a phoneme detection method based on multitask comprises the following steps:

step A1) training a phoneme detection model;

step A2) obtaining a voice sequence to be detected;

step A3) dividing the speech sequence into a plurality of basic subsequences;

step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets;

step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients;

step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence;

step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4).

Further, the predicted phoneme of the transformed subsequence with the highest confidence in the step a7) is used as the final phoneme detection result of the base subsequence in the step A3), and the two end positions of the transformed subsequence with the highest confidence are used as the phoneme start point and the end position of the base subsequence in the step A3).

Further, the method for segmenting the speech sequence into a plurality of basic sub-sequences in the step a3) includes: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on the fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is a preset width W and a step S, the width W is the length of a basic subsequence, and the step S represents the distance of the alignment window moving from the previous window to the next window after each division.

Further, the method for generating the transform sub-sequence from the base sub-sequence in step a4) includes translating the two end positions of the base sub-sequence equidistantly or scaling the two end positions of the base sub-sequence with respect to the center of the sequence.

Further, the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update the model parameters by training the text information of the speech data and the phoneme end point position markers corresponding to the speech data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.

Further, the termination condition in step a7) is set such that the confidence difference between the front and back side phoneme identifications is less than a set value a, the two sequence IOU results with the highest phoneme identifications in the front and back two times are less than c%, the phoneme identification confidence is greater than b, and the number of iterations is greater than or equal to a preset maximum number of iterations N, where N is any positive integer, that is, when N is 1, no iteration is performed.

The invention also provides a phoneme detection device based on multitask, which is used for realizing the phoneme detection method based on multitask and comprises a voice data module, a voice sequence segmentation module and a phoneme detection module, wherein the voice data module is in signal connection with the voice sequence segmentation module, and the voice segmentation module is in signal connection with the phoneme detection module.

Further, the voice data acquisition module is used for acquiring a voice sequence to be detected.

Further, the speech segmentation module is configured to segment the speech sequence into a plurality of base subsequences.

Further, the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the tasks of phoneme recognition and phoneme alignment simultaneously, also solves the problems that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, provides more accurate data support for technologies such as rear-end voiceprint recognition and the like, and improves the accuracy of a voice system.

Drawings

FIG. 1 is a view showing the structure of the apparatus of the present invention.

FIG. 2 is a flow chart of the present invention.

FIG. 3 is a flow chart of the phoneme detection model training process of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

a phoneme detection method based on multitask comprises the following steps:

step A1) training a phoneme detection model; as shown in fig. 3, training the phoneme detection model includes the following steps:

step B1) sets a set of phones/syllables to be recognized, and arranges the phones/syllables in the set in order, the phones/syllables being the smallest speech units divided according to the natural attributes of the speech, and being analyzed according to the pronunciation action in the syllables, one action constituting one phone.

Step B2) obtains one or more speaker data sets containing the speaker's voice information and the phone position label file corresponding to the speaker's voice information and the phone/syllable arrangement order.

Step B3) segmenting according to the phoneme label file in the step B2) to obtain phoneme segmentation sub-sequences as expected results.

Step B4) equally dividing the original speech sequence generated by the phoneme/syllable arrangement sequence corresponding to the speaker speech information into a plurality of basic subsequences with the same number as the phonemes, wherein the plurality of basic subsequences form a set.

Step B5) inputting each basic subsequence into a multi-scale transformation model for processing, thereby obtaining a multi-scale subsequence, and combining a plurality of multi-scale subsequences to form a multi-scale transformation subsequence set. The multi-scale transformation model is set as 5 sequence feature matrixes with different scales, wherein a sequence is a basic subsequence which expands 0.1L to the left, b sequence is a basic subsequence which expands outwards from the center according to the proportion of 1.1, c sequence is a basic subsequence, d sequence is a basic subsequence which contracts towards the center according to the proportion of 0.9, and e sequence is a basic subsequence which expands 0.1L to the right.

Step B6) extracting the characteristics of a plurality of basic subsequences, and performing MFCC extraction on the basic subsequences in the step B4) and the multi-scale transformation subsequences in the step B5), wherein the specific parameters are set to be 25ms of window function size, 10ms of step size, 13-dimensional MFCC dimension, 13-dimensional first-order delta dimension, 13-dimensional second-order delta dimension and 39-dimensional characteristic extraction result. Thereby obtaining a feature matrix for each base subsequence.

Step B7) constructing a phoneme recognition model, wherein specific parameters comprise three layers of RNN networks, the feature matrix of each basic subsequence and the phoneme category corresponding to the feature matrix obtained in the step B3) are used as input and prediction expectation of the phoneme recognition model, and the phoneme recognition model is pre-trained, so that a pre-trained acoustic model for phoneme recognition is obtained.

Step B8) inputting the feature matrix of each basic subsequence in the step B6) into a phoneme recognition model, and outputting to obtain a phoneme probability matrix.

Step B9) constructing a phoneme regression model, setting specific parameters as a 3-layer one-dimensional CNN network, inputting the characteristic matrix of each basic subsequence in the step B6), and outputting a candidate frame of the basic subsequence on the whole audio frame.

Step B10) combining the phoneme regression model and the phoneme recognition model to obtain a combined loss function, obtaining the phoneme type and phoneme position information, and updating the phoneme detection model, wherein the method comprises the following specific steps: 1. the phoneme recognition model output layer is a softmax layer, and Relu is adopted as the network loss function. 2. The phoneme regression model uses one-dimensional DIOU as a network loss function. 3. The combined loss function is obtained as 0.4-fold phoneme recognition model loss function + 0.6-fold phoneme regression model loss function.

As shown in fig. 2, the detection process of the phoneme recognition model includes the following steps: step A2) acquiring a voice sequence to be detected, namely voice acquisition, acquiring a wav format file with an audio parameter of 16K sampling rate, acquiring a recognized pinyin result according to the voice recognition result of the ASR, acquiring a phoneme recognition result according to a pinyin phoneme mapping table, and counting the number of phonemes, thereby initializing the voice sequence. Extracting information signs of the voice sequence, performing MFCC extraction on the obtained voice sequence, wherein the size of a window function is 25ms, the step length is 10ms, the dimension of the MFCC is 13 dimensions, the first-order delta dimension is 13 dimensions, the second-order delta dimension is 13 dimensions, and the information feature extraction result is 39 dimensions.

Step a3) divides the speech sequence into a plurality of basic subsequences, and equally divides the speech audio sequence into a plurality of basic subsequences equal to the number of phonemes according to the number of phonemes. The method for dividing the voice sequence into a plurality of basic subsequences comprises the following steps: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on a fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is carried out on the segmentation of the voice sequence by a preset width W and a step length S, wherein the width W is the length of a basic subsequence, and the step length S represents the distance of the alignment window moving from the previous window to the next window after each segmentation.

Step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; the method for generating the transformation subsequence from the basic subsequence comprises the step of translating the positions of two end points of the basic subsequence equidistantly or scaling the positions of the two end points of the basic subsequence relative to the center of the sequence. The specific method is set to set 5 sequences with different scales, wherein the sequence a is a basic subsequence which expands 0.1L to the left, the sequence b is a basic subsequence which expands outwards from the center according to the proportion of 1.1, the sequence c is a basic subsequence, the sequence d is a basic subsequence which contracts to the center according to the proportion of 0.9, and the sequence e is a basic subsequence which expands 0.1L to the right.

Step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; the detection process comprises the following steps: 1. and performing phoneme recognition on the transformed subsequence set. 2. And performing phoneme regression on each scale of the transformed subsequence according to the recognition result.

Step A6) takes the transform subsequence with the highest confidence as the new base subsequence.

Step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The termination condition is set as that the confidence difference of phoneme recognition on the front side and the rear side is smaller than a set value a, the IOU results of two sequences with the highest phoneme recognition on the front side and the rear side are smaller than c%, the phoneme recognition confidence is larger than b, and the iteration times are larger than or equal to a preset maximum iteration time N, wherein N is any positive integer, namely, iteration is not performed when N is 1. The predicted phoneme of the transformed subsequence with the highest confidence degree is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme starting point and the phoneme ending position of the basic subsequence in the step A3).

The phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.

Referring to fig. 1, a multitask-based phoneme detecting apparatus for implementing the multitask-based phoneme detecting method includes a speech data module, a speech sequence segmentation module, and a phoneme detection module, where the speech data module is in signal connection with the speech sequence segmentation module, and the speech segmentation module is in signal connection with the phoneme detection module. The voice data acquisition module is used for acquiring a voice sequence to be detected. The voice segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences. The phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.

The invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the phoneme recognition and phoneme alignment tasks simultaneously, also solves the problem that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, and provides more accurate data support for technologies such as rear-end voiceprint recognition to improve the accuracy of a voice system.

Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.

Claims

1. A phoneme detection method based on multitasking is characterized in that: the method comprises the following steps:

step A1) training a phoneme detection model;

step A2) obtaining a voice sequence to be detected;

step A3) dividing the speech sequence into a plurality of basic subsequences; step A3) dividing the voice sequence into a plurality of basic subsequences, and equally dividing the voice audio sequence into a plurality of basic subsequences with the same number as the phonemes according to the number of the phonemes;

step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining and outputting the phoneme detection result and the starting point and the ending point of the phoneme, and if not, returning to the step A4); the termination condition in the step a7) is set as that the confidence difference of the phoneme recognition at the front side and the back side is smaller than a set value a, the two sequences IOU result with the highest phoneme recognition at the front side and the back side is smaller than c%, the phoneme recognition confidence is larger than b, and the iteration number is larger than or equal to a preset maximum iteration number N, wherein N is any positive integer, that is, when N is 1, iteration is not performed.

2. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the predicted phoneme of the transformed subsequence with the highest confidence degree in the step a7) is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme start point and the end point position of the basic subsequence in the step A3).

3. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for segmenting the voice sequence into a plurality of basic subsequences in the step A3) comprises the following steps: the number of phonemes included in the speech sequence is detected by a speech recognition or phoneme recognition method, and the speech sequence is equally divided or randomly divided into a plurality of basic subsequences according to the number of phonemes.

4. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for generating the transform subsequence from the base subsequence in the step A4) includes translating the two end positions of the base subsequence equidistantly or scaling the two end positions of the base subsequence relative to the sequence center.

5. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.

6. A multitask-based phoneme detection apparatus, characterized in that: a method for implementing a multitasking based phoneme detection method according to claims 1-5 comprising a speech data module, a speech sequence segmentation module and a phoneme detection module, said speech data module being signal connected to said speech sequence segmentation module, said speech sequence segmentation module being signal connected to said phoneme detection module.

7. A multitask based phoneme detection device according to claim 6 and being characterized in that: the voice data acquisition module is used for acquiring a voice sequence to be detected.

8. The apparatus of claim 6, wherein: the voice sequence segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences.

9. The apparatus of claim 6, wherein: the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.