CN112420075B - Multitask-based phoneme detection method and device - Google Patents
Multitask-based phoneme detection method and device Download PDFInfo
- Publication number
- CN112420075B CN112420075B CN202011156288.3A CN202011156288A CN112420075B CN 112420075 B CN112420075 B CN 112420075B CN 202011156288 A CN202011156288 A CN 202011156288A CN 112420075 B CN112420075 B CN 112420075B
- Authority
- CN
- China
- Prior art keywords
- phoneme
- subsequence
- basic
- sequence
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 63
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000006243 chemical reaction Methods 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 31
- 230000011218 segmentation Effects 0.000 claims description 17
- 230000009466 transformation Effects 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 description 10
- 230000006870 function Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a phoneme detection method based on multitask, which comprises the following steps: step A1) training a phoneme detection model; step A2) obtaining a voice sequence to be detected; step A3) dividing the speech sequence into a plurality of basic subsequences; step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence; step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The invention solves the technical problems that the phoneme recognition task and the phoneme alignment task can not be completed at the same time, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task can not share the learning result.
Description
Technical Field
The invention relates to the field of data intelligence, in particular to a multi-task-based phoneme detection method and device.
Background
With the development of deep learning technology, the technology of speech recognition, voiceprint recognition, speech synthesis, speech emotion analysis and the like based on deep speech processing is continuously broken through. Phonemes are the minimum phonetic unit divided by the natural attributes of speech, and play a very important role in deep speech processing, which is the basis of most speech processing. Meanwhile, the phoneme has very important significance for the rapid response of the deep speech processing system in a practical scene. Meanwhile, the existing data set has very few voice databases containing phoneme alignment information and is limited by the phoneme definition specifications of the databases, so that the condition that the phoneme definition specifications of the databases are not uniform is easily met, and great obstruction is generated in the research of voice in the phoneme related field. The research on phonemes of speech is limited by the fact that each database is not universal for phoneme definitions, data cannot be expanded, and data enhancement cannot be performed at a phoneme level. Meanwhile, the artificial phoneme detection method is adopted, so that the problem of great cost increase exists, a large amount of labor cost is consumed, a large amount of time cost is required, artificial phoneme detection cannot be performed on a large amount of data, and the requirement of the existing algorithm on the training data volume cannot be met.
Disclosure of Invention
The invention aims to provide a multi-task-based phoneme detection method and a multi-task-based phoneme detection device, which are used for solving the technical problems that a phoneme recognition task and a phoneme alignment task cannot be completed simultaneously, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task cannot share a learning result.
The invention solves the problems through the following technical scheme:
a phoneme detection method based on multitask comprises the following steps:
step A1) training a phoneme detection model;
step A2) obtaining a voice sequence to be detected;
step A3) dividing the speech sequence into a plurality of basic subsequences;
step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets;
step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients;
step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence;
step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4).
Further, the predicted phoneme of the transformed subsequence with the highest confidence in the step a7) is used as the final phoneme detection result of the base subsequence in the step A3), and the two end positions of the transformed subsequence with the highest confidence are used as the phoneme start point and the end position of the base subsequence in the step A3).
Further, the method for segmenting the speech sequence into a plurality of basic sub-sequences in the step a3) includes: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on the fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is a preset width W and a step S, the width W is the length of a basic subsequence, and the step S represents the distance of the alignment window moving from the previous window to the next window after each division.
Further, the method for generating the transform sub-sequence from the base sub-sequence in step a4) includes translating the two end positions of the base sub-sequence equidistantly or scaling the two end positions of the base sub-sequence with respect to the center of the sequence.
Further, the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update the model parameters by training the text information of the speech data and the phoneme end point position markers corresponding to the speech data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
Further, the termination condition in step a7) is set such that the confidence difference between the front and back side phoneme identifications is less than a set value a, the two sequence IOU results with the highest phoneme identifications in the front and back two times are less than c%, the phoneme identification confidence is greater than b, and the number of iterations is greater than or equal to a preset maximum number of iterations N, where N is any positive integer, that is, when N is 1, no iteration is performed.
The invention also provides a phoneme detection device based on multitask, which is used for realizing the phoneme detection method based on multitask and comprises a voice data module, a voice sequence segmentation module and a phoneme detection module, wherein the voice data module is in signal connection with the voice sequence segmentation module, and the voice segmentation module is in signal connection with the phoneme detection module.
Further, the voice data acquisition module is used for acquiring a voice sequence to be detected.
Further, the speech segmentation module is configured to segment the speech sequence into a plurality of base subsequences.
Further, the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the tasks of phoneme recognition and phoneme alignment simultaneously, also solves the problems that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, provides more accurate data support for technologies such as rear-end voiceprint recognition and the like, and improves the accuracy of a voice system.
Drawings
FIG. 1 is a view showing the structure of the apparatus of the present invention.
FIG. 2 is a flow chart of the present invention.
FIG. 3 is a flow chart of the phoneme detection model training process of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
a phoneme detection method based on multitask comprises the following steps:
step A1) training a phoneme detection model; as shown in fig. 3, training the phoneme detection model includes the following steps:
step B1) sets a set of phones/syllables to be recognized, and arranges the phones/syllables in the set in order, the phones/syllables being the smallest speech units divided according to the natural attributes of the speech, and being analyzed according to the pronunciation action in the syllables, one action constituting one phone.
Step B2) obtains one or more speaker data sets containing the speaker's voice information and the phone position label file corresponding to the speaker's voice information and the phone/syllable arrangement order.
Step B3) segmenting according to the phoneme label file in the step B2) to obtain phoneme segmentation sub-sequences as expected results.
Step B4) equally dividing the original speech sequence generated by the phoneme/syllable arrangement sequence corresponding to the speaker speech information into a plurality of basic subsequences with the same number as the phonemes, wherein the plurality of basic subsequences form a set.
Step B5) inputting each basic subsequence into a multi-scale transformation model for processing, thereby obtaining a multi-scale subsequence, and combining a plurality of multi-scale subsequences to form a multi-scale transformation subsequence set. The multi-scale transformation model is set as 5 sequence feature matrixes with different scales, wherein a sequence is a basic subsequence which expands 0.1L to the left, b sequence is a basic subsequence which expands outwards from the center according to the proportion of 1.1, c sequence is a basic subsequence, d sequence is a basic subsequence which contracts towards the center according to the proportion of 0.9, and e sequence is a basic subsequence which expands 0.1L to the right.
Step B6) extracting the characteristics of a plurality of basic subsequences, and performing MFCC extraction on the basic subsequences in the step B4) and the multi-scale transformation subsequences in the step B5), wherein the specific parameters are set to be 25ms of window function size, 10ms of step size, 13-dimensional MFCC dimension, 13-dimensional first-order delta dimension, 13-dimensional second-order delta dimension and 39-dimensional characteristic extraction result. Thereby obtaining a feature matrix for each base subsequence.
Step B7) constructing a phoneme recognition model, wherein specific parameters comprise three layers of RNN networks, the feature matrix of each basic subsequence and the phoneme category corresponding to the feature matrix obtained in the step B3) are used as input and prediction expectation of the phoneme recognition model, and the phoneme recognition model is pre-trained, so that a pre-trained acoustic model for phoneme recognition is obtained.
Step B8) inputting the feature matrix of each basic subsequence in the step B6) into a phoneme recognition model, and outputting to obtain a phoneme probability matrix.
Step B9) constructing a phoneme regression model, setting specific parameters as a 3-layer one-dimensional CNN network, inputting the characteristic matrix of each basic subsequence in the step B6), and outputting a candidate frame of the basic subsequence on the whole audio frame.
Step B10) combining the phoneme regression model and the phoneme recognition model to obtain a combined loss function, obtaining the phoneme type and phoneme position information, and updating the phoneme detection model, wherein the method comprises the following specific steps: 1. the phoneme recognition model output layer is a softmax layer, and Relu is adopted as the network loss function. 2. The phoneme regression model uses one-dimensional DIOU as a network loss function. 3. The combined loss function is obtained as 0.4-fold phoneme recognition model loss function + 0.6-fold phoneme regression model loss function.
As shown in fig. 2, the detection process of the phoneme recognition model includes the following steps: step A2) acquiring a voice sequence to be detected, namely voice acquisition, acquiring a wav format file with an audio parameter of 16K sampling rate, acquiring a recognized pinyin result according to the voice recognition result of the ASR, acquiring a phoneme recognition result according to a pinyin phoneme mapping table, and counting the number of phonemes, thereby initializing the voice sequence. Extracting information signs of the voice sequence, performing MFCC extraction on the obtained voice sequence, wherein the size of a window function is 25ms, the step length is 10ms, the dimension of the MFCC is 13 dimensions, the first-order delta dimension is 13 dimensions, the second-order delta dimension is 13 dimensions, and the information feature extraction result is 39 dimensions.
Step a3) divides the speech sequence into a plurality of basic subsequences, and equally divides the speech audio sequence into a plurality of basic subsequences equal to the number of phonemes according to the number of phonemes. The method for dividing the voice sequence into a plurality of basic subsequences comprises the following steps: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on a fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is carried out on the segmentation of the voice sequence by a preset width W and a step length S, wherein the width W is the length of a basic subsequence, and the step length S represents the distance of the alignment window moving from the previous window to the next window after each segmentation.
Step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; the method for generating the transformation subsequence from the basic subsequence comprises the step of translating the positions of two end points of the basic subsequence equidistantly or scaling the positions of the two end points of the basic subsequence relative to the center of the sequence. The specific method is set to set 5 sequences with different scales, wherein the sequence a is a basic subsequence which expands 0.1L to the left, the sequence b is a basic subsequence which expands outwards from the center according to the proportion of 1.1, the sequence c is a basic subsequence, the sequence d is a basic subsequence which contracts to the center according to the proportion of 0.9, and the sequence e is a basic subsequence which expands 0.1L to the right.
Step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; the detection process comprises the following steps: 1. and performing phoneme recognition on the transformed subsequence set. 2. And performing phoneme regression on each scale of the transformed subsequence according to the recognition result.
Step A6) takes the transform subsequence with the highest confidence as the new base subsequence.
Step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The termination condition is set as that the confidence difference of phoneme recognition on the front side and the rear side is smaller than a set value a, the IOU results of two sequences with the highest phoneme recognition on the front side and the rear side are smaller than c%, the phoneme recognition confidence is larger than b, and the iteration times are larger than or equal to a preset maximum iteration time N, wherein N is any positive integer, namely, iteration is not performed when N is 1. The predicted phoneme of the transformed subsequence with the highest confidence degree is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme starting point and the phoneme ending position of the basic subsequence in the step A3).
The phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
Referring to fig. 1, a multitask-based phoneme detecting apparatus for implementing the multitask-based phoneme detecting method includes a speech data module, a speech sequence segmentation module, and a phoneme detection module, where the speech data module is in signal connection with the speech sequence segmentation module, and the speech segmentation module is in signal connection with the phoneme detection module. The voice data acquisition module is used for acquiring a voice sequence to be detected. The voice segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences. The phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
The invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the phoneme recognition and phoneme alignment tasks simultaneously, also solves the problem that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, and provides more accurate data support for technologies such as rear-end voiceprint recognition to improve the accuracy of a voice system.
Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.
Claims (9)
1. A phoneme detection method based on multitasking is characterized in that: the method comprises the following steps:
step A1) training a phoneme detection model;
step A2) obtaining a voice sequence to be detected;
step A3) dividing the speech sequence into a plurality of basic subsequences; step A3) dividing the voice sequence into a plurality of basic subsequences, and equally dividing the voice audio sequence into a plurality of basic subsequences with the same number as the phonemes according to the number of the phonemes;
step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets;
step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients;
step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence;
step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining and outputting the phoneme detection result and the starting point and the ending point of the phoneme, and if not, returning to the step A4); the termination condition in the step a7) is set as that the confidence difference of the phoneme recognition at the front side and the back side is smaller than a set value a, the two sequences IOU result with the highest phoneme recognition at the front side and the back side is smaller than c%, the phoneme recognition confidence is larger than b, and the iteration number is larger than or equal to a preset maximum iteration number N, wherein N is any positive integer, that is, when N is 1, iteration is not performed.
2. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the predicted phoneme of the transformed subsequence with the highest confidence degree in the step a7) is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme start point and the end point position of the basic subsequence in the step A3).
3. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for segmenting the voice sequence into a plurality of basic subsequences in the step A3) comprises the following steps: the number of phonemes included in the speech sequence is detected by a speech recognition or phoneme recognition method, and the speech sequence is equally divided or randomly divided into a plurality of basic subsequences according to the number of phonemes.
4. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for generating the transform subsequence from the base subsequence in the step A4) includes translating the two end positions of the base subsequence equidistantly or scaling the two end positions of the base subsequence relative to the sequence center.
5. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
6. A multitask-based phoneme detection apparatus, characterized in that: a method for implementing a multitasking based phoneme detection method according to claims 1-5 comprising a speech data module, a speech sequence segmentation module and a phoneme detection module, said speech data module being signal connected to said speech sequence segmentation module, said speech sequence segmentation module being signal connected to said phoneme detection module.
7. A multitask based phoneme detection device according to claim 6 and being characterized in that: the voice data acquisition module is used for acquiring a voice sequence to be detected.
8. The apparatus of claim 6, wherein: the voice sequence segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences.
9. The apparatus of claim 6, wherein: the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011156288.3A CN112420075B (en) | 2020-10-26 | 2020-10-26 | Multitask-based phoneme detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011156288.3A CN112420075B (en) | 2020-10-26 | 2020-10-26 | Multitask-based phoneme detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112420075A CN112420075A (en) | 2021-02-26 |
CN112420075B true CN112420075B (en) | 2022-08-19 |
Family
ID=74840585
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011156288.3A Active CN112420075B (en) | 2020-10-26 | 2020-10-26 | Multitask-based phoneme detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112420075B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113889083B (en) * | 2021-11-03 | 2022-12-02 | 广州博冠信息科技有限公司 | Voice recognition method and device, storage medium and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20110010243A (en) * | 2009-07-24 | 2011-02-01 | 고려대학교 산학협력단 | System and method for searching phoneme boundaries |
CN104239289A (en) * | 2013-06-24 | 2014-12-24 | 富士通株式会社 | Syllabication method and syllabication device |
CN108899047A (en) * | 2018-08-20 | 2018-11-27 | 百度在线网络技术(北京)有限公司 | The masking threshold estimation method, apparatus and storage medium of audio signal |
CN109036384A (en) * | 2018-09-06 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN109377981A (en) * | 2018-11-22 | 2019-02-22 | 四川长虹电器股份有限公司 | The method and device of phoneme alignment |
CN110223673A (en) * | 2019-06-21 | 2019-09-10 | 龙马智芯(珠海横琴)科技有限公司 | The processing method and processing device of voice, storage medium, electronic equipment |
WO2019198265A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Corporation | Speech recognition system and method using speech recognition system |
CN112883726A (en) * | 2021-01-21 | 2021-06-01 | 昆明理工大学 | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5390278A (en) * | 1991-10-08 | 1995-02-14 | Bell Canada | Phoneme based speech recognition |
TWI220511B (en) * | 2003-09-12 | 2004-08-21 | Ind Tech Res Inst | An automatic speech segmentation and verification system and its method |
US8135590B2 (en) * | 2007-01-11 | 2012-03-13 | Microsoft Corporation | Position-dependent phonetic models for reliable pronunciation identification |
JP2010230695A (en) * | 2007-10-22 | 2010-10-14 | Toshiba Corp | Speech boundary estimation apparatus and method |
WO2014108890A1 (en) * | 2013-01-09 | 2014-07-17 | Novospeech Ltd | Method and apparatus for phoneme separation in an audio signal |
-
2020
- 2020-10-26 CN CN202011156288.3A patent/CN112420075B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20110010243A (en) * | 2009-07-24 | 2011-02-01 | 고려대학교 산학협력단 | System and method for searching phoneme boundaries |
CN104239289A (en) * | 2013-06-24 | 2014-12-24 | 富士通株式会社 | Syllabication method and syllabication device |
WO2019198265A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Corporation | Speech recognition system and method using speech recognition system |
CN108899047A (en) * | 2018-08-20 | 2018-11-27 | 百度在线网络技术(北京)有限公司 | The masking threshold estimation method, apparatus and storage medium of audio signal |
CN109036384A (en) * | 2018-09-06 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN109377981A (en) * | 2018-11-22 | 2019-02-22 | 四川长虹电器股份有限公司 | The method and device of phoneme alignment |
CN110223673A (en) * | 2019-06-21 | 2019-09-10 | 龙马智芯(珠海横琴)科技有限公司 | The processing method and processing device of voice, storage medium, electronic equipment |
CN112883726A (en) * | 2021-01-21 | 2021-06-01 | 昆明理工大学 | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning |
Also Published As
Publication number | Publication date |
---|---|
CN112420075A (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109087648B (en) | Counter voice monitoring method and device, computer equipment and storage medium | |
CN109192213B (en) | Method and device for real-time transcription of court trial voice, computer equipment and storage medium | |
JP5059115B2 (en) | Voice keyword identification method, apparatus, and voice identification system | |
Sainath et al. | Exemplar-based sparse representation features: From TIMIT to LVCSR | |
US6845357B2 (en) | Pattern recognition using an observable operator model | |
CN112289299B (en) | Training method and device of speech synthesis model, storage medium and electronic equipment | |
CN108885870A (en) | For by combining speech to TEXT system with speech to intention system the system and method to realize voice user interface | |
US20210350791A1 (en) | Accent detection method and accent detection device, and non-transitory storage medium | |
EP2617030A1 (en) | Deep belief network for large vocabulary continuous speech recognition | |
CN111798840A (en) | Voice keyword recognition method and device | |
CN111292763B (en) | Stress detection method and device, and non-transient storage medium | |
Basak et al. | Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems. | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN115312033A (en) | Speech emotion recognition method, device, equipment and medium based on artificial intelligence | |
CN116303966A (en) | Dialogue behavior recognition system based on prompt learning | |
CN112420075B (en) | Multitask-based phoneme detection method and device | |
Kherdekar et al. | Convolution neural network model for recognition of speech for words used in mathematical expression | |
CN112686041B (en) | Pinyin labeling method and device | |
CN113744727A (en) | Model training method, system, terminal device and storage medium | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
JP7505582B2 (en) | SPEAKER DIARIZATION METHOD, SPEAKER DIARIZATION DEVICE, AND SPEAKER DIARIZATION PROGRAM | |
JP7505584B2 (en) | SPEAKER DIARIZATION METHOD, SPEAKER DIARIZATION DEVICE, AND SPEAKER DIARIZATION PROGRAM | |
Abraham et al. | Articulatory Feature Extraction Using CTC to Build Articulatory Classifiers Without Forced Frame Alignments for Speech Recognition. | |
JP3589044B2 (en) | Speaker adaptation device | |
CN113593525A (en) | Method, device and storage medium for training accent classification model and accent classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |