CN112420075B - Multitask-based phoneme detection method and device - Google Patents

Multitask-based phoneme detection method and device Download PDF

Info

Publication number
CN112420075B
CN112420075B CN202011156288.3A CN202011156288A CN112420075B CN 112420075 B CN112420075 B CN 112420075B CN 202011156288 A CN202011156288 A CN 202011156288A CN 112420075 B CN112420075 B CN 112420075B
Authority
CN
China
Prior art keywords
phoneme
subsequence
basic
sequence
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011156288.3A
Other languages
Chinese (zh)
Other versions
CN112420075A (en
Inventor
谢川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN202011156288.3A priority Critical patent/CN112420075B/en
Publication of CN112420075A publication Critical patent/CN112420075A/en
Application granted granted Critical
Publication of CN112420075B publication Critical patent/CN112420075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a phoneme detection method based on multitask, which comprises the following steps: step A1) training a phoneme detection model; step A2) obtaining a voice sequence to be detected; step A3) dividing the speech sequence into a plurality of basic subsequences; step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence; step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The invention solves the technical problems that the phoneme recognition task and the phoneme alignment task can not be completed at the same time, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task can not share the learning result.

Description

Multitask-based phoneme detection method and device
Technical Field
The invention relates to the field of data intelligence, in particular to a multi-task-based phoneme detection method and device.
Background
With the development of deep learning technology, the technology of speech recognition, voiceprint recognition, speech synthesis, speech emotion analysis and the like based on deep speech processing is continuously broken through. Phonemes are the minimum phonetic unit divided by the natural attributes of speech, and play a very important role in deep speech processing, which is the basis of most speech processing. Meanwhile, the phoneme has very important significance for the rapid response of the deep speech processing system in a practical scene. Meanwhile, the existing data set has very few voice databases containing phoneme alignment information and is limited by the phoneme definition specifications of the databases, so that the condition that the phoneme definition specifications of the databases are not uniform is easily met, and great obstruction is generated in the research of voice in the phoneme related field. The research on phonemes of speech is limited by the fact that each database is not universal for phoneme definitions, data cannot be expanded, and data enhancement cannot be performed at a phoneme level. Meanwhile, the artificial phoneme detection method is adopted, so that the problem of great cost increase exists, a large amount of labor cost is consumed, a large amount of time cost is required, artificial phoneme detection cannot be performed on a large amount of data, and the requirement of the existing algorithm on the training data volume cannot be met.
Disclosure of Invention
The invention aims to provide a multi-task-based phoneme detection method and a multi-task-based phoneme detection device, which are used for solving the technical problems that a phoneme recognition task and a phoneme alignment task cannot be completed simultaneously, the phoneme alignment accuracy is low, and the phoneme recognition task and the phoneme alignment task cannot share a learning result.
The invention solves the problems through the following technical scheme:
a phoneme detection method based on multitask comprises the following steps:
step A1) training a phoneme detection model;
step A2) obtaining a voice sequence to be detected;
step A3) dividing the speech sequence into a plurality of basic subsequences;
step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets;
step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients;
step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence;
step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4).
Further, the predicted phoneme of the transformed subsequence with the highest confidence in the step a7) is used as the final phoneme detection result of the base subsequence in the step A3), and the two end positions of the transformed subsequence with the highest confidence are used as the phoneme start point and the end position of the base subsequence in the step A3).
Further, the method for segmenting the speech sequence into a plurality of basic sub-sequences in the step a3) includes: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on the fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is a preset width W and a step S, the width W is the length of a basic subsequence, and the step S represents the distance of the alignment window moving from the previous window to the next window after each division.
Further, the method for generating the transform sub-sequence from the base sub-sequence in step a4) includes translating the two end positions of the base sub-sequence equidistantly or scaling the two end positions of the base sub-sequence with respect to the center of the sequence.
Further, the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update the model parameters by training the text information of the speech data and the phoneme end point position markers corresponding to the speech data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
Further, the termination condition in step a7) is set such that the confidence difference between the front and back side phoneme identifications is less than a set value a, the two sequence IOU results with the highest phoneme identifications in the front and back two times are less than c%, the phoneme identification confidence is greater than b, and the number of iterations is greater than or equal to a preset maximum number of iterations N, where N is any positive integer, that is, when N is 1, no iteration is performed.
The invention also provides a phoneme detection device based on multitask, which is used for realizing the phoneme detection method based on multitask and comprises a voice data module, a voice sequence segmentation module and a phoneme detection module, wherein the voice data module is in signal connection with the voice sequence segmentation module, and the voice segmentation module is in signal connection with the phoneme detection module.
Further, the voice data acquisition module is used for acquiring a voice sequence to be detected.
Further, the speech segmentation module is configured to segment the speech sequence into a plurality of base subsequences.
Further, the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the tasks of phoneme recognition and phoneme alignment simultaneously, also solves the problems that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, provides more accurate data support for technologies such as rear-end voiceprint recognition and the like, and improves the accuracy of a voice system.
Drawings
FIG. 1 is a view showing the structure of the apparatus of the present invention.
FIG. 2 is a flow chart of the present invention.
FIG. 3 is a flow chart of the phoneme detection model training process of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
a phoneme detection method based on multitask comprises the following steps:
step A1) training a phoneme detection model; as shown in fig. 3, training the phoneme detection model includes the following steps:
step B1) sets a set of phones/syllables to be recognized, and arranges the phones/syllables in the set in order, the phones/syllables being the smallest speech units divided according to the natural attributes of the speech, and being analyzed according to the pronunciation action in the syllables, one action constituting one phone.
Step B2) obtains one or more speaker data sets containing the speaker's voice information and the phone position label file corresponding to the speaker's voice information and the phone/syllable arrangement order.
Step B3) segmenting according to the phoneme label file in the step B2) to obtain phoneme segmentation sub-sequences as expected results.
Step B4) equally dividing the original speech sequence generated by the phoneme/syllable arrangement sequence corresponding to the speaker speech information into a plurality of basic subsequences with the same number as the phonemes, wherein the plurality of basic subsequences form a set.
Step B5) inputting each basic subsequence into a multi-scale transformation model for processing, thereby obtaining a multi-scale subsequence, and combining a plurality of multi-scale subsequences to form a multi-scale transformation subsequence set. The multi-scale transformation model is set as 5 sequence feature matrixes with different scales, wherein a sequence is a basic subsequence which expands 0.1L to the left, b sequence is a basic subsequence which expands outwards from the center according to the proportion of 1.1, c sequence is a basic subsequence, d sequence is a basic subsequence which contracts towards the center according to the proportion of 0.9, and e sequence is a basic subsequence which expands 0.1L to the right.
Step B6) extracting the characteristics of a plurality of basic subsequences, and performing MFCC extraction on the basic subsequences in the step B4) and the multi-scale transformation subsequences in the step B5), wherein the specific parameters are set to be 25ms of window function size, 10ms of step size, 13-dimensional MFCC dimension, 13-dimensional first-order delta dimension, 13-dimensional second-order delta dimension and 39-dimensional characteristic extraction result. Thereby obtaining a feature matrix for each base subsequence.
Step B7) constructing a phoneme recognition model, wherein specific parameters comprise three layers of RNN networks, the feature matrix of each basic subsequence and the phoneme category corresponding to the feature matrix obtained in the step B3) are used as input and prediction expectation of the phoneme recognition model, and the phoneme recognition model is pre-trained, so that a pre-trained acoustic model for phoneme recognition is obtained.
Step B8) inputting the feature matrix of each basic subsequence in the step B6) into a phoneme recognition model, and outputting to obtain a phoneme probability matrix.
Step B9) constructing a phoneme regression model, setting specific parameters as a 3-layer one-dimensional CNN network, inputting the characteristic matrix of each basic subsequence in the step B6), and outputting a candidate frame of the basic subsequence on the whole audio frame.
Step B10) combining the phoneme regression model and the phoneme recognition model to obtain a combined loss function, obtaining the phoneme type and phoneme position information, and updating the phoneme detection model, wherein the method comprises the following specific steps: 1. the phoneme recognition model output layer is a softmax layer, and Relu is adopted as the network loss function. 2. The phoneme regression model uses one-dimensional DIOU as a network loss function. 3. The combined loss function is obtained as 0.4-fold phoneme recognition model loss function + 0.6-fold phoneme regression model loss function.
As shown in fig. 2, the detection process of the phoneme recognition model includes the following steps: step A2) acquiring a voice sequence to be detected, namely voice acquisition, acquiring a wav format file with an audio parameter of 16K sampling rate, acquiring a recognized pinyin result according to the voice recognition result of the ASR, acquiring a phoneme recognition result according to a pinyin phoneme mapping table, and counting the number of phonemes, thereby initializing the voice sequence. Extracting information signs of the voice sequence, performing MFCC extraction on the obtained voice sequence, wherein the size of a window function is 25ms, the step length is 10ms, the dimension of the MFCC is 13 dimensions, the first-order delta dimension is 13 dimensions, the second-order delta dimension is 13 dimensions, and the information feature extraction result is 39 dimensions.
Step a3) divides the speech sequence into a plurality of basic subsequences, and equally divides the speech audio sequence into a plurality of basic subsequences equal to the number of phonemes according to the number of phonemes. The method for dividing the voice sequence into a plurality of basic subsequences comprises the following steps: the method based on the fixed phoneme number is set as that the phoneme number contained in the voice sequence is detected through a voice recognition method or a phoneme recognition method based on a fixed phoneme number and a window, and the voice sequence is equally divided or randomly divided into a plurality of basic subsequences according to the phoneme number; and the window-based setting is carried out on the segmentation of the voice sequence by a preset width W and a step length S, wherein the width W is the length of a basic subsequence, and the step length S represents the distance of the alignment window moving from the previous window to the next window after each segmentation.
Step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets; the method for generating the transformation subsequence from the basic subsequence comprises the step of translating the positions of two end points of the basic subsequence equidistantly or scaling the positions of the two end points of the basic subsequence relative to the center of the sequence. The specific method is set to set 5 sequences with different scales, wherein the sequence a is a basic subsequence which expands 0.1L to the left, the sequence b is a basic subsequence which expands outwards from the center according to the proportion of 1.1, the sequence c is a basic subsequence, the sequence d is a basic subsequence which contracts to the center according to the proportion of 0.9, and the sequence e is a basic subsequence which expands 0.1L to the right.
Step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients; the detection process comprises the following steps: 1. and performing phoneme recognition on the transformed subsequence set. 2. And performing phoneme regression on each scale of the transformed subsequence according to the recognition result.
Step A6) takes the transform subsequence with the highest confidence as the new base subsequence.
Step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining the phoneme detection result and the phoneme position and outputting, and if not, returning to the step A4). The termination condition is set as that the confidence difference of phoneme recognition on the front side and the rear side is smaller than a set value a, the IOU results of two sequences with the highest phoneme recognition on the front side and the rear side are smaller than c%, the phoneme recognition confidence is larger than b, and the iteration times are larger than or equal to a preset maximum iteration time N, wherein N is any positive integer, namely, iteration is not performed when N is 1. The predicted phoneme of the transformed subsequence with the highest confidence degree is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme starting point and the phoneme ending position of the basic subsequence in the step A3).
The phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
Referring to fig. 1, a multitask-based phoneme detecting apparatus for implementing the multitask-based phoneme detecting method includes a speech data module, a speech sequence segmentation module, and a phoneme detection module, where the speech data module is in signal connection with the speech sequence segmentation module, and the speech segmentation module is in signal connection with the phoneme detection module. The voice data acquisition module is used for acquiring a voice sequence to be detected. The voice segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences. The phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
The invention is provided with a voice acquisition module, a voice sequence segmentation module and a phoneme detection module, firstly, a phoneme recognition model is trained, then, input audio information is received, the audio sequence is segmented into a plurality of audio subsequences, phoneme detection is carried out on each subsequence to obtain a detection information matrix, then, each subsequence is re-segmented by referring to each subsequence detection information matrix, and finally, each subsequence range and a corresponding recognition phoneme result are output according to a subsequence correction result. The method solves the problems that the two tasks of phoneme recognition and phoneme alignment cannot be trained and executed simultaneously or the existing method cannot achieve high accuracy in the phoneme recognition and phoneme alignment tasks simultaneously, also solves the problem that the existing method cannot estimate the aliasing part between phonemes and affects the phoneme alignment result to reduce the accuracy, and provides more accurate data support for technologies such as rear-end voiceprint recognition to improve the accuracy of a voice system.
Although the invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be the only preferred embodiments of the invention, it is not intended that the invention be limited thereto, since many other modifications and embodiments will be apparent to those skilled in the art and will be within the spirit and scope of the principles of this disclosure.

Claims (9)

1. A phoneme detection method based on multitasking is characterized in that: the method comprises the following steps:
step A1) training a phoneme detection model;
step A2) obtaining a voice sequence to be detected;
step A3) dividing the speech sequence into a plurality of basic subsequences; step A3) dividing the voice sequence into a plurality of basic subsequences, and equally dividing the voice audio sequence into a plurality of basic subsequences with the same number as the phonemes according to the number of the phonemes;
step A4) moving the end points of the basic subsequence to obtain a group of conversion subsequence sets;
step A5) inputting all the conversion subsequences into a phoneme detection model to obtain predicted phonemes and corresponding confidence coefficients;
step A6) taking the transform subsequence with the highest confidence level as a new basic subsequence;
step A7) judging whether the basic subsequence meets the termination condition, if so, obtaining and outputting the phoneme detection result and the starting point and the ending point of the phoneme, and if not, returning to the step A4); the termination condition in the step a7) is set as that the confidence difference of the phoneme recognition at the front side and the back side is smaller than a set value a, the two sequences IOU result with the highest phoneme recognition at the front side and the back side is smaller than c%, the phoneme recognition confidence is larger than b, and the iteration number is larger than or equal to a preset maximum iteration number N, wherein N is any positive integer, that is, when N is 1, iteration is not performed.
2. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the predicted phoneme of the transformed subsequence with the highest confidence degree in the step a7) is used as the final phoneme detection result of the basic subsequence in the step A3), and the two end point positions of the transformed subsequence with the highest confidence degree are used as the phoneme start point and the end point position of the basic subsequence in the step A3).
3. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for segmenting the voice sequence into a plurality of basic subsequences in the step A3) comprises the following steps: the number of phonemes included in the speech sequence is detected by a speech recognition or phoneme recognition method, and the speech sequence is equally divided or randomly divided into a plurality of basic subsequences according to the number of phonemes.
4. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the method for generating the transform subsequence from the base subsequence in the step A4) includes translating the two end positions of the base subsequence equidistantly or scaling the two end positions of the base subsequence relative to the sequence center.
5. The method of claim 1, wherein the phoneme detection method based on multitasking is characterized by comprising the following steps: the phoneme detection model in the step a1) includes a convolutional neural network, an SVM, or a retraining model, and the retraining model is configured to update model parameters by training the text information of the phonetic data and the phoneme end point position markers corresponding to the phonetic data, and using the coincidence ratio of the marked phoneme end point positions and the closest transformation subsequence positions.
6. A multitask-based phoneme detection apparatus, characterized in that: a method for implementing a multitasking based phoneme detection method according to claims 1-5 comprising a speech data module, a speech sequence segmentation module and a phoneme detection module, said speech data module being signal connected to said speech sequence segmentation module, said speech sequence segmentation module being signal connected to said phoneme detection module.
7. A multitask based phoneme detection device according to claim 6 and being characterized in that: the voice data acquisition module is used for acquiring a voice sequence to be detected.
8. The apparatus of claim 6, wherein: the voice sequence segmentation module is used for segmenting the voice sequence into a plurality of basic subsequences.
9. The apparatus of claim 6, wherein: the phoneme detection module is used for obtaining the detection result and the phoneme position of each basic subsequence.
CN202011156288.3A 2020-10-26 2020-10-26 Multitask-based phoneme detection method and device Active CN112420075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011156288.3A CN112420075B (en) 2020-10-26 2020-10-26 Multitask-based phoneme detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011156288.3A CN112420075B (en) 2020-10-26 2020-10-26 Multitask-based phoneme detection method and device

Publications (2)

Publication Number Publication Date
CN112420075A CN112420075A (en) 2021-02-26
CN112420075B true CN112420075B (en) 2022-08-19

Family

ID=74840585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011156288.3A Active CN112420075B (en) 2020-10-26 2020-10-26 Multitask-based phoneme detection method and device

Country Status (1)

Country Link
CN (1) CN112420075B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889083B (en) * 2021-11-03 2022-12-02 广州博冠信息科技有限公司 Voice recognition method and device, storage medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110010243A (en) * 2009-07-24 2011-02-01 고려대학교 산학협력단 System and method for searching phoneme boundaries
CN104239289A (en) * 2013-06-24 2014-12-24 富士通株式会社 Syllabication method and syllabication device
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109036384A (en) * 2018-09-06 2018-12-18 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109377981A (en) * 2018-11-22 2019-02-22 四川长虹电器股份有限公司 The method and device of phoneme alignment
CN110223673A (en) * 2019-06-21 2019-09-10 龙马智芯(珠海横琴)科技有限公司 The processing method and processing device of voice, storage medium, electronic equipment
WO2019198265A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Corporation Speech recognition system and method using speech recognition system
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390278A (en) * 1991-10-08 1995-02-14 Bell Canada Phoneme based speech recognition
TWI220511B (en) * 2003-09-12 2004-08-21 Ind Tech Res Inst An automatic speech segmentation and verification system and its method
US8135590B2 (en) * 2007-01-11 2012-03-13 Microsoft Corporation Position-dependent phonetic models for reliable pronunciation identification
JP2010230695A (en) * 2007-10-22 2010-10-14 Toshiba Corp Speech boundary estimation apparatus and method
WO2014108890A1 (en) * 2013-01-09 2014-07-17 Novospeech Ltd Method and apparatus for phoneme separation in an audio signal

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110010243A (en) * 2009-07-24 2011-02-01 고려대학교 산학협력단 System and method for searching phoneme boundaries
CN104239289A (en) * 2013-06-24 2014-12-24 富士通株式会社 Syllabication method and syllabication device
WO2019198265A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Corporation Speech recognition system and method using speech recognition system
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109036384A (en) * 2018-09-06 2018-12-18 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109377981A (en) * 2018-11-22 2019-02-22 四川长虹电器股份有限公司 The method and device of phoneme alignment
CN110223673A (en) * 2019-06-21 2019-09-10 龙马智芯(珠海横琴)科技有限公司 The processing method and processing device of voice, storage medium, electronic equipment
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning

Also Published As

Publication number Publication date
CN112420075A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN109087648B (en) Counter voice monitoring method and device, computer equipment and storage medium
CN109192213B (en) Method and device for real-time transcription of court trial voice, computer equipment and storage medium
JP5059115B2 (en) Voice keyword identification method, apparatus, and voice identification system
Sainath et al. Exemplar-based sparse representation features: From TIMIT to LVCSR
US6845357B2 (en) Pattern recognition using an observable operator model
CN112289299B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN108885870A (en) For by combining speech to TEXT system with speech to intention system the system and method to realize voice user interface
US20210350791A1 (en) Accent detection method and accent detection device, and non-transitory storage medium
EP2617030A1 (en) Deep belief network for large vocabulary continuous speech recognition
CN111798840A (en) Voice keyword recognition method and device
CN111292763B (en) Stress detection method and device, and non-transient storage medium
Basak et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems.
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
CN116303966A (en) Dialogue behavior recognition system based on prompt learning
CN112420075B (en) Multitask-based phoneme detection method and device
Kherdekar et al. Convolution neural network model for recognition of speech for words used in mathematical expression
CN112686041B (en) Pinyin labeling method and device
CN113744727A (en) Model training method, system, terminal device and storage medium
CN113763939B (en) Mixed voice recognition system and method based on end-to-end model
JP7505582B2 (en) SPEAKER DIARIZATION METHOD, SPEAKER DIARIZATION DEVICE, AND SPEAKER DIARIZATION PROGRAM
JP7505584B2 (en) SPEAKER DIARIZATION METHOD, SPEAKER DIARIZATION DEVICE, AND SPEAKER DIARIZATION PROGRAM
Abraham et al. Articulatory Feature Extraction Using CTC to Build Articulatory Classifiers Without Forced Frame Alignments for Speech Recognition.
JP3589044B2 (en) Speaker adaptation device
CN113593525A (en) Method, device and storage medium for training accent classification model and accent classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant