CN103345922A - Large-length voice full-automatic segmentation method - Google Patents
Large-length voice full-automatic segmentation method Download PDFInfo
- Publication number
- CN103345922A CN103345922A CN2013102801599A CN201310280159A CN103345922A CN 103345922 A CN103345922 A CN 103345922A CN 2013102801599 A CN2013102801599 A CN 2013102801599A CN 201310280159 A CN201310280159 A CN 201310280159A CN 103345922 A CN103345922 A CN 103345922A
- Authority
- CN
- China
- Prior art keywords
- sentence
- segmentation
- text
- classification
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 105
- 230000011218 segmentation Effects 0.000 title claims abstract description 86
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 71
- 238000012549 training Methods 0.000 claims abstract description 42
- 230000007246 mechanism Effects 0.000 claims abstract description 37
- 238000002372 labelling Methods 0.000 claims abstract description 33
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000001514 detection method Methods 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 4
- 241000590419 Polygonia interrogationis Species 0.000 claims description 3
- 238000007635 classification algorithm Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims description 2
- 238000002360 preparation method Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 abstract description 5
- 230000015572 biosynthetic process Effects 0.000 description 19
- 238000003786 synthesis reaction Methods 0.000 description 19
- 238000002474 experimental method Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 102100022970 Basic leucine zipper transcriptional factor ATF-like Human genes 0.000 description 4
- 102100035893 CD151 antigen Human genes 0.000 description 4
- 101000903742 Homo sapiens Basic leucine zipper transcriptional factor ATF-like Proteins 0.000 description 4
- 101000946874 Homo sapiens CD151 antigen Proteins 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000012804 iterative process Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 238000001308 synthesis method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000001915 proofreading effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000005096 rolling process Methods 0.000 description 2
- 208000000044 Amnesia Diseases 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a large-length voice full-automatic segmentation method which is a zero-labeling sentence automatic segmentation algorithm having higher accuracy. The algorithm enables a Force-alignment non-supervision algorithm and a semi-supervised learning method based on an HMM to be blended, automatic expansion is carried out on a few precise labeling sets provided for the zero-labeling sentence segmentation algorithm by a semi-supervised learning minimization labeling sentence segmentation algorithm through establishment of an iteration mechanism based on a timer shaft, the purpose of the maximization of the precise labeling sets is realized, and then according to obtained correct periods, voice of an original length is cut into smaller paragraphs or sets of sentences. According to the method, a Force-aligned method under the HMM and a Co_training method in semi-supervised learning are blended together, so the facts that in a large-length voice sentence segmentation process, manual intervention is not needed, and segmentation accuracy is high are guaranteed. The large-length voice full-automatic segmentation method can be applied to rapid and automatic construction of a voice corpus.
Description
Technical Field
The invention belongs to the technical field of voice synthesis, voice recognition, voice retrieval and marking, and relates to a full-automatic segmentation method for long-format voice.
Background
Two methods for speech synthesis currently prevailing in the world are HMM-based Trainable (traceable TTS) speech synthesis methods, such as CLUSTERGEN of the university of tomimelon in the united states (CMU), HTS speech synthesis engine developed by the university of the famous ancient house industry in japan, both of which use a method based on parameter statistics (Parametric Statistical) synthesis; another is a speech synthesis method (speech-based TTS) based on a large speech corpus, such as KX-PSOLA (1993) of the institute of acoustics of chinese academy of sciences, and speech synthesis techniques adopted by the telecom platform, which use a technique based on unit selection and waveform concatenation to synthesize speech. The core of the above two speech synthesis techniques is based on the well-labeled and high-accuracy speech corpus. The current construction of a speech library (speech Corpus) generally records large texts sentence by sentence, and then manually marks the texts sentence by sentence: because the speech units at the beginning and end of a single sentence recording often differ from the units in the sentence, the speech units are forced to align to a given label (Transcript, a direct conversion from text) according to the Viterbi algorithm, and therefore manual boundary adjustment is required. FIG. 1 shows the general steps of a conventional method for constructing a speech corpus. The method for constructing the voice corpus has strong subjectivity, the manual labeling is lack of consistency, and a large amount of cost and time are spent. Meanwhile, the single sentence recording inevitably loses rich prosodic features and context information contained in the language. These prosody and context information are hard to transmit, and contribute to speech understanding, pragmatic meaning such as mood, prompting speech structure, emotion of speaker, etc. These information are important parameters that are essential to synthesize more expressive speech.
Therefore, the single sentence recording and manual marking are the bottlenecks that the existing speech synthesis engine can have more expressive power. From this point of view, if a method can be provided to accurately and automatically segment natural-sounding space speech containing multiple paragraphs into single sentences (and Viterbi-forced alignment can be directly performed on such single sentences without manual boundary adjustment because the difference between the beginning speech units, the middle speech units and the end speech units in the space speech is small), finding this method is a key problem to reduce the construction cost of speech Corpus and improve the expressive power of speech synthesis.
On the basis of solving the problem of automatic sentence segmentation, most of the existing methods are methods for simply pursuing the accuracy and recall rate of sentence segmentation, and a large amount of manual labeling is needed, or some labeling amount is reduced on the basis, and the two types of researches are not repeated. Fully automatic sentence segmentation algorithms are not much studied. The Alan W Black and Kishore Prahallad of CMU proposed in 2011 a method of automatically segmenting a speech book (VoiceBook) and then re-constructing a speech synthesis engine. The label-free sentence automatic segmentation algorithm is used for researching the segmentation of sentences from the spectral parameters of the voice, so that although the label-free sentence automatic segmentation algorithm has the label-free characteristic and ensures that the segmentation result has higher accuracy, the waste is very large, and the accuracy can only be ensured to be 40.4%. The semi-supervised learning based minimal labeled automatic sentence segmentation algorithm depends on the prosodic parameters to detect and classify the sentence boundaries. Although the amount of labeling can be reduced significantly, a high degree of accuracy cannot be guaranteed.
Disclosure of Invention
In order to solve the technical problems, the invention provides a full-automatic segmentation method for long-length speech, which is a label-free automatic sentence segmentation algorithm with higher accuracy, and the algorithm fuses a Force-alignment unsupervised algorithm based on an HMM (hidden Mark model) and a semi-supervised learning method, and automatically expands a small amount of accurate label sets provided by the label-free sentence segmentation algorithm by utilizing the semi-supervised learning minimal label sentence segmentation algorithm through establishing an iteration mechanism based on a time axis so as to achieve the aim of maximizing the accurate label sets, and then segments the original length speech into smaller paragraphs or sentence sets according to the obtained correct sentence points.
The technical scheme is as follows:
a full-automatic segmentation method for long-width voice comprises the following steps:
(1) providing accurate time data of the marked Sentence points by a non-marked Sentence Segmentation system (ZLSS) method, and corresponding the time data to the input of a Minimum marked Sentence classification system (MLSS) algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis;
(2) and extracting the corresponding data frame characteristics from the original file by using the corresponding good time data by using a boundary characteristic extraction program to prepare for classification iteration of collaborative training (Co _ training). It should be noted here that the boundary feature extraction program is embedded in the MLSS algorithm, and the extracted object is the original audio paragraph with long and multiple paragraphs. The time information of the corresponding sentence boundary points is also relative to the initial space speech. This prior art extraction procedure extracts feature information of all Candidate Sentence boundaries (sequence Boundary dictionary) accordingly before performing the subsequent steps.
(3) Adding the boundary characteristic information of the correct period position extracted from the previous step into a training set of MLSS (Multi-level class service) to carry out Co _ training, and further classifying to obtain more new periods; the MLSS algorithm is actually a binary classifier based on maximum Entropy (Maxmum Encopy) and Co _ training algorithms.
(4) The classification result only gives the starting frame and the ending frame corresponding to the period position, and further corresponds to a time axis consistent with ZLSS through a conversion program, and then the next step is carried out;
(5) making a judgment once, and judging whether a new period is found in the iteration process of this time, if not, ending the whole iteration process, and if a new period is found, carrying out the next step;
(6) after the time point information output by the conversion program is obtained, the segmentation method provided by ZLSS is utilized to further segment the current space voice and text into smaller and more paragraphs or sentences, and the result is replaced with the initial voice and text set of the previous iteration;
(7) one problem to be pointed out here is that, for the ZLSS method, during each iteration, the original text and speech are segmented into relatively smaller paragraphs or sentences according to the found sentence positions, and at the same time, the time information of the original text and speech is discarded while the new sentence information (the time of the relatively small paragraph) found in the current iteration process is retained, and therefore, a HashMap-based tracking and searching mechanism is adopted. To uniformly correspond all the found correct sentence time information to the initial time axis. And preparing for next iterative classification.
(8) The above steps are repeatedly performed.
Further preferably, a HashMap-based tracking and searching mechanism is adopted, all found correct sentence time information is uniformly corresponding to an initial time axis, and preparation is made for next iterative classification.
Further preferably, the ZLSS method considers the silence of the sentence boundary as an independent phoneme sil, first, training the hidden markov model of each phoneme in the speech through the HMM-based unsupervised method and Flat-start training algorithm, and aligning the phoneme sequence of the space with the text of the space through Viterbi forced alignment. And finally, judging whether the segmented sentence is correct or not through a strict checking mechanism according to the sentence ending symbol in the text, thereby obtaining a smaller correct boundary marking set.
Further preferably, the ZLSS method introduces an iterative algorithm: firstly, segmenting space speech into paragraph speech and sentence speech according to correct and unmistakable sil given by the above checking mechanism; then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.
Further preferably, the method further comprises automatically expanding the precise annotation set:
firstly, researching the classification of vowel/consonant/pause V/C/P on the Frame-segment Clip of the audio Frame by adopting prosodic features; secondly, realizing the minimal labeled sentence boundary detection according to Co _ training and Active Learning frameworks, and searching sentence boundaries in the pause; and finally, researching a strict error detection mechanism of the prosodic features and determining an accurate sentence boundary.
Further preferably, the classification of V/C/P: firstly, performing framing processing on original audio data according to a frame of 20ms, wherein the frames have no overlapped part, then calculating the Energy, zero-crossing rate ZCR and Pitch frequency Pitch of each frame of data, and then smoothing the Energy curve and the Pitch frequency curve;
after the above features are extracted, the vowel/consonant/pause Classification is performed on each frame of data according to the above three features, and the Classification algorithm V/C/P Classification is described as follows:
1) calculating and determining threshold values for energy
Finding a proper energy threshold to detect the pause, and counting the labeling data provided by the ZLSS according to the following principle:
for the annotation data provided by the annotation system, each annotation data comprises a plurality of data frames, and after the data frames are obtained, the average value of the energy of the data frames is calculated;
II, setting the energy threshold value as the maximum value of all energy average values;
after statistics, the energy threshold was set to 0.005 and calculated as follows
2) The mean and variance of the zero-crossing rates (MZCR and VZCR) are calculated, with the threshold TZCR for ZCR being defined as:
TZCR=MZCR+0.005VZCR
3) the data frames are V/C/P classified according to the following criteria, FrameType being used to indicate the type of frame:
if ZCR > TZCR, then FrameType is the Consonant Consonant
Otherwise Energy < 0.005, then FrameType is Pause Pause
In addition, FrameType is Vowel Vowel
4) Carrying out merging operation of the same category on the frames classified by the V/C/P, namely, regarding the data frames which are continuously in the same category as an indefinite length of the same category, merging two indefinite lengths if a short consonant indefinite length exists between two adjacent pause indefinite lengths, and then replacing the middle C with P;
5) and performing vowel segmentation operation: if the duration of the vowel indefinite length is too long, the vowel indefinite length is segmented at the energy valley; for some cases where the energy has no valley, the processing is performed as equal-length segmentation.
The duration of the short consonant with indefinite length is less than two frames.
Further preferably, an error detection mechanism is introduced, and the workflow thereof is briefly as follows:
A) taking periods, question marks and semicolons as sentence boundary identifiers on the text, a, e, i, o and u represent vowels on the text, then calculating the total number of the vowels on the text, recording the total number as TV, calculating the number of the vowels from the starting point of the text to the boundary and from the boundary to the ending point of the text for each boundary on the text, respectively recording the number as TP and TS, and if a plurality of vowels are connected together, treating the vowels as a vowel;
B) for candidate boundaries found by the classifier, calculating the total number AV of vowels based on the classification result of V/C/P, and calculating the number of vowels from the starting point of the classification result of V/C/P to the candidate boundaries and from the candidate boundaries to the end of audio, which are respectively marked AS AP and AS; C) if either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary. Note that the value 0.015 here is obtained according to a statistical method. Specifically, the above formula is used to perform statistical calculation on a specific number of original text paragraphs and speech, and obtain an average value as the threshold for the period boundary division.
Further preferably, the minimization labeling sentence segmentation algorithm based on Co _ training is divided into four steps: firstly, carrying out V/C/P classification on audio, then, carrying out feature extraction on data frames, then, adding a classifier for training and classification, and finally, sending a classification result to a checking mechanism to further ensure the correctness of the classification result.
Preferably, the labeling result automatically obtained from the ZLSS iteration process is used as the input of an MLSS algorithm, the MLSS algorithm expands the automatic labeling result, then the expanded labeling result is used as the input of the ZLSS algorithm, effective labeling is continuously expanded, the ZLSS algorithm and the MLSS algorithm adopt a rolling iteration method which is input and output mutually, labeling is continuously expanded, a maximized accurate labeling set is finally formed, and the whole result is obtained automatically.
Compared with the prior art, the invention has the beneficial effects that: the invention has good performance on the problem of voice segmentation of multi-section length, and is mainly embodied in the following two aspects. Firstly, the method provided by the invention avoids the generation of Sentence boundaries (sequence Boundary) which may cause some human errors like SFA-1, and the errors can bring adverse effects on the detection of the boundaries and the extraction of prosodic parameters, thereby directly influencing the quality of the final synthesized speech. Secondly, although the SFA-2 method reduces the memory loss of the whole process and the computational cost of the computer, a plurality of clauses are still combined into a large paragraph, which undoubtedly causes the Viterbi forced alignment algorithm to fail to perform the optimal performance. The combined explosion problem of the search path causes the misplacement problem of the decoded state sequence to influence the performance of the whole system to a great extent. The present invention strongly avoids this disadvantage of the SFA-2 method.
Drawings
FIG. 1 is a general process for constructing a conventional corpus of speech;
FIG. 2 is a schematic diagram of a fully automatic label-free sentence segmentation;
FIG. 3 is a hash table trace lookup mechanism;
FIG. 4 is a checking mechanism; FIG. 4A illustrates an example of a correct period being identified; FIG. 4B is a sample of what is identified as an erroneous period;
FIG. 5 is an annotation system ZLSS iterative algorithm;
FIG. 6 is a vowel segmentation rule;
FIG. 7 is a set of features for sentence boundary detection;
FIG. 8 is a numerical calculation illustration;
FIG. 9 is a Co _ training based minimized annotated sentence segmentation system;
FIG. 10 is a histogram of classification performance of the classification system.
Detailed Description
The technical scheme of the invention is further explained by combining the drawings and the embodiment.
Method for automatically segmenting sentences based on HMM (hidden Markov model) of spectral parameters and prosodic parameters
Full-automatic sentence segmentation algorithm introduction
Firstly, a non-labeled sentence automatic segmentation algorithm and a minimized labeled sentence segmentation algorithm based on semi-supervised learning are respectively regarded as two subsystems of an entire sentence automatic segmentation System framework, the non-labeled sentence segmentation algorithm based on the Force-alignment algorithm defined by the former is used as a Labeling System to provide a small amount of trainable precise data, and the System is named as a Sub-Labeling System (ZLSS) again. The minimum labeled sentence automatic segmentation algorithm defined by the latter is used as a Classification System and is renamed as Sub-Classification System (MLSS) for automatically expanding a small amount of accurately labeled data sets provided by ZLSS. A systematic description of the entire algorithm is given below, as shown in fig. 2:
describing algorithm steps:
1) providing accurate marking data (period time information) by a ZLSS method, and corresponding the marking data (period time information) to the input of an MLSS algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis
2) And extracting the corresponding data frame characteristics from the original file by using the corresponding good time data by using a boundary characteristic extraction program. Prepare for the classification iteration of Co _ training below.
3) And adding the boundary characteristic information of the correct period position extracted from the previous step into a training set of the MLSS to perform Co _ training, and further classifying to obtain a new period.
4) The classification result only gives the starting frame and the ending frame corresponding to the period position, and further passes through a conversion program to correspond to a time axis consistent with ZLSS and then goes to the next step operation.
5) At this time, a judgment is made to see whether a new period is found in the iteration process. That is, whether there is a new period is determined. If not, the whole iteration process is ended, and if a new period is found, the next step is carried out.
6) After the time point information output by the conversion program is obtained, the segmentation method provided by ZLSS is utilized to further segment the current space voice and text into smaller and more paragraphs or sentences, and the result is replaced with the initial voice and text set of the previous iteration.
7) The above steps are repeatedly performed.
The HashMap tracks the lookup mechanism. One problem to be pointed out here is that, for the ZLSS method, during each iteration, the original text and speech are segmented into relatively smaller paragraphs or sentences according to the found sentence positions, and at the same time, the time information of the original text and speech is discarded while the new sentence information (the time of the relatively small paragraph) found in the current iteration process is retained, and therefore, a HashMap-based tracking and searching mechanism is adopted. To uniformly correspond all the found correct sentence time information to the initial time axis. And preparing for next iterative classification. As shown in fig. 3:
the upper graph shows three iterative track-finding processes performed for a given piece of audio data. The diagram performs a total of three iterations, each time finding a new period. Corresponding to sentence points I, II and III in sequence. However, at each iteration, the system replaces the current corpus with a paragraph or sentence that has been segmented by the new period found. In other words, it is the period found by the current iterative process that loses its position in the original audio file. This makes it troublesome to detect the classification performed later. Therefore, we need to extract the feature parameters of the data frame at the original audio position corresponding to the found new sentence to add in the training set for classification detection.
Just as shown in the above figure, when the third iteration is performed, a new period is found, how is it located in the original audio? Then, the position of the sentence point of the last iteration relative to the original audio is found back by means of the hash table, and in the same way, the positions corresponding to the sentence points are found by pushing back, so that the new sentence points found in the iteration process are sequentially corresponding to the corresponding positions in the original audio file. The shaded portion in the diagram represents a sentence that has been correctly segmented and has been thrown away from the current corpus. Fig. 2 shows a schematic diagram of the full automatic sentence segmentation:
practical feasibility of maximized accurate full-automatic labeling algorithm
Because the voices researched by the invention all have good corresponding texts, the voice signal spectrum parameters and the corresponding texts can be forcibly aligned through a Viterbi algorithm based on an HTK tool to obtain the boundary information (HMM sequence) of the voice sound segment on the basis of automatic phoneme segmentation. However, the voice phenomena are very abundant, especially for Chinese voice, the voice phenomena are varied, and all the voice phenomena cannot be represented only by modeling by using spectral parameters. Therefore, as a result of the segmentation, there may be the following problems: (1) for some boundary types, the result of automatic segmentation may be offset from the result of natural artificial segmentation. (2) In actual speech, there may be a mis-pronunciation (mis-pronunciation) phenomenon. For solving the problems, the current method adopts manual adjustment and manual proofreading, which is obviously contradictory to the content of the research of the invention.
In the proposal of the problems, the idea of combining the HMM sentence segmentation algorithm based on the spectrum parameters and the sentence segmentation method based on the prosodic parameters is adopted to maximize the accurate labeling set. The feasibility of the fusion mechanism of the two methods is that the two methods have the same time scale, so that the two methods can be established based on the mutual iteration format on the same time axis, and the two methods are mutually input and output to carry out mutual rolling iteration.
HMM (hidden Markov model) label-free automatic sentence segmentation algorithm based on spectrum parameters
Introduction of segmentation principle of ZLSS algorithm
The algorithm can automatically form an accurate initial annotation set in the system. In this algorithm, we consider the silence of the sentence boundary as an independent phoneme (sil). First, a hidden markov model of each phoneme in the speech is trained through an HMM-based unsupervised method and a Flat-start training algorithm, and a phoneme sequence of a space and a text of the space are aligned through Viterbi forced-alignment (forced-alignment). The sentences are then segmented according to sentence-ending symbols in the text (e.g., periods, question marks, exclamation marks, semicolons, etc.). And finally, judging whether the segmented sentences are correct or not through a strict checking mechanism, and further obtaining a smaller correct boundary marking set.
Error detection mechanism
The reason for introducing the checking mechanism is that when the large-length speech data and the phoneme sequence are aligned forcibly, the arithmetic complexity of the Viterbi algorithm and the consumption of the computer storage space increase with the increase of the data length, but the alignment effect tends to be reduced, and even alignment errors occur. To solve this problem, we have added this checking mechanism to further ensure that all found periods are correct.
After the space speech and the space phoneme sequence (obtained by looking up the dictionary) are aligned by the Viterbi, the position of the phoneme sequence corresponding to the speech is naturally determined. First, the position of the period in the audio file may be preliminarily determined based on the alignment result. Meanwhile, adjacent or front-back phoneme information and time information can be obtained. After the phonemes which are preliminarily determined to be before and after the period sil phoneme are obtained, the single phoneme recognizer provided by the system is used for recognizing the adjacent phonemes. And comparing the recognition result with the phoneme at the corresponding position of the text, and if the recognition results of the front phoneme and the rear phoneme are consistent with the corresponding positions on the text, determining that the period is indeed a correct period. Otherwise, it is considered an erroneous period or not a period. Fig. 4 gives a sample illustration of the checking mechanism:
the example of fig. 4A in the upper figure shows that the last phoneme of the previous sentence and the first phoneme of the next sentence are ang and er, respectively, and the recognition results are ang and er in turn. This indicates that the period is the correct period. Fig. 4B identifies the last phoneme of the previous sentence incorrectly, so we consider the Viterbi decoding to have errors, and therefore do not consider the sentence to be the correct one.
Automatic sentence segmentation iterative algorithm
In this algorithm, we introduce an iterative algorithm. First, the space speech is divided into paragraph speech and sentence speech according to the correct and error-free sil given by the above checking mechanism. Then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.
The principle of adopting the iterative process is that new periods are continuously found along with the progress of the iterative process, and meanwhile, the original large-space speech and text are segmented into smaller periods and sentences according to the found periods, so that a more accurate Chinese phoneme HMMs model can be trained. Find more new periods until no more new periods can be found, and the iteration ends.
In addition, it should be noted that in the forced alignment algorithm, the slicing accuracy of the boundary error within 20ms to 50ms reaches more than 95%. Therefore, the phonemes are basically near the phoneme segmentation boundary, and a Sliding Mechanism (Sliding Mechanism) is specially added to search the phonemes before and after the sil at the boundary so as to improve the detection rate of the correct sil. Fig. 5 gives a detailed flow chart of this algorithm.
Method for segmenting minimized labeled sentences based on prosodic features
Principle of minimizing sentence segmentation
The above ZLSS labeling system provides an accurate set of labels that is not sufficient to construct a corpus of speech to synthesize more natural speech. In this section, a sentence segmentation method based on prosodic features and minimized labels is further researched according to semi-supervised learning and active learning theories, and the method is used for automatically expanding an accurate label set.
First, classification of vowel/consonant/pause (V/C/P) for audio Frame-segment (Frame-Clip) is studied using prosodic features. Then, according to the Co _ tracing and Active Learning architectures, the minimally labeled sentence boundary detection (finding the sentence boundary in the pause) is realized. Finally, the exact sentence boundaries are determined by studying the strict error detection mechanism of prosodic features (e.g., comparing vowel/consonant/pause ratios for text and audio).
Classification of V/C/P
First, we frame the original audio data according to 20ms frame, there is no overlapping part between frames, then calculate the Energy (Energy), Zero Crossing Rate (ZCR) and Pitch frequency (Pitch) of each frame data, and then smooth the Energy curve and Pitch frequency curve.
After the features are extracted, the vowel/consonant/pause classification is carried out on each frame of data according to the three features. The Classification algorithm (V/C/P Classification) is described as follows:
1) calculating and determining threshold values for energy
Pauses are an important feature of sentence boundary detection. The Chinese news broadcasting voice environment used by us is relatively stable. Thus finding an appropriate energy threshold to detect a pause. Due to different characteristics of different voices, in order to determine an energy threshold of a news voice, statistics is performed on annotation data provided by the ZLSS according to the following principle:
for the annotation data provided by the annotation system, each annotation data comprises a plurality of data frames, and after the data frames are obtained, the average value of the energy of the data frames is calculated.
II then, the energy threshold is set to the maximum of all the energy averages.
After statistics, the energy threshold was set to 0.005 and calculated as follows
2) The mean and variance of the zero-crossing rates (MZCR and VZCR) are calculated, with the threshold TZCR for ZCR being defined as:
TZCR=MZCR+0.005VZCR
3) the data frames are V/C/P classified according to the following criteria, FrameType being used to indicate the type of frame:
if ZCR > TZCR, then FrameType is a Consonant (Consonant)
Otherwise Energy < 0.005, then FrameType is quiesce (Pause)
In addition, FrameType is Vowel (Vowel)
4) And carrying out merging operation of the same category on the frames classified by the V/C/P, namely, regarding the data frames which are continuously in the same category as an indefinite length of the same category. If there is a short consonant of indefinite length (duration less than two frames) between two adjacent pause of indefinite length, the two indefinite lengths are merged and then the middle C is replaced by P.
5) Since the possibility of detecting several vowels as a large vowel may cause the indefinite length of some vowels to be too long, the segmentation of vowels is also required to avoid errors in feature calculation: if the duration of the vowel indefinite length is too long, the vowel indefinite length is segmented at the energy valley thereof. However, for some cases where the energy has no valley, we process it as equal-length segmentation.
After counting the V/C/P classification results, it is found that 15 frames are vowel time with the highest occurrence frequency, and the average time length of vowels is 16 frames, so 15 frames are taken as a threshold value. The division rule is shown in FIG. 6
Feature extraction
FIG. 7 illustrates features of context information used to describe candidate boundaries. In FIG. 7, the pause feature of the candidate boundary is combined with the speech rate feature. Prosodic features of vowels of the temporary candidate boundary are also used to represent prosodic changes in the vicinity of the boundary point.
Pause is one of the most important indicators for sentence boundary detection, and prosodic information also plays an important role in sentence segmentation. The Rate of Speech (ROS), which affects the duration of pauses between sentences, is also incorporated into the sentence boundary feature set for detecting the boundaries of sentences. Therefore, we use the pause feature, speech rate and prosody feature as three feature sets for sentence boundary differentiation and detection.
Wherein, we define the ROS as follows
ROS=n/∑di
n is the number of vowels, diThe duration of the ith vowel. Pauses and consonants are not included in the calculation of speech rate.
Error detection mechanism
Since the above-described Co _ training-based classification method is the maximum entropy classification method, the correctness of the classification result cannot be fully guaranteed. Therefore, in order to avoid unnecessary manual proofreading of the classification result, in the algorithm, an error detection mechanism is introduced again, which aims to further ensure the correctness of the classification result obtained by the binary classifier. There is a need to further filter the true sentence boundaries from the preliminary classification result set. We therefore propose such an error detection mechanism. The work flow is briefly as follows:
1) a sentence number, a question mark and a semicolon are used as sentence boundary identifiers on the text, a, e, i, o and u represent vowels on the text, and then the total number of the vowels on the text is calculated and recorded as TV. For each boundary on the text, the number of vowels from the beginning of the text to the boundary and from the boundary to the end of the text is calculated and recorded as TP and TS respectively. If several vowels are connected together, we treat it as one vowel.
2) For the candidate boundary found by the classifier, the total number of vowels AV is calculated based on the classification result of V/C/P, and the number of vowels from the V/C/P classification result (audio) start point to the candidate boundary and from the candidate boundary to the audio end is calculated, and recorded AS AP and AS, respectively. Since vowel segmentation may affect the calculation of the number of vowels, the V/C/P classification result herein is a result of performing class merging, but does not segment vowels of an indefinite length.
3) If either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary. This is because: although the number of vowels may be very different, the "positions" of the same boundary on the text and audio should be the same, i.e. the AP/AV and TP/TV should be very different, so this method is to sort out the real sentence boundary according to the "position" judgment. As shown in fig. 8 below:
co _ training-based minimization labeling sentence segmentation algorithm
The algorithm is carried out in four steps, firstly, V/C/P classification is carried out on audio, then, feature extraction is carried out on data frames, then, classifiers are added for training and classification, and finally, classification results are sent to a checking mechanism, so that the correctness of the classification results is further ensured. The detailed steps are shown in figure 9 below:
experimental results and data analysis
Experiments on Standard data set (Benchmark) and analysis of results
First, to compare with the automatic sentence segmentation algorithm proposed in the prior art, we performed experiments on the same standard data set and used the same evaluation measure standard. It should be noted that we cannot guarantee that the total number of sentences obtained after the selected corpus for training is segmented is the same as the total number of sentences selected in the reference document, and only the total duration is controlled to be consistent with the total duration of the selected corpus. The 42-minute corpus is selected as a training set, and compared with the automatic sentence segmentation methods FA-0, SFA-1 and SFA-2 for large-space corpora mentioned in the prior art, the same synthesis tool Clustergen and the training corpus (42-minute voice and text provided on LibriVox) are used for building a voice synthesis engine, and the voice synthesis method based on parameter statistics has the advantages of small model space-time overhead and high flexibility. Then, the corpus of large space is divided into separate sentences, which amount to 653 sentences by using the method (HAZ-SAS) provided by the invention.
The quality assessment for synthesized speech can be measured using Mel-Cepstral discrimination (MCD). The segmented sentences are divided into training sets and testing sets. Then, it is taken to the Clustergen synthesis engine for training synthesis. And finally, calculating a corresponding MCD value according to the formula (1) according to the obtained synthesis result. By comparing the influence of different segmentation methods on the synthesis result under the same training set with the difference of the synthesis result under different test sets using the same method, it can be clearly seen that the sentence segmentation method adopted by the invention has a more obvious improvement on the speech synthesis quality, the MCD values under different test sets are calculated and compared with the two methods adopted in the prior art, as shown in the following table 1:
TABLE 1 comparison of MCD values and Experimental data for different test sets
Φ e in the above chart represents the EMMA e-book audio and corresponding text set provided by LibriVox, from which we have extracted the corresponding time segment. It should be noted that FA-0 corresponds to the experimental result without any treatment of the Viterbi algorithm. SFA-1 and SFA-2 are experimental results obtained after corresponding modifications to the Viterbi algorithm. The MCD calculation formula is as follows:
wherein,andrepresenting feature vector values of the synthesized audio and the original audio, respectively. From the above experimental data, we can easily find that, under the same training set, by using the sentence segmentation algorithm adopted by the invention, when the test set is selected to be 9 sentences and the duration is 4min, the MCD value is reduced by 0.08 compared with SFA-1, which indicates that the method adopted by the invention has higher accuracy for positioning the sentence boundary and has good improvement on the final synthesis quality, and can be applied to automatic construction of a speech corpus.
HAZ-SAS system performance evaluation and experimental analysis
The above experiments are performed on a standard data set, and because the experimental data provided by the standard data set have relatively clear manual recording and accurate text correspondence, the algorithm provided by the invention is tested for the segmentation performance on a common data set. We have also made the following experiments: the experiment used Chinese news simulcast speech and corresponding text downloaded from the Internet. With a total of 70 paragraphs of speech, 9447 seconds (approximately 2.6 hours), the experiment still used HTK tool as the tool for training HMM and forward-forced alignment algorithm (front-forced alignment technique).
The results of the ZLSS subsystem and the MLSS subsystem on sentence boundary detection performance and the results of the sentences which are output by the complete full-automatic sentence segmentation system and are correctly segmented are respectively given before and after the two sentence segmentation methods are fused. First, table 2 shows the ability of the ZLSS method itself to provide a scalar after one complete iteration.
We define sentence segmentation accuracy:
sentence segmentation accuracy ═ 100% (number of correctly segmented sentences/total number of sentences) ×
TABLE 2ZLSS, by itself, ability to provide a scalar quantity after a complete iteration is performed
Obviously, the ZLSS method alone is difficult to achieve ideal requirements, and is not enough to provide enough accurate labeled data to quickly construct a chinese speech corpus, which is applied to the field of speech synthesis. Next, we have performed sentence boundary detection performance experiments on the MLSS method in this article, and the data is as follows:
TABLE 3 statistical results of classification performance of MLSS on sentence boundaries
For the convenience of analysis, we can make the classification performance of the boundary into a histogram representation based on the above results. It can be easily seen that the classification performance of the system is continuously improved along with the continuous increase of the capability of the labeling system for providing the labeling sets. In addition, under the same training set, the size of the buffer is set to influence the classification performance of the system, and as can be seen from fig. 10, the larger the buffer is, the higher the classification performance is.
Adding MLSS again to 42.2% of labeled data provided by ZLSS, extracting information characteristics of corresponding sentence points, performing collaborative training (Co _ training), performing further iterative classification, and further obtaining correct sentence points. The results are shown in table 4 below:
TABLE 4 iterative Classification results
Therefore, the MLSS adopted by the invention has good classification performance, and the sentence segmentation accuracy is greatly improved on the basis of adding an error detection mechanism. Meanwhile, as can also be seen from the above data, the classification performance of the classifier obviously increases with the increase of the training data. Then, we give a complete full-automatic sentence segmentation system, and after four iterations, the output results are shown in table 5 below:
TABLE 5 full-automatic sentence segmentation System results
From all the experimental data above it follows that:
1) the improved full-automatic sentence segmentation system has good segmentation accuracy, and more importantly, the number of the obtained correct sentences is greatly improved compared with that of the original subsystems in the whole process without manual participation.
2) Meanwhile, in the flat start training algorithm, shorter space voice input can be used for training better HMMs; in the forced alignment process, the Viterbi decoding space is reduced, and the alignment accuracy is improved.
3) The labeling data of the whole system is automatically generated, so that the labeling data can be regarded as label-free (Zero-labeling), the classification efficiency is greatly improved, and the cost is saved.
4) The automatic sentence segmentation algorithm provided by the invention still has some places needing improvement. For example, when Viterbi-enforced alignment is performed on a large corpus for the first time, the overall requirements for computer performance are significantly higher than those provided in the prior art. The reason is that the iterative algorithm used in the present invention performs Viterbi decoding on the entire speech when performing the first iteration. As such, there are relatively high demands on both the performance and memory size of the computer processor.
The invention provides a full-automatic sentence segmentation algorithm based on spectrum Parameters (Mel-Cepstral Parameters) and prosodic Parameters (Prosodicparameters), which fuses Force-aligned under an HMM and a Co-training method in semi-supervised learning, thereby ensuring that manual intervention is not needed and higher segmentation accuracy is achieved in the process of sentence segmentation of space speech. The method can be applied to the rapid automatic construction of the voice corpus.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.
Claims (10)
1. A full-automatic segmentation method for long speech is characterized by comprising the following steps:
(1) providing accurate time data for marking periods by a ZLSS method, and corresponding the time data to the input of an MLSS algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis;
(2) extracting corresponding data frame characteristics from an original file by a boundary characteristic extraction program by utilizing the corresponding good time data to prepare for Co _ training classification iteration;
(3) adding the boundary characteristic information of the correct sentence positions extracted from the previous step into a training set of MLSS (Multi-level class service), performing Co _ training, and further classifying to obtain new sentences;
(4) the classification result only gives the starting frame and the ending frame corresponding to the period position, and the corresponding period is further corresponding to a time axis consistent with ZLSS through a conversion program, and then the next step operation is carried out;
(5) making a judgment once, and judging whether a new period is found in the iteration process of this time, if not, ending the whole iteration process, and if a new period is found, carrying out the next step;
(6) after the time point information output by the conversion program is obtained, the segmentation method provided by ZLSS is utilized to further segment the current space voice and text into smaller and more paragraphs or sentences, and the result is replaced with the initial voice and text set of the previous iteration;
(7) the above steps are repeatedly performed.
2. The method for full-automatic segmentation of long speech according to claim 1, wherein a HashMap-based tracking lookup mechanism is employed to uniformly map all found correct sentence time information to an initial time axis for preparation for next iterative classification.
3. The method as claimed in claim 1, wherein the ZLSS method considers the silence of sentence boundary as an independent phoneme sil, first, trains hidden markov models of each phoneme in the speech through HMM-based unsupervised method and Flat-start training algorithm, aligns phoneme sequences of the text with the text of the text through Viterbi forced alignment, then judges whether the sentence segmentation is correct according to the sentence end sign in the text, and finally, judges whether the sentence segmentation is correct through a strict checking mechanism, thereby obtaining a smaller correct boundary label set.
4. The full-automatic segmentation method for long speech according to claim 1, wherein the ZLSS method introduces an iterative algorithm: firstly, segmenting space speech into paragraph speech and sentence speech according to correct and unmistakable sil given by the above checking mechanism; then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.
5. The method according to claim 1, further comprising automatically expanding the set of accurate labels:
firstly, researching the classification of vowel/consonant/pause V/C/P on the Frame-segment Clip of the audio Frame by adopting prosodic features; secondly, realizing the minimal labeled sentence boundary detection according to Co _ training and Active Learning frameworks, and searching sentence boundaries in the pause; and finally, researching a strict error detection mechanism of the prosodic features and determining an accurate sentence boundary.
6. The method according to claim 5, wherein the classification of V/C/P is: firstly, performing framing processing on original audio data according to a frame of 20ms, wherein the frames have no overlapped part, then calculating the Energy, zero-crossing rate ZCR and Pitch frequency Pitch of each frame of data, and then smoothing the Energy curve and the Pitch frequency curve;
after the above features are extracted, the vowel/consonant/pause Classification is performed on each frame of data according to the above three features, and the Classification algorithm V/C/P Classification is described as follows:
1) calculating and determining threshold values for energy
Finding a proper energy threshold to detect the pause, and counting the labeling data provided by the ZLSS according to the following principle:
for the annotation data provided by the annotation system, each annotation data comprises a plurality of data frames, and after the data frames are obtained, the average value of the energy of the data frames is calculated;
II, setting the energy threshold value as the maximum value of all energy average values;
after statistics, the energy threshold was set to 0.005 and calculated as follows
2) The mean value MZCR and the variance VZCR of the zero crossing rate are calculated, the threshold TZCR of ZCR being defined as:
TZCR=MZCR+0.005VZCR
3) the data frames are V/C/P classified according to the following criteria, FrameType being used to indicate the type of frame:
if ZCR > TZCR, then FrameType is the Consonant Consonant
Otherwise Energy < 0.005, then FrameType is Pause Pause
In addition, FrameType is Vowel Vowel
4) Carrying out merging operation of the same category on the frames classified by the V/C/P, namely, regarding the data frames which are continuously in the same category as an indefinite length of the same category, merging two indefinite lengths if a short consonant indefinite length exists between two adjacent pause indefinite lengths, and then replacing the middle C with P;
5) and performing vowel segmentation operation: if the duration of the vowel indefinite length is too long, the vowel indefinite length is segmented at the energy valley; for some cases where the energy has no valley, the processing is performed as equal-length segmentation.
7. The method according to claim 6, wherein the duration of the short consonant with variable length is less than two frames.
8. The method according to claim 6, wherein an error detection mechanism is introduced, and the workflow thereof is as follows:
A) taking periods, question marks and semicolons as sentence boundary identifiers on the text, a, e, i, o and u represent vowels on the text, then calculating the total number of the vowels on the text, recording the total number as TV, calculating the number of the vowels from the starting point of the text to the boundary and from the boundary to the ending point of the text for each boundary on the text, respectively recording the number as TP and TS, and if a plurality of vowels are connected together, treating the vowels as a vowel;
B) for candidate boundaries found by the classifier, calculating the total number AV of vowels based on the classification result of V/C/P, and calculating the number of vowels from the starting point of the classification result of V/C/P to the candidate boundaries and from the candidate boundaries to the end of audio, which are respectively marked AS AP and AS;
C) if either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary.
9. The method according to claim 1, wherein the minimization-labeled sentence segmentation algorithm based on Co _ training is divided into four steps: firstly, carrying out V/C/P classification on audio, then, carrying out feature extraction on data frames, then, adding a classifier for training and classification, and finally, sending a classification result to a checking mechanism to further ensure the correctness of the classification result.
10. The method for full-automatic segmentation of long speech according to claim 1, wherein the labeling result automatically obtained from the ZLSS iteration process is used as the input of MLSS algorithm, the MLSS algorithm expands the automatic labeling result, and then the expanded labeling result is used as the input of ZLSS to continue expanding the effective labeling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310280159.9A CN103345922B (en) | 2013-07-05 | 2013-07-05 | A kind of large-length voice full-automatic segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310280159.9A CN103345922B (en) | 2013-07-05 | 2013-07-05 | A kind of large-length voice full-automatic segmentation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103345922A true CN103345922A (en) | 2013-10-09 |
CN103345922B CN103345922B (en) | 2016-07-06 |
Family
ID=49280713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310280159.9A Expired - Fee Related CN103345922B (en) | 2013-07-05 | 2013-07-05 | A kind of large-length voice full-automatic segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103345922B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761064A (en) * | 2013-12-27 | 2014-04-30 | 圆展科技股份有限公司 | Automatic voice input system and method |
CN104463208A (en) * | 2014-12-09 | 2015-03-25 | 北京工商大学 | Multi-view semi-supervised collaboration classification algorithm with combination of agreement and disagreement label rules |
CN104978961A (en) * | 2015-05-25 | 2015-10-14 | 腾讯科技(深圳)有限公司 | Audio processing method, device and terminal |
CN105047202A (en) * | 2015-05-25 | 2015-11-11 | 腾讯科技(深圳)有限公司 | Audio processing method, device and terminal |
CN105161094A (en) * | 2015-06-26 | 2015-12-16 | 徐信 | System and method for manually adjusting cutting point in audio cutting of voice |
CN106157951A (en) * | 2016-08-31 | 2016-11-23 | 北京华科飞扬科技股份公司 | Carry out automatic method for splitting and the system of audio frequency punctuate |
CN106373592A (en) * | 2016-08-31 | 2017-02-01 | 北京华科飞扬科技股份公司 | Audio noise tolerance punctuation processing method and system |
CN106504773A (en) * | 2016-11-08 | 2017-03-15 | 上海贝生医疗设备有限公司 | A kind of wearable device and voice and activities monitoring system |
CN106782508A (en) * | 2016-12-20 | 2017-05-31 | 美的集团股份有限公司 | The cutting method of speech audio and the cutting device of speech audio |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107578769A (en) * | 2016-07-04 | 2018-01-12 | 科大讯飞股份有限公司 | Speech data mask method and device |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
CN108597497A (en) * | 2018-04-03 | 2018-09-28 | 中译语通科技股份有限公司 | A kind of accurate synchronization system of subtitle language and method, information data processing terminal |
CN109377998A (en) * | 2018-12-11 | 2019-02-22 | 科大讯飞股份有限公司 | A kind of voice interactive method and device |
CN109871537A (en) * | 2019-01-31 | 2019-06-11 | 沈阳雅译网络技术有限公司 | A kind of high-precision Thai subordinate sentence method |
CN110164420A (en) * | 2018-08-02 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method and device of the method for speech recognition, voice punctuate |
CN110277104A (en) * | 2019-06-21 | 2019-09-24 | 上海乂学教育科技有限公司 | Word pronunciation training system |
CN110390930A (en) * | 2018-04-15 | 2019-10-29 | 高翔 | A kind of method and system of audio text check and correction |
CN110400580A (en) * | 2019-08-30 | 2019-11-01 | 北京百度网讯科技有限公司 | Audio-frequency processing method, device, equipment and medium |
CN110428841A (en) * | 2019-07-16 | 2019-11-08 | 河海大学 | A kind of vocal print dynamic feature extraction method based on random length mean value |
WO2019227547A1 (en) * | 2018-05-31 | 2019-12-05 | 平安科技(深圳)有限公司 | Voice segmenting method and apparatus, and computer device and storage medium |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN111310413A (en) * | 2020-02-20 | 2020-06-19 | 阿基米德(上海)传媒有限公司 | Intelligent broadcasting program audio strip removing method and device based on program series list |
CN111883169A (en) * | 2019-12-12 | 2020-11-03 | 马上消费金融股份有限公司 | Audio file cutting position processing method and device |
CN112261214A (en) * | 2020-10-21 | 2021-01-22 | 广东商路信息科技有限公司 | Network voice communication automatic test method and system |
CN112420016A (en) * | 2020-11-20 | 2021-02-26 | 四川长虹电器股份有限公司 | Method and device for aligning synthesized voice and text and computer storage medium |
CN113593528A (en) * | 2021-06-30 | 2021-11-02 | 北京百度网讯科技有限公司 | Training method and device of voice segmentation model, electronic equipment and storage medium |
US11645110B2 (en) | 2019-03-13 | 2023-05-09 | International Business Machines Corporation | Intelligent generation and organization of user manuals |
CN116483960A (en) * | 2023-03-30 | 2023-07-25 | 阿波罗智联(北京)科技有限公司 | Dialogue identification method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1163009A (en) * | 1994-09-30 | 1997-10-22 | 摩托罗拉公司 | Method and system for recognizing a boundary between sounds in continuous speech |
CN1252592A (en) * | 1998-10-28 | 2000-05-10 | 国际商业机器公司 | Command boundary discriminator of conversation natural language |
US6169972B1 (en) * | 1998-02-27 | 2001-01-02 | Kabushiki Kaisha Toshiba | Information analysis and method |
WO2007003505A1 (en) * | 2005-07-01 | 2007-01-11 | France Telecom | Method and device for segmenting and labelling the contents of an input signal in the form of a continuous flow of undifferentiated data |
CN102063898A (en) * | 2010-09-27 | 2011-05-18 | 北京捷通华声语音技术有限公司 | Method for predicting prosodic phrases |
-
2013
- 2013-07-05 CN CN201310280159.9A patent/CN103345922B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1163009A (en) * | 1994-09-30 | 1997-10-22 | 摩托罗拉公司 | Method and system for recognizing a boundary between sounds in continuous speech |
US6169972B1 (en) * | 1998-02-27 | 2001-01-02 | Kabushiki Kaisha Toshiba | Information analysis and method |
CN1252592A (en) * | 1998-10-28 | 2000-05-10 | 国际商业机器公司 | Command boundary discriminator of conversation natural language |
WO2007003505A1 (en) * | 2005-07-01 | 2007-01-11 | France Telecom | Method and device for segmenting and labelling the contents of an input signal in the form of a continuous flow of undifferentiated data |
CN102063898A (en) * | 2010-09-27 | 2011-05-18 | 北京捷通华声语音技术有限公司 | Method for predicting prosodic phrases |
Cited By (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761064A (en) * | 2013-12-27 | 2014-04-30 | 圆展科技股份有限公司 | Automatic voice input system and method |
CN104463208A (en) * | 2014-12-09 | 2015-03-25 | 北京工商大学 | Multi-view semi-supervised collaboration classification algorithm with combination of agreement and disagreement label rules |
CN104978961A (en) * | 2015-05-25 | 2015-10-14 | 腾讯科技(深圳)有限公司 | Audio processing method, device and terminal |
CN105047202A (en) * | 2015-05-25 | 2015-11-11 | 腾讯科技(深圳)有限公司 | Audio processing method, device and terminal |
CN104978961B (en) * | 2015-05-25 | 2019-10-15 | 广州酷狗计算机科技有限公司 | A kind of audio-frequency processing method, device and terminal |
CN105047202B (en) * | 2015-05-25 | 2019-04-16 | 广州酷狗计算机科技有限公司 | A kind of audio-frequency processing method, device and terminal |
CN105161094A (en) * | 2015-06-26 | 2015-12-16 | 徐信 | System and method for manually adjusting cutting point in audio cutting of voice |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107578769A (en) * | 2016-07-04 | 2018-01-12 | 科大讯飞股份有限公司 | Speech data mask method and device |
CN107578769B (en) * | 2016-07-04 | 2021-03-23 | 科大讯飞股份有限公司 | Voice data labeling method and device |
CN106373592B (en) * | 2016-08-31 | 2019-04-23 | 北京华科飞扬科技股份公司 | Audio holds processing method and the system of making pauses in reading unpunctuated ancient writings of making an uproar |
CN106157951B (en) * | 2016-08-31 | 2019-04-23 | 北京华科飞扬科技股份公司 | Carry out the automatic method for splitting and system of audio punctuate |
CN106157951A (en) * | 2016-08-31 | 2016-11-23 | 北京华科飞扬科技股份公司 | Carry out automatic method for splitting and the system of audio frequency punctuate |
CN106373592A (en) * | 2016-08-31 | 2017-02-01 | 北京华科飞扬科技股份公司 | Audio noise tolerance punctuation processing method and system |
CN106504773A (en) * | 2016-11-08 | 2017-03-15 | 上海贝生医疗设备有限公司 | A kind of wearable device and voice and activities monitoring system |
CN106782508A (en) * | 2016-12-20 | 2017-05-31 | 美的集团股份有限公司 | The cutting method of speech audio and the cutting device of speech audio |
CN107657947B (en) * | 2017-09-20 | 2020-11-24 | 百度在线网络技术(北京)有限公司 | Speech processing method and device based on artificial intelligence |
CN107657947A (en) * | 2017-09-20 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | Method of speech processing and its device based on artificial intelligence |
CN108597497A (en) * | 2018-04-03 | 2018-09-28 | 中译语通科技股份有限公司 | A kind of accurate synchronization system of subtitle language and method, information data processing terminal |
CN108597497B (en) * | 2018-04-03 | 2020-09-08 | 中译语通科技股份有限公司 | Subtitle voice accurate synchronization system and method and information data processing terminal |
CN110390930A (en) * | 2018-04-15 | 2019-10-29 | 高翔 | A kind of method and system of audio text check and correction |
WO2019227547A1 (en) * | 2018-05-31 | 2019-12-05 | 平安科技(深圳)有限公司 | Voice segmenting method and apparatus, and computer device and storage medium |
CN110364145A (en) * | 2018-08-02 | 2019-10-22 | 腾讯科技(深圳)有限公司 | A kind of method and device of the method for speech recognition, voice punctuate |
CN110164420A (en) * | 2018-08-02 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method and device of the method for speech recognition, voice punctuate |
US11430428B2 (en) | 2018-08-02 | 2022-08-30 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for segmenting sentences for speech recognition |
CN110164420B (en) * | 2018-08-02 | 2022-07-19 | 腾讯科技(深圳)有限公司 | Voice recognition method, and method and device for sentence breaking by voice |
CN109377998A (en) * | 2018-12-11 | 2019-02-22 | 科大讯飞股份有限公司 | A kind of voice interactive method and device |
CN109871537B (en) * | 2019-01-31 | 2022-12-27 | 沈阳雅译网络技术有限公司 | High-precision Thai sentence segmentation method |
CN109871537A (en) * | 2019-01-31 | 2019-06-11 | 沈阳雅译网络技术有限公司 | A kind of high-precision Thai subordinate sentence method |
US11645110B2 (en) | 2019-03-13 | 2023-05-09 | International Business Machines Corporation | Intelligent generation and organization of user manuals |
CN110277104A (en) * | 2019-06-21 | 2019-09-24 | 上海乂学教育科技有限公司 | Word pronunciation training system |
CN110277104B (en) * | 2019-06-21 | 2021-08-06 | 上海松鼠课堂人工智能科技有限公司 | Word voice training system |
CN110428841A (en) * | 2019-07-16 | 2019-11-08 | 河海大学 | A kind of vocal print dynamic feature extraction method based on random length mean value |
CN110428841B (en) * | 2019-07-16 | 2021-09-28 | 河海大学 | Voiceprint dynamic feature extraction method based on indefinite length mean value |
CN110400580B (en) * | 2019-08-30 | 2022-06-17 | 北京百度网讯科技有限公司 | Audio processing method, apparatus, device and medium |
CN110400580A (en) * | 2019-08-30 | 2019-11-01 | 北京百度网讯科技有限公司 | Audio-frequency processing method, device, equipment and medium |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN110930997B (en) * | 2019-12-10 | 2022-08-16 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN111883169A (en) * | 2019-12-12 | 2020-11-03 | 马上消费金融股份有限公司 | Audio file cutting position processing method and device |
CN111883169B (en) * | 2019-12-12 | 2021-11-23 | 马上消费金融股份有限公司 | Audio file cutting position processing method and device |
CN111310413A (en) * | 2020-02-20 | 2020-06-19 | 阿基米德(上海)传媒有限公司 | Intelligent broadcasting program audio strip removing method and device based on program series list |
CN111310413B (en) * | 2020-02-20 | 2023-03-03 | 阿基米德(上海)传媒有限公司 | Intelligent broadcasting program audio strip removing method and device based on program series list |
CN112261214A (en) * | 2020-10-21 | 2021-01-22 | 广东商路信息科技有限公司 | Network voice communication automatic test method and system |
CN112420016B (en) * | 2020-11-20 | 2022-06-03 | 四川长虹电器股份有限公司 | Method and device for aligning synthesized voice and text and computer storage medium |
CN112420016A (en) * | 2020-11-20 | 2021-02-26 | 四川长虹电器股份有限公司 | Method and device for aligning synthesized voice and text and computer storage medium |
CN113593528B (en) * | 2021-06-30 | 2022-05-17 | 北京百度网讯科技有限公司 | Training method and device of voice segmentation model, electronic equipment and storage medium |
CN113593528A (en) * | 2021-06-30 | 2021-11-02 | 北京百度网讯科技有限公司 | Training method and device of voice segmentation model, electronic equipment and storage medium |
CN116483960A (en) * | 2023-03-30 | 2023-07-25 | 阿波罗智联(北京)科技有限公司 | Dialogue identification method, device, equipment and storage medium |
CN116483960B (en) * | 2023-03-30 | 2024-01-02 | 阿波罗智联(北京)科技有限公司 | Dialogue identification method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN103345922B (en) | 2016-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103345922B (en) | A kind of large-length voice full-automatic segmentation method | |
CN102800314B (en) | English sentence recognizing and evaluating system with feedback guidance and method | |
CN105632484B (en) | Speech database for speech synthesis pause information automatic marking method and system | |
CN107039034B (en) | Rhythm prediction method and system | |
CN101436403B (en) | Method and system for recognizing tone | |
CN101727902B (en) | Method for estimating tone | |
CN111640418B (en) | Prosodic phrase identification method and device and electronic equipment | |
Stan et al. | ALISA: An automatic lightly supervised speech segmentation and alignment tool | |
Stan et al. | A grapheme-based method for automatic alignment of speech and text data | |
CN101950560A (en) | Continuous voice tone identification method | |
CN111128128B (en) | Voice keyword detection method based on complementary model scoring fusion | |
Mamiya et al. | Lightly supervised GMM VAD to use audiobook for speech synthesiser | |
Chen et al. | The ustc system for blizzard challenge 2011 | |
Stanek et al. | Algorithms for vowel recognition in fluent speech based on formant positions | |
Stan et al. | Lightly supervised discriminative training of grapheme models for improved sentence-level alignment of speech and text data. | |
Ling et al. | The USTC system for blizzard challenge 2012 | |
Yu et al. | Overview of SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013 | |
Bartkova et al. | Prosodic parameters and prosodic structures of French emotional data | |
CN104240699A (en) | Simple and effective phrase speech recognition method | |
Ahmed et al. | Technique for automatic sentence level alignment of long speech and transcripts. | |
Milne | Improving the accuracy of forced alignment through model selection and dictionary restriction | |
Tesser et al. | Experiments with signal-driven symbolic prosody for statistical parametric speech synthesis | |
Wiśniewski et al. | Automatic detection and classification of phoneme repetitions using HTK toolkit | |
Li et al. | Grammar-based semi-supervised incremental learning in automatic speech recognition and labeling | |
Mehrabani et al. | Nativeness Classification with Suprasegmental Features on the Accent Group Level. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 266000 Shandong province Qingdao City fish Road No. 5 Applicant after: Zhang Wei Address before: 266100 Shandong city of Qingdao province Ocean University of China College of information science and engineering, South B517 Applicant before: Zhang Wei |
|
COR | Change of bibliographic data | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160706 Termination date: 20200705 |
|
CF01 | Termination of patent right due to non-payment of annual fee |