CN103345922A

CN103345922A - Large-length voice full-automatic segmentation method

Info

Publication number: CN103345922A
Application number: CN2013102801599A
Authority: CN
Inventors: 张巍; 王永远; 张志楠
Original assignee: 张巍
Priority date: 2013-07-05
Filing date: 2013-07-05
Publication date: 2013-10-09
Anticipated expiration: 2033-07-05
Also published as: CN103345922B

Abstract

The invention relates to a large-length voice full-automatic segmentation method which is a zero-labeling sentence automatic segmentation algorithm having higher accuracy. The algorithm enables a Force-alignment non-supervision algorithm and a semi-supervised learning method based on an HMM to be blended, automatic expansion is carried out on a few precise labeling sets provided for the zero-labeling sentence segmentation algorithm by a semi-supervised learning minimization labeling sentence segmentation algorithm through establishment of an iteration mechanism based on a timer shaft, the purpose of the maximization of the precise labeling sets is realized, and then according to obtained correct periods, voice of an original length is cut into smaller paragraphs or sets of sentences. According to the method, a Force-aligned method under the HMM and a Co_training method in semi-supervised learning are blended together, so the facts that in a large-length voice sentence segmentation process, manual intervention is not needed, and segmentation accuracy is high are guaranteed. The large-length voice full-automatic segmentation method can be applied to rapid and automatic construction of a voice corpus.

Description

Full-automatic segmentation method for long-format voice

Technical Field

The invention belongs to the technical field of voice synthesis, voice recognition, voice retrieval and marking, and relates to a full-automatic segmentation method for long-format voice.

Background

Two methods for speech synthesis currently prevailing in the world are HMM-based Trainable (traceable TTS) speech synthesis methods, such as CLUSTERGEN of the university of tomimelon in the united states (CMU), HTS speech synthesis engine developed by the university of the famous ancient house industry in japan, both of which use a method based on parameter statistics (Parametric Statistical) synthesis; another is a speech synthesis method (speech-based TTS) based on a large speech corpus, such as KX-PSOLA (1993) of the institute of acoustics of chinese academy of sciences, and speech synthesis techniques adopted by the telecom platform, which use a technique based on unit selection and waveform concatenation to synthesize speech. The core of the above two speech synthesis techniques is based on the well-labeled and high-accuracy speech corpus. The current construction of a speech library (speech Corpus) generally records large texts sentence by sentence, and then manually marks the texts sentence by sentence: because the speech units at the beginning and end of a single sentence recording often differ from the units in the sentence, the speech units are forced to align to a given label (Transcript, a direct conversion from text) according to the Viterbi algorithm, and therefore manual boundary adjustment is required. FIG. 1 shows the general steps of a conventional method for constructing a speech corpus. The method for constructing the voice corpus has strong subjectivity, the manual labeling is lack of consistency, and a large amount of cost and time are spent. Meanwhile, the single sentence recording inevitably loses rich prosodic features and context information contained in the language. These prosody and context information are hard to transmit, and contribute to speech understanding, pragmatic meaning such as mood, prompting speech structure, emotion of speaker, etc. These information are important parameters that are essential to synthesize more expressive speech.

Therefore, the single sentence recording and manual marking are the bottlenecks that the existing speech synthesis engine can have more expressive power. From this point of view, if a method can be provided to accurately and automatically segment natural-sounding space speech containing multiple paragraphs into single sentences (and Viterbi-forced alignment can be directly performed on such single sentences without manual boundary adjustment because the difference between the beginning speech units, the middle speech units and the end speech units in the space speech is small), finding this method is a key problem to reduce the construction cost of speech Corpus and improve the expressive power of speech synthesis.

On the basis of solving the problem of automatic sentence segmentation, most of the existing methods are methods for simply pursuing the accuracy and recall rate of sentence segmentation, and a large amount of manual labeling is needed, or some labeling amount is reduced on the basis, and the two types of researches are not repeated. Fully automatic sentence segmentation algorithms are not much studied. The Alan W Black and Kishore Prahallad of CMU proposed in 2011 a method of automatically segmenting a speech book (VoiceBook) and then re-constructing a speech synthesis engine. The label-free sentence automatic segmentation algorithm is used for researching the segmentation of sentences from the spectral parameters of the voice, so that although the label-free sentence automatic segmentation algorithm has the label-free characteristic and ensures that the segmentation result has higher accuracy, the waste is very large, and the accuracy can only be ensured to be 40.4%. The semi-supervised learning based minimal labeled automatic sentence segmentation algorithm depends on the prosodic parameters to detect and classify the sentence boundaries. Although the amount of labeling can be reduced significantly, a high degree of accuracy cannot be guaranteed.

Disclosure of Invention

In order to solve the technical problems, the invention provides a full-automatic segmentation method for long-length speech, which is a label-free automatic sentence segmentation algorithm with higher accuracy, and the algorithm fuses a Force-alignment unsupervised algorithm based on an HMM (hidden Mark model) and a semi-supervised learning method, and automatically expands a small amount of accurate label sets provided by the label-free sentence segmentation algorithm by utilizing the semi-supervised learning minimal label sentence segmentation algorithm through establishing an iteration mechanism based on a time axis so as to achieve the aim of maximizing the accurate label sets, and then segments the original length speech into smaller paragraphs or sentence sets according to the obtained correct sentence points.

The technical scheme is as follows:

a full-automatic segmentation method for long-width voice comprises the following steps:

(1) providing accurate time data of the marked Sentence points by a non-marked Sentence Segmentation system (ZLSS) method, and corresponding the time data to the input of a Minimum marked Sentence classification system (MLSS) algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis;

(2) and extracting the corresponding data frame characteristics from the original file by using the corresponding good time data by using a boundary characteristic extraction program to prepare for classification iteration of collaborative training (Co _ training). It should be noted here that the boundary feature extraction program is embedded in the MLSS algorithm, and the extracted object is the original audio paragraph with long and multiple paragraphs. The time information of the corresponding sentence boundary points is also relative to the initial space speech. This prior art extraction procedure extracts feature information of all Candidate Sentence boundaries (sequence Boundary dictionary) accordingly before performing the subsequent steps.

(3) Adding the boundary characteristic information of the correct period position extracted from the previous step into a training set of MLSS (Multi-level class service) to carry out Co _ training, and further classifying to obtain more new periods; the MLSS algorithm is actually a binary classifier based on maximum Entropy (Maxmum Encopy) and Co _ training algorithms.

(4) The classification result only gives the starting frame and the ending frame corresponding to the period position, and further corresponds to a time axis consistent with ZLSS through a conversion program, and then the next step is carried out;

(5) making a judgment once, and judging whether a new period is found in the iteration process of this time, if not, ending the whole iteration process, and if a new period is found, carrying out the next step;

(6) after the time point information output by the conversion program is obtained, the segmentation method provided by ZLSS is utilized to further segment the current space voice and text into smaller and more paragraphs or sentences, and the result is replaced with the initial voice and text set of the previous iteration;

(7) one problem to be pointed out here is that, for the ZLSS method, during each iteration, the original text and speech are segmented into relatively smaller paragraphs or sentences according to the found sentence positions, and at the same time, the time information of the original text and speech is discarded while the new sentence information (the time of the relatively small paragraph) found in the current iteration process is retained, and therefore, a HashMap-based tracking and searching mechanism is adopted. To uniformly correspond all the found correct sentence time information to the initial time axis. And preparing for next iterative classification.

(8) The above steps are repeatedly performed.

Further preferably, a HashMap-based tracking and searching mechanism is adopted, all found correct sentence time information is uniformly corresponding to an initial time axis, and preparation is made for next iterative classification.

Further preferably, the ZLSS method considers the silence of the sentence boundary as an independent phoneme sil, first, training the hidden markov model of each phoneme in the speech through the HMM-based unsupervised method and Flat-start training algorithm, and aligning the phoneme sequence of the space with the text of the space through Viterbi forced alignment. And finally, judging whether the segmented sentence is correct or not through a strict checking mechanism according to the sentence ending symbol in the text, thereby obtaining a smaller correct boundary marking set.

Further preferably, the ZLSS method introduces an iterative algorithm: firstly, segmenting space speech into paragraph speech and sentence speech according to correct and unmistakable sil given by the above checking mechanism; then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.

Further preferably, the method further comprises automatically expanding the precise annotation set:

firstly, researching the classification of vowel/consonant/pause V/C/P on the Frame-segment Clip of the audio Frame by adopting prosodic features; secondly, realizing the minimal labeled sentence boundary detection according to Co _ training and Active Learning frameworks, and searching sentence boundaries in the pause; and finally, researching a strict error detection mechanism of the prosodic features and determining an accurate sentence boundary.

Further preferably, the classification of V/C/P: firstly, performing framing processing on original audio data according to a frame of 20ms, wherein the frames have no overlapped part, then calculating the Energy, zero-crossing rate ZCR and Pitch frequency Pitch of each frame of data, and then smoothing the Energy curve and the Pitch frequency curve;

after the above features are extracted, the vowel/consonant/pause Classification is performed on each frame of data according to the above three features, and the Classification algorithm V/C/P Classification is described as follows:

1) calculating and determining threshold values for energy

Finding a proper energy threshold to detect the pause, and counting the labeling data provided by the ZLSS according to the following principle:

for the annotation data provided by the annotation system, each annotation data comprises a plurality of data frames, and after the data frames are obtained, the average value of the energy of the data frames is calculated;

II, setting the energy threshold value as the maximum value of all energy average values;

after statistics, the energy threshold was set to 0.005 and calculated as follows

2) The mean and variance of the zero-crossing rates (MZCR and VZCR) are calculated, with the threshold TZCR for ZCR being defined as:

TZCR＝MZCR+0.005VZCR

3) the data frames are V/C/P classified according to the following criteria, FrameType being used to indicate the type of frame:

if ZCR > TZCR, then FrameType is the Consonant Consonant

Otherwise Energy < 0.005, then FrameType is Pause Pause

In addition, FrameType is Vowel Vowel

4) Carrying out merging operation of the same category on the frames classified by the V/C/P, namely, regarding the data frames which are continuously in the same category as an indefinite length of the same category, merging two indefinite lengths if a short consonant indefinite length exists between two adjacent pause indefinite lengths, and then replacing the middle C with P;

5) and performing vowel segmentation operation: if the duration of the vowel indefinite length is too long, the vowel indefinite length is segmented at the energy valley; for some cases where the energy has no valley, the processing is performed as equal-length segmentation.

The duration of the short consonant with indefinite length is less than two frames.

Further preferably, an error detection mechanism is introduced, and the workflow thereof is briefly as follows:

A) taking periods, question marks and semicolons as sentence boundary identifiers on the text, a, e, i, o and u represent vowels on the text, then calculating the total number of the vowels on the text, recording the total number as TV, calculating the number of the vowels from the starting point of the text to the boundary and from the boundary to the ending point of the text for each boundary on the text, respectively recording the number as TP and TS, and if a plurality of vowels are connected together, treating the vowels as a vowel;

B) for candidate boundaries found by the classifier, calculating the total number AV of vowels based on the classification result of V/C/P, and calculating the number of vowels from the starting point of the classification result of V/C/P to the candidate boundaries and from the candidate boundaries to the end of audio, which are respectively marked AS AP and AS; C) if either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary. Note that the value 0.015 here is obtained according to a statistical method. Specifically, the above formula is used to perform statistical calculation on a specific number of original text paragraphs and speech, and obtain an average value as the threshold for the period boundary division.

Further preferably, the minimization labeling sentence segmentation algorithm based on Co _ training is divided into four steps: firstly, carrying out V/C/P classification on audio, then, carrying out feature extraction on data frames, then, adding a classifier for training and classification, and finally, sending a classification result to a checking mechanism to further ensure the correctness of the classification result.

Preferably, the labeling result automatically obtained from the ZLSS iteration process is used as the input of an MLSS algorithm, the MLSS algorithm expands the automatic labeling result, then the expanded labeling result is used as the input of the ZLSS algorithm, effective labeling is continuously expanded, the ZLSS algorithm and the MLSS algorithm adopt a rolling iteration method which is input and output mutually, labeling is continuously expanded, a maximized accurate labeling set is finally formed, and the whole result is obtained automatically.

Compared with the prior art, the invention has the beneficial effects that: the invention has good performance on the problem of voice segmentation of multi-section length, and is mainly embodied in the following two aspects. Firstly, the method provided by the invention avoids the generation of Sentence boundaries (sequence Boundary) which may cause some human errors like SFA-1, and the errors can bring adverse effects on the detection of the boundaries and the extraction of prosodic parameters, thereby directly influencing the quality of the final synthesized speech. Secondly, although the SFA-2 method reduces the memory loss of the whole process and the computational cost of the computer, a plurality of clauses are still combined into a large paragraph, which undoubtedly causes the Viterbi forced alignment algorithm to fail to perform the optimal performance. The combined explosion problem of the search path causes the misplacement problem of the decoded state sequence to influence the performance of the whole system to a great extent. The present invention strongly avoids this disadvantage of the SFA-2 method.

Drawings

FIG. 1 is a general process for constructing a conventional corpus of speech;

FIG. 2 is a schematic diagram of a fully automatic label-free sentence segmentation;

FIG. 3 is a hash table trace lookup mechanism;

FIG. 4 is a checking mechanism; FIG. 4A illustrates an example of a correct period being identified; FIG. 4B is a sample of what is identified as an erroneous period;

FIG. 5 is an annotation system ZLSS iterative algorithm;

FIG. 6 is a vowel segmentation rule;

FIG. 7 is a set of features for sentence boundary detection;

FIG. 8 is a numerical calculation illustration;

FIG. 9 is a Co _ training based minimized annotated sentence segmentation system;

FIG. 10 is a histogram of classification performance of the classification system.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the embodiment.

Method for automatically segmenting sentences based on HMM (hidden Markov model) of spectral parameters and prosodic parameters

Full-automatic sentence segmentation algorithm introduction

Firstly, a non-labeled sentence automatic segmentation algorithm and a minimized labeled sentence segmentation algorithm based on semi-supervised learning are respectively regarded as two subsystems of an entire sentence automatic segmentation System framework, the non-labeled sentence segmentation algorithm based on the Force-alignment algorithm defined by the former is used as a Labeling System to provide a small amount of trainable precise data, and the System is named as a Sub-Labeling System (ZLSS) again. The minimum labeled sentence automatic segmentation algorithm defined by the latter is used as a Classification System and is renamed as Sub-Classification System (MLSS) for automatically expanding a small amount of accurately labeled data sets provided by ZLSS. A systematic description of the entire algorithm is given below, as shown in fig. 2:

describing algorithm steps:

1) providing accurate marking data (period time information) by a ZLSS method, and corresponding the marking data (period time information) to the input of an MLSS algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis

2) And extracting the corresponding data frame characteristics from the original file by using the corresponding good time data by using a boundary characteristic extraction program. Prepare for the classification iteration of Co _ training below.

3) And adding the boundary characteristic information of the correct period position extracted from the previous step into a training set of the MLSS to perform Co _ training, and further classifying to obtain a new period.

4) The classification result only gives the starting frame and the ending frame corresponding to the period position, and further passes through a conversion program to correspond to a time axis consistent with ZLSS and then goes to the next step operation.

5) At this time, a judgment is made to see whether a new period is found in the iteration process. That is, whether there is a new period is determined. If not, the whole iteration process is ended, and if a new period is found, the next step is carried out.

6) After the time point information output by the conversion program is obtained, the segmentation method provided by ZLSS is utilized to further segment the current space voice and text into smaller and more paragraphs or sentences, and the result is replaced with the initial voice and text set of the previous iteration.

7) The above steps are repeatedly performed.

The HashMap tracks the lookup mechanism. One problem to be pointed out here is that, for the ZLSS method, during each iteration, the original text and speech are segmented into relatively smaller paragraphs or sentences according to the found sentence positions, and at the same time, the time information of the original text and speech is discarded while the new sentence information (the time of the relatively small paragraph) found in the current iteration process is retained, and therefore, a HashMap-based tracking and searching mechanism is adopted. To uniformly correspond all the found correct sentence time information to the initial time axis. And preparing for next iterative classification. As shown in fig. 3:

the upper graph shows three iterative track-finding processes performed for a given piece of audio data. The diagram performs a total of three iterations, each time finding a new period. Corresponding to sentence points I, II and III in sequence. However, at each iteration, the system replaces the current corpus with a paragraph or sentence that has been segmented by the new period found. In other words, it is the period found by the current iterative process that loses its position in the original audio file. This makes it troublesome to detect the classification performed later. Therefore, we need to extract the feature parameters of the data frame at the original audio position corresponding to the found new sentence to add in the training set for classification detection.

Just as shown in the above figure, when the third iteration is performed, a new period is found, how is it located in the original audio? Then, the position of the sentence point of the last iteration relative to the original audio is found back by means of the hash table, and in the same way, the positions corresponding to the sentence points are found by pushing back, so that the new sentence points found in the iteration process are sequentially corresponding to the corresponding positions in the original audio file. The shaded portion in the diagram represents a sentence that has been correctly segmented and has been thrown away from the current corpus. Fig. 2 shows a schematic diagram of the full automatic sentence segmentation:

practical feasibility of maximized accurate full-automatic labeling algorithm

Because the voices researched by the invention all have good corresponding texts, the voice signal spectrum parameters and the corresponding texts can be forcibly aligned through a Viterbi algorithm based on an HTK tool to obtain the boundary information (HMM sequence) of the voice sound segment on the basis of automatic phoneme segmentation. However, the voice phenomena are very abundant, especially for Chinese voice, the voice phenomena are varied, and all the voice phenomena cannot be represented only by modeling by using spectral parameters. Therefore, as a result of the segmentation, there may be the following problems: (1) for some boundary types, the result of automatic segmentation may be offset from the result of natural artificial segmentation. (2) In actual speech, there may be a mis-pronunciation (mis-pronunciation) phenomenon. For solving the problems, the current method adopts manual adjustment and manual proofreading, which is obviously contradictory to the content of the research of the invention.

In the proposal of the problems, the idea of combining the HMM sentence segmentation algorithm based on the spectrum parameters and the sentence segmentation method based on the prosodic parameters is adopted to maximize the accurate labeling set. The feasibility of the fusion mechanism of the two methods is that the two methods have the same time scale, so that the two methods can be established based on the mutual iteration format on the same time axis, and the two methods are mutually input and output to carry out mutual rolling iteration.

HMM (hidden Markov model) label-free automatic sentence segmentation algorithm based on spectrum parameters

Introduction of segmentation principle of ZLSS algorithm

The algorithm can automatically form an accurate initial annotation set in the system. In this algorithm, we consider the silence of the sentence boundary as an independent phoneme (sil). First, a hidden markov model of each phoneme in the speech is trained through an HMM-based unsupervised method and a Flat-start training algorithm, and a phoneme sequence of a space and a text of the space are aligned through Viterbi forced-alignment (forced-alignment). The sentences are then segmented according to sentence-ending symbols in the text (e.g., periods, question marks, exclamation marks, semicolons, etc.). And finally, judging whether the segmented sentences are correct or not through a strict checking mechanism, and further obtaining a smaller correct boundary marking set.

Error detection mechanism

The reason for introducing the checking mechanism is that when the large-length speech data and the phoneme sequence are aligned forcibly, the arithmetic complexity of the Viterbi algorithm and the consumption of the computer storage space increase with the increase of the data length, but the alignment effect tends to be reduced, and even alignment errors occur. To solve this problem, we have added this checking mechanism to further ensure that all found periods are correct.

After the space speech and the space phoneme sequence (obtained by looking up the dictionary) are aligned by the Viterbi, the position of the phoneme sequence corresponding to the speech is naturally determined. First, the position of the period in the audio file may be preliminarily determined based on the alignment result. Meanwhile, adjacent or front-back phoneme information and time information can be obtained. After the phonemes which are preliminarily determined to be before and after the period sil phoneme are obtained, the single phoneme recognizer provided by the system is used for recognizing the adjacent phonemes. And comparing the recognition result with the phoneme at the corresponding position of the text, and if the recognition results of the front phoneme and the rear phoneme are consistent with the corresponding positions on the text, determining that the period is indeed a correct period. Otherwise, it is considered an erroneous period or not a period. Fig. 4 gives a sample illustration of the checking mechanism:

the example of fig. 4A in the upper figure shows that the last phoneme of the previous sentence and the first phoneme of the next sentence are ang and er, respectively, and the recognition results are ang and er in turn. This indicates that the period is the correct period. Fig. 4B identifies the last phoneme of the previous sentence incorrectly, so we consider the Viterbi decoding to have errors, and therefore do not consider the sentence to be the correct one.

Automatic sentence segmentation iterative algorithm

In this algorithm, we introduce an iterative algorithm. First, the space speech is divided into paragraph speech and sentence speech according to the correct and error-free sil given by the above checking mechanism. Then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.

The principle of adopting the iterative process is that new periods are continuously found along with the progress of the iterative process, and meanwhile, the original large-space speech and text are segmented into smaller periods and sentences according to the found periods, so that a more accurate Chinese phoneme HMMs model can be trained. Find more new periods until no more new periods can be found, and the iteration ends.

In addition, it should be noted that in the forced alignment algorithm, the slicing accuracy of the boundary error within 20ms to 50ms reaches more than 95%. Therefore, the phonemes are basically near the phoneme segmentation boundary, and a Sliding Mechanism (Sliding Mechanism) is specially added to search the phonemes before and after the sil at the boundary so as to improve the detection rate of the correct sil. Fig. 5 gives a detailed flow chart of this algorithm.

Method for segmenting minimized labeled sentences based on prosodic features

Principle of minimizing sentence segmentation

The above ZLSS labeling system provides an accurate set of labels that is not sufficient to construct a corpus of speech to synthesize more natural speech. In this section, a sentence segmentation method based on prosodic features and minimized labels is further researched according to semi-supervised learning and active learning theories, and the method is used for automatically expanding an accurate label set.

First, classification of vowel/consonant/pause (V/C/P) for audio Frame-segment (Frame-Clip) is studied using prosodic features. Then, according to the Co _ tracing and Active Learning architectures, the minimally labeled sentence boundary detection (finding the sentence boundary in the pause) is realized. Finally, the exact sentence boundaries are determined by studying the strict error detection mechanism of prosodic features (e.g., comparing vowel/consonant/pause ratios for text and audio).

Classification of V/C/P

First, we frame the original audio data according to 20ms frame, there is no overlapping part between frames, then calculate the Energy (Energy), Zero Crossing Rate (ZCR) and Pitch frequency (Pitch) of each frame data, and then smooth the Energy curve and Pitch frequency curve.

After the features are extracted, the vowel/consonant/pause classification is carried out on each frame of data according to the three features. The Classification algorithm (V/C/P Classification) is described as follows:

1) calculating and determining threshold values for energy

Pauses are an important feature of sentence boundary detection. The Chinese news broadcasting voice environment used by us is relatively stable. Thus finding an appropriate energy threshold to detect a pause. Due to different characteristics of different voices, in order to determine an energy threshold of a news voice, statistics is performed on annotation data provided by the ZLSS according to the following principle:

for the annotation data provided by the annotation system, each annotation data comprises a plurality of data frames, and after the data frames are obtained, the average value of the energy of the data frames is calculated.

II then, the energy threshold is set to the maximum of all the energy averages.

TZCR＝MZCR+0.005VZCR

if ZCR > TZCR, then FrameType is a Consonant (Consonant)

Otherwise Energy < 0.005, then FrameType is quiesce (Pause)

In addition, FrameType is Vowel (Vowel)

4) And carrying out merging operation of the same category on the frames classified by the V/C/P, namely, regarding the data frames which are continuously in the same category as an indefinite length of the same category. If there is a short consonant of indefinite length (duration less than two frames) between two adjacent pause of indefinite length, the two indefinite lengths are merged and then the middle C is replaced by P.

5) Since the possibility of detecting several vowels as a large vowel may cause the indefinite length of some vowels to be too long, the segmentation of vowels is also required to avoid errors in feature calculation: if the duration of the vowel indefinite length is too long, the vowel indefinite length is segmented at the energy valley thereof. However, for some cases where the energy has no valley, we process it as equal-length segmentation.

After counting the V/C/P classification results, it is found that 15 frames are vowel time with the highest occurrence frequency, and the average time length of vowels is 16 frames, so 15 frames are taken as a threshold value. The division rule is shown in FIG. 6

Feature extraction

FIG. 7 illustrates features of context information used to describe candidate boundaries. In FIG. 7, the pause feature of the candidate boundary is combined with the speech rate feature. Prosodic features of vowels of the temporary candidate boundary are also used to represent prosodic changes in the vicinity of the boundary point.

Pause is one of the most important indicators for sentence boundary detection, and prosodic information also plays an important role in sentence segmentation. The Rate of Speech (ROS), which affects the duration of pauses between sentences, is also incorporated into the sentence boundary feature set for detecting the boundaries of sentences. Therefore, we use the pause feature, speech rate and prosody feature as three feature sets for sentence boundary differentiation and detection.

Wherein, we define the ROS as follows

ROS＝n/∑d_i

n is the number of vowels, d_iThe duration of the ith vowel. Pauses and consonants are not included in the calculation of speech rate.

Error detection mechanism

Since the above-described Co _ training-based classification method is the maximum entropy classification method, the correctness of the classification result cannot be fully guaranteed. Therefore, in order to avoid unnecessary manual proofreading of the classification result, in the algorithm, an error detection mechanism is introduced again, which aims to further ensure the correctness of the classification result obtained by the binary classifier. There is a need to further filter the true sentence boundaries from the preliminary classification result set. We therefore propose such an error detection mechanism. The work flow is briefly as follows:

1) a sentence number, a question mark and a semicolon are used as sentence boundary identifiers on the text, a, e, i, o and u represent vowels on the text, and then the total number of the vowels on the text is calculated and recorded as TV. For each boundary on the text, the number of vowels from the beginning of the text to the boundary and from the boundary to the end of the text is calculated and recorded as TP and TS respectively. If several vowels are connected together, we treat it as one vowel.

2) For the candidate boundary found by the classifier, the total number of vowels AV is calculated based on the classification result of V/C/P, and the number of vowels from the V/C/P classification result (audio) start point to the candidate boundary and from the candidate boundary to the audio end is calculated, and recorded AS AP and AS, respectively. Since vowel segmentation may affect the calculation of the number of vowels, the V/C/P classification result herein is a result of performing class merging, but does not segment vowels of an indefinite length.

3) If either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary. This is because: although the number of vowels may be very different, the "positions" of the same boundary on the text and audio should be the same, i.e. the AP/AV and TP/TV should be very different, so this method is to sort out the real sentence boundary according to the "position" judgment. As shown in fig. 8 below:

co _ training-based minimization labeling sentence segmentation algorithm

The algorithm is carried out in four steps, firstly, V/C/P classification is carried out on audio, then, feature extraction is carried out on data frames, then, classifiers are added for training and classification, and finally, classification results are sent to a checking mechanism, so that the correctness of the classification results is further ensured. The detailed steps are shown in figure 9 below:

experimental results and data analysis

Experiments on Standard data set (Benchmark) and analysis of results

First, to compare with the automatic sentence segmentation algorithm proposed in the prior art, we performed experiments on the same standard data set and used the same evaluation measure standard. It should be noted that we cannot guarantee that the total number of sentences obtained after the selected corpus for training is segmented is the same as the total number of sentences selected in the reference document, and only the total duration is controlled to be consistent with the total duration of the selected corpus. The 42-minute corpus is selected as a training set, and compared with the automatic sentence segmentation methods FA-0, SFA-1 and SFA-2 for large-space corpora mentioned in the prior art, the same synthesis tool Clustergen and the training corpus (42-minute voice and text provided on LibriVox) are used for building a voice synthesis engine, and the voice synthesis method based on parameter statistics has the advantages of small model space-time overhead and high flexibility. Then, the corpus of large space is divided into separate sentences, which amount to 653 sentences by using the method (HAZ-SAS) provided by the invention.

The quality assessment for synthesized speech can be measured using Mel-Cepstral discrimination (MCD). The segmented sentences are divided into training sets and testing sets. Then, it is taken to the Clustergen synthesis engine for training synthesis. And finally, calculating a corresponding MCD value according to the formula (1) according to the obtained synthesis result. By comparing the influence of different segmentation methods on the synthesis result under the same training set with the difference of the synthesis result under different test sets using the same method, it can be clearly seen that the sentence segmentation method adopted by the invention has a more obvious improvement on the speech synthesis quality, the MCD values under different test sets are calculated and compared with the two methods adopted in the prior art, as shown in the following table 1:

TABLE 1 comparison of MCD values and Experimental data for different test sets

Φ e in the above chart represents the EMMA e-book audio and corresponding text set provided by LibriVox, from which we have extracted the corresponding time segment. It should be noted that FA-0 corresponds to the experimental result without any treatment of the Viterbi algorithm. SFA-1 and SFA-2 are experimental results obtained after corresponding modifications to the Viterbi algorithm. The MCD calculation formula is as follows:

MCD = (10 / \ln (10)) * \sqrt{2 * Σ_{l = 1}^{25} {(c_{l}^{s} - c_{l}^{o})}^{2}}

wherein,

and

representing feature vector values of the synthesized audio and the original audio, respectively. From the above experimental data, we can easily find that, under the same training set, by using the sentence segmentation algorithm adopted by the invention, when the test set is selected to be 9 sentences and the duration is 4min, the MCD value is reduced by 0.08 compared with SFA-1, which indicates that the method adopted by the invention has higher accuracy for positioning the sentence boundary and has good improvement on the final synthesis quality, and can be applied to automatic construction of a speech corpus.

HAZ-SAS system performance evaluation and experimental analysis

The above experiments are performed on a standard data set, and because the experimental data provided by the standard data set have relatively clear manual recording and accurate text correspondence, the algorithm provided by the invention is tested for the segmentation performance on a common data set. We have also made the following experiments: the experiment used Chinese news simulcast speech and corresponding text downloaded from the Internet. With a total of 70 paragraphs of speech, 9447 seconds (approximately 2.6 hours), the experiment still used HTK tool as the tool for training HMM and forward-forced alignment algorithm (front-forced alignment technique).

The results of the ZLSS subsystem and the MLSS subsystem on sentence boundary detection performance and the results of the sentences which are output by the complete full-automatic sentence segmentation system and are correctly segmented are respectively given before and after the two sentence segmentation methods are fused. First, table 2 shows the ability of the ZLSS method itself to provide a scalar after one complete iteration.

We define sentence segmentation accuracy:

sentence segmentation accuracy ═ 100% (number of correctly segmented sentences/total number of sentences) ×

TABLE 2ZLSS, by itself, ability to provide a scalar quantity after a complete iteration is performed

Obviously, the ZLSS method alone is difficult to achieve ideal requirements, and is not enough to provide enough accurate labeled data to quickly construct a chinese speech corpus, which is applied to the field of speech synthesis. Next, we have performed sentence boundary detection performance experiments on the MLSS method in this article, and the data is as follows:

TABLE 3 statistical results of classification performance of MLSS on sentence boundaries

For the convenience of analysis, we can make the classification performance of the boundary into a histogram representation based on the above results. It can be easily seen that the classification performance of the system is continuously improved along with the continuous increase of the capability of the labeling system for providing the labeling sets. In addition, under the same training set, the size of the buffer is set to influence the classification performance of the system, and as can be seen from fig. 10, the larger the buffer is, the higher the classification performance is.

Adding MLSS again to 42.2% of labeled data provided by ZLSS, extracting information characteristics of corresponding sentence points, performing collaborative training (Co _ training), performing further iterative classification, and further obtaining correct sentence points. The results are shown in table 4 below:

TABLE 4 iterative Classification results

Therefore, the MLSS adopted by the invention has good classification performance, and the sentence segmentation accuracy is greatly improved on the basis of adding an error detection mechanism. Meanwhile, as can also be seen from the above data, the classification performance of the classifier obviously increases with the increase of the training data. Then, we give a complete full-automatic sentence segmentation system, and after four iterations, the output results are shown in table 5 below:

TABLE 5 full-automatic sentence segmentation System results

From all the experimental data above it follows that:

1) the improved full-automatic sentence segmentation system has good segmentation accuracy, and more importantly, the number of the obtained correct sentences is greatly improved compared with that of the original subsystems in the whole process without manual participation.

2) Meanwhile, in the flat start training algorithm, shorter space voice input can be used for training better HMMs; in the forced alignment process, the Viterbi decoding space is reduced, and the alignment accuracy is improved.

3) The labeling data of the whole system is automatically generated, so that the labeling data can be regarded as label-free (Zero-labeling), the classification efficiency is greatly improved, and the cost is saved.

4) The automatic sentence segmentation algorithm provided by the invention still has some places needing improvement. For example, when Viterbi-enforced alignment is performed on a large corpus for the first time, the overall requirements for computer performance are significantly higher than those provided in the prior art. The reason is that the iterative algorithm used in the present invention performs Viterbi decoding on the entire speech when performing the first iteration. As such, there are relatively high demands on both the performance and memory size of the computer processor.

The invention provides a full-automatic sentence segmentation algorithm based on spectrum Parameters (Mel-Cepstral Parameters) and prosodic Parameters (Prosodicparameters), which fuses Force-aligned under an HMM and a Co-training method in semi-supervised learning, thereby ensuring that manual intervention is not needed and higher segmentation accuracy is achieved in the process of sentence segmentation of space speech. The method can be applied to the rapid automatic construction of the voice corpus.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. A full-automatic segmentation method for long speech is characterized by comprising the following steps:

(1) providing accurate time data for marking periods by a ZLSS method, and corresponding the time data to the input of an MLSS algorithm by a HashMap tracking and searching mechanism according to the corresponding relation of a time axis;

(2) extracting corresponding data frame characteristics from an original file by a boundary characteristic extraction program by utilizing the corresponding good time data to prepare for Co _ training classification iteration;

(3) adding the boundary characteristic information of the correct sentence positions extracted from the previous step into a training set of MLSS (Multi-level class service), performing Co _ training, and further classifying to obtain new sentences;

(4) the classification result only gives the starting frame and the ending frame corresponding to the period position, and the corresponding period is further corresponding to a time axis consistent with ZLSS through a conversion program, and then the next step operation is carried out;

(7) the above steps are repeatedly performed.

2. The method for full-automatic segmentation of long speech according to claim 1, wherein a HashMap-based tracking lookup mechanism is employed to uniformly map all found correct sentence time information to an initial time axis for preparation for next iterative classification.

3. The method as claimed in claim 1, wherein the ZLSS method considers the silence of sentence boundary as an independent phoneme sil, first, trains hidden markov models of each phoneme in the speech through HMM-based unsupervised method and Flat-start training algorithm, aligns phoneme sequences of the text with the text of the text through Viterbi forced alignment, then judges whether the sentence segmentation is correct according to the sentence end sign in the text, and finally, judges whether the sentence segmentation is correct through a strict checking mechanism, thereby obtaining a smaller correct boundary label set.

4. The full-automatic segmentation method for long speech according to claim 1, wherein the ZLSS method introduces an iterative algorithm: firstly, segmenting space speech into paragraph speech and sentence speech according to correct and unmistakable sil given by the above checking mechanism; then, judging whether the total number of the sentences and paragraphs obtained currently is increased relative to the result obtained in the last iteration process, namely judging whether a new correct sil is found, if so, replacing the result voice and text with the result voice and text of the previous time, retraining the HMMs, and continuing the iteration; if there is no increase, this means that the iteration process is over.

5. The method according to claim 1, further comprising automatically expanding the set of accurate labels:

6. The method according to claim 5, wherein the classification of V/C/P is: firstly, performing framing processing on original audio data according to a frame of 20ms, wherein the frames have no overlapped part, then calculating the Energy, zero-crossing rate ZCR and Pitch frequency Pitch of each frame of data, and then smoothing the Energy curve and the Pitch frequency curve;

1) calculating and determining threshold values for energy

2) The mean value MZCR and the variance VZCR of the zero crossing rate are calculated, the threshold TZCR of ZCR being defined as:

TZCR＝MZCR+0.005VZCR

if ZCR > TZCR, then FrameType is the Consonant Consonant

Otherwise Energy < 0.005, then FrameType is Pause Pause

In addition, FrameType is Vowel Vowel

7. The method according to claim 6, wherein the duration of the short consonant with variable length is less than two frames.

8. The method according to claim 6, wherein an error detection mechanism is introduced, and the workflow thereof is as follows:

B) for candidate boundaries found by the classifier, calculating the total number AV of vowels based on the classification result of V/C/P, and calculating the number of vowels from the starting point of the classification result of V/C/P to the candidate boundaries and from the candidate boundaries to the end of audio, which are respectively marked AS AP and AS;

C) if either | AP/AV-TP/TV | or | AS/AV-TP/TV | is less than 0.015, the boundary is considered to be a correct sentence boundary.

9. The method according to claim 1, wherein the minimization-labeled sentence segmentation algorithm based on Co _ training is divided into four steps: firstly, carrying out V/C/P classification on audio, then, carrying out feature extraction on data frames, then, adding a classifier for training and classification, and finally, sending a classification result to a checking mechanism to further ensure the correctness of the classification result.

10. The method for full-automatic segmentation of long speech according to claim 1, wherein the labeling result automatically obtained from the ZLSS iteration process is used as the input of MLSS algorithm, the MLSS algorithm expands the automatic labeling result, and then the expanded labeling result is used as the input of ZLSS to continue expanding the effective labeling.