CN101266790A

CN101266790A - Device and method for automatic time marking of text file

Info

Publication number: CN101266790A
Application number: CNA2007100886277A
Authority: CN
Inventors: 颜铭祥; 颜睿余; 赵平峡
Original assignee: Micro Star International Co Ltd
Current assignee: Micro Star International Co Ltd
Priority date: 2007-03-16
Filing date: 2007-03-16
Publication date: 2008-09-17

Abstract

The invention relates to a device and a method of word file automatically marking time. The method comprises: a receiving module receives the word file and an audio file, wherein, the word file is consisted of a plurality of sentences; a voice recognition module converts the sentences in the word file into a voice model, and divides the audio file into a plurality of frames according to interval time and numbers the frames in order, then changes speech data of the frames into characteristic parameter via capturing voice, and calculates a best voice route of the frame and the voice model which are matched with each other; a marking module captures a number of the frame corresponding to beginning of each sentence according to the best voice route, and the number of the frame and interval time gain starting time of the audio file corresponding to the beginning of each sentence, and mark the starting time on the work file. The method of the invention can make each sentence in the word file automatically mark the starting time corresponding to the audio file, and need not make use of artificial means to mark time sentence by sentence as traditional technology, so as to save a plurality of time and human.

Description

Text file indicates the apparatus and method of time automatically

Technical field

The present invention relates to the apparatus and method that a kind of text file indicates the time, relate in particular to and a kind ofly carry out the apparatus and method that text file indicates the time automatically by speech recognition.

Background technology

No matter be language learner or speech player (for example, MP3 player), present most equipment all has the synchronous function of ci and qu.Just listening to that language is read aloud or during playback of songs as the user, having corresponding literal (in reading aloud perhaps the lyrics), following voice document and together playing.So that allow the user can listen to voice document on one side, Yi Bian read and voice document literal in correspondence with each other.So, when user's utilization has the learning equipment language of ci and qu synchronizing function or listens to song, can increase the efficient of language learning or quicken the efficient that song is learnt.

At present the common synchronous file of ci and qu is the LRC file, and the form of so-called LRC file to be exactly the temporal information back in simple terms following passage information.Wherein, the meaning of temporal information representative is exactly the zero-time of this section Word message in voice document.So,, also just can hear and the corresponding voice content of this section Word message as long as begin to play from this time of voice document.Also, just can produce many product or softwares on the market with ci and qu synchronizing function because there is the file of this form of similar LRC to occur.

But with present technology, the making major part of LRC file is to finish in the mode of manual manufacture.Just carry out the pairing time-tagging of sentence according to the content of literal and voice document.In simple terms, exactly with the time of voice document that word segment corresponds to, utilize artificial method to mark sentence by sentence.So, a large amount of time and the waste of manpower will be caused.

For example, TaiWan, China is applied for a patent " editing system of the vocal accompaniment lyrics and editor thereof and the method that shows " No. 92117564.But this patent provides and is applied on the computing machine executive's interface, edit the corresponding lyrics of vocal accompaniment music rhythm by the user, and when the zero-time that defines every section song is used to show, can show and change corresponding character according to the song duration accurately to present, the user can be followed easily sing.Its disclosed technology is to edit the corresponding lyrics of vocal accompaniment music rhythm by the user, just adopts the above-mentioned artificial mode of introducing that indicates the time voluntarily, and finishing accompanying song Chinese words file (lyrics) can the synchronous function of ci and qu.

In the at present relevant in addition research document, there is trial key vocabularies to be put in order and, realizes the keyword extraction of big glossary, and its discrimination power and identification usefulness are studied with fast algorithm with all key vocabularies structurings.And, be the speech control system of platform with PDA, the vector quantization process based on neural network is discussed, to the influence of voice system discrimination power.The method of using comprises utilizes Digital Signal Processing acquisition speech characteristic parameter, and the vector quantization method is done pre-treatment, and hidden Markov model is main identification and training algorithm.

Above-mentioned mentioned document, main research contents focuses on the skill of speech recognition.Can't reach the function of time that the corresponding text file of voice document is indicated automatically.Therefore, how allowing text file indicate the time automatically, and save time and the money that the artificial sign time is spent, is a problem demanding prompt solution.

Summary of the invention

In view of this present invention proposes the apparatus and method that a kind of text file indicates the time automatically, carries out text file by speech recognition and indicates the time automatically.Utilize the present invention each sentence in the text file can be indicated automatically time corresponding to voice document.Therefore, do not need again as conventional art, utilize artificial mode sentence by sentence the mark or character file correspond to time of voice document.So, with the cost of significantly saving time with manpower.

A kind of text file proposed by the invention indicates the device of time automatically and comprises: receiver module, voice identification module and sign module.

Receiver module receives text file and voice document.Wherein, text file is made up of a plurality of sentence.Voice identification module is converted to speech model with the sentence in the text file, and according to interval time voice document being divided into a plurality of frames (frame) and numbering in regular turn, calculates the match each other optimal voice path of (match) of frame and speech model.Indicate module and capture the numbering of the pairing frame of beginning of each sentence according to optimal voice path, by the numbering and the zero-time of the beginning that obtains each sentence interval time corresponding to voice document of frame, and the sign zero-time is in text file.

The present invention proposes a kind of text file and indicates time method automatically, carries out text file by speech recognition and indicates the time automatically, comprises the following step.Receive text file and voice document, and this literal file is made up of a plurality of sentence.Sentence in the conversion text file is a speech model.According to interval time voice document being divided into a plurality of frames (frame) and numbering in regular turn.Calculate the match each other optimal voice path of (match) of frame and speech model.Capture the numbering of the pairing frame of beginning of each sentence according to optimal voice path.According to the numbering of frame and the beginning that obtains each sentence interval time zero-time corresponding to voice document.At last, indicate zero-time in text file.

Each sentence in the text file can be indicated automatically zero-time by method of the present invention, not need for another example that conventional art equally utilizes artificial mode to indicate the time sentence by sentence, and then save a large amount of time and the cost of manpower corresponding to voice document.

Relevant preferred embodiment of the present invention and effect thereof, conjunction with figs. explanation as after.

Description of drawings

Fig. 1 indicates the schematic representation of apparatus of time automatically for text file of the present invention.

Fig. 2 is the synoptic diagram of voice identification module.

Fig. 3 is the optimal voice path synoptic diagram.

Fig. 4 indicates the time method process flow diagram automatically for text file of the present invention.

Fig. 5 is the thin portion method flow diagram of calculating optimum voice path.

Wherein, description of reference numerals is as follows:

S1, S2, S3, S4: sentence

F1～FN: frame

10: text file

12: voice document

14: speech model

20: receiver module

30: voice identification module

32: acquisition module

34: the first computing modules

36: the second computing modules

38: optimal voice path

40: indicate module

Embodiment

Fig. 1 indicates the schematic representation of apparatus of time automatically for text file of the present invention.As shown in Figure 1, text file of the present invention indicates the device of time automatically and comprises: receiver module 20, voice identification module 30 and indicate module 40.

Receiver module 20 receives text file 10 and voice document 12.Wherein, text file 10 and voice document 12 are file in correspondence with each other, and for example: voice document 12 records are English reads aloud the voice content of session, and text file 10 is to write down this English to read aloud the word content of session; Or voice document 12 is popular song, and text file 10 is the lyrics of this popular song.Text file 10 is just general as being seen article, writing down and voice document 12 literal in correspondence with each other, and one piece of article is made up of a plurality of sentence.So text file 10 also is made up of a plurality of sentence.

Voice identification module 30 is converted to speech model with all sentences in the text file 10.Wherein, speech model belong to hidden Markov model (Hidden Markov Model, HMM).So-called hidden Markov model is a kind of statistical model, is used for describing a Markovian process with implicit unknown parameter.From observable parameter, determine the implicit unknown parameter of this process, utilize these parameters to do further to analyze then.And present most voice identification system promptly is to adopt hidden Markov model, utilizes probability model to describe the phenomenon of pronunciation, with the phonation of a bit of voice, regards continuum of states transfer in the Markov model as.

Text file 10 being converted to speech model, if text file 10 is a Chinese for instance, because the literal phonetic of Chinese is made of for example " rent initial consonant and simple or compound vowel of a Chinese syllable " phonetic of this word is made up of initial consonant " Jie " and simple or compound vowel of a Chinese syllable " メ ".When text file 10 was Chinese, speech model was to utilize the initial consonant and the simple or compound vowel of a Chinese syllable of Chinese to train the hidden Markov model that forms.So, each sentence in the text file 10 is converted to the speech model of forming by initial consonant and simple or compound vowel of a Chinese syllable.Relative, if text file 10 is English, speech model is exactly to utilize English vowel and consonant to train the hidden Markov model that forms.So, when text file 10 is English, just convert each sentence in the text file 10 to form speech model by vowel and consonant.

Then according to interval time voice document 12 being divided into a plurality of frames (frame) and numbering in regular turn.Wherein, be about 23～30 milliseconds interval time.In hidden Markov model, the characteristic parameter that each frame presented can be regarded the output result under certain state as.And the transfer of state, and the output result under a state, can describe by probability model.Utilize hidden Markov model or other speech recognition notion no matter be, earlier voice document 12 is divided into basic phonetic unit, just so-called frame, carrying out follow-up speech recognition again handles, convenience and accuracy that speech recognition is handled can be improved, the speed in the computing can be accelerated simultaneously.

Next a plurality of frames of being divided according to voice document 12 of voice identification module 30 with the speech model that text file 10 is changed, calculate the match each other optimal voice path of (match) of frame and speech model.

Indicate module 40 according to the optimal voice path that voice identification module 30 is produced, capture the numbering of the pairing frame of beginning of each sentence in the text file 10.Numbering by frame and the beginning that obtains each sentence interval time are corresponding to the zero-time of voice document 12 again.Suppose, comprise 4 sentences with voice document 12 corresponding text files 10.When the frame zero-time of voice document 12 is 30 seconds, and be the beginning of second sentence of text file via the result of speech recognition, 30 seconds is the zero-time of second sentence in the text file 10 so.That is to say that when the reproduction time of voice document 12 was 30 seconds, the content of being play just was the beginning of second sentence in the

text file

10,30 seconds was the zero-time that second sentence corresponds to voice document 12 in the text file 10 so.Same, when the frame zero-time of voice document 12 is 55 seconds, and be the beginning of the 3rd sentence of text file via the result of speech recognition, 55 seconds is the zero-time of the 3rd sentence in the text file 10 so.That is to say, when voice document 12 continues to play to the time when being 55 seconds, the content of being play just is the beginning of the 3rd sentence in the text file 10,55 seconds is the zero-time that the 3rd sentence corresponds to voice document 12 in the text file 10 so, by that analogy.

Wherein, capture according to optimal voice path each sentence in the text file 10 beginning after the numbering of corresponding frame, owing to can choose voluntarily according to user's demand or according to the requirement on calculating the interval time of frame.Therefore, the algorithm of the zero-time of each sentence can start pairing frame number by each sentence and be multiplied by the interval time of each frame and obtain.For example, supposing to be set at interval time 25 milliseconds and each frame does not all have overlappingly, and just voice document 12 every intervals just are divided into a frame for 25 milliseconds.Suppose, the beginning of second sentence in the text file 10, capturing corresponding frame number via optimal voice path is 1200, because the time that every frame comprised is 25 milliseconds, therefore the beginning of second sentence is that the numbering of frame is multiplied by interval time (1200*25ms=30s) corresponding to the zero-time of voice document 12 in the text file 10, and the beginning that can obtain second sentence is 30 seconds corresponding to the zero-time of voice document 12.Same, the beginning of the 3rd sentence in the text file 10, capturing corresponding frame number via optimal voice path is 2200, therefore the beginning of the 3rd sentence is that the numbering of frame is multiplied by interval time (2200*25ms=55s) corresponding to the zero-time of voice document 12, and the beginning that can obtain the 3rd sentence is 55 seconds corresponding to the zero-time of voice document 12.

At last, indicate module 40 and indicate zero-time in text file 10.After obtaining the zero-time of beginning corresponding to voice document 12 of each sentence in the text file 10, the zero-time of a sentence is shown in text file 10.So, similar LRC file is the same, and text file 10 not only writes down the word content corresponding to voice document 12, more writes down the zero-time of the beginning of each sentence.So voice document 12 needs only from the zero-time of some sentences and begins to play, and also just can hear and the corresponding voice content of this sentence word content, reaches the synchronous function of ci and qu.And do not need to utilize artificial mode to come the sign time like that in the conventional art for another example, each sentence in the text file 10 can be indicated automatically zero-time corresponding to voice document 12 by device of the present invention.

Please refer to Fig. 2, Fig. 2 is the synoptic diagram of voice identification module.The present invention's text file indicates in the device of time automatically, and voice identification module 30 comprises: acquisition module 32, first computing module 34 and second computing module 36.

Voice signal has an important characteristic, at different time, though the voice that send are that its waveform is not quite similar, and we can say that also voice are dynamic signals of a kind of time to time change with a word or same sound.Speech recognition will be found out regularity exactly from these dynamic signals, in case find after the regularity, how voice signal changes in time, on the whole can both point out their characteristic place, and then the voice signal identification is come out.This regularity is called characteristic parameter (featureparameter) in speech recognition, just can represent the parameter of voice signal characteristic.And the ultimate principle of speech recognition is exactly as the basis with these characteristic parameters.Therefore at the beginning, acquisition module 32 captures the pairing characteristic parameter of each frame in the voice document 12 earlier, in order to the processing of subsequent voice identification.

Because aforesaid speech model can belong to hidden Markov model, and hidden Markov model is the method for a kind of Probability ﹠ Statistics aspect, be fit to be used in the description of characteristics of speech sounds, because voice are random processing signals of a kind of multiparameter, via the processing of hidden Markov model, can infer accurately all parameters.Therefore, next first computing module 34 utilizes first algorithm computation to go out a comparison probability of each characteristic parameter and speech model.Wherein, first algorithm can be forward process (forwardprocedure) algorithm or reverse process (backward procedure) algorithm.The state number of supposing hidden Markov model is N, and hidden Markov model allows by a certain state transitions to any state, and then all state transition sequence numbers are N ^TWhen the T value is too big, make that the calculated amount of probability can be too heavy.Therefore, can adopt forward process algorithm or reverse process algorithm, come accelerometer to calculate the comparison probability of characteristic parameter and speech model.

Fig. 3 is for being the optimal voice path synoptic diagram.As shown in Figure 3, the comparison probability that second computing module 36 is calculated according to first computing module 34, and utilize second algorithm, calculate optimal voice path 38.Wherein, second algorithm can adopt viterbi algorithm (viterbi algorithm).Suppose to have in the text file 10 four sentences to be in regular turn: S1, S2, S3 and S4.At first, preface be docile and obedient in these four sentences be converted to speech model 14, will be divided into a plurality of frames (F1～FN) again corresponding to the voice document 12 of text file 10.(F1～FN) is a transverse axis, and the speech model of being changed with text file 10 14 is that the longitudinal axis carries out identification and viterbi algorithm is with a plurality of frames of voice document 12.Viterbi algorithm can mark the most possible corresponding speech model of each frame.After the characteristic parameter of all frames in the voice document 12 is all handled, can obtain calculating: the optimal voice path 38 that whole frames are the most similar to speech model via viterbi algorithm.

As shown in Figure 3 as seen, by optimal voice path 38 can capture each sentence beginning the numbering of corresponding frame.According to the numbering of the frame of each sentence and the interval time that each frame is comprised, can obtain the zero-time of the beginning of each sentence corresponding to voice document 12.

Fig. 4 is for to indicate the time method process flow diagram automatically for text file, and Fig. 4 comprises the following step:

Step S10: receive text file and voice document.Wherein, text file and voice document are file in correspondence with each other, and text file is made up of a plurality of sentence.

Step S20: the sentence in the conversion text file is a speech model.Wherein, speech model belongs to hidden Markov model.

Step S30: the voice document that step S10 is received, according to being divided into a plurality of frames and numbering in regular turn interval time.Wherein, be about 23～30 milliseconds interval time.

Step S40: calculate the optimal voice path that frame and speech model match each other.This step can be subdivided into three steps again, will introduce below.

Step S50: capture according to optimal voice path each sentence beginning the numbering of corresponding frame.

Step S60: according to the numbering of frame and the beginning that obtains each sentence interval time zero-time corresponding to voice document.Because can choose voluntarily the interval time of frame according to user's demand or according to the requirement on calculating.Therefore, the algorithm of the zero-time of each sentence can start pairing frame number by each obtained sentence of step S50 and be multiplied by the interval time of each frame and obtain.

Step S70: last, the zero-time of beginning that indicates each sentence is in text file.So, text file more writes down the zero-time of the beginning of each sentence except the word content of record corresponding to voice document.Therefore, voice document if from the zero-time of some sentences begin to play, just can hear and the corresponding voice content of this sentence word content, and reach the synchronous function of ci and qu.Each sentence in the text file can be indicated automatically zero-time by method of the present invention, not need for another example that conventional art equally utilizes artificial mode to indicate the time sentence by sentence, and then save a large amount of time and the cost of manpower corresponding to voice document.

Above-mentioned steps S40 calculates the optimal voice path that frame and speech model match each other, and comprises the following step, please refer to Fig. 5, and Fig. 5 is the thin portion method flow diagram of calculating optimum voice path.

Step S42: capture the corresponding characteristic parameter of each frame.Though voice signal is a kind of Dynamic Signal of time to time change, as long as but find out in the voice signal each in short-term apart from (short time) or be called the regularity of a frame, no matter how voice signal changes in time so, can both find out its characteristic place haply, and then the voice signal identification is come out.And this regularity is called characteristic parameter (feature parameter) in speech recognition, just can represent the parameter of voice signal characteristic.Therefore, the characteristic parameter with each frame captures out earlier, in order to the processing of subsequent voice identification.

Step S44: utilize first algorithm computation to go out the comparison probability of each characteristic parameter and speech model.Wherein, first algorithm can be forward process algorithm or reverse process algorithm.

Step S46: each characteristic parameter that is calculated according to step S44 and the comparison probability of speech model, utilize second algorithm computation to go out optimal voice path again.Wherein, second algorithm can adopt viterbi algorithm.As shown in Figure 3, the optimal voice path of utilizing viterbi algorithm to calculate captures the numbering of the pairing frame of beginning of each sentence in the text file again by optimal voice path.According to the numbering of the frame of each sentence and the interval time that each frame is comprised, can obtain the zero-time of the beginning of each sentence corresponding to voice document.

Though technology contents of the present invention with preferred embodiment openly as above; yet it is not in order to limit scope of patent protection of the present invention; any those of ordinary skill in the art; do not breaking away from change and the retouching that spirit of the present invention is done; all should be covered by in the scope of patent protection of the present invention, so protection scope of the present invention should be as the criterion with the scope of patent protection that affiliated claims define.

Claims

1. a text file indicates the device of time automatically, comprising:

Receiver module receives text file and voice document, and this literal file is made up of a plurality of sentence;

Voice identification module is converted to speech model with the described sentence in this literal file, and according to an interval time this voice document being divided into a plurality of frames and numbering in regular turn, calculates the optimal voice path that described frame and this speech model match each other; And

Indicate module, capture the numbering of pairing this frame of beginning of each described sentence according to this optimal voice path, by the numbering of this frame and obtain the initial time of the beginning of each described sentence this interval time, and indicate this zero-time in this literal file corresponding to this voice document.

2. text file as claimed in claim 1 indicates the device of time automatically, and wherein this speech model belongs to hidden Markov model.

3. text file as claimed in claim 1 indicates the device of time automatically, wherein is about 23～30 milliseconds this interval time.

4. text file as claimed in claim 1 indicates the device of time automatically, and wherein this voice identification module also comprises:

Acquisition module captures the corresponding characteristic parameter of each described frame;

First computing module utilizes first algorithm computation to go out the comparison probability of each described characteristic parameter and this speech model; And

Second computing module according to this comparison probability and utilize second algorithm, calculates this optimal voice path.

5. text file as claimed in claim 4 indicates the device of time automatically, and wherein this first algorithm is the forward process algorithm.

6. text file as claimed in claim 4 indicates the device of time automatically, and wherein this first algorithm is the reverse process algorithm.

7. text file as claimed in claim 4 indicates the device of time automatically, and wherein this second algorithm is a viterbi algorithm.

8. text file as claimed in claim 1 indicates the device of time automatically, and wherein this zero-time is multiplied by this interval time by the numbering of this frame and obtains.

9. a text file indicates time method automatically, comprises the following steps:

Receive text file and voice document, this literal file is made up of a plurality of sentence;

The described sentence of changing in this literal file is a speech model;

According to interval time this voice document being divided into a plurality of frames and numbering in regular turn;

Calculate described frame and this speech model optimal voice path that one of matches each other;

Capture the numbering of pairing this frame of beginning of each described sentence according to described optimal voice path;

According to the numbering of this frame and obtain the initial time of the beginning of each described sentence this interval time corresponding to this voice document; And

Indicate this zero-time in this literal file.

10. text file as claimed in claim 9 indicates time method automatically, and wherein this speech model belongs to a hidden Markov model.

11. text file as claimed in claim 9 indicates time method automatically, wherein is about 23～30 milliseconds this interval time.

12. text file as claimed in claim 9 indicates time method automatically, wherein this calculation procedure also comprises the following steps:

Capture the corresponding characteristic parameter of each described frame;

Utilize one first algorithm computation to go out one of each described characteristic parameter and this speech model comparison probability; And

According to this comparison probability and utilize second algorithm, calculate this optimal voice path.

13. text file as claimed in claim 12 indicates time method automatically, wherein this first algorithm is the forward process algorithm.

14. text file as claimed in claim 12 indicates time method automatically, wherein this first algorithm is the reverse process algorithm.

15. text file as claimed in claim 12 indicates time method automatically, wherein this second algorithm is a viterbi algorithm.

16. text file as claimed in claim 9 indicates time method automatically, wherein this zero-time is multiplied by this interval time by the numbering of this frame and obtains.