CN103210447A - Method and apparatus for producing script data - Google Patents

Method and apparatus for producing script data Download PDF

Info

Publication number
CN103210447A
CN103210447A CN2011800538470A CN201180053847A CN103210447A CN 103210447 A CN103210447 A CN 103210447A CN 2011800538470 A CN2011800538470 A CN 2011800538470A CN 201180053847 A CN201180053847 A CN 201180053847A CN 103210447 A CN103210447 A CN 103210447A
Authority
CN
China
Prior art keywords
information
data
sentence
word
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011800538470A
Other languages
Chinese (zh)
Inventor
任矿淳
金仁松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CN103210447A publication Critical patent/CN103210447A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • G11B27/105Programmed access in sequence to addressed parts of tracks of operating record carriers of operating discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/34Indicating arrangements 
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data
    • G11B2020/10555Audio or video recording specifically adapted for audio data wherein the frequency, the amplitude, or other characteristics of the audio signal is taken into account

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a method and apparatus for producing script data with respect to audio data. The method for producing the script data includes: obtaining the whole time information of an actual sound section of the audio data; obtaining the whole syllable number information with respect to a sound section on the basis of text data; calculating unit syllable time information corresponding to one syllable on the basis of the whole time information and the whole syllable number information; obtaining prediction playback position information with respect to a corresponding sound section of the audio data on the basis of a sound section occupied by a word or paragraph for which prediction is required in the text data and the unit syllable time information; ; and recording a mute section, which is the closest to a prediction playback position, of mute sections of the audio data located before or after the prediction playback position as actual playback position information.

Description

Script data generation method and device
Technical field
The present invention relates to possess method and device about the script data of the match information of voice data for generation.
Background technology
Recently, along with the development of portable electric appts and internet etc., the content of multimedia of various contents is popularized widely and is utilized.For example, be used for studying English, the various of language such as Japanese, Chinese listen to study and are popularized with voice document and utilize.
At the voice document that exist to be used for learning a language with therewith under the situation of the script of corresponding character form, for in the desirable position reproduction of script therewith between the sound zones of corresponding voice document, when making voice document, corresponding information need be inserted in the voice document.
But, under the situation of the voice document of the reproduction positional information that possesses regulation in the past, only comprised and to have pressed the information that paragraph reproduces, in fact when the certain articles of wanting to reproduce in the long paragraph of length, need between the corresponding sound zones of direct search voice document, therefore have the problem that is difficult to use voice document.
In addition, under the situation that does not store the voice document that reproduces positional information, be difficult to reproduce arbitrarily between the sound zones of the voice document corresponding with the certain words of script or sentence.For example, in not storing the voice document that reproduces positional information, need user oneself manual operation rewinding or fast forward button and find out the problem of appropriate section when confirming the content of voice document in person in order to reproduce between the sound zones corresponding with the desirable word of script or sentence or article, to exist.
Summary of the invention
Technical task
The present invention researches and develops in order to address the above problem a little, and fundamental purpose of the present invention is, the script data generation method that possesses the information that the actual sound interval of the article in the script or word and voice document is mated easily is provided.
Another object of the present invention is to provide the script data that can adopt described script data generation method generating apparatus.
Solve the means of problem
In order to solve above-mentioned technical task, according to an aspect of the present invention, a kind of script data generation method is provided, based on voice data with record the text data of content of voice data and the generation script data, this script data has and the word of text data or the reproduction positional information of sentence corresponding audio data, this script data generation method comprises: first step obtains at the All Time information between the sound zones of voice data; Second step obtains at the whole syllable number information between sound zones based on text data; Third step is based on All Time information and all computings of syllable number information and a unit syllable temporal information that syllable is corresponding; The 4th step, grasping needs shared interval syllable number information between the word predicted or at least one sound zones before the sentence in the text data; The 5th step based on interval syllable number information and unit syllable temporal information, is reproduced positional information at the prediction of the word that needs in the voice data to predict or sentence; The 6th step is reproduced positional information to value and prediction that the temporal information between each sound zones of voice data addition is successively obtained and is compared; And the 7th step, reproduce positional information when consistent in the value that the temporal information between each sound zones addition is successively obtained and prediction, will and the assigned position of prediction in reproducing between the most contiguous stagnant wake of positional information store in the text data as the actual reproduction position of the word of predicting at needs or sentence.
In one embodiment, first step comprises the steps: to extract at the temporal information between each sound zones according to amplitude information, and wherein, this amplitude information is based on the amplitude information of the time of voice data.
In one embodiment, first step comprises the steps: that wherein, this amplitude information is based on the amplitude information of the time of voice data according between the stagnant wake in the amplitude information extraction voice data; And based on the temporal information between each stagnant wake that is obtained by the extraction between the stagnant wake and the All Time information of voice data itself, obtain at the All Time information between sound zones.
In one embodiment, second step comprises the steps: to obtain the syllable number at word or the sentence of text data based on by syllabic language or have the literal of attribute of syllabic language or the pronouncing dictionary information of sign flag text data.
In one embodiment, the 4th step comprises the steps: to choose last word or sentence as the word or the sentence that need prediction successively one by one from first word or the sentence of text data.
In one embodiment, the 6th step comprise the steps: with between first sound zones of voice data to the temporal information between each sound zones the final sound range of sound of voice data addition and search for and predict and reproduce the positional information consistent location one by one successively, wherein, first word or sentence with text data between this first sound zones are corresponding, and word or sentence with last between this final sound range of sound are corresponding.
According to a further aspect in the invention, a kind of script data generating apparatus is provided, it is based on voice data and record the text data of content of voice data and the generation script data, this script data possesses and the word of text data or the reproduction positional information of sentence corresponding audio data, this script data generating apparatus comprises: the amplitude comparing section, and it obtains at the All Time information between the sound zones of voice data; The syllable number extraction unit, it obtains at the whole syllable number information between sound zones based on text data; The unit interval operational part, it is based on the All Time information that obtains from the amplitude comparing section and whole syllable number information of obtaining from the syllable number extraction unit, computing and a unit syllable temporal information that syllable is corresponding; Reproduce position prediction portion, it is based on shared interval syllable number and unit syllable temporal information between at least one sound zones before the word that needs in the text data to predict or the sentence, and acquisition is reproduced positional information at the word of the needs prediction of voice data or the prediction of sentence; And reproduction location finding portion, it reproduces positional information to the value that the temporal information between each sound zones of voice data addition is successively obtained with prediction and compares, reproduce positional information when consistent in described value and prediction, will and the assigned position of prediction in reproducing between the most contiguous stagnant wake of positional information store in the text data as the actual reproduction position of the word of predicting at needs or sentence.
In one embodiment, the amplitude comparing section is extracted between sound zones according to amplitude information, and according to the temporal information computing All Time information between each sound zones that extracts, wherein, this amplitude information is the amplitude information of the time of voice data.
In one embodiment, the amplitude comparing section is according between the stagnant wake in the amplitude information extraction voice data, according to the All Time information between the All Time information computing sound zones of the temporal information between each stagnant wake of extracting and described voice data itself, wherein, this amplitude information is based on the amplitude information of the time of voice data.
In one embodiment, the syllable number extraction unit based on syllabic language have the literal of attribute of syllabic language or the pronouncing dictionary information acquisition of sign flag text data at the syllable number of word or the sentence of text data.
In one embodiment, reproduce position prediction portion and choose last word or sentence successively one by one from first word or the sentence of text data, as word or the sentence of needs prediction.
In one embodiment, reproduce location finding portion will between first sound zones of voice data to the temporal information between each sound zones the final sound range of sound of voice data addition and search for and predict and reproduce the positional information consistent location one by one successively, wherein, first word or sentence with text data between this first sound zones are corresponding, and word or sentence with last between this final sound range of sound are corresponding.
The invention effect
According to the present invention, can provide by mating simply between the corresponding sound zones in the ad-hoc location on the script file that makes with the voice document corresponding character form of regulation and the voice document, thereby improve the utilization factor of voice document, and the script data generation method of raising user's convenience and the script data of being made by this method.
In addition, can be provided at the program of utilizing script data of the present invention and in record and/or transcriber, driving and use or record executable storage medium in the computing machine of script data.
In addition, according to the present invention, can not only be that unit mates the corresponding position of reproducing with the article in voice document, also for any sentence or word in the script, also can in voice document, mate the corresponding position of reproducing.Therefore, can be on script data the desirable reproduction position of designated tone audio data at random, therefore can remove because confirming sth. oneself and searching for the inconvenience that the content of voice data is brought, can listen to the degree of applying flexibly that improves script data in the purposes such as study greatly at language.
Description of drawings
Fig. 1 is the summary construction diagram that can adopt an example of the record of script data generation method of the present invention and/or transcriber.
Fig. 2 is the precedence diagram about an embodiment of the dictionary set-up procedure that can adopt in script data generation method of the present invention.
Fig. 3 a is the figure about an embodiment of the text data that can adopt in the dictionary set-up procedure of Fig. 2.
Fig. 3 b is the figure about an embodiment of the voice data that can adopt in the dictionary set-up procedure of Fig. 2.
Fig. 4 is about the precedence diagram according to the script data generation method of one embodiment of the invention.
Fig. 5 a is the figure about an embodiment of the text data that can adopt in the script data generation method of Fig. 4.
Fig. 5 b is the figure of an embodiment of the script data of the present invention that generates about the text data that utilizes Fig. 5 a.
Fig. 6 is the figure about an embodiment of the voice data that can adopt in the script data generation method of Fig. 4.
Fig. 7 a and Fig. 7 b are the figure about an embodiment of the pronouncing dictionary information that can adopt in the script data generation method of Fig. 4.
Fig. 8 is for the figure of explanation in the process of the reproduction position of the ad-hoc location corresponding audio data of the script data generation method prediction of Fig. 4 and search and script data.
Fig. 9 is the block diagram according to the script data generating apparatus of one embodiment of the invention.
Embodiment
Below, with reference to accompanying drawing, to being elaborated according to embodiments of the invention.
Fig. 1 is the summary construction diagram that can adopt an example of the record of script data generation method of the present invention and/or transcriber.
With reference to Fig. 1, embody the record of script data generation method of the present invention and/or transcriber and be and carry out voice data and export voice or sound, can see the device of text data (140) with picture by regulated procedure.For example, record and/or transcriber comprise: the computer installation (120) that possesses display frame and be connected with loudspeaker (110); The portable terminal (130) that as smart mobile phone, possesses computer function; Can record and the multimedia rendering file, as a kind of personal portable information terminal (PDA:Personal digital assistant) of portable computer; Possess the touch panel picture, be connected and the audio reproducing apparatus such as MP3 player of outputting audio data with earphone or loudspeaker.
Such record and/or transcriber (below, simply be called transcriber) possess storage medium.In the present embodiment, in the storage medium of transcriber, store the program that embodies script data generation method and the application that is used for carrying out this program.In addition, in storage medium, store for the data that embody according to the script data generation method of present embodiment.
For example, data comprise: the temporal information in the actual sound interval of voice data; Obtain based on text data at the whole syllable number information between sound zones; The unit syllable temporal information corresponding with a syllable of text data; Interval syllable number information till between the sound zones corresponding with each word of text data or sentence; And based on interval syllable number information and unit syllable temporal information and with actual reproduction positional information of certain words or sentence corresponding audio data etc.
The transcriber of present embodiment can be used in the study of listening to of English, Japanese, Chinese etc.In other words, there is voice document (perhaps voice data) in the transcriber of present embodiment and therewith under the situation of the script of corresponding character form (perhaps text data) for example listen to study for English, can possess carrying out Matching Algorithm or program accordingly therewith between the sound zones corresponding with each word, sentence or the article of script.
In order to differentiate with following script data related in the detailed description of present embodiment, script is called script original paper or text data.At this, text data or script original paper refer to not record the script in the past according to the actual reproduction positional information of present embodiment.
For algorithm or the script data that is embodied by described transcriber is provided, the inventor is at voice data that the prescribed language by English, Japanese, Chinese etc. constitutes with therewith in the script original paper of corresponding character form, based on the syllabic language corresponding with the script original paper, grasp is to the syllable number between the sound zones of voice data, thereby prediction and the certain articles of script original paper or the reproduction position in sentence or the word corresponding audio data provide and the script data of script original paper and voice data coupling can be generated scheme.
In script data generation scheme, in order to grasp at the syllable number between the sound zones of voice data, the advanced person is about to the script original paper and is transformed to syllabic language or the dictionary operation of corresponding literal or symbol therewith.In this case, syllabic language or therewith corresponding literal or symbol as can be with the literal of the language of script original paper literalization according to pronunciation, comprise assumed name, the Korean of Japanese Korean, Cyprus's literal, American Indians the Qie Luoji literal or as the syllabic language of the language that once uses or using in the past or now in countries in the world or have the literal of the attribute of syllabic language at least.At this, syllabic language refers to represent with a word literal of a syllable of sound.
In one embodiment, the inventor consider with English differently, the Korean is the literal with attribute of syllabic language, the pronouncing dictionary that is equivalent to each English word is turned to the Korean, predict that thus the English word of script original paper is with the length pronunciation of which kind of degree, from voice document, select and the only interval of length of predicting, thereby the script data generation method that can make that script original paper and voice document at random mate is provided.Based on the script original paper and under the situation about will grasp with hangul syllable at the syllable number of each word of English script original paper or sentence, for example, under the situation of utilizing English-Korean pronouncing dictionary, in order to improve accuracy for predicting, in the pronunciation of Korean, utilize macron.
In addition, as another dictionary operation, the advanced person is about to ready voice document and is divided into and listens to each that paragraph is for example listened to problem according to English and the operation of the corresponding voice messaging of each paragraph of distinguishing.At this, the voice messaging corresponding with at least one article of each paragraph or each paragraph is called voice data.
To carry out described process and be in order to constitute under the situation that a plurality of voice documents of listening to paragraph provide with a file, prevent with whole intervals being the diffusion of the error that taken place when predicting of object, reduce to mainly inserting between paragraph and the paragraph and unwished-for sound zones between the bad influence of sound such as for example guiding statement, telephone ringer sound, noisy sound.
If described dictionary operation is carried out more specifically bright, then as shown in Figure 2.Fig. 2 is the precedence diagram about an embodiment of the dictionary set-up procedure that can adopt in script data generation method of the present invention.
In the detailed description below, be that example describes as prediction the means of the syllable number of the word of script original paper or sentence are utilized the situation of the Korean of the attribute with syllabic language.
With reference to Fig. 2, the script original paper (S210) of input characters form in embodying according to the transcriber of the script data generation method of one embodiment of the invention at first.The script original paper can be the text data corresponding with the illustrated paper of Fig. 3 a (with reference to Fig. 5 a).
Then, grasp the quantity (S215) of listening to paragraph (problem).The quantity of listening to paragraph be according to the adjacent text fragment that is positioned at the script original paper each other ceases to be busy or between the long relatively stagnant wake that the adjacent paragraph of voice document inserts each other, grasp.
Then, judge whether voice document provides (S220) according to paragraph.If the judged result in this step (S220) does not provide according to paragraph for voice document, then voice document is input to device specifies and for example embodies the device of dictionary operation of the script data generation method of present embodiment, so that can carry out following regulation operation (S225) for the voice document that possesses a plurality of paragraphs.
Then, based on the amplitude information based on time of voice document, can not extract between the stagnant wake of sound (S230).
Then, with between the stagnant wake according to length for example at the burst length of time with descending sort (S235).When arranging between the stagnant wake, possessing big relatively value between the relative short stagnant wake between the long stagnant wake between paragraph and the paragraph in paragraph or the article is priority, so can grasp the quantity of listening to paragraph (problem) easily thus.
Then, select and to listen to paragraph quantity interval accordingly, a voice document is separated (S240) according to paragraph.The purpose of this step is, grasps the quantity of listening to paragraph based on the quantity between the long stagnant wake of clearly distinguishing relatively, prepares the voice messaging of correspondence therewith according to paragraph.
Then, judge each voice messaging whether with each paragraph of actual speech file consistent (S245).The result who judges in this step (S245) if each voice messaging is consistent with each paragraph of actual speech file, then finishes the preparation of respectively listening to the corresponding voice messaging of paragraph (S250) with voice document.The voice messaging of preparing is stored in storer or the memory storage of the regulation of transcriber.
On the other hand, the result who judges in above-mentioned steps (S245) if each paragraph of each voice messaging and actual speech file is inconsistent each other, then carries out the correction operation (S255) at voice messaging.Each paragraph of each voice messaging and actual speech file inconsistent situation each other mainly is that voice document comprises the situation with the irrelevant guiding statement of script or acoustics etc., under these circumstances, the processing of in matching process, passing through or getting rid of between the sound zones that the user can make and script is irrelevant.The user interface of audio editing program that can be by being equipped on transcriber is revised operation.When making that by the correction operation each voice messaging is consistent with actual script paragraph, finish and respectively listen to the preparation (S250) of the corresponding voice messaging of paragraph (problem).
On the other hand, the result who judges in above-mentioned steps (S220) if provide voice document according to paragraph, then imports each voice document (S260) that provides according to paragraph.According to this step (S260), each voice document that existence provides according to paragraph, therefore move to corresponding therewith voice document when in script, selecting regulation paragraph (problem), under the situation that need not to generate according to the voice messaging of paragraph, can finish the operation of this dictionary thus.
Fig. 3 a is the figure about an embodiment of adoptable text data in the dictionary set-up procedure of Fig. 2.Fig. 3 b is the figure about an embodiment of adoptable voice data in the dictionary set-up procedure of Fig. 2.
Shown in Fig. 3 a, be used for to embody according to the dictionary preparation work of the script data method of generationing of present embodiment applicable to about the paper (300) in " 2011 school years university's cultivation competence test Simulation evaluation examination question foreign languages (English) field " or the script original paper of correspondence therewith.For example, be prepared to when selecting the square frame mark (322) of second topic (320) shown in preceding with input medias such as mouse or touch panels, reproduce between the sound zones of corresponding audio data therewith.
About paper (300) second to the 7th (at this, the 3rd to the 6th be for convenience diagram and omit) the voice messaging of respectively listening to paragraph shown in Fig. 3 b.For example, in Fig. 3 a and Fig. 3 b, second topic of paper (300) is to the 7th topic each interval corresponding to (P2 to P7) between six sound zones of the voice document (380) of the voice messaging that comprises each paragraph.
According to present embodiment, in a voice document, exist under a plurality of situations according to the voice messaging of listening to paragraph, extract between the stagnant wake (at this according to the amplitude information based on time of voice document, comprise S1 to S6 between the stagnant wake), and based between the stagnant wake of extracting, to listen to paragraph (P2 to P7) and distinguish according to paragraph or separate to corresponding six of the 7th topic with second topic, preparation be according to the voice messaging of paragraph.
Fig. 4 is about the precedence diagram according to the script data generation method of one embodiment of the invention.
With reference to Fig. 4, in the script data generation method of present embodiment, at first be chosen in the voice messaging (S410) that carries out operation in the voice document.In this step (S410), voice messaging can be to comprising a plurality of voice documents of listening to paragraph, comprises voice messaging or a voice document of listening to paragraph at the post-job specific paragraph of dictionary by mode shown in Figure 2.Comprise that the voice document of the voice messaging of a paragraph or a voice messaging is corresponding to voice data.
Then, in selected voice data, grasp All Time (S415) between the sound zones that actual sound takes place based on amplitude information.The purpose of this step is for example with between a problem area listening to study to remove the voice messaging of corresponding paragraph between the stagnant wake and time in [with reference to white waveform interval in Fig. 6 (b)] between actual whole sound zones that sound takes place in order to grasp from corresponding paragraph.
Based on amplitude information and with distinguish between sound zones and between the stagnant wake comprise will be more than the reference amplitude of regulation as between sound zones and situation about obtaining.For example, the highest sound was made as 100% o'clock, can judges between the stagnant wake with the sound with the size below 3%, 5% or 7%.
Then, based on the corresponding scripts original paper of voice data and grasp whole syllable number (S420) between sound zones.
In this step (S420), can grasp at the hangul syllable of the English word in the script original paper for example and obtain whole syllable number between sound zones.Because the Korean has the attribute of syllabic language, so the hangul syllable person of can be used as considers the consistent unit of giving orders or instructions with the hearer and utilizes.
In addition, in this step (S420), can grasp the whole syllable number between sound zones in one embodiment based on the English that records the hangul syllable corresponding with English word-Korean pronouncing dictionary information.English-Korean pronouncing dictionary information can comprise the pronouncing dictionary information shown in Fig. 7 a.
Then, the All Time that will in above-mentioned steps (S415), obtain (below, be called A) computing unit syllable time corresponding with syllable divided by the whole syllable number that in above-mentioned another step (S420), obtain (below, be called B) (below, be called C) (S425).When the unit syllable time of obtaining is in this step represented with mathematical expression, shown in following mathematical expression 1.
[mathematical expression 1]
C=A/B
Then, grasp specific or need the word of prediction or the sound zones before the sentence between shared syllable number (S430).At this,, syllable number and the whole syllable number that will grasp are called interval syllable number for distinguishing.
Then, interval syllable number and unit syllable time (C) are multiplied each other and predict the shared time (S435) between sound zones before respective word or the sentence.In this step (S435), the time shared between the sound zones before the word of prediction or the sentence is corresponding with the prediction reproduction position at respective word or sentence.
Then, will be benchmark addition one by one successively from front to back with time between each sound zones of voice data, reproduce position (S440) up to the prediction that arrives at respective word or sentence.
Then, judge whether sum reproduces position consistency (S445) with the prediction of prediction reproduction positional information between the sound zones that obtains from above-mentioned step (S440).The result who judges in this step (S445), if inconsistent, step (S440) before then turning back to after addition between sound zones afterwards, judges whether sum reproduces position consistency (S445) with prediction between sound zones.(S440, S445) repeating between sound zones sum and prediction, to reproduce the position substantially consistent for described two steps.
Then, the result who in described step (S445), judges, if sum is reproduced position consistency with prediction between sound zones, then between search and the most contiguous stagnant wake, prediction reproduction position, with between the most contiguous stagnant wake of searching for as store (S450) at the actual reproduction position of respective word or sentence.
Can search in the following manner and predict and reproduce between the most contiguous stagnant wake, position: in chronological order with between the sound zones of voice data and between the stagnant wake successively one by one addition and extracting comprise that prediction reproduces between the sound zones of position, respectively relatively between two stagnant wakes of the both sides between the sound zones that extracts and prediction reproduce distance between the position.At this, the time sum of reproducing between sound zones till the position to the prediction of voice data is consistent with the time of prediction reproduction position.That is, its expression can be got rid of the error of bringing by between the stagnant wake in the voice data according to the method for present embodiment.
The actual reproduction position can be the assigned position for example start time between corresponding stagnant wake, interlude or the concluding time between the stagnant wake.
Then, judge whether to finish processing (S455) to all articles in the voice data.The result who in this step (S455), judges, if described prediction and search to all articles do not finish, then turn back to particular step (S430), select after the next word or sentence in the voice data, shared interval syllable number between the next word that acquisition need be predicted or the sound zones before the sentence, predict the reproduction position in corresponding sound interval based on this interval syllable number and unit syllable temporal information, search for the actual reproduction position after process.
Then, the result who in described step (S455), judges, if the processing of all articles is finished, then the voice messaging that namely remains paragraph for the remaining audio data judges according to prediction and the search of described process whether finish (S460).
The result who in above-mentioned steps (S460), judges, if all processing of audio data are not finished, then turn back to particular step (S410), selecting the voice messaging corresponding with next paragraph in the ongoing voice document of operation is next voice data, carries out particular step process afterwards for next voice data.This process proceeds to till all processing of audio data end.
And the result who judges in above-mentioned steps (S460) if all processing of audio data are finished, then finishes the coupling (S465) for corresponding voice document and script original paper thereof.When coupling finishes, can store in the text data at the corresponding reproducing positional information of each paragraph, each article or each word of text data, can generate the script data according to present embodiment thus.
Below, for the voice document of present embodiment and the matching process of corresponding script original paper therewith, with the special audio data in the voice document [with reference to (b) of Fig. 6] and therewith the script original paper of correspondence (be that example describes with reference to Fig. 5 matching process a).
Fig. 5 a is the figure about an embodiment of adoptable text data in the script data generation method of Fig. 4.Fig. 5 b is the figure of an embodiment of the script data of the present invention that generates about the text data that utilizes Fig. 5 a.Fig. 6 is the figure about an embodiment of adoptable voice data in the script data generation method of Fig. 4.Fig. 7 a and Fig. 7 b are the figure about an embodiment of adoptable pronouncing dictionary information in the script data generation method of Fig. 4.
With reference to Fig. 5 a, text data (500) is script original paper corresponding with problem two in Fig. 3 a illustrated " 2011 school years university's cultivation competence test Simulation evaluation examination question foreign languages (English) field " paper.
In order to hear again about desirable article, sentence or the word of such script original paper (500) between the sound zones in the corresponding therewith voice document, exist the user need search for the difficult problem in the corresponding sound interval in the voice document in person.
But, script data (500a) according to the present embodiment shown in Fig. 5 b, can be according to moving to correspondence position in the voice document at the reproduction positional information (530) of listening to the article in the paragraph or at the reproduction positional information (540) of word, thus the corresponding sound interval directly reproduced.At least any possesses in script data with the link information form based on the Hypertext Transfer Protocol mode in above-mentioned two reproduction positional informations (530,540), so that can't see at actual script data (500a).
For example, as shown in Figure 6, for predict comprise in the voice document (601) of a plurality of voice messagings (610,611,612,613,614,615) of listening to paragraph with the voice messaging of the specific paragraph script article that for example voice data (620) in the specific sound interval in first voice messaging (610) is corresponding (with reference to Fig. 5 b 550) or sound zones at the word in this article or sentence between, at first grasp the All Time at the actual sound interval of the voice data (620) of corresponding article (550).
In the present embodiment, whole length of carrying out the voice data (620) of operation are about 1.952 seconds, (632,634) from 0.623 second to about 0.646 second with from about 1.413 seconds to about 1.439 seconds, so the All Time in actual sound interval is to deduct after about 0.049 second about 1.903 seconds of All Time between the stagnant wake about 1.952 seconds from the All Time of voice data (620) for approximately between the stagnant wake.
Then, obtain whole syllable number at described article (550).The english article of present embodiment (550) " How do you like your new job; Chelsea? " syllable number be can be with reference to phonography, more specifically, with reference to the pronouncing dictionary information of syllabic language and obtain, in the present embodiment, shown in Fig. 7 a and Fig. 7 b, can utilize as pronouncing dictionary information having recorded at the dish (710) of the hangul syllable of described article (550) or the table (720) that recorded at the assumed name syllable of the Japanese of described article (550).In the present embodiment, the syllable number of described article (550) is 16.5.
Then, obtain and the unit syllable time that syllable is corresponding.In the present embodiment, the unit syllable time for about 1.903 seconds divided by 16.5 values that obtain namely about 0.115333 second.
Then, obtain interval syllable number at the sentence of wanting to predict or word.And, interval syllable number and unit syllable time are multiplied each other and obtain prediction and reproduce the position, search is reproduced between the most contiguous stagnant wake, position with prediction then, thus acquisition is to the actual reproduction positional information of this sentence or word.
On the other hand, when in the script original paper of Fig. 5 b, obtaining syllable number, be preferably, with the Korean statement, such omission or eliminatings such as specific expression word that " M: " reaches " W: ".Certainly, more preferably be in the script original paper, not comprise the statement of described Korean and specific expression word.
Below, with reference to Fig. 8, prediction and the search procedure of described reproduction position further described in detail.
Fig. 8 is for the figure of explanation in the process of the reproduction position of the ad-hoc location corresponding audio data of the script data generation method prediction of Fig. 4 and search and script data.
At first, suppose the voice data (620) of existence shown in Fig. 8 (a).As following record, voice data (620) possess with the regulation article (with reference to Fig. 5 b 550) between corresponding sound zones.
“How?do?you?like?your?new?job,Chelsea?”
And, suppose that the word that will predict is " Chelsea? "
So, from now on to prediction and search and described word " Chelsea? " one embodiment of the process between the sound zones of corresponding audio data is elaborated.
In the present embodiment, voice data (620) possesses between three sound zones between (810,820,830) and two stagnant wakes (632,634).(810) are corresponding with first sentence " How do you like " of described article (550) between first sound zones, (820) are corresponding with second sentence " your new job " of described article (550) between second sound zones, and between the 3rd sound zones (830) and described article (550) the 3rd sentence " Chelsea? " corresponding.
At first, grasp at the All Time between the sound zones of voice data (620).All Time is the time sum of (810,820,830) between three sound zones.In the present embodiment, (810) are 0 to till about 0.623 second between first sound zones, and (820) are till about 0.646 second to about 1.413 seconds between second sound zones, and (830) are from about 1.439 seconds to about 1.952 seconds between the 3rd sound zones.Thus, the All Time between sound zones is about 1.903 seconds.
Then, grasp at the whole syllable number between the sound zones of voice data (620) based on described article (550).For example can utilize hangul syllable as following, to obtain whole syllable number (with reference to Fig. 7 a).
Figure BDA00003158856000121
(" Howdoyoulikeyournewjob, Chelsea? " Korean pronunciation mark)
If hypothesis macron (-) is equivalent to 0.5 syllable in above-mentioned hangul syllable, be 16.5 syllables at whole syllable number of voice data or this article (550) in the present embodiment then.
Required time when then, calculating the 1 syllable pronunciation of voice data (620).In the present embodiment, when the required time calculated when 1 syllable was pronounced, it was with the value of the All Time between sound zones divided by whole syllable number.That is,
1.903 second/16.5 syllables=0.115333 second/syllable
Then, grasp shared interval syllable number between sound zones till the word want to predict.In the present embodiment, to described word " Chelsea? " till sound zones between comprise between first sound zones between (810) and second sound zones (820).Therefore, grasp as follows to the syllable number between these two sound zones and its length.
" Howdoyoulike " as first sentence is 7 syllables of hangul syllable, to and the corresponding sound zones of this sentence between time predict then as follows:
7 syllables * 0.115333 second/syllable=0.807331 second
And, be 6 syllables of hangul syllable as " yournewjob " of second sentence, to and the corresponding sound zones of this sentence between time predict then as follows:
6 syllables * 0.115333 second/syllable=0.691998 second
Between this expression and first sentence and second corresponding sound zones of sentence sum account for and about 1.499329 seconds corresponding sound zones between.In the present embodiment, the time of prediction is corresponding with prediction reproduction position, and therefore prediction reproduction position is about 1.499329.
Then, for the speech waveform of voice data (620), analyze successively and arrive between each stagnant wake (632 from the outset; Length between the sound zones 634).At this, analyze proceed to the length of finding between sound zones and namely predict between reproduction immediate stagnant wake, position by the time of above-mentioned process and prediction till.
That is, as Fig. 8 (b) and (c), in the voice data (620) of described article (550), the length of (810) is about 0.623 second between the sound zones till (632) between first stagnant wake.The length of (810,820) is about 1.412 seconds between the sound zones till (634) between second stagnant wake.And the length of (810,820,830) is about 1.925 seconds between the sound zones till (not shown) between the 3rd stagnant wake.
Therefore, when comparing between (634) between second stagnant wake and the 3rd stagnant wake, it is between second stagnant wake (634) that the most close prediction is reproduced between the stagnant wake of position.That is, be positioned at and reproduce as prediction between second stagnant wake of about 1.499329 seconds positions the most contiguous between the sound zones of position as with described word " Chelsea? " corresponding actual reproduction position and storing.At this, the actual reproduction position can be chosen in as random time or position in about 1.413 seconds between second stagnant wake to about 1.439 seconds.Be combined with this word, sentence or article with the hyperlink form and store the script original paper in the actual reproduction position of searching for.
According to said structure, can from audio data content easily and exactly search with the script original paper " Chelsea? " between the sound zones of this word correspondence, when the analytic process of such mode is applied to the whole word of script original paper, sentence or article, can carry out the script original paper easily all and the coupling between the voice document, will namely reproduce positional information to the match information of each word, sentence or the article of voice data thus and insert the script original paper and generate script data according to present embodiment.
The script data that generates not only can be used as the script data corresponding with voice data itself and utilizes, and can carry or be attached to the program that can be used as storage medium that computing machine carries out or the reproduction of carrying out its storage medium and/or pen recorder that stores voice data or script data and use and utilize.
On the other hand, in above-mentioned detailed description, the pronouncing dictionary information of Fig. 7 a and Fig. 7 b only is used for one embodiment of the invention are described, and is not to limit the invention.
Fig. 9 is about the block diagram according to the script data generating apparatus of one embodiment of the invention.
With reference to Fig. 9, the script data generating apparatus (900) of present embodiment is for generate the device that possesses with the script data of the reproduction positional information of word, sentence or the article corresponding audio data of text data with the text data of written form record based on voice data with the content of this voice data.
In one embodiment, script data generating apparatus (900) possesses microprocessor (910) and memory storage (920), and microprocessor (910) is carried out the script data generative process according to the program and/or the application that are stored in memory storage (920).
Such script data generating apparatus (900) possess microprocessor (910) or according to the program of being carried out by microprocessor (910) use the function portion of moving for example amplitude comparing section (930), syllable number extraction unit (940), unit interval operational part (950), reproduce position prediction portion (960), and reproduce location finding portion (970).
Amplitude comparing section (930) is obtained the All Time information in the actual sound interval of voice messaging or voice data.
In one embodiment, amplitude comparing section (930) is extracted the actual sound interval based on the amplitude information according to time of voice data, from obtaining All Time information between each sound zones that extracts.
In another embodiment, amplitude comparing section (930) is extracted between stagnant wake in the voice data according to the amplitude information based on time of voice data, from obtaining the All Time information between the stagnant wake between the stagnant wake of extracting, deduct full-time temporal information between the stagnant wake and All Time information between the computing sound zones based on the All Time from voice data.
Syllable number extraction unit (940) based on the syllabic language information acquisition corresponding with the literal of script original paper or text data to the whole syllable number information between sound zones.
In one embodiment, syllable number extraction unit (940) is based on the syllable number that obtains with the pronouncing dictionary information of syllabic language retrtieval data at word or the sentence of text data.Syllabic language can comprise the Japanese of each word in can enough syllabeme recording audio datas or have the Korean etc. of the attribute of such syllabic language.
Whole syllable quantity informations that unit interval operational part (950) obtains based on the temporal information between the generating region that obtains from amplitude comparing section (930) with from syllable number extraction unit (940) and computing are corresponding to the unit syllable temporal information of a syllable.
Reproduce position prediction portion (960) from amplitude comparing section (930) and syllable number extraction unit (940) obtain each article or need prediction word, sentence or article before between sound zones till shared interval syllable number information and unit syllable temporal information, obtain based on such information and the sound zones of this word, sentence or article corresponding audio data between length namely predict the reproduction positional information.
Reproduce location finding portion (970) based on reproducing the position from the prediction of reproducing position prediction portion (960) acquisition, from be arranged in prediction reproduce the position before or between the stagnant wake of afterwards voice data, will and prediction reproduce and determine to be the actual reproduction positional information between the most contiguous stagnant wake, position.
The script data generating apparatus (900) of present embodiment is equipped on the illustrated record of Fig. 1 and/or transcriber and generates described script data to use form.
More than, describing the present invention with reference to preferential embodiment, but the invention is not restricted to described embodiment, those skilled in the art can carry out various corrections and distortion from such record.Therefore, the present invention should make an explanation with reference to whole records of appending claims and accompanying drawing etc., and so impartial or distortion of equal value should all belong in the thought range of the present invention.

Claims (12)

1. script data generation method, based on voice data with record the text data of content of described voice data and the generation script data, this script data has and the word of described text data or the reproduction positional information of sentence corresponding audio data, and this script data generation method comprises:
First step obtains at the All Time information between the sound zones of described voice data;
Second step obtains at the whole syllable number information between described sound zones based on described text data;
Third step is based on described All Time information and the computing of described whole syllable number information and a unit syllable temporal information that syllable is corresponding;
The 4th step is grasped shared interval syllable number information between at least one sound zones that needs in the described text data before the word predicted or the sentence;
The 5th step based on described interval syllable number information and described unit syllable temporal information, obtains to reproduce positional information at the prediction of the word that needs described prediction in the described voice data or sentence;
The 6th step compares value and the described prediction reproduction positional information that the temporal information between each sound zones of described voice data addition is successively obtained; And
The 7th step, reproduce positional information when consistent in the value that the temporal information between described each sound zones addition is successively obtained and described prediction, will and the assigned position conduct of described prediction in reproducing between the most contiguous stagnant wake of positional information store in the described text data at the actual reproduction position of the word of the described prediction of needs or sentence.
2. script data generation method according to claim 1, wherein, described first step comprises the steps: to extract at the temporal information between each sound zones according to amplitude information, and wherein, this amplitude information is based on the amplitude information of the time of described voice data.
3. script data generation method according to claim 1, wherein, described first step comprises the steps: to extract between stagnant wake in the described voice data according to amplitude information, and wherein, this amplitude information is based on the amplitude information of the time of described voice data; And according to being obtained at the All Time information between described sound zones by the temporal information between each stagnant wake of the acquisition of the extraction between described stagnant wake and the All Time information of described voice data itself.
4. script data generation method according to claim 1, wherein, described second step comprises the steps: according to by the pronouncing dictionary information acquisition of the described text data of the syllabic language mark syllable number at word or the sentence of described text data.
5. script data generation method according to claim 1, wherein, till described the 4th step comprises the steps: to choose last word or sentence successively one by one from first word of described text data or sentence, as word or the sentence of the described prediction of needs.
6. script data generation method according to claim 5, wherein, described the 6th step comprise the steps: with between first sound zones of the described voice data corresponding with first word of described text data or sentence to the final sound range of sound of the corresponding described voice data of described last word or sentence till each sound zones between temporal information addition and search for and the positional information consistent location is reproduced in described prediction one by one successively.
7. script data generating apparatus, it is based on voice data and record the text data of content of described voice data and the generation script data, this script data has and the word of described text data or the reproduction positional information of sentence corresponding audio data, and this script data generating apparatus comprises:
The amplitude comparing section, it obtains at the All Time information between the sound zones of described voice data;
The syllable number extraction unit, it obtains at the whole syllable number information between described sound zones based on described text data;
The unit interval operational part, it is based on the described All Time information that obtains from described amplitude comparing section and the computing of described whole syllable number information and a unit syllable temporal information that syllable is corresponding that obtain from described syllable number extraction unit;
Reproduce position prediction portion, it reproduces positional information based on interval syllable number shared between at least one sound zones before the word that needs in the described text data to predict or the sentence and the syllable temporal information acquisition of described unit at the prediction of the appropriate section of described voice data; And
Reproduce location finding portion, it compares value and the described prediction reproduction positional information that the temporal information between each sound zones of described voice data addition is successively obtained, reproduce positional information when consistent in described value and described prediction, will and the assigned position conduct of described prediction in reproducing between the most contiguous stagnant wake of positional information store in the described text data at the actual reproduction position of the word of the described prediction of needs or sentence.
8. script data generating apparatus according to claim 7, wherein, described amplitude comparing section is extracted between described sound zones according to amplitude information, according to the described All Time information of the temporal information computing between each sound zones that extracts, wherein, this amplitude information is based on the amplitude information of the time of described voice data.
9. script data generating apparatus according to claim 7, wherein, described amplitude comparing section is extracted between stagnant wake in the described voice data according to amplitude information, according to the All Time information between the described sound zones of All Time information computing of the temporal information between each stagnant wake of extracting and described voice data itself, wherein, this amplitude information is based on the amplitude information of the time of described voice data.
10. script data generating apparatus according to claim 7, wherein, described syllable number extraction unit is based on by the pronouncing dictionary information acquisition of the described text data of the syllabic language mark syllable number at word or the sentence of described text data.
11. script data generating apparatus according to claim 7, wherein, till described reproduction position prediction portion chooses last word or sentence successively one by one from first word of described text data or sentence, as word or the sentence of the described prediction of needs.
12. script data generating apparatus according to claim 11, wherein, described reproduction location finding portion will between first sound zones of the described voice data corresponding with first word of described text data or sentence to the final sound range of sound of the corresponding described voice data of described last word or sentence till each sound zones between addition and search for and the positional information consistent location is reproduced in described prediction one by one successively.
CN2011800538470A 2010-11-10 2011-11-09 Method and apparatus for producing script data Pending CN103210447A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
KR1020100111615A KR101030777B1 (en) 2010-11-10 2010-11-10 Method and apparatus for producing script data
KR10-2010-0111615 2010-11-10
PCT/KR2011/008522 WO2012064110A2 (en) 2010-11-10 2011-11-09 Method and apparatus for producing script data

Publications (1)

Publication Number Publication Date
CN103210447A true CN103210447A (en) 2013-07-17

Family

ID=44365384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011800538470A Pending CN103210447A (en) 2010-11-10 2011-11-09 Method and apparatus for producing script data

Country Status (3)

Country Link
KR (1) KR101030777B1 (en)
CN (1) CN103210447A (en)
WO (1) WO2012064110A2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282941A (en) * 2021-12-20 2022-04-05 咪咕音乐有限公司 Method, device and equipment for determining advertisement insertion position and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1297561A (en) * 1999-03-25 2001-05-30 松下电器产业株式会社 Speech synthesizing system and speech synthesizing method
JP2005189454A (en) * 2003-12-25 2005-07-14 Casio Comput Co Ltd Text synchronous speech reproduction controller and program
CN1661673A (en) * 2004-02-27 2005-08-31 雅马哈株式会社 Speech synthesizer,method and recording medium for speech recording synthetic program
JP2010157816A (en) * 2008-12-26 2010-07-15 Toshiba Corp Subtitle information generating device, subtitle information generating method, and program

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002318580A (en) * 2001-04-23 2002-10-31 Sony Corp Speech reproducing device, speech reproducing method, speech reproducing program, speech reproducing program storage medium and data storage medium
US7346506B2 (en) * 2003-10-08 2008-03-18 Agfa Inc. System and method for synchronized text display and audio playback
JP5313466B2 (en) * 2007-06-28 2013-10-09 ニュアンス コミュニケーションズ,インコーポレイテッド Technology to display audio content in sync with audio playback
JP5246948B2 (en) * 2009-03-27 2013-07-24 Kddi株式会社 Subtitle shift correction device, playback device, and broadcast device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1297561A (en) * 1999-03-25 2001-05-30 松下电器产业株式会社 Speech synthesizing system and speech synthesizing method
JP2005189454A (en) * 2003-12-25 2005-07-14 Casio Comput Co Ltd Text synchronous speech reproduction controller and program
CN1661673A (en) * 2004-02-27 2005-08-31 雅马哈株式会社 Speech synthesizer,method and recording medium for speech recording synthetic program
JP2010157816A (en) * 2008-12-26 2010-07-15 Toshiba Corp Subtitle information generating device, subtitle information generating method, and program

Also Published As

Publication number Publication date
WO2012064110A2 (en) 2012-05-18
KR101030777B1 (en) 2011-05-25
WO2012064110A3 (en) 2012-07-12

Similar Documents

Publication Publication Date Title
KR101324910B1 (en) Automatically creating a mapping between text data and audio data
Durand et al. The Oxford handbook of corpus phonology
JP5463385B2 (en) Automatic creation of mapping between text data and audio data
Schultz et al. Multilingual speech processing
US20200294487A1 (en) Hands-free annotations of audio text
US8352269B2 (en) Systems and methods for processing indicia for document narration
US8346557B2 (en) Systems and methods document narration
MacKenzie et al. Assessing the accuracy of existing forced alignment software on varieties of British English
CN105869446B (en) A kind of electronic reading device and voice reading loading method
US20140250355A1 (en) Time-synchronized, talking ebooks and readers
KR20160111275A (en) Foreign language learning system and foreign language learning method
US20140013192A1 (en) Techniques for touch-based digital document audio and user interface enhancement
Alghamdi et al. Saudi accented Arabic voice bank
CN113901186A (en) Telephone recording marking method, device, equipment and storage medium
CN103210447A (en) Method and apparatus for producing script data
Arawjo et al. Typetalker: A speech synthesis-based multi-modal commenting system
Luz et al. Interface design strategies for computer-assisted speech transcription
CN102542854A (en) Method for learning pronunciation through role play
Nyhan et al. Oral history, audio-visual materials and digital humanities: a new ‘grand challenge’
Anderson et al. Internet delivery of time-synchronised multimedia: the SCOTS project
Yang et al. Modeling the acoustic correlates of expressive elements in text genres for expressive text-to-speech synthesis.
CN116010654A (en) Method, device, equipment and storage medium for adding audio annotation to video
MacCrossan Richard Brome Online. Edition
JP2008129475A (en) Voice text
CN117174092A (en) Mobile corpus transcription method and device based on voiceprint recognition and multi-modal analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130717