CN108090051A

CN108090051A - The interpretation method and translator of continuous long voice document

Info

Publication number: CN108090051A
Application number: CN201711388000.3A
Authority: CN
Inventors: 郑勇; 金志军; 王文祺
Original assignee: Shenzhen Water World Co Ltd
Current assignee: Shenzhen Water World Co Ltd
Priority date: 2017-12-20
Filing date: 2017-12-20
Publication date: 2018-05-29
Also published as: WO2019119552A1

Abstract

Present invention is disclosed the interpretation method and translator of continuous long voice document, wherein, the interpretation method of continuous long voice document, including：The continuous long voice document is parsed, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in the continuous long voice；It sends continuous long voice document to the server to be translated, and receives the audio code stream file after the server translation continuous long voice document；The audio code stream file is parsed, is obtained and each first voice segments and each first non-speech segment distribution identical each second voice segments of order and each second non-speech segment；First non-speech segment of identical sorting position is replaced into second non-speech segment in the audio code stream file, obtains final translated speech file.Rhythm, language setting's sound and sentence present invention preserves continuous long voice document are spaced naturally, improve the usage experience of user.

Description

The interpretation method and translator of continuous long voice document

Technical field

The present invention relates to electronic translation technologies, especially relate to interpretation method and the translation of continuous long voice document Machine.

Background technology

In electronic translation field, the continuous long voice document under the application scenarios such as education, recording translation passes through speech recognition Engine, translation engine, Compositing Engine synergistic effect, the voice document after being translated, and pass through the electronic equipments such as translator end End output, facilitates the mutual communication of different language user, is provided a great convenience for people’s lives.It is but existing to turn over Translate engine does not have background noise information, Er Qieyu to the voice document of each voice segments translation output in continuous long voice document Sentence interval in sound file is preset fixed output gap so that the voice document after translation is departing from original continuous The rhythm and sentence of long voice document are spaced naturally, lose the language environment and temperament and interest of the long voice document of original continuous, user Experience poor performance.

Therefore, the prior art could be improved.

The content of the invention

The main object of the present invention is to provide a kind of interpretation method of continuous long voice document, it is intended to solve existing translation skill The rhythm of the long voice document of original continuous cannot be retained in art and the technical issues of sentence is spaced naturally.

The present invention proposes a kind of interpretation method of continuous long voice document, including：

The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments With each first non-speech segment according to the timing distribution generated in above-mentioned continuous long voice；

It sends above-mentioned continuous long voice document to server to be translated, and receives above-mentioned server and translate above-mentioned continuous length Audio code stream file after voice document；

Above-mentioned audio code stream file is parsed, is obtained and above-mentioned each first voice segments and each first non-speech segment distribution order phase Same each second voice segments and each second non-speech segment；

It is non-that above-mentioned first non-speech segment of identical sorting position replaced above-mentioned second in above-mentioned audio code stream file Voice segments obtain final translated speech file.

Preferably, the continuous long voice document of above-mentioned parsing, the step of obtaining each first voice segments and each first non-speech segment, Including：

Above-mentioned continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and first non- The arrangement state of speech frame；

Each first voice segments and each first non-are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame Voice segments.

Preferably, it is above-mentioned that each first voice segments are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame The step of with each first non-speech segment, including：

The first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, the first non-voice that will continuously arrange Frame is respectively synthesized each above-mentioned first non-speech segment.

Preferably, it is above-mentioned that the first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, it will continuously arrange The first non-speech frame the step of being respectively synthesized each above-mentioned first non-speech segment after, including：

Extract each above-mentioned first non-speech segment；

Each above-mentioned first non-speech segment is stored in non-speech segment according to the sequential generated in above-mentioned continuous long voice to delay Deposit area；

Preferably, above-mentioned continuous long voice document to the server of above-mentioned transmission is translated, and is received above-mentioned server and turned over The step of translating the audio code stream file after above-mentioned continuous long voice document, including：

Continuous long voice document is sent to speech recognition server；

The first text file corresponding with above-mentioned continuous long voice document for receiving above-mentioned speech recognition server feedback；

Above-mentioned first text file is sent to translating server；

Receive the second text text of the specified languages after above-mentioned first text file of translation of above-mentioned translating server feedback Part；

Above-mentioned second text file is sent to voice synthesizing server；

It receives above-mentioned voice synthesizing server and converts the audio code stream file after above-mentioned second text file.

Preferably, the above-mentioned audio code stream file of above-mentioned parsing obtains and above-mentioned each first voice segments and each first non-voice The step of identical each second voice segments of section distribution order and each second non-speech segment, including：

By the first character string information of above-mentioned first text file and the second character string information of above-mentioned second text file Correspondence analysis obtains first kind one-to-one relationship；

Above-mentioned audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and the second non-language The arrangement state of sound frame；

It is non-according to each second voice segments of the arrangement state of above-mentioned second speech frame and the second non-speech frame acquisition and each second Voice segments；

The of each above-mentioned first voice segments and each above-mentioned second voice segments is established according to above-mentioned first kind one-to-one relationship Two class one-to-one relationships；

According to above-mentioned second class one-to-one relationship and each first voice segments and each first non-speech segment according to described The sequential generated in continuous long voice obtains and is distributed identical each of order with each first voice segments and each first non-speech segment Second voice segments and each second non-speech segment.

The present invention also provides a kind of translator, including：

First parsing module for parsing continuous long voice document, obtains each first voice segments and each first non-speech segment, Wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice；

Sending/receiving module is translated for sending above-mentioned continuous long voice document to server, and receives above-mentioned clothes Audio code stream file after the above-mentioned continuous long voice document of business device translation；

Second parsing module for parsing above-mentioned audio code stream file, obtains and above-mentioned each first voice segments and each first Non-speech segment is distributed identical each second voice segments of order and each second non-speech segment；

Replacement module, for replacing above-mentioned first non-speech segment of identical sorting position in above-mentioned audio code stream file Fall above-mentioned second non-speech segment, obtain final translated speech file.

Preferably, above-mentioned first parsing module, including：

First processing units for passing through the above-mentioned continuous long voice document of voice activity detection analytical technology processing, obtain The arrangement state of first speech frame and the first non-speech frame；

First obtains unit, for obtaining each first according to the arrangement state of above-mentioned first speech frame and the first non-speech frame Voice segments and each first non-speech segment.

Preferably, above-mentioned first obtains unit, including：

Subelement is synthesized, for the first speech frame continuously arranged to be respectively synthesized each above-mentioned the according to above-mentioned arrangement state The first non-speech frame continuously arranged is respectively synthesized each above-mentioned first non-speech segment by one voice segments.

Preferably, above-mentioned first obtains unit, further includes：

Subelement is extracted, for extracting each above-mentioned first non-speech segment；

Storing sub-units, for each above-mentioned first non-speech segment to be deposited according to the sequential generated in above-mentioned continuous long voice It is stored in non-speech segment buffer area.

Preferably, above-mentioned sending/receiving module, including：

First transmitting element, for continuous long voice document to be sent to speech recognition server；

First receiving unit, for receiving the corresponding with above-mentioned continuous long voice document of above-mentioned speech recognition server feedback The first text file；

Second transmitting element, for above-mentioned first text file to be sent to translating server；

Second receiving unit, for receiving specifying after above-mentioned first text file of translation of above-mentioned translating server feedback Second text file of languages；

3rd transmitting element, for above-mentioned second text file to be sent to voice synthesizing server；

3rd receiving unit, for receiving the audio code after above-mentioned voice synthesizing server converts above-mentioned second text file Stream file.

Preferably, above-mentioned second parsing module, including：

Analytic unit, for by the of the first character string information of above-mentioned first text file and above-mentioned second text file Two character string information correspondence analysis, obtain first kind one-to-one relationship；

Second processing unit handles above-mentioned audio code stream file for passing through voice activity detection analytical technology, obtains the The arrangement state of two speech frames and the second non-speech frame；

Second obtaining unit, for obtaining each second according to the arrangement state of above-mentioned second speech frame and the second non-speech frame Voice segments and each second non-speech segment；

Unit is established, for establishing each above-mentioned first voice segments and each above-mentioned the according to above-mentioned first kind one-to-one relationship Second class one-to-one relationship of two voice segments；

3rd obtaining unit, for according to above-mentioned second class one-to-one relationship and each first voice segments and each first Non-speech segment obtains and each first voice segments and each first non-voice according to the sequential generated in the continuous long voice Section distribution identical each second voice segments of order and each second non-speech segment.

The present invention by the long voice document of original continuous by dividing into voice segments and non-speech segment, and reservation and original continuous The identical non-speech segment of long voice document so that the audio stream code file after translation is compared with the long voice document of original continuous, tool There are almost identical rhythm, language setting's sound and sentence to be spaced naturally, increase the lively vigor sense of machine translation, improve user's Usage experience.

Description of the drawings

The interpretation method flow diagram of the continuous long voice document of Fig. 1 one embodiment of the invention；

The flow diagram of the step S1 of Fig. 2 one embodiment of the invention；

The flow diagram of the step S11 of Fig. 3 one embodiment of the invention；

The flow diagram of the step S2 of Fig. 4 one embodiment of the invention；

The flow diagram of the step S3 of Fig. 5 one embodiment of the invention；

The structure diagram of the translator of Fig. 6 one embodiment of the invention；

The structure diagram of first parsing module of Fig. 7 one embodiment of the invention；

The structure diagram of the first obtains unit of Fig. 8 one embodiment of the invention；

The structure diagram of the sending/receiving module of Fig. 9 one embodiment of the invention；

The structure diagram of second parsing module of Figure 10 one embodiment of the invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

Reference Fig. 1, the interpretation method of the continuous long voice document of one embodiment of the invention, including：

S1：The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first language Segment and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice.

The terminal device of the present embodiment is by taking translator as an example.By parsing above-mentioned continuous long voice document in this step, obtain The data file arranged to the alternate intervals of each first voice segments and each first non-speech segment, each first voice segments and each first non- Voice segments are according to the timing distribution generated in above-mentioned continuous long voice, for example are expressed as：First voice segments 1, the first non-voice Section the 1, first voice segments 2, the first non-speech segment 2, the first voice segments 3, the first non-speech segment 3 ..., the first voice segments N, first Non-speech segment N.

S2：It sends above-mentioned continuous long voice document to server to be translated, and receives above-mentioned server and translate above-mentioned company Continue the audio code stream file after long voice document.

This step refer to continuous long voice document successively translated machine be sent to speech recognition server, translating server and The process that voice synthesizing server is translated.The audio code stream file of the present embodiment obtains after referring to the continuous long voice document of translation Corresponding voice data, including voice data, non-speech data.

S3：Above-mentioned audio code stream file is parsed, is obtained and above-mentioned each first voice segments and the distribution time of each first non-speech segment Identical each second voice segments of sequence and each second non-speech segment.

The audio code stream file of the present embodiment is by corresponding the audio number translated and obtained after continuous long voice document According to so each second voice segments and each second non-speech segment and each first voice segments and each first non-language in audio code stream file Segment distribution order is identical.

S4：Above-mentioned first non-speech segment of the identical identical sorting position of order will be distributed in above-mentioned audio code stream file Above-mentioned second non-speech segment is replaced, obtains final translated speech file.

The present embodiment is in the audio code stream file identical with continuous long voice document distribution order, by identical sorting position On the second non-speech segment replace with the first non-speech segment, and the audio code stream file after the first non-speech segment and translation is combined into One so that final translated speech file has identical rhythm, language setting's sound and language with the long voice document of original continuous Sentence is spaced naturally, is increased the lively vigor sense of machine translation, is improved the usage experience of user.

Reference Fig. 2, further, in one embodiment of the invention, step S1, including：

S10：Above-mentioned continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and the The arrangement state of one non-speech frame.

The present embodiment does VAD (Voice Activity Detection, voice by translator to continuous long voice document Activity detection is analyzed), the first voice segments in above-mentioned continuous long voice document and the first non-speech segment are distinguished, in order to rear Continuous operation.Citing ground handles continuous long voice document by frame, is set per frame duration according to voice signal feature, such as GSM The time of 20ms is frame length.Detected first by VAD each first voice segments in continuous long voice document beginning and Terminate, and pass through the time span that algorithm process obtains each first voice segments of continuous long voice document, such as use gsm communication system ETSI vad algorithms in system or G.729Annex B vad algorithms, the parameter of the continuous long voice document of VAD extractions is special Value indicative is compared with threshold value, to distinguish above-mentioned first voice segments and above-mentioned first non-speech segment.The arrangement state of this step is Refer to, after voice activity detection analyzes and processes, continuous long voice document becomes the first speech frame continuously arranged and continuous arrangement The first non-speech frame alternatively distributed data file when arrangement information.

S11：Each first voice segments and Ge are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame One non-speech segment.

In the present embodiment, by each first voice segments distinguished by VAD and each first non-speech segment, difference is respectively adopted Coded markings, to identify.

Reference Fig. 3, further, in one embodiment of the invention, step S11, including：

S112：The first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, will continuously arrange first Non-speech frame is respectively synthesized each above-mentioned first non-speech segment.

The present embodiment distinguishes the first speech frame and the first non-speech frame by the court verdict of VAD, such as：Court verdict is 1, it is exactly the first speech frame；Court verdict is 0, is exactly background noise frames (i.e. the first non-speech frame), and continuous long voice text Sentence in part becomes through VAD treated the first languages that continuously arranged the 1 to the first speech frame of first speech frame m is merged into Segment 1, the first non-speech segment 1 that continuously arranged the 1 to the first non-speech frame of first non-speech frame k is merged into, is handled successively, will Continuous long voice document becomes continuously arranged first voice segments, 1 to the first voice segments N, with continuously arranged first non-speech segment 1 to the first non-speech segment N alternatively distributed data files one by one.

Further, in one embodiment of the invention, after step S112, including：

S113：Extract each above-mentioned first non-speech segment.

It is marked according to the different coding of each first voice segments and each first non-speech segment, the in continuous long voice document One non-speech segment extracts, such as：Will be according to being encoded to T1, T2 successively ... the first non-speech segment 1 of Tn, 2...N are extracted Come.

S114:Each above-mentioned first non-speech segment is stored in non-voice according to the sequential generated in above-mentioned continuous long voice Section buffer area.

The non-speech segment buffer area of this step is set in the specified region of translator memory, so as to according in above-mentioned continuous length The aligned identical order of the first non-speech segment in sequential is generated in voice, it is successively that the audio code stream file after translation and first is non- Voice segments export again after being integrated.

With reference to Fig. 4, in one embodiment of the invention, step S2, including：

S20：Continuous long voice document is sent to speech recognition server.

S21：The the first text text corresponding with above-mentioned continuous long voice document for receiving above-mentioned speech recognition server feedback Part.

This step being obtained with continuous long corresponding first text file of voice document by speech recognition server.

S22：Above-mentioned first text file is sent to translating server.

S23：Receive the second text of the specified languages after above-mentioned first text file of translation of above-mentioned translating server feedback This document.

This step translates the first text file by translating server, to form the second text text for specifying languages Part.Such as：Into English, the second text file of the first Chinese text file and English is one-to-one relationship for translator of Chinese, I.e. Chinese in short translates into a word of English, and every words correspond.

S24：Above-mentioned second text file is sent to voice synthesizing server.

S25：It receives above-mentioned voice synthesizing server and converts the audio code stream file after above-mentioned second text file.

This step by by the second text file order be sent into voice synthesizing server, so as to by the second text file order The audio code stream file of specified languages is converted to, for example, the audio code stream file of English.

Reference Fig. 5, further, in one embodiment of the invention, step S3, including：

S30：By the second character string of the first character string information of above-mentioned first text file and above-mentioned second text file Information correspondence analysis obtains first kind one-to-one relationship.

This step is by the character string information of comparative analysis text file, on the first text file and the second text file Character string composition every words mark respectively, such as：1st word, the 2nd word ..., N words, will pass through pair The one-to-one relationship for obtaining the second text file and the first text file should be compared.

S31：Above-mentioned audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and second The arrangement state of non-speech frame.

This step handles audio code stream file by VAD, to distinguish the second voice segments in audio code stream file and the Two non-speech segments, each second voice segments correspond to a word in the second text file, the time span of N number of second non-speech segment It is identical.

S32：Each second voice segments and each the are obtained according to the arrangement state of above-mentioned second speech frame and the second non-speech frame Two non-speech segments.

In the present embodiment, each second voice segments distinguished by VAD and each second non-speech segment are equally respectively adopted Different coding marks, to identify.

S33：Each above-mentioned first voice segments and each above-mentioned second voice segments are established according to above-mentioned first kind one-to-one relationship The second class one-to-one relationship.

Each first voice segments correspond to a word in the first text file, and each second voice segments correspond to the second text text A word in part, according to the one-to-one relationship of the second text file and the first text file, find each first voice segments with The one-to-one relationship of each second voice segments, so that the one-to-one corresponding for determining each first non-speech segment and each second non-speech segment closes System, accurately to replace.

S34：According to above-mentioned second class one-to-one relationship and each first voice segments and each first non-speech segment according to The sequential generated in above-mentioned continuous long voice obtains identical with above-mentioned each first voice segments and each first non-speech segment distribution order Each second voice segments and each second non-speech segment.

The present embodiment by the second class one-to-one relationship and each first voice segments and each first non-speech segment according to The sequential generated in the continuous long voice makes the voice segments of the audio code stream file after translation and continuous long voice document one by one It is corresponding, make the rhythm (such as different interval between every words and every words) of continuous long voice document, language setting's sound (such as Background music, applause etc.) and sentence be spaced (i.e. the natural length of non-speech segment) naturally, can preferably with the audio after translation ASCII stream file ASCII synthesis is integrated so that for final translated speech file more close to original language environment, that improves user uses body It tests.

Reference Fig. 6, the translator of one embodiment of the invention, including：

First parsing module 1 for parsing continuous long voice document, obtains each first voice segments and each first non-voice Section, wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice.

The terminal device of the present embodiment is by taking translator as an example.Above-mentioned company is parsed by the first parsing module 1 in the present embodiment Continue long voice document, obtain the data file that the alternate intervals of each first voice segments and each first non-speech segment are arranged, each first Voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice, for example are expressed as：First language Segment 1, the first non-speech segment 1, the first voice segments 2, the first non-speech segment 2, the first voice segments 3, the first non-speech segment 3 ..., First voice segments N, the first non-speech segment N.

Sending/receiving module 2 is translated for sending above-mentioned continuous long voice document to server, and receives above-mentioned clothes Audio code stream file after the above-mentioned continuous long voice document of business device translation.

Continuous long voice document is sent in sequence to speech recognition server, is turned over by the present embodiment by sending/receiving module 2 It translates server and voice synthesizing server is translated.The audio code stream file of the present embodiment refers to the continuous long voice document of translation The corresponding voice data obtained afterwards, including voice data, non-speech data.

Second parsing module 3 for parsing above-mentioned audio code stream file, obtains and above-mentioned each first voice segments and each first Non-speech segment is distributed identical each second voice segments of order and each second non-speech segment.

The audio code stream file of the present embodiment is by corresponding the audio number translated and obtained after continuous long voice document According to, thus by the second parsing module 3 parse audio code stream file, obtained each second voice segments and each second non-speech segment with Each first voice segments are identical with each first non-speech segment distribution order.

Replacement module 4, for replacing above-mentioned first non-speech segment of identical sorting position in above-mentioned audio code stream file Above-mentioned second non-speech segment is changed, obtains final translated speech file.

The present embodiment is in the audio code stream file identical with continuous long voice document distribution order, by identical sorting position On the second non-speech segment the first non-speech segment is replaced with by replacement module 4, and by the first non-speech segment and the sound after translation Frequency code stream file is integrated so that final translated speech file and the long voice document of original continuous have identical rhythm, Language setting's sound and sentence are spaced naturally, increase the lively vigor sense of machine translation, improve the usage experience of user.

Reference Fig. 7, further, in one embodiment of the invention, above-mentioned first parsing module 1, including：

First processing units 10 for passing through the above-mentioned continuous long voice document of voice activity detection analytical technology processing, obtain Obtain the arrangement state of the first speech frame and the first non-speech frame.

The present embodiment is VAD by first processing units 10 to continuous long voice document, by above-mentioned continuous long voice document In the first voice segments and the first non-speech segment distinguish, in order to subsequent operation.Citing ground handles continuous long voice text by frame Part is set per frame duration according to voice signal feature, for example the time of the 20ms of GSM is frame length.It is detected first by VAD Go out the beginning and end of each first voice segments in continuous long voice document, and pass through algorithm process and obtain continuous long voice text The time span of each first voice segments of part, as using the ETSI vad algorithms in gsm communication system or G.729Annex B Vad algorithm, by the parameter attribute value of the continuous long voice document of VAD extractions compared with threshold value, to distinguish above-mentioned first language Segment and above-mentioned first non-speech segment.The arrangement state of the present embodiment refers to, through first processing units 10 voice activity detection point After analysis processing, continuous long voice document becomes the first speech frame continuously arranged and replaces with the first non-speech frame continuously arranged point Arrangement information during the data file of cloth.

First obtains unit 11, for obtaining each the according to the arrangement state of above-mentioned first speech frame and the first non-speech frame One voice segments and each first non-speech segment.

Reference Fig. 8, further, in one embodiment of the invention, above-mentioned first obtains unit 11, including：

Subelement 112 is synthesized, it is each for being respectively synthesized the first speech frame continuously arranged according to above-mentioned arrangement state The first voice segments are stated, the first non-speech frame continuously arranged is respectively synthesized each above-mentioned first non-speech segment.

The present embodiment distinguishes the first speech frame and the first non-speech frame by the court verdict of VAD, such as：Court verdict is 1, it is exactly the first speech frame；Court verdict is 0, is exactly background noise frames (i.e. the first non-speech frame), and passes through synthesis subelement Sentence in 112 continuous long voice documents becomes through VAD treated continuously arranged the 1 to the first voices of first speech frame The first voice segments 1 that frame m is merged into, the first non-language that continuously arranged the 1 to the first non-speech frame of first non-speech frame k is merged into Segment 1, is handled successively, and continuous long voice document is become continuously arranged first voice segments, 1 to the first voice segments N, and continuous 1 to the first non-speech segment N of the first non-speech segment alternatively distributed data files one by one of arrangement.

Further, in one embodiment of the invention, above-mentioned first obtains unit 11 further includes：

Subelement 113 is extracted, for extracting each above-mentioned first non-speech segment.

It is marked according to the different coding of each first voice segments and each first non-speech segment, by extracting subelement 113 even The first non-speech segment continued in long voice document extracts, such as：Will be according to being encoded to T1, T2 successively ... the first non-language of Tn Segment 1,2...N are extracted.

Storing sub-units 114, for by each above-mentioned first non-speech segment according to generated in above-mentioned continuous long voice when Sequence is stored in non-speech segment buffer area.

The non-speech segment buffer area of the present embodiment is set in the specified region of storing sub-units 114, so as to according to above-mentioned The aligned identical order of the first non-speech segment in sequential is generated in continuous long voice, successively by the audio code stream file after translation with First non-speech segment exports again after being integrated.

With reference to Fig. 9, in one embodiment of the invention, above-mentioned sending/receiving module 2, including：

First transmitting element 20, for continuous long voice document to be sent to speech recognition server.

First receiving unit 21, for receiving above-mentioned speech recognition server feedback with above-mentioned continuous long voice document pair The first text file answered.

Continuous long voice document is sent to speech recognition server by the present embodiment by the first transmitting element 20, through voice Identification server is converted to and continuous long corresponding first text file of voice document.

Second transmitting element 22, for above-mentioned first text file to be sent to translating server.

Second receiving unit 23, for receiving the finger after above-mentioned first text file of translation of above-mentioned translating server feedback Second text file of attribute kind.

First text file is sent to translating server by the second transmitting element of the present embodiment 22, passes through translating server pair First text file is translated, to form the second text file for specifying languages.Such as：Translator of Chinese is Chinese into English Second text file of the first text file and English is one-to-one relationship, i.e., Chinese in short translate into English one Words, every words correspond.

3rd transmitting element 24, for above-mentioned second text file to be sent to voice synthesizing server.

3rd receiving unit 25, for receiving the audio after above-mentioned voice synthesizing server converts above-mentioned second text file ASCII stream file ASCII.

Second text file order is sent into voice synthesizing server by the present embodiment by the 3rd transmitting element 24, to incite somebody to action Second text file is sequentially converted into the audio code stream file of specified languages, for example, the audio code stream file of English.

Reference Figure 10, further, in one embodiment of the invention, above-mentioned second parsing module 3, including：

Analytic unit 30, for by the first character string information of above-mentioned first text file and above-mentioned second text file Second character string information correspondence analysis, obtains first kind one-to-one relationship.

The present embodiment is by the character string information of 30 comparative analysis text file of analytic unit, to the first text file and Every words of the character string composition on two text files mark respectively, such as：1st word, the 2nd word ..., N Words, will pass through the corresponding one-to-one relationship for relatively obtaining the second text file and the first text file.

Second processing unit 31 handles above-mentioned audio code stream file for passing through voice activity detection analytical technology.

The present embodiment carries out VAD processing, to distinguish audio code stream by second processing unit 31 to audio code stream file The second voice segments and the second non-speech segment in file, each second voice segments correspond to a word in the second text file, N number of The time span of second non-speech segment is identical.

Second obtaining unit 32, for obtaining each the according to the arrangement state of above-mentioned second speech frame and the second non-speech frame Two voice segments and each second non-speech segment.

In the present embodiment, each second voice segments and each second non-speech segment are obtained by the second obtaining unit 32, similary point Not Cai Yong different coding mark, to identify.

Establish unit 33, for according to above-mentioned first kind one-to-one relationship establish each above-mentioned first voice segments with it is each above-mentioned Second class one-to-one relationship of the second voice segments.

Each first voice segments correspond to a word in the first text file, and each second voice segments correspond to the second text text A word in part establishes one-to-one relationship of the unit 33 according to the second text file and the first text file, finds each The one-to-one relationship of one voice segments and each second voice segments, to determine each first non-speech segment and each second non-speech segment One-to-one relationship, accurately to replace.

3rd obtaining unit 34, for according to above-mentioned second class one-to-one relationship and each first voice segments and Ge One non-speech segment obtains and above-mentioned each first voice segments and each first non-language according to the sequential generated in above-mentioned continuous long voice Segment is distributed identical each second voice segments of order and each second non-speech segment.

3rd obtaining unit 34 of the present embodiment passes through the second class one-to-one relationship and each first voice segments and Ge One non-speech segment is according to the sequential generated in above-mentioned continuous long voice, audio code stream file and continuous long language after being translated The voice segments one-to-one relationship of sound file makes the rhythm of continuous long voice document (such as between every words and every words not With interval), language setting's sound (such as background music, applause etc.) and sentence be spaced that (i.e. the natural of non-speech segment is grown naturally Degree), it can be preferably integrated with the audio code stream file synthesis after translation so that final translated speech file is more close to original Language environment improves the usage experience of user.

The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every utilization It is related to be directly or indirectly used in other for the equivalent structure or equivalent flow shift that description of the invention and accompanying drawing content are made Technical field, be included within the scope of the present invention.

Claims

1. a kind of interpretation method of continuous long voice document, which is characterized in that including：

The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments and each First non-speech segment is according to the timing distribution generated in the continuous long voice；

It sends continuous long voice document to the server to be translated, and receives the server translation continuous long voice Audio code stream file after file；

The audio code stream file is parsed, is obtained identical with each first voice segments and each first non-speech segment distribution order Each second voice segments and each second non-speech segment；

First non-speech segment of identical sorting position is replaced into second non-voice in the audio code stream file Section, obtains final translated speech file.

2. the interpretation method of continuous long voice document according to claim 1, which is characterized in that the continuous long language of parsing Sound file, the step of obtaining each first voice segments and each first non-speech segment, including：

The continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and the first non-voice The arrangement state of frame；

Each first voice segments and each first non-voice are obtained according to the arrangement state of first speech frame and the first non-speech frame Section.

3. the interpretation method of continuous long voice document according to claim 2, which is characterized in that described according to described first The arrangement state of speech frame and the first non-speech frame obtains the step of each first voice segments and each first non-speech segment, including：

The first speech frame continuously arranged is respectively synthesized each first voice segments, by the first non-speech frame continuously arranged point Each first non-speech segment is not synthesized.

4. the interpretation method of continuous long voice document according to claim 1, which is characterized in that the transmission is described continuous Long voice document to server is translated, and receives the audio code stream after the server translation continuous long voice document The step of file, including：

Continuous long voice document is sent to speech recognition server；

The first text file corresponding with the continuous long voice document for receiving speech recognition server feedback；

First text file is sent to translating server；

Receive the second text file of the specified languages after translation first text file of the translating server feedback；

Second text file is sent to voice synthesizing server；

It receives the voice synthesizing server and converts the audio code stream file after second text file.

5. the interpretation method of continuous long voice document according to claim 4, which is characterized in that the parsing audio ASCII stream file ASCII, obtains with each first voice segments and each first non-speech segment identical each second voice segments of distribution order and respectively The step of second non-speech segment, including：

First character string information of first text file is corresponding with the second character string information of second text file Analysis, obtains first kind one-to-one relationship；

The audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and the second non-speech frame Arrangement state；

Each second voice segments and each second non-voice are obtained according to the arrangement state of second speech frame and the second non-speech frame Section；

The second class of each first voice segments and each second voice segments is established according to the first kind one-to-one relationship One-to-one relationship；

According to the second class one-to-one relationship and each first voice segments and each first non-speech segment according to described continuous The sequential generated in long voice obtains each second identical with each first voice segments and each first non-speech segment distribution order Voice segments and each second non-speech segment.

6. a kind of translator, which is characterized in that including：

First parsing module for parsing continuous long voice document, obtains each first voice segments and each first non-speech segment, In, each first voice segments and each first non-speech segment are according to the timing distribution generated in the continuous long voice；

Sending/receiving module is translated for sending continuous long voice document to the server, and receives the server Translate the audio code stream file after the continuous long voice document；

Second parsing module for parsing the audio code stream file, obtains and each first voice segments and each first non-language Segment is distributed identical each second voice segments of order and each second non-speech segment；

Replacement module, for first non-speech segment of identical sorting position to be replaced institute in the audio code stream file The second non-speech segment is stated, obtains final translated speech file.

7. translator according to claim 6, which is characterized in that first parsing module, including：

First processing units for passing through the voice activity detection analytical technology processing continuous long voice document, obtain first The arrangement state of speech frame and the first non-speech frame；

First obtains unit, the arrangement state for according to, obtaining the first speech frame and the first non-speech frame obtain each the One voice segments and each first non-speech segment.

8. translator according to claim 7, which is characterized in that the first obtains unit, including：

Subelement is synthesized, for the first speech frame continuously arranged to be respectively synthesized each first language according to the arrangement state The first non-speech frame continuously arranged is respectively synthesized each first non-speech segment by segment.

9. translator according to claim 6, which is characterized in that the sending/receiving module, including：

First receiving unit, for receiving corresponding with the continuous long voice document the of speech recognition server feedback One text file；

Second transmitting element, for first text file to be sent to translating server；

Second receiving unit, for receiving the specified languages after translation first text file of the translating server feedback The second text file；

3rd transmitting element, for second text file to be sent to voice synthesizing server；

3rd receiving unit, for receiving the audio code stream text after the voice synthesizing server converts second text file Part.

10. translator according to claim 9, which is characterized in that second parsing module, including：

Analytic unit, for by the second word of the first character string information of first text file and second text file Symbol string information correspondence analysis, obtains first kind one-to-one relationship；

Second processing unit handles the audio code stream file for passing through voice activity detection analytical technology, obtains the second language The arrangement state of sound frame and the second non-speech frame；

Second obtaining unit, for obtaining each second voice according to the arrangement state of second speech frame and the second non-speech frame Section and each second non-speech segment；

Unit is established, for establishing each first voice segments and each second language according to the first kind one-to-one relationship Second class one-to-one relationship of segment；

3rd obtaining unit, for according to the second class one-to-one relationship and each first voice segments and each first non-language Segment obtains and each first voice segments and each first non-speech segment point according to the sequential generated in the continuous long voice Identical each second voice segments of cloth order and each second non-speech segment.