CN108090051A - The interpretation method and translator of continuous long voice document - Google Patents
The interpretation method and translator of continuous long voice document Download PDFInfo
- Publication number
- CN108090051A CN108090051A CN201711388000.3A CN201711388000A CN108090051A CN 108090051 A CN108090051 A CN 108090051A CN 201711388000 A CN201711388000 A CN 201711388000A CN 108090051 A CN108090051 A CN 108090051A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- continuous long
- speech segment
- segments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Present invention is disclosed the interpretation method and translator of continuous long voice document, wherein, the interpretation method of continuous long voice document, including:The continuous long voice document is parsed, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in the continuous long voice;It sends continuous long voice document to the server to be translated, and receives the audio code stream file after the server translation continuous long voice document;The audio code stream file is parsed, is obtained and each first voice segments and each first non-speech segment distribution identical each second voice segments of order and each second non-speech segment;First non-speech segment of identical sorting position is replaced into second non-speech segment in the audio code stream file, obtains final translated speech file.Rhythm, language setting's sound and sentence present invention preserves continuous long voice document are spaced naturally, improve the usage experience of user.
Description
Technical field
The present invention relates to electronic translation technologies, especially relate to interpretation method and the translation of continuous long voice document
Machine.
Background technology
In electronic translation field, the continuous long voice document under the application scenarios such as education, recording translation passes through speech recognition
Engine, translation engine, Compositing Engine synergistic effect, the voice document after being translated, and pass through the electronic equipments such as translator end
End output, facilitates the mutual communication of different language user, is provided a great convenience for people’s lives.It is but existing to turn over
Translate engine does not have background noise information, Er Qieyu to the voice document of each voice segments translation output in continuous long voice document
Sentence interval in sound file is preset fixed output gap so that the voice document after translation is departing from original continuous
The rhythm and sentence of long voice document are spaced naturally, lose the language environment and temperament and interest of the long voice document of original continuous, user
Experience poor performance.
Therefore, the prior art could be improved.
The content of the invention
The main object of the present invention is to provide a kind of interpretation method of continuous long voice document, it is intended to solve existing translation skill
The rhythm of the long voice document of original continuous cannot be retained in art and the technical issues of sentence is spaced naturally.
The present invention proposes a kind of interpretation method of continuous long voice document, including:
The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments
With each first non-speech segment according to the timing distribution generated in above-mentioned continuous long voice;
It sends above-mentioned continuous long voice document to server to be translated, and receives above-mentioned server and translate above-mentioned continuous length
Audio code stream file after voice document;
Above-mentioned audio code stream file is parsed, is obtained and above-mentioned each first voice segments and each first non-speech segment distribution order phase
Same each second voice segments and each second non-speech segment;
It is non-that above-mentioned first non-speech segment of identical sorting position replaced above-mentioned second in above-mentioned audio code stream file
Voice segments obtain final translated speech file.
Preferably, the continuous long voice document of above-mentioned parsing, the step of obtaining each first voice segments and each first non-speech segment,
Including:
Above-mentioned continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and first non-
The arrangement state of speech frame;
Each first voice segments and each first non-are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame
Voice segments.
Preferably, it is above-mentioned that each first voice segments are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame
The step of with each first non-speech segment, including:
The first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, the first non-voice that will continuously arrange
Frame is respectively synthesized each above-mentioned first non-speech segment.
Preferably, it is above-mentioned that the first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, it will continuously arrange
The first non-speech frame the step of being respectively synthesized each above-mentioned first non-speech segment after, including:
Extract each above-mentioned first non-speech segment;
Each above-mentioned first non-speech segment is stored in non-speech segment according to the sequential generated in above-mentioned continuous long voice to delay
Deposit area;
Preferably, above-mentioned continuous long voice document to the server of above-mentioned transmission is translated, and is received above-mentioned server and turned over
The step of translating the audio code stream file after above-mentioned continuous long voice document, including:
Continuous long voice document is sent to speech recognition server;
The first text file corresponding with above-mentioned continuous long voice document for receiving above-mentioned speech recognition server feedback;
Above-mentioned first text file is sent to translating server;
Receive the second text text of the specified languages after above-mentioned first text file of translation of above-mentioned translating server feedback
Part;
Above-mentioned second text file is sent to voice synthesizing server;
It receives above-mentioned voice synthesizing server and converts the audio code stream file after above-mentioned second text file.
Preferably, the above-mentioned audio code stream file of above-mentioned parsing obtains and above-mentioned each first voice segments and each first non-voice
The step of identical each second voice segments of section distribution order and each second non-speech segment, including:
By the first character string information of above-mentioned first text file and the second character string information of above-mentioned second text file
Correspondence analysis obtains first kind one-to-one relationship;
Above-mentioned audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and the second non-language
The arrangement state of sound frame;
It is non-according to each second voice segments of the arrangement state of above-mentioned second speech frame and the second non-speech frame acquisition and each second
Voice segments;
The of each above-mentioned first voice segments and each above-mentioned second voice segments is established according to above-mentioned first kind one-to-one relationship
Two class one-to-one relationships;
According to above-mentioned second class one-to-one relationship and each first voice segments and each first non-speech segment according to described
The sequential generated in continuous long voice obtains and is distributed identical each of order with each first voice segments and each first non-speech segment
Second voice segments and each second non-speech segment.
The present invention also provides a kind of translator, including:
First parsing module for parsing continuous long voice document, obtains each first voice segments and each first non-speech segment,
Wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice;
Sending/receiving module is translated for sending above-mentioned continuous long voice document to server, and receives above-mentioned clothes
Audio code stream file after the above-mentioned continuous long voice document of business device translation;
Second parsing module for parsing above-mentioned audio code stream file, obtains and above-mentioned each first voice segments and each first
Non-speech segment is distributed identical each second voice segments of order and each second non-speech segment;
Replacement module, for replacing above-mentioned first non-speech segment of identical sorting position in above-mentioned audio code stream file
Fall above-mentioned second non-speech segment, obtain final translated speech file.
Preferably, above-mentioned first parsing module, including:
First processing units for passing through the above-mentioned continuous long voice document of voice activity detection analytical technology processing, obtain
The arrangement state of first speech frame and the first non-speech frame;
First obtains unit, for obtaining each first according to the arrangement state of above-mentioned first speech frame and the first non-speech frame
Voice segments and each first non-speech segment.
Preferably, above-mentioned first obtains unit, including:
Subelement is synthesized, for the first speech frame continuously arranged to be respectively synthesized each above-mentioned the according to above-mentioned arrangement state
The first non-speech frame continuously arranged is respectively synthesized each above-mentioned first non-speech segment by one voice segments.
Preferably, above-mentioned first obtains unit, further includes:
Subelement is extracted, for extracting each above-mentioned first non-speech segment;
Storing sub-units, for each above-mentioned first non-speech segment to be deposited according to the sequential generated in above-mentioned continuous long voice
It is stored in non-speech segment buffer area.
Preferably, above-mentioned sending/receiving module, including:
First transmitting element, for continuous long voice document to be sent to speech recognition server;
First receiving unit, for receiving the corresponding with above-mentioned continuous long voice document of above-mentioned speech recognition server feedback
The first text file;
Second transmitting element, for above-mentioned first text file to be sent to translating server;
Second receiving unit, for receiving specifying after above-mentioned first text file of translation of above-mentioned translating server feedback
Second text file of languages;
3rd transmitting element, for above-mentioned second text file to be sent to voice synthesizing server;
3rd receiving unit, for receiving the audio code after above-mentioned voice synthesizing server converts above-mentioned second text file
Stream file.
Preferably, above-mentioned second parsing module, including:
Analytic unit, for by the of the first character string information of above-mentioned first text file and above-mentioned second text file
Two character string information correspondence analysis, obtain first kind one-to-one relationship;
Second processing unit handles above-mentioned audio code stream file for passing through voice activity detection analytical technology, obtains the
The arrangement state of two speech frames and the second non-speech frame;
Second obtaining unit, for obtaining each second according to the arrangement state of above-mentioned second speech frame and the second non-speech frame
Voice segments and each second non-speech segment;
Unit is established, for establishing each above-mentioned first voice segments and each above-mentioned the according to above-mentioned first kind one-to-one relationship
Second class one-to-one relationship of two voice segments;
3rd obtaining unit, for according to above-mentioned second class one-to-one relationship and each first voice segments and each first
Non-speech segment obtains and each first voice segments and each first non-voice according to the sequential generated in the continuous long voice
Section distribution identical each second voice segments of order and each second non-speech segment.
The present invention by the long voice document of original continuous by dividing into voice segments and non-speech segment, and reservation and original continuous
The identical non-speech segment of long voice document so that the audio stream code file after translation is compared with the long voice document of original continuous, tool
There are almost identical rhythm, language setting's sound and sentence to be spaced naturally, increase the lively vigor sense of machine translation, improve user's
Usage experience.
Description of the drawings
The interpretation method flow diagram of the continuous long voice document of Fig. 1 one embodiment of the invention;
The flow diagram of the step S1 of Fig. 2 one embodiment of the invention;
The flow diagram of the step S11 of Fig. 3 one embodiment of the invention;
The flow diagram of the step S2 of Fig. 4 one embodiment of the invention;
The flow diagram of the step S3 of Fig. 5 one embodiment of the invention;
The structure diagram of the translator of Fig. 6 one embodiment of the invention;
The structure diagram of first parsing module of Fig. 7 one embodiment of the invention;
The structure diagram of the first obtains unit of Fig. 8 one embodiment of the invention;
The structure diagram of the sending/receiving module of Fig. 9 one embodiment of the invention;
The structure diagram of second parsing module of Figure 10 one embodiment of the invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
Reference Fig. 1, the interpretation method of the continuous long voice document of one embodiment of the invention, including:
S1:The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first language
Segment and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice.
The terminal device of the present embodiment is by taking translator as an example.By parsing above-mentioned continuous long voice document in this step, obtain
The data file arranged to the alternate intervals of each first voice segments and each first non-speech segment, each first voice segments and each first non-
Voice segments are according to the timing distribution generated in above-mentioned continuous long voice, for example are expressed as:First voice segments 1, the first non-voice
Section the 1, first voice segments 2, the first non-speech segment 2, the first voice segments 3, the first non-speech segment 3 ..., the first voice segments N, first
Non-speech segment N.
S2:It sends above-mentioned continuous long voice document to server to be translated, and receives above-mentioned server and translate above-mentioned company
Continue the audio code stream file after long voice document.
This step refer to continuous long voice document successively translated machine be sent to speech recognition server, translating server and
The process that voice synthesizing server is translated.The audio code stream file of the present embodiment obtains after referring to the continuous long voice document of translation
Corresponding voice data, including voice data, non-speech data.
S3:Above-mentioned audio code stream file is parsed, is obtained and above-mentioned each first voice segments and the distribution time of each first non-speech segment
Identical each second voice segments of sequence and each second non-speech segment.
The audio code stream file of the present embodiment is by corresponding the audio number translated and obtained after continuous long voice document
According to so each second voice segments and each second non-speech segment and each first voice segments and each first non-language in audio code stream file
Segment distribution order is identical.
S4:Above-mentioned first non-speech segment of the identical identical sorting position of order will be distributed in above-mentioned audio code stream file
Above-mentioned second non-speech segment is replaced, obtains final translated speech file.
The present embodiment is in the audio code stream file identical with continuous long voice document distribution order, by identical sorting position
On the second non-speech segment replace with the first non-speech segment, and the audio code stream file after the first non-speech segment and translation is combined into
One so that final translated speech file has identical rhythm, language setting's sound and language with the long voice document of original continuous
Sentence is spaced naturally, is increased the lively vigor sense of machine translation, is improved the usage experience of user.
Reference Fig. 2, further, in one embodiment of the invention, step S1, including:
S10:Above-mentioned continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and the
The arrangement state of one non-speech frame.
The present embodiment does VAD (Voice Activity Detection, voice by translator to continuous long voice document
Activity detection is analyzed), the first voice segments in above-mentioned continuous long voice document and the first non-speech segment are distinguished, in order to rear
Continuous operation.Citing ground handles continuous long voice document by frame, is set per frame duration according to voice signal feature, such as GSM
The time of 20ms is frame length.Detected first by VAD each first voice segments in continuous long voice document beginning and
Terminate, and pass through the time span that algorithm process obtains each first voice segments of continuous long voice document, such as use gsm communication system
ETSI vad algorithms in system or G.729Annex B vad algorithms, the parameter of the continuous long voice document of VAD extractions is special
Value indicative is compared with threshold value, to distinguish above-mentioned first voice segments and above-mentioned first non-speech segment.The arrangement state of this step is
Refer to, after voice activity detection analyzes and processes, continuous long voice document becomes the first speech frame continuously arranged and continuous arrangement
The first non-speech frame alternatively distributed data file when arrangement information.
S11:Each first voice segments and Ge are obtained according to the arrangement state of above-mentioned first speech frame and the first non-speech frame
One non-speech segment.
In the present embodiment, by each first voice segments distinguished by VAD and each first non-speech segment, difference is respectively adopted
Coded markings, to identify.
Reference Fig. 3, further, in one embodiment of the invention, step S11, including:
S112:The first speech frame continuously arranged is respectively synthesized each above-mentioned first voice segments, will continuously arrange first
Non-speech frame is respectively synthesized each above-mentioned first non-speech segment.
The present embodiment distinguishes the first speech frame and the first non-speech frame by the court verdict of VAD, such as:Court verdict is
1, it is exactly the first speech frame;Court verdict is 0, is exactly background noise frames (i.e. the first non-speech frame), and continuous long voice text
Sentence in part becomes through VAD treated the first languages that continuously arranged the 1 to the first speech frame of first speech frame m is merged into
Segment 1, the first non-speech segment 1 that continuously arranged the 1 to the first non-speech frame of first non-speech frame k is merged into, is handled successively, will
Continuous long voice document becomes continuously arranged first voice segments, 1 to the first voice segments N, with continuously arranged first non-speech segment
1 to the first non-speech segment N alternatively distributed data files one by one.
Further, in one embodiment of the invention, after step S112, including:
S113:Extract each above-mentioned first non-speech segment.
It is marked according to the different coding of each first voice segments and each first non-speech segment, the in continuous long voice document
One non-speech segment extracts, such as:Will be according to being encoded to T1, T2 successively ... the first non-speech segment 1 of Tn, 2...N are extracted
Come.
S114:Each above-mentioned first non-speech segment is stored in non-voice according to the sequential generated in above-mentioned continuous long voice
Section buffer area.
The non-speech segment buffer area of this step is set in the specified region of translator memory, so as to according in above-mentioned continuous length
The aligned identical order of the first non-speech segment in sequential is generated in voice, it is successively that the audio code stream file after translation and first is non-
Voice segments export again after being integrated.
With reference to Fig. 4, in one embodiment of the invention, step S2, including:
S20:Continuous long voice document is sent to speech recognition server.
S21:The the first text text corresponding with above-mentioned continuous long voice document for receiving above-mentioned speech recognition server feedback
Part.
This step being obtained with continuous long corresponding first text file of voice document by speech recognition server.
S22:Above-mentioned first text file is sent to translating server.
S23:Receive the second text of the specified languages after above-mentioned first text file of translation of above-mentioned translating server feedback
This document.
This step translates the first text file by translating server, to form the second text text for specifying languages
Part.Such as:Into English, the second text file of the first Chinese text file and English is one-to-one relationship for translator of Chinese,
I.e. Chinese in short translates into a word of English, and every words correspond.
S24:Above-mentioned second text file is sent to voice synthesizing server.
S25:It receives above-mentioned voice synthesizing server and converts the audio code stream file after above-mentioned second text file.
This step by by the second text file order be sent into voice synthesizing server, so as to by the second text file order
The audio code stream file of specified languages is converted to, for example, the audio code stream file of English.
Reference Fig. 5, further, in one embodiment of the invention, step S3, including:
S30:By the second character string of the first character string information of above-mentioned first text file and above-mentioned second text file
Information correspondence analysis obtains first kind one-to-one relationship.
This step is by the character string information of comparative analysis text file, on the first text file and the second text file
Character string composition every words mark respectively, such as:1st word, the 2nd word ..., N words, will pass through pair
The one-to-one relationship for obtaining the second text file and the first text file should be compared.
S31:Above-mentioned audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and second
The arrangement state of non-speech frame.
This step handles audio code stream file by VAD, to distinguish the second voice segments in audio code stream file and the
Two non-speech segments, each second voice segments correspond to a word in the second text file, the time span of N number of second non-speech segment
It is identical.
S32:Each second voice segments and each the are obtained according to the arrangement state of above-mentioned second speech frame and the second non-speech frame
Two non-speech segments.
In the present embodiment, each second voice segments distinguished by VAD and each second non-speech segment are equally respectively adopted
Different coding marks, to identify.
S33:Each above-mentioned first voice segments and each above-mentioned second voice segments are established according to above-mentioned first kind one-to-one relationship
The second class one-to-one relationship.
Each first voice segments correspond to a word in the first text file, and each second voice segments correspond to the second text text
A word in part, according to the one-to-one relationship of the second text file and the first text file, find each first voice segments with
The one-to-one relationship of each second voice segments, so that the one-to-one corresponding for determining each first non-speech segment and each second non-speech segment closes
System, accurately to replace.
S34:According to above-mentioned second class one-to-one relationship and each first voice segments and each first non-speech segment according to
The sequential generated in above-mentioned continuous long voice obtains identical with above-mentioned each first voice segments and each first non-speech segment distribution order
Each second voice segments and each second non-speech segment.
The present embodiment by the second class one-to-one relationship and each first voice segments and each first non-speech segment according to
The sequential generated in the continuous long voice makes the voice segments of the audio code stream file after translation and continuous long voice document one by one
It is corresponding, make the rhythm (such as different interval between every words and every words) of continuous long voice document, language setting's sound (such as
Background music, applause etc.) and sentence be spaced (i.e. the natural length of non-speech segment) naturally, can preferably with the audio after translation
ASCII stream file ASCII synthesis is integrated so that for final translated speech file more close to original language environment, that improves user uses body
It tests.
Reference Fig. 6, the translator of one embodiment of the invention, including:
First parsing module 1 for parsing continuous long voice document, obtains each first voice segments and each first non-voice
Section, wherein, each first voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice.
The terminal device of the present embodiment is by taking translator as an example.Above-mentioned company is parsed by the first parsing module 1 in the present embodiment
Continue long voice document, obtain the data file that the alternate intervals of each first voice segments and each first non-speech segment are arranged, each first
Voice segments and each first non-speech segment are according to the timing distribution generated in above-mentioned continuous long voice, for example are expressed as:First language
Segment 1, the first non-speech segment 1, the first voice segments 2, the first non-speech segment 2, the first voice segments 3, the first non-speech segment 3 ...,
First voice segments N, the first non-speech segment N.
Sending/receiving module 2 is translated for sending above-mentioned continuous long voice document to server, and receives above-mentioned clothes
Audio code stream file after the above-mentioned continuous long voice document of business device translation.
Continuous long voice document is sent in sequence to speech recognition server, is turned over by the present embodiment by sending/receiving module 2
It translates server and voice synthesizing server is translated.The audio code stream file of the present embodiment refers to the continuous long voice document of translation
The corresponding voice data obtained afterwards, including voice data, non-speech data.
Second parsing module 3 for parsing above-mentioned audio code stream file, obtains and above-mentioned each first voice segments and each first
Non-speech segment is distributed identical each second voice segments of order and each second non-speech segment.
The audio code stream file of the present embodiment is by corresponding the audio number translated and obtained after continuous long voice document
According to, thus by the second parsing module 3 parse audio code stream file, obtained each second voice segments and each second non-speech segment with
Each first voice segments are identical with each first non-speech segment distribution order.
Replacement module 4, for replacing above-mentioned first non-speech segment of identical sorting position in above-mentioned audio code stream file
Above-mentioned second non-speech segment is changed, obtains final translated speech file.
The present embodiment is in the audio code stream file identical with continuous long voice document distribution order, by identical sorting position
On the second non-speech segment the first non-speech segment is replaced with by replacement module 4, and by the first non-speech segment and the sound after translation
Frequency code stream file is integrated so that final translated speech file and the long voice document of original continuous have identical rhythm,
Language setting's sound and sentence are spaced naturally, increase the lively vigor sense of machine translation, improve the usage experience of user.
Reference Fig. 7, further, in one embodiment of the invention, above-mentioned first parsing module 1, including:
First processing units 10 for passing through the above-mentioned continuous long voice document of voice activity detection analytical technology processing, obtain
Obtain the arrangement state of the first speech frame and the first non-speech frame.
The present embodiment is VAD by first processing units 10 to continuous long voice document, by above-mentioned continuous long voice document
In the first voice segments and the first non-speech segment distinguish, in order to subsequent operation.Citing ground handles continuous long voice text by frame
Part is set per frame duration according to voice signal feature, for example the time of the 20ms of GSM is frame length.It is detected first by VAD
Go out the beginning and end of each first voice segments in continuous long voice document, and pass through algorithm process and obtain continuous long voice text
The time span of each first voice segments of part, as using the ETSI vad algorithms in gsm communication system or G.729Annex B
Vad algorithm, by the parameter attribute value of the continuous long voice document of VAD extractions compared with threshold value, to distinguish above-mentioned first language
Segment and above-mentioned first non-speech segment.The arrangement state of the present embodiment refers to, through first processing units 10 voice activity detection point
After analysis processing, continuous long voice document becomes the first speech frame continuously arranged and replaces with the first non-speech frame continuously arranged point
Arrangement information during the data file of cloth.
First obtains unit 11, for obtaining each the according to the arrangement state of above-mentioned first speech frame and the first non-speech frame
One voice segments and each first non-speech segment.
In the present embodiment, by each first voice segments distinguished by VAD and each first non-speech segment, difference is respectively adopted
Coded markings, to identify.
Reference Fig. 8, further, in one embodiment of the invention, above-mentioned first obtains unit 11, including:
Subelement 112 is synthesized, it is each for being respectively synthesized the first speech frame continuously arranged according to above-mentioned arrangement state
The first voice segments are stated, the first non-speech frame continuously arranged is respectively synthesized each above-mentioned first non-speech segment.
The present embodiment distinguishes the first speech frame and the first non-speech frame by the court verdict of VAD, such as:Court verdict is
1, it is exactly the first speech frame;Court verdict is 0, is exactly background noise frames (i.e. the first non-speech frame), and passes through synthesis subelement
Sentence in 112 continuous long voice documents becomes through VAD treated continuously arranged the 1 to the first voices of first speech frame
The first voice segments 1 that frame m is merged into, the first non-language that continuously arranged the 1 to the first non-speech frame of first non-speech frame k is merged into
Segment 1, is handled successively, and continuous long voice document is become continuously arranged first voice segments, 1 to the first voice segments N, and continuous
1 to the first non-speech segment N of the first non-speech segment alternatively distributed data files one by one of arrangement.
Further, in one embodiment of the invention, above-mentioned first obtains unit 11 further includes:
Subelement 113 is extracted, for extracting each above-mentioned first non-speech segment.
It is marked according to the different coding of each first voice segments and each first non-speech segment, by extracting subelement 113 even
The first non-speech segment continued in long voice document extracts, such as:Will be according to being encoded to T1, T2 successively ... the first non-language of Tn
Segment 1,2...N are extracted.
Storing sub-units 114, for by each above-mentioned first non-speech segment according to generated in above-mentioned continuous long voice when
Sequence is stored in non-speech segment buffer area.
The non-speech segment buffer area of the present embodiment is set in the specified region of storing sub-units 114, so as to according to above-mentioned
The aligned identical order of the first non-speech segment in sequential is generated in continuous long voice, successively by the audio code stream file after translation with
First non-speech segment exports again after being integrated.
With reference to Fig. 9, in one embodiment of the invention, above-mentioned sending/receiving module 2, including:
First transmitting element 20, for continuous long voice document to be sent to speech recognition server.
First receiving unit 21, for receiving above-mentioned speech recognition server feedback with above-mentioned continuous long voice document pair
The first text file answered.
Continuous long voice document is sent to speech recognition server by the present embodiment by the first transmitting element 20, through voice
Identification server is converted to and continuous long corresponding first text file of voice document.
Second transmitting element 22, for above-mentioned first text file to be sent to translating server.
Second receiving unit 23, for receiving the finger after above-mentioned first text file of translation of above-mentioned translating server feedback
Second text file of attribute kind.
First text file is sent to translating server by the second transmitting element of the present embodiment 22, passes through translating server pair
First text file is translated, to form the second text file for specifying languages.Such as:Translator of Chinese is Chinese into English
Second text file of the first text file and English is one-to-one relationship, i.e., Chinese in short translate into English one
Words, every words correspond.
3rd transmitting element 24, for above-mentioned second text file to be sent to voice synthesizing server.
3rd receiving unit 25, for receiving the audio after above-mentioned voice synthesizing server converts above-mentioned second text file
ASCII stream file ASCII.
Second text file order is sent into voice synthesizing server by the present embodiment by the 3rd transmitting element 24, to incite somebody to action
Second text file is sequentially converted into the audio code stream file of specified languages, for example, the audio code stream file of English.
Reference Figure 10, further, in one embodiment of the invention, above-mentioned second parsing module 3, including:
Analytic unit 30, for by the first character string information of above-mentioned first text file and above-mentioned second text file
Second character string information correspondence analysis, obtains first kind one-to-one relationship.
The present embodiment is by the character string information of 30 comparative analysis text file of analytic unit, to the first text file and
Every words of the character string composition on two text files mark respectively, such as:1st word, the 2nd word ..., N
Words, will pass through the corresponding one-to-one relationship for relatively obtaining the second text file and the first text file.
Second processing unit 31 handles above-mentioned audio code stream file for passing through voice activity detection analytical technology.
The present embodiment carries out VAD processing, to distinguish audio code stream by second processing unit 31 to audio code stream file
The second voice segments and the second non-speech segment in file, each second voice segments correspond to a word in the second text file, N number of
The time span of second non-speech segment is identical.
Second obtaining unit 32, for obtaining each the according to the arrangement state of above-mentioned second speech frame and the second non-speech frame
Two voice segments and each second non-speech segment.
In the present embodiment, each second voice segments and each second non-speech segment are obtained by the second obtaining unit 32, similary point
Not Cai Yong different coding mark, to identify.
Establish unit 33, for according to above-mentioned first kind one-to-one relationship establish each above-mentioned first voice segments with it is each above-mentioned
Second class one-to-one relationship of the second voice segments.
Each first voice segments correspond to a word in the first text file, and each second voice segments correspond to the second text text
A word in part establishes one-to-one relationship of the unit 33 according to the second text file and the first text file, finds each
The one-to-one relationship of one voice segments and each second voice segments, to determine each first non-speech segment and each second non-speech segment
One-to-one relationship, accurately to replace.
3rd obtaining unit 34, for according to above-mentioned second class one-to-one relationship and each first voice segments and Ge
One non-speech segment obtains and above-mentioned each first voice segments and each first non-language according to the sequential generated in above-mentioned continuous long voice
Segment is distributed identical each second voice segments of order and each second non-speech segment.
3rd obtaining unit 34 of the present embodiment passes through the second class one-to-one relationship and each first voice segments and Ge
One non-speech segment is according to the sequential generated in above-mentioned continuous long voice, audio code stream file and continuous long language after being translated
The voice segments one-to-one relationship of sound file makes the rhythm of continuous long voice document (such as between every words and every words not
With interval), language setting's sound (such as background music, applause etc.) and sentence be spaced that (i.e. the natural of non-speech segment is grown naturally
Degree), it can be preferably integrated with the audio code stream file synthesis after translation so that final translated speech file is more close to original
Language environment improves the usage experience of user.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every utilization
It is related to be directly or indirectly used in other for the equivalent structure or equivalent flow shift that description of the invention and accompanying drawing content are made
Technical field, be included within the scope of the present invention.
Claims (10)
1. a kind of interpretation method of continuous long voice document, which is characterized in that including:
The continuous long voice document of parsing, obtains each first voice segments and each first non-speech segment, wherein, each first voice segments and each
First non-speech segment is according to the timing distribution generated in the continuous long voice;
It sends continuous long voice document to the server to be translated, and receives the server translation continuous long voice
Audio code stream file after file;
The audio code stream file is parsed, is obtained identical with each first voice segments and each first non-speech segment distribution order
Each second voice segments and each second non-speech segment;
First non-speech segment of identical sorting position is replaced into second non-voice in the audio code stream file
Section, obtains final translated speech file.
2. the interpretation method of continuous long voice document according to claim 1, which is characterized in that the continuous long language of parsing
Sound file, the step of obtaining each first voice segments and each first non-speech segment, including:
The continuous long voice document is handled by voice activity detection analytical technology, obtains the first speech frame and the first non-voice
The arrangement state of frame;
Each first voice segments and each first non-voice are obtained according to the arrangement state of first speech frame and the first non-speech frame
Section.
3. the interpretation method of continuous long voice document according to claim 2, which is characterized in that described according to described first
The arrangement state of speech frame and the first non-speech frame obtains the step of each first voice segments and each first non-speech segment, including:
The first speech frame continuously arranged is respectively synthesized each first voice segments, by the first non-speech frame continuously arranged point
Each first non-speech segment is not synthesized.
4. the interpretation method of continuous long voice document according to claim 1, which is characterized in that the transmission is described continuous
Long voice document to server is translated, and receives the audio code stream after the server translation continuous long voice document
The step of file, including:
Continuous long voice document is sent to speech recognition server;
The first text file corresponding with the continuous long voice document for receiving speech recognition server feedback;
First text file is sent to translating server;
Receive the second text file of the specified languages after translation first text file of the translating server feedback;
Second text file is sent to voice synthesizing server;
It receives the voice synthesizing server and converts the audio code stream file after second text file.
5. the interpretation method of continuous long voice document according to claim 4, which is characterized in that the parsing audio
ASCII stream file ASCII, obtains with each first voice segments and each first non-speech segment identical each second voice segments of distribution order and respectively
The step of second non-speech segment, including:
First character string information of first text file is corresponding with the second character string information of second text file
Analysis, obtains first kind one-to-one relationship;
The audio code stream file is handled by voice activity detection analytical technology, obtains the second speech frame and the second non-speech frame
Arrangement state;
Each second voice segments and each second non-voice are obtained according to the arrangement state of second speech frame and the second non-speech frame
Section;
The second class of each first voice segments and each second voice segments is established according to the first kind one-to-one relationship
One-to-one relationship;
According to the second class one-to-one relationship and each first voice segments and each first non-speech segment according to described continuous
The sequential generated in long voice obtains each second identical with each first voice segments and each first non-speech segment distribution order
Voice segments and each second non-speech segment.
6. a kind of translator, which is characterized in that including:
First parsing module for parsing continuous long voice document, obtains each first voice segments and each first non-speech segment,
In, each first voice segments and each first non-speech segment are according to the timing distribution generated in the continuous long voice;
Sending/receiving module is translated for sending continuous long voice document to the server, and receives the server
Translate the audio code stream file after the continuous long voice document;
Second parsing module for parsing the audio code stream file, obtains and each first voice segments and each first non-language
Segment is distributed identical each second voice segments of order and each second non-speech segment;
Replacement module, for first non-speech segment of identical sorting position to be replaced institute in the audio code stream file
The second non-speech segment is stated, obtains final translated speech file.
7. translator according to claim 6, which is characterized in that first parsing module, including:
First processing units for passing through the voice activity detection analytical technology processing continuous long voice document, obtain first
The arrangement state of speech frame and the first non-speech frame;
First obtains unit, the arrangement state for according to, obtaining the first speech frame and the first non-speech frame obtain each the
One voice segments and each first non-speech segment.
8. translator according to claim 7, which is characterized in that the first obtains unit, including:
Subelement is synthesized, for the first speech frame continuously arranged to be respectively synthesized each first language according to the arrangement state
The first non-speech frame continuously arranged is respectively synthesized each first non-speech segment by segment.
9. translator according to claim 6, which is characterized in that the sending/receiving module, including:
First transmitting element, for continuous long voice document to be sent to speech recognition server;
First receiving unit, for receiving corresponding with the continuous long voice document the of speech recognition server feedback
One text file;
Second transmitting element, for first text file to be sent to translating server;
Second receiving unit, for receiving the specified languages after translation first text file of the translating server feedback
The second text file;
3rd transmitting element, for second text file to be sent to voice synthesizing server;
3rd receiving unit, for receiving the audio code stream text after the voice synthesizing server converts second text file
Part.
10. translator according to claim 9, which is characterized in that second parsing module, including:
Analytic unit, for by the second word of the first character string information of first text file and second text file
Symbol string information correspondence analysis, obtains first kind one-to-one relationship;
Second processing unit handles the audio code stream file for passing through voice activity detection analytical technology, obtains the second language
The arrangement state of sound frame and the second non-speech frame;
Second obtaining unit, for obtaining each second voice according to the arrangement state of second speech frame and the second non-speech frame
Section and each second non-speech segment;
Unit is established, for establishing each first voice segments and each second language according to the first kind one-to-one relationship
Second class one-to-one relationship of segment;
3rd obtaining unit, for according to the second class one-to-one relationship and each first voice segments and each first non-language
Segment obtains and each first voice segments and each first non-speech segment point according to the sequential generated in the continuous long voice
Identical each second voice segments of cloth order and each second non-speech segment.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711388000.3A CN108090051A (en) | 2017-12-20 | 2017-12-20 | The interpretation method and translator of continuous long voice document |
PCT/CN2018/072007 WO2019119552A1 (en) | 2017-12-20 | 2018-01-09 | Method for translating continuous long speech file, and translation machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711388000.3A CN108090051A (en) | 2017-12-20 | 2017-12-20 | The interpretation method and translator of continuous long voice document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108090051A true CN108090051A (en) | 2018-05-29 |
Family
ID=62177614
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711388000.3A Pending CN108090051A (en) | 2017-12-20 | 2017-12-20 | The interpretation method and translator of continuous long voice document |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108090051A (en) |
WO (1) | WO2019119552A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101497A (en) * | 2018-07-18 | 2018-12-28 | 深圳市锐曼智能技术有限公司 | Voice collecting translating equipment, system and method |
CN111862940A (en) * | 2020-07-15 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Earphone-based translation method, device, system, equipment and storage medium |
WO2021109000A1 (en) * | 2019-12-03 | 2021-06-10 | 深圳市欢太科技有限公司 | Data processing method and apparatus, electronic device, and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004109658A1 (en) * | 2003-06-02 | 2004-12-16 | International Business Machines Corporation | Voice response system, voice response method, voice server, voice file processing method, program and recording medium |
CN103167360A (en) * | 2013-02-21 | 2013-06-19 | 中国对外翻译出版有限公司 | Method for achieving multilingual subtitle translation |
CN103400580A (en) * | 2013-07-23 | 2013-11-20 | 华南理工大学 | Method for estimating importance degree of speaker in multiuser session voice |
US20140163970A1 (en) * | 2012-11-29 | 2014-06-12 | Huawei Technologies Co., Ltd. | Method for classifying voice conference minutes, device, and system |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
CN105719642A (en) * | 2016-02-29 | 2016-06-29 | 黄博 | Continuous and long voice recognition method and system and hardware equipment |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107391498A (en) * | 2017-07-28 | 2017-11-24 | 深圳市沃特沃德股份有限公司 | Voice translation method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008243080A (en) * | 2007-03-28 | 2008-10-09 | Toshiba Corp | Device, method, and program for translating voice |
CN101458681A (en) * | 2007-12-10 | 2009-06-17 | 株式会社东芝 | Voice translation method and voice translation apparatus |
CN101727904B (en) * | 2008-10-31 | 2013-04-24 | 国际商业机器公司 | Voice translation method and device |
CN105912533B (en) * | 2016-04-12 | 2019-02-12 | 苏州大学 | Long sentence cutting method and device towards neural machine translation |
CN106303695A (en) * | 2016-08-09 | 2017-01-04 | 北京东方嘉禾文化发展股份有限公司 | Audio translation multiple language characters processing method and system |
-
2017
- 2017-12-20 CN CN201711388000.3A patent/CN108090051A/en active Pending
-
2018
- 2018-01-09 WO PCT/CN2018/072007 patent/WO2019119552A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004109658A1 (en) * | 2003-06-02 | 2004-12-16 | International Business Machines Corporation | Voice response system, voice response method, voice server, voice file processing method, program and recording medium |
US20140163970A1 (en) * | 2012-11-29 | 2014-06-12 | Huawei Technologies Co., Ltd. | Method for classifying voice conference minutes, device, and system |
CN103167360A (en) * | 2013-02-21 | 2013-06-19 | 中国对外翻译出版有限公司 | Method for achieving multilingual subtitle translation |
CN103400580A (en) * | 2013-07-23 | 2013-11-20 | 华南理工大学 | Method for estimating importance degree of speaker in multiuser session voice |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
CN105719642A (en) * | 2016-02-29 | 2016-06-29 | 黄博 | Continuous and long voice recognition method and system and hardware equipment |
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
CN107391498A (en) * | 2017-07-28 | 2017-11-24 | 深圳市沃特沃德股份有限公司 | Voice translation method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101497A (en) * | 2018-07-18 | 2018-12-28 | 深圳市锐曼智能技术有限公司 | Voice collecting translating equipment, system and method |
WO2021109000A1 (en) * | 2019-12-03 | 2021-06-10 | 深圳市欢太科技有限公司 | Data processing method and apparatus, electronic device, and storage medium |
CN111862940A (en) * | 2020-07-15 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Earphone-based translation method, device, system, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2019119552A1 (en) | 2019-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107169430B (en) | Reading environment sound effect enhancement system and method based on image processing semantic analysis | |
CN103700370B (en) | A kind of radio and television speech recognition system method and system | |
CN108364632B (en) | Emotional Chinese text voice synthesis method | |
CN110147451B (en) | Dialogue command understanding method based on knowledge graph | |
CN110853649A (en) | Label extraction method, system, device and medium based on intelligent voice technology | |
CN102063904B (en) | Melody extraction method and melody recognition system for audio files | |
CN109714608B (en) | Video data processing method, video data processing device, computer equipment and storage medium | |
CN104143329A (en) | Method and device for conducting voice keyword search | |
CN104166462A (en) | Input method and system for characters | |
CN106157951B (en) | Carry out the automatic method for splitting and system of audio punctuate | |
CN107578769A (en) | Speech data mask method and device | |
CN105336342B (en) | Speech recognition result evaluation method and system | |
CN108090051A (en) | The interpretation method and translator of continuous long voice document | |
CN105895103A (en) | Speech recognition method and device | |
CN106782615A (en) | Speech data emotion detection method and apparatus and system | |
CN103164403A (en) | Generation method of video indexing data and system | |
CN110111778A (en) | A kind of method of speech processing, device, storage medium and electronic equipment | |
CN105869628A (en) | Voice endpoint detection method and device | |
CN111489743A (en) | Operation management analysis system based on intelligent voice technology | |
Alghifari et al. | On the use of voice activity detection in speech emotion recognition | |
CN106550268B (en) | Video processing method and video processing device | |
CN103474075B (en) | Voice signal sending method and system, method of reseptance and system | |
CN112466287B (en) | Voice segmentation method, device and computer readable storage medium | |
CN109817223A (en) | Phoneme marking method and device based on audio fingerprints | |
CN103474067A (en) | Voice signal transmission method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180529 |