WO2019119552A1

WO2019119552A1 - Method for translating continuous long speech file, and translation machine

Info

Publication number: WO2019119552A1
Application number: PCT/CN2018/072007
Authority: WO
Inventors: 郑勇; 金志军; 王文祺
Original assignee: 深圳市沃特沃德股份有限公司
Priority date: 2017-12-20
Filing date: 2018-01-09
Publication date: 2019-06-27
Also published as: CN108090051A

Abstract

A method for translating a continuous long speech file, and a translation machine. The method for translating a continuous long speech file comprises: parsing a continuous long speech file to obtain first speech segments and first non-speech segments (S1), wherein the first speech segments and the first non-speech segments are distributed according to a time sequence generated in the continuous long speech; sending the continuous long speech file to a server for translation, and receiving an audio code stream file obtained after the server translates the continuous long speech file (S2); parsing the audio code stream file to obtain second speech segments and second non-speech segments having the same distribution sequence as that of the first speech segments and the first non-speech segments (S3); and replacing the second non-speech segments with the first non-speech segments at the same ranking positions in the audio code stream file to obtain a final translated speech file (S4). The rhythm, language background sound, and sentence natural interval of a continuous long speech file are reserved, and the use experience of users is improved.

Description

Continuous long speech file translation method and translation machine

Technical field

The present invention relates to electronic translation techniques, and more particularly to a translation method and a translation machine for continuous long speech files.

Background technique

In the field of electronic translation, continuous speech files in educational, recording and other application scenarios are coordinated by a speech recognition engine, a translation engine, and a synthesis engine to obtain a translated voice file, which is outputted by an electronic device terminal such as a translation machine, which is convenient. The communication and communication between users of different languages provides great convenience for people's lives. However, the existing translation engine has no background noise information for the voice file translated and outputted for each voice segment in the continuous long voice file, and the statement interval in the voice file is a preset fixed output interval, so that the translated voice file is separated. The rhythm of the original continuous long speech file and the natural interval of the sentence, the language environment and taste of the original continuous long speech file are lost, and the user experience performance is poor.

Therefore, the prior art has yet to be improved.

technical problem

The main object of the present invention is to provide a method for translating a continuous long speech file, which aims to solve the technical problem that the rhythm of the original continuous long speech file and the natural interval of the sentence cannot be preserved in the existing translation technology.

Technical solution

The invention provides a method for translating a continuous long speech file, comprising:

Parsing the continuous long speech file to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are distributed according to a time sequence generated in the continuous long speech;

Transmitting the continuous long voice file to the server for translation, and receiving the audio stream file after the server translates the continuous long voice file;

Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as each of the first speech segment and each of the first non-speech segments;

The first non-speech segment of the same sorting position is replaced with the second non-speech segment in the audio stream file to obtain a final translated voice file.

Preferably, the step of parsing the continuous long speech file to obtain each first speech segment and each first non-speech segment comprises:

Processing the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame;

Each of the first speech segments and each of the first non-speech segments is obtained according to the arrangement state of the first speech frame and the first non-speech frame.

Preferably, the step of obtaining each of the first voice segment and each of the first non-speech segments according to the arrangement state of the first voice frame and the first non-speech frame includes:

The successively arranged first speech frames are respectively synthesized into the respective first speech segments, and the successively arranged first non-speech frames are respectively synthesized into the respective first non-speech segments.

Preferably, after the step of synthesizing the first voice segments that are consecutively arranged into the first voice segments and combining the first non-speech frames that are consecutively arranged into the first non-speech segments, the method includes:

Extracting each of the first non-speech segments;

And storing each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech;

Preferably, the step of transmitting the continuous long voice file to the server for translation, and receiving the audio stream file after the server translates the continuous long voice file comprises:

Sending a continuous long voice file to the voice recognition server;

Receiving, by the voice recognition server, the first text file corresponding to the continuous long voice file;

Sending the first text file above to the translation server;

Receiving, by the above translation server, a second text file of the specified language after translating the first text file;

Sending the second text file to the speech synthesis server;

Receiving the audio stream file after the voice synthesis server converts the second text file.

Preferably, the step of parsing the audio stream file to obtain the second speech segments and the second non-speech segments in the same order as the first speech segments and the first non-speech segments, includes:

And correspondingly analyzing the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;

Processing the audio stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-voice frame;

Obtaining each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame;

And establishing a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence relationship;

Obtaining the first speech segment and each of the first non-speech according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech Each of the second speech segments and the second non-speech segments are in the same order.

The invention also provides a translation machine comprising:

a first parsing module, configured to parse the continuous long speech file, to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are generated according to the continuous long speech Timing distribution

a sending and receiving module, configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file;

a second parsing module, configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as each of the first speech segment and each of the first non-speech segments;

And a replacement module, configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.

Preferably, the first parsing module includes:

a first processing unit, configured to process the continuous long voice file by using a voice activity detection and analysis technology, to obtain an arrangement state of the first voice frame and the first non-speech frame;

And a first obtaining unit, configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.

Preferably, the first obtaining unit includes:

And a synthesizing sub-unit, configured to respectively synthesize the first speech segments that are consecutively arranged according to the foregoing arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.

Preferably, the first obtaining unit further includes:

Extracting a subunit for extracting each of the first non-speech segments;

And a storage subunit, configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.

Preferably, the sending and receiving module includes:

a first sending unit, configured to send the continuous long voice file to the voice recognition server;

a first receiving unit, configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server;

a second sending unit, configured to send the first text file to the translation server;

a second receiving unit, configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server;

a third sending unit, configured to send the second text file to the voice synthesis server;

And a third receiving unit, configured to receive the audio stream file after the voice synthesis server converts the second text file.

Preferably, the foregoing second parsing module includes:

The analyzing unit is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence;

a second processing unit, configured to process the audio stream file by using a voice activity detection and analysis technology, to obtain an arrangement state of the second voice frame and the second non-speech frame;

a second obtaining unit, configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame;

a establishing unit, configured to establish, according to the first type one-to-one correspondence, the second type one-to-one correspondence between each of the first voice segment and each of the second voice segments;

a third obtaining unit, configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech Each of the segments and each of the first non-speech segments are distributed in the same order of the second speech segment and each of the second non-speech segments.

Beneficial effect

The invention distinguishes the original continuous long speech file into a speech segment and a non-speech segment, and retains the same non-speech segment as the original continuous long speech file, so that the translated audio stream code file has a compared with the original continuous long speech file. Almost the same rhythm, language background sounds and natural intervals of sentences increase the vivid vitality of machine translation and improve the user experience.

DRAWINGS

1 is a schematic flow chart of a method for translating a continuous long speech file according to an embodiment of the present invention;

2 is a schematic flow chart of step S1 of an embodiment of the present invention;

FIG. 3 is a schematic flowchart of step S11 according to an embodiment of the present invention;

4 is a schematic flow chart of step S2 of an embodiment of the present invention;

FIG. 5 is a schematic flowchart of step S3 according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a translation machine according to an embodiment of the present invention; FIG.

FIG. 7 is a schematic structural diagram of a first parsing module according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a first obtaining unit according to an embodiment of the present invention; FIG.

FIG. 9 is a schematic structural diagram of a transmitting and receiving module according to an embodiment of the present invention; FIG.

FIG. 10 is a schematic structural diagram of a second parsing module according to an embodiment of the present invention.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

BEST MODE FOR CARRYING OUT THE INVENTION

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to FIG. 1, a method for translating a continuous long voice file according to an embodiment of the present invention includes:

S1: Parsing the continuous long speech file to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are distributed according to a time sequence generated in the continuous long speech.

The terminal device of this embodiment takes a translation machine as an example. In this step, by analyzing the continuous long speech file, a data file of alternately arranged intervals between each first speech segment and each first non-speech segment is obtained, and each first speech segment and each first non-speech segment are continuous in the above-mentioned continuous length. The time distribution generated in the voice is expressed, for example, as: a first speech segment 1, a first non-speech segment 1, a first speech segment 2, a first non-speech segment 2, a first speech segment 3, a first non-speech segment 3, ..., the first speech segment N, the first non-speech segment N.

S2: Send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.

This step refers to a process in which a continuous long speech file is sequentially sent by a translation machine to a speech recognition server, a translation server, and a speech synthesis server for translation. The audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.

S3: Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as the first speech segment and each of the first non-speech segments.

The audio stream file in this embodiment is the audio data obtained by translating the continuous long speech files one by one, so each second speech segment and each second non-speech segment in the audio stream file and each first speech segment and Each of the first non-speech segments is distributed in the same order.

S4: replacing the first non-speech segment of the same sorting position with the same distribution order in the audio stream file to replace the second non-speech segment to obtain a final translated voice file.

In this embodiment, in the audio stream file with the same distribution order of the continuous long voice files, the second non-speech file at the same sorting position is replaced with the first non-speech segment, and the first non-speech segment and the translated audio are The code stream files are integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of the machine translation and improves the user experience.

Referring to FIG. 2, in an embodiment of the present invention, step S1 includes:

S10: processing the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.

In this embodiment, the first voice segment and the first non-speech segment in the continuous long voice file are distinguished by a VAD (Voice Activity Detection) for the continuous long voice file by the translation machine, so as to facilitate subsequent operations. For example, continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal. For example, the time of 20 ms of GSM is the frame length. Firstly, the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system. The algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment. The arrangement state of this step refers to the arrangement when the continuous long speech file becomes a data file in which the first speech frame of the continuous arrangement and the first non-speech frame of the continuous arrangement are alternately distributed after the speech activity detection analysis processing. information.

S11: Obtain each first speech segment and each first non-speech segment according to the arrangement state of the first speech frame and the first non-speech frame.

In this embodiment, each of the first speech segments and the first non-speech segments separated by the VAD are respectively marked with different coding marks for identification.

Referring to FIG. 3, in an embodiment of the present invention, step S11 includes:

S112: synthesize the first speech frames that are consecutively arranged into the first speech segments, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.

In this embodiment, the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame). And changing the statement in the continuous long speech file into the first speech segment 1 in which the VDD-processed consecutively arranged first speech frame 1 to the first speech frame m are merged, and the first non-speech frame 1 continuously arranged The first non-speech segment 1 merged into the first non-speech frame k is sequentially processed to change the continuous long speech file into the first speech segment 1 to the first speech segment N, and the first non-speech continuously arranged A data file in which the segment 1 to the first non-speech segment N are alternately distributed.

Further, in an embodiment of the present invention, after step S112,

S113: Extract each of the foregoing first non-speech segments.

Extracting the first non-speech segment in the continuous long speech file according to different coding flags of the first speech segment and each of the first non-speech segments, for example, the first non-speech code that is sequentially encoded as T1, T2...

Tn Segments

1, 2...N are extracted.

S114: Store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.

The non-speech segment buffer of this step is set in a designated area of the translation machine memory, so as to sequentially convert the translated audio stream file and the first order according to the same arrangement order of the first non-speech segments in the continuous long speech. A non-speech segment is combined and output.

Referring to FIG. 4, in an embodiment of the present invention, step S2 includes:

S20: Send the continuous long voice file to the voice recognition server.

S21: Receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.

This step is a first text file corresponding to the continuous long voice file obtained by the voice recognition server.

S22: Send the first text file above to the translation server.

S23: Receive a second text file of the specified language after the translation of the first text file, which is fed back by the translation server.

This step translates the first text file through the translation server to form a second text file of the specified language. For example, the Chinese translation into English, the first text file of Chinese and the second text file of English have a one-to-one correspondence, that is, a sentence translated into English in Chinese, one sentence per sentence.

S24: Send the second text file to the voice synthesis server.

S25: Receive an audio stream file after the voice synthesis server converts the second text file.

In this step, the second text file is sequentially sent to the speech synthesis server to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file in English.

Referring to FIG. 5, in an embodiment of the present invention, step S3 includes:

S30: Correlate the first character string information of the first text file with the second character string information of the second text file to obtain a first-class one-to-one correspondence.

This step marks each sentence of the first text file and the second text file by comparing the string information of the text file, for example: the first sentence, the second sentence, ... The Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.

S31: processing the audio stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-voice frame.

In this step, the audio stream file is processed by the VAD to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to a sentence in the second text file, N second Non-speech segments have the same length of time.

S32: Obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame.

In this embodiment, each of the second speech segments and the second non-speech segments separated by the VAD are also respectively labeled with different coding marks for identification.

S33: Establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.

Each of the first speech segments corresponds to a sentence in the first text file, and each of the second speech segments corresponds to a sentence in the second text file, and each one is found according to the one-to-one correspondence between the second text file and the first text file. A one-to-one correspondence between the first speech segment and each of the second speech segments to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments for accurate replacement.

S34: Obtain, according to the second type one-to-one correspondence, and each of the first speech segments and each of the first non-speech segments, according to the timing generated in the continuous long speech, and the first speech segments and the first non-speech Each of the second speech segments and the second non-speech segments are in the same order.

In this embodiment, the translated audio stream file and the continuous long voice file are performed by the second type one-to-one correspondence, and each first voice segment and each first non-speech segment are generated according to the timing generated in the continuous long voice. One-to-one correspondence of speech segments, so that the rhythm of continuous speech files (such as different intervals between each sentence and each sentence), language background sounds (such as background music, applause, etc.) and natural intervals of sentences (ie, non-speech segments) The natural length) can be better integrated with the translated audio stream file, so that the final translated voice file is closer to the original language environment and improves the user experience.

Referring to FIG. 6, a translation machine according to an embodiment of the present invention includes:

The first parsing module 1 is configured to parse the continuous long speech file to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are in the continuous long speech The resulting timing distribution.

The terminal device of this embodiment takes a translation machine as an example. In this embodiment, the continuous speech file is parsed by the first parsing module 1 to obtain data files of alternate intervals of the first speech segment and each of the first non-speech segments, each first speech segment and each first non-speech The segment is distributed according to the time series generated in the continuous continuous speech, for example, as: first speech segment 1, first non-speech segment 1, first speech segment 2, first non-speech segment 2, first speech segment 3, A non-speech segment 3, ..., a first speech segment N, and a first non-speech segment N.

The sending and receiving module 2 is configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.

In this embodiment, the continuous and long voice files are sequentially sent to the voice recognition server, the translation server, and the voice synthesis server through the transmitting and receiving module 2 for translation. The audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.

The second parsing module 3 is configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as the first speech segment and each of the first non-speech segments.

The audio stream file in this embodiment is the audio data obtained by translating the continuous long speech file one by one, so that the second stream segment file and the second non-voice are obtained by parsing the audio stream file through the second parsing module 3. The segment is distributed in the same order as each of the first speech segments and each of the first non-speech segments.

The replacing module 4 is configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.

In this embodiment, in the audio stream file with the same distribution order of the continuous long voice files, the second non-speech file at the same sorting position is replaced by the replacement module 4 with the first non-speech segment, and the first non-speech segment is The translated audio stream file is integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of machine translation and improves the user experience.

Referring to FIG. 7, further, in an embodiment of the present invention, the first parsing module 1 includes:

The first processing unit 10 is configured to process the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.

In this embodiment, the first processing unit 10 performs VAD on the continuous long voice file, and the first voice segment and the first non-speech segment in the continuous long voice file are distinguished to facilitate subsequent operations. For example, continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal. For example, the time of 20 ms of GSM is the frame length. Firstly, the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system. The algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment. The arrangement state of the present embodiment means that after the first activity processing detection and analysis processing by the first processing unit 10, the continuous speech file becomes a continuous arrangement of the first speech frame and the consecutively arranged first non-speech frames are alternately distributed. Arrangement information when the data file.

The first obtaining unit 11 is configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.

Referring to FIG. 8, further, in an embodiment of the present invention, the first obtaining unit 11 includes:

The synthesizing sub-unit 112 is configured to synthesize the first speech frames that are consecutively arranged into the first speech segments according to the arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.

In this embodiment, the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame). And the sentence in the continuous long speech file is changed into the first speech segment 1 of the continuously arranged first speech frame 1 to the first speech frame m after the VAD processing by the synthesizing sub-unit 112, and the consecutively arranged The first non-speech segment 1 from which the non-speech frame 1 to the first non-speech frame k are merged is sequentially processed, and the continuous long speech file is changed into the first speech segment 1 to the first speech segment N which are continuously arranged, and is continuously arranged. The first non-speech segment 1 to the first non-speech segment N are one-to-one data files alternately distributed.

Further, in an embodiment of the present invention, the first obtaining unit 11 further includes:

The extracting sub-unit 113 is configured to extract each of the first non-speech segments.

And extracting, by the extracting sub-unit 113, the first non-speech segment in the continuous long speech file according to different coding marks of the first speech segment and each of the first non-speech segments, for example, sequentially encoding T1, T2, ... Tn The first

non-speech segments

1, 2...N are extracted.

The storage sub-unit 114 is configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.

The non-speech segment buffer area of this embodiment is set in a designated area of the storage sub-unit 114, so that the translated audio stream file is sequentially sequentially according to the same arrangement order of the first non-speech segments in the generation of the continuous long speech. It is integrated with the first non-speech segment and then output.

Referring to FIG. 9, in an embodiment of the present invention, the sending and receiving module 2 includes:

The first sending unit 20 is configured to send the continuous long voice file to the voice recognition server.

The first receiving unit 21 is configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.

In this embodiment, the continuous transmission voice file is sent to the voice recognition server by the first sending unit 20, and converted into a first text file corresponding to the continuous long voice file by the voice recognition server.

The second sending unit 22 is configured to send the first text file to the translation server.

The second receiving unit 23 is configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server.

In this embodiment, the second sending unit 22 sends the first text file to the translation server, and the first text file is translated by the translation server to form a second text file of the specified language. For example, Chinese is translated into English, and the first text file of Chinese and the second text file of English are in one-to-one correspondence, that is, a sentence in which a Chinese sentence is translated into English, and each sentence corresponds one-to-one.

The third sending unit 24 is configured to send the second text file to the voice synthesis server.

The third receiving unit 25 is configured to receive the audio stream file after the voice synthesis server converts the second text file.

In this embodiment, the second text file is sequentially sent to the voice synthesis server by the third sending unit 24 to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file of English.

Referring to FIG. 10, further, in an embodiment of the present invention, the second parsing module 3 includes:

The analyzing unit 30 is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence.

In this embodiment, the analysis unit 30 compares and analyzes the character string information of the text file, and marks each sentence of the first text file and the second text file, for example, the first sentence and the second sentence. Words, ..., Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.

The second processing unit 31 is configured to process the audio stream file by using a voice activity detection and analysis technology.

In this embodiment, the second processing unit 31 performs VAD processing on the audio stream file to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to the second text file. In one sentence, the N second non-speech segments have the same length of time.

The second obtaining unit 32 is configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame.

In this embodiment, each second speech segment and each second non-speech segment are obtained by the second obtaining unit 32, and different coding marks are also respectively used for identification.

The establishing unit 33 is configured to establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.

Each of the first speech segments corresponds to a sentence in the first text file, each second speech segment corresponds to a sentence in the second text file, and the establishing unit 33 is configured according to the one-to-one correspondence between the second text file and the first text file. And finding a one-to-one correspondence between each of the first speech segments and each of the second speech segments, so as to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments, so as to be accurately replaced.

a third obtaining unit 34, configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech segment Each of the second speech segments and each of the second non-speech segments are arranged in the same order as the first non-speech segments.

The third obtaining unit 34 of the embodiment obtains the translated audio stream file by using the second type one-to-one correspondence, and each first speech segment and each first non-speech segment according to the timing generated in the continuous long speech. And the one-to-one correspondence of the speech segments of the continuous long speech file, so that the rhythm of the continuous long speech file (such as the different interval between each sentence and each sentence), the language background sound (such as background music, applause, etc.) and the statement nature The interval (that is, the natural length of the non-speech segment) can be better integrated with the translated audio stream file, so that the final translated voice file is closer to the original language environment and improves the user experience.

The above is only the preferred embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the invention and the drawings are directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present invention.

Claims

A method for translating a continuous long speech file, comprising:

Parsing the continuous long speech file to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are distributed according to a time sequence generated in the continuous long speech;

Transmitting the continuous long voice file to a server for translation, and receiving an audio stream file after the server translates the continuous long voice file;

Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as the first speech segment and each first non-speech segment;

The first non-speech segment of the same sorting position is replaced with the second non-speech segment in the audio stream file to obtain a final translated voice file.
The method for translating a continuous long speech file according to claim 1, wherein the step of parsing the continuous long speech file to obtain each of the first speech segment and each of the first non-speech segments comprises:

Processing the continuous long voice file by using a voice activity detection and analysis technology to obtain an arrangement state of the first voice frame and the first non-speech frame;

Each of the first speech segments and each of the first non-speech segments is obtained according to an arrangement state of the first speech frame and the first non-speech frame.
The method for translating a continuous long speech file according to claim 2, wherein the first speech segment and each first non-speech are obtained according to an arrangement state of the first speech frame and the first non-speech frame Steps of the paragraph, including:

The successively arranged first speech frames are respectively synthesized into the first speech segments, and the consecutively arranged first non-speech frames are respectively synthesized into the first non-speech segments.
The method for translating a continuous long speech file according to claim 3, wherein the firstly arranged first speech frames are respectively synthesized into the first speech segments, and the first non-speech frames arranged consecutively After separately synthesizing each of the first non-speech segments, the method includes:

Extracting each of the first non-speech segments;

Each of the first non-speech segments is stored in a non-speech segment buffer according to a timing generated in the continuous long speech.
The method for translating a continuous long speech file according to claim 1, wherein the transmitting the continuous long speech file to a server for translation, and receiving the audio stream after the server translates the continuous long speech file The steps of the file, including:

Sending a continuous long voice file to the voice recognition server;

Receiving, by the voice recognition server, a first text file corresponding to the continuous long voice file;

Sending the first text file to a translation server;

Receiving, by the translation server, a second text file of a specified language after translating the first text file;

Sending the second text file to a voice synthesis server;

Receiving an audio stream file after the voice synthesis server converts the second text file.
The method for translating a continuous long speech file according to claim 5, wherein the parsing the audio stream file to obtain the same distribution order as the first speech segments and the first non-speech segments The steps of the second voice segment and each of the second non-speech segments include:

And correspondingly analyzing the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;

Processing the audio code stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-speech frame;

Obtaining each second speech segment and each second non-speech segment according to an arrangement state of the second speech frame and the second non-speech frame;

Establishing a first-to-one correspondence between the first voice segment and each of the second voice segments according to the first-class one-to-one correspondence;

Obtaining, according to the second type one-to-one correspondence relationship, and each of the first speech segments and each of the first non-speech segments according to the timing generated in the continuous long speech, obtaining the first speech segment and each of the first non-speech segments Each of the second speech segments and the second non-speech segments are in the same order.
A translation machine, comprising:

a first parsing module, configured to parse a continuous long speech file, to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are generated according to the continuous long speech Timing distribution

a sending and receiving module, configured to send the continuous long voice file to a server for translation, and receive an audio stream file after the server translates the continuous long voice file;

a second parsing module, configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as the first speech segment and each first non-speech segment;

And a replacement module, configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
The translation machine according to claim 7, wherein the first parsing module comprises:

a first processing unit, configured to process the continuous long voice file by using a voice activity detection and analysis technology, to obtain an arrangement state of the first voice frame and the first non-speech frame;

And a first obtaining unit, configured to obtain each of the first voice segment and each of the first non-speech segments according to the arrangement state of the first voice frame and the first non-speech frame.
The translation machine according to claim 8, wherein the first obtaining unit comprises:

a synthesizing subunit, configured to respectively synthesize the first speech frames that are consecutively arranged according to the arrangement state into each of the first speech segments, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segment.
The translation machine according to claim 9, wherein the first obtaining unit further comprises:

Extracting a subunit for extracting each of the first non-speech segments;

And a storage subunit, configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
The translation machine according to claim 7, wherein the transmitting and receiving module comprises:

a first sending unit, configured to send the continuous long voice file to the voice recognition server;

a first receiving unit, configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server;

a second sending unit, configured to send the first text file to a translation server;

a second receiving unit, configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server;

a third sending unit, configured to send the second text file to a voice synthesis server;

And a third receiving unit, configured to receive an audio stream file after the voice synthesis server converts the second text file.
The translation machine according to claim 11, wherein the second parsing module comprises:

The analyzing unit is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;

a second processing unit, configured to process the audio code stream file by using a voice activity detection and analysis technology, to obtain an arrangement state of the second voice frame and the second non-speech frame;

a second obtaining unit, configured to obtain each second speech segment and each second non-speech segment according to an arrangement state of the second speech frame and the second non-speech frame;

And a establishing unit, configured to establish, according to the first type one-to-one correspondence, a first-to-one correspondence between each of the first voice segment and each of the second voice segments;

a third obtaining unit, configured to obtain, according to the second type one-to-one correspondence, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first Each of the second speech segment and each of the second non-speech segments in the same order in which the speech segment and each of the first non-speech segments are distributed.