WO2019119552A1 - Method for translating continuous long speech file, and translation machine - Google Patents

Method for translating continuous long speech file, and translation machine Download PDF

Info

Publication number
WO2019119552A1
WO2019119552A1 PCT/CN2018/072007 CN2018072007W WO2019119552A1 WO 2019119552 A1 WO2019119552 A1 WO 2019119552A1 CN 2018072007 W CN2018072007 W CN 2018072007W WO 2019119552 A1 WO2019119552 A1 WO 2019119552A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
file
voice
continuous long
segment
Prior art date
Application number
PCT/CN2018/072007
Other languages
French (fr)
Chinese (zh)
Inventor
郑勇
金志军
王文祺
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2019119552A1 publication Critical patent/WO2019119552A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present invention relates to electronic translation techniques, and more particularly to a translation method and a translation machine for continuous long speech files.
  • continuous speech files in educational, recording and other application scenarios are coordinated by a speech recognition engine, a translation engine, and a synthesis engine to obtain a translated voice file, which is outputted by an electronic device terminal such as a translation machine, which is convenient.
  • a speech recognition engine In the field of electronic translation, continuous speech files in educational, recording and other application scenarios are coordinated by a speech recognition engine, a translation engine, and a synthesis engine to obtain a translated voice file, which is outputted by an electronic device terminal such as a translation machine, which is convenient.
  • the existing translation engine has no background noise information for the voice file translated and outputted for each voice segment in the continuous long voice file, and the statement interval in the voice file is a preset fixed output interval, so that the translated voice file is separated.
  • the rhythm of the original continuous long speech file and the natural interval of the sentence, the language environment and taste of the original continuous long speech file are lost, and the user experience performance is poor.
  • the main object of the present invention is to provide a method for translating a continuous long speech file, which aims to solve the technical problem that the rhythm of the original continuous long speech file and the natural interval of the sentence cannot be preserved in the existing translation technology.
  • the invention provides a method for translating a continuous long speech file, comprising:
  • the first non-speech segment of the same sorting position is replaced with the second non-speech segment in the audio stream file to obtain a final translated voice file.
  • the step of parsing the continuous long speech file to obtain each first speech segment and each first non-speech segment comprises:
  • Each of the first speech segments and each of the first non-speech segments is obtained according to the arrangement state of the first speech frame and the first non-speech frame.
  • the step of obtaining each of the first voice segment and each of the first non-speech segments according to the arrangement state of the first voice frame and the first non-speech frame includes:
  • the successively arranged first speech frames are respectively synthesized into the respective first speech segments, and the successively arranged first non-speech frames are respectively synthesized into the respective first non-speech segments.
  • the method includes:
  • the step of transmitting the continuous long voice file to the server for translation, and receiving the audio stream file after the server translates the continuous long voice file comprises:
  • Receiving the audio stream file after the voice synthesis server converts the second text file.
  • the step of parsing the audio stream file to obtain the second speech segments and the second non-speech segments in the same order as the first speech segments and the first non-speech segments includes:
  • the invention also provides a translation machine comprising:
  • a first parsing module configured to parse the continuous long speech file, to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are generated according to the continuous long speech Timing distribution
  • a sending and receiving module configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file
  • a second parsing module configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as each of the first speech segment and each of the first non-speech segments;
  • a replacement module configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
  • the first parsing module includes:
  • a first processing unit configured to process the continuous long voice file by using a voice activity detection and analysis technology, to obtain an arrangement state of the first voice frame and the first non-speech frame;
  • a first obtaining unit configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.
  • the first obtaining unit includes:
  • a synthesizing sub-unit configured to respectively synthesize the first speech segments that are consecutively arranged according to the foregoing arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
  • the first obtaining unit further includes:
  • a storage subunit configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
  • the sending and receiving module includes:
  • a first sending unit configured to send the continuous long voice file to the voice recognition server
  • a first receiving unit configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server
  • a second sending unit configured to send the first text file to the translation server
  • a second receiving unit configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server;
  • a third sending unit configured to send the second text file to the voice synthesis server
  • a third receiving unit configured to receive the audio stream file after the voice synthesis server converts the second text file.
  • the foregoing second parsing module includes:
  • the analyzing unit is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence;
  • a second processing unit configured to process the audio stream file by using a voice activity detection and analysis technology, to obtain an arrangement state of the second voice frame and the second non-speech frame;
  • a second obtaining unit configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame;
  • a establishing unit configured to establish, according to the first type one-to-one correspondence, the second type one-to-one correspondence between each of the first voice segment and each of the second voice segments;
  • a third obtaining unit configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech
  • Each of the segments and each of the first non-speech segments are distributed in the same order of the second speech segment and each of the second non-speech segments.
  • the invention distinguishes the original continuous long speech file into a speech segment and a non-speech segment, and retains the same non-speech segment as the original continuous long speech file, so that the translated audio stream code file has a compared with the original continuous long speech file. Almost the same rhythm, language background sounds and natural intervals of sentences increase the vivid vitality of machine translation and improve the user experience.
  • FIG. 1 is a schematic flow chart of a method for translating a continuous long speech file according to an embodiment of the present invention
  • step S1 of an embodiment of the present invention is a schematic flow chart of step S1 of an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of step S11 according to an embodiment of the present invention.
  • step S2 is a schematic flow chart of step S2 of an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of step S3 according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a translation machine according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a first parsing module according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a first obtaining unit according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a transmitting and receiving module according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a second parsing module according to an embodiment of the present invention.
  • a method for translating a continuous long voice file includes:
  • the terminal device of this embodiment takes a translation machine as an example.
  • a data file of alternately arranged intervals between each first speech segment and each first non-speech segment is obtained, and each first speech segment and each first non-speech segment are continuous in the above-mentioned continuous length.
  • the time distribution generated in the voice is expressed, for example, as: a first speech segment 1, a first non-speech segment 1, a first speech segment 2, a first non-speech segment 2, a first speech segment 3, a first non-speech segment 3, ..., the first speech segment N, the first non-speech segment N.
  • S2 Send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.
  • This step refers to a process in which a continuous long speech file is sequentially sent by a translation machine to a speech recognition server, a translation server, and a speech synthesis server for translation.
  • the audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.
  • the audio stream file in this embodiment is the audio data obtained by translating the continuous long speech files one by one, so each second speech segment and each second non-speech segment in the audio stream file and each first speech segment and Each of the first non-speech segments is distributed in the same order.
  • the second non-speech file at the same sorting position is replaced with the first non-speech segment, and the first non-speech segment and the translated audio are
  • the code stream files are integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of the machine translation and improves the user experience.
  • step S1 includes:
  • S10 processing the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.
  • the first voice segment and the first non-speech segment in the continuous long voice file are distinguished by a VAD (Voice Activity Detection) for the continuous long voice file by the translation machine, so as to facilitate subsequent operations.
  • VAD Voice Activity Detection
  • continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal. For example, the time of 20 ms of GSM is the frame length.
  • the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system.
  • the algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment.
  • the arrangement state of this step refers to the arrangement when the continuous long speech file becomes a data file in which the first speech frame of the continuous arrangement and the first non-speech frame of the continuous arrangement are alternately distributed after the speech activity detection analysis processing. information.
  • each of the first speech segments and the first non-speech segments separated by the VAD are respectively marked with different coding marks for identification.
  • step S11 includes:
  • S112 synthesize the first speech frames that are consecutively arranged into the first speech segments, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
  • the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame).
  • the first non-speech segment 1 merged into the first non-speech frame k is sequentially processed to change the continuous long speech file into the first speech segment 1 to the first speech segment N, and the first non-speech continuously arranged A data file in which the segment 1 to the first non-speech segment N are alternately distributed.
  • step S112 After step S112,
  • Extracting the first non-speech segment in the continuous long speech file according to different coding flags of the first speech segment and each of the first non-speech segments for example, the first non-speech code that is sequentially encoded as T1, T2...Tn Segments 1, 2...N are extracted.
  • S114 Store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
  • the non-speech segment buffer of this step is set in a designated area of the translation machine memory, so as to sequentially convert the translated audio stream file and the first order according to the same arrangement order of the first non-speech segments in the continuous long speech.
  • a non-speech segment is combined and output.
  • step S2 includes:
  • S20 Send the continuous long voice file to the voice recognition server.
  • S21 Receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.
  • This step is a first text file corresponding to the continuous long voice file obtained by the voice recognition server.
  • This step translates the first text file through the translation server to form a second text file of the specified language.
  • the Chinese translation into English, the first text file of Chinese and the second text file of English have a one-to-one correspondence, that is, a sentence translated into English in Chinese, one sentence per sentence.
  • S25 Receive an audio stream file after the voice synthesis server converts the second text file.
  • the second text file is sequentially sent to the speech synthesis server to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file in English.
  • step S3 includes:
  • This step marks each sentence of the first text file and the second text file by comparing the string information of the text file, for example: the first sentence, the second sentence, ... The Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.
  • S31 processing the audio stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-voice frame.
  • the audio stream file is processed by the VAD to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to a sentence in the second text file, N second Non-speech segments have the same length of time.
  • each of the second speech segments and the second non-speech segments separated by the VAD are also respectively labeled with different coding marks for identification.
  • S33 Establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.
  • Each of the first speech segments corresponds to a sentence in the first text file
  • each of the second speech segments corresponds to a sentence in the second text file, and each one is found according to the one-to-one correspondence between the second text file and the first text file.
  • a one-to-one correspondence between the first speech segment and each of the second speech segments to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments for accurate replacement.
  • S34 Obtain, according to the second type one-to-one correspondence, and each of the first speech segments and each of the first non-speech segments, according to the timing generated in the continuous long speech, and the first speech segments and the first non-speech Each of the second speech segments and the second non-speech segments are in the same order.
  • the translated audio stream file and the continuous long voice file are performed by the second type one-to-one correspondence, and each first voice segment and each first non-speech segment are generated according to the timing generated in the continuous long voice.
  • One-to-one correspondence of speech segments so that the rhythm of continuous speech files (such as different intervals between each sentence and each sentence), language background sounds (such as background music, applause, etc.) and natural intervals of sentences (ie, non-speech segments)
  • the natural length can be better integrated with the translated audio stream file, so that the final translated voice file is closer to the original language environment and improves the user experience.
  • a translation machine includes:
  • the first parsing module 1 is configured to parse the continuous long speech file to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are in the continuous long speech The resulting timing distribution.
  • the terminal device of this embodiment takes a translation machine as an example.
  • the continuous speech file is parsed by the first parsing module 1 to obtain data files of alternate intervals of the first speech segment and each of the first non-speech segments, each first speech segment and each first non-speech
  • the segment is distributed according to the time series generated in the continuous continuous speech, for example, as: first speech segment 1, first non-speech segment 1, first speech segment 2, first non-speech segment 2, first speech segment 3, A non-speech segment 3, ..., a first speech segment N, and a first non-speech segment N.
  • the sending and receiving module 2 is configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.
  • the continuous and long voice files are sequentially sent to the voice recognition server, the translation server, and the voice synthesis server through the transmitting and receiving module 2 for translation.
  • the audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.
  • the second parsing module 3 is configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as the first speech segment and each of the first non-speech segments.
  • the audio stream file in this embodiment is the audio data obtained by translating the continuous long speech file one by one, so that the second stream segment file and the second non-voice are obtained by parsing the audio stream file through the second parsing module 3.
  • the segment is distributed in the same order as each of the first speech segments and each of the first non-speech segments.
  • the replacing module 4 is configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
  • the second non-speech file at the same sorting position is replaced by the replacement module 4 with the first non-speech segment, and the first non-speech segment is
  • the translated audio stream file is integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of machine translation and improves the user experience.
  • the first parsing module 1 includes:
  • the first processing unit 10 is configured to process the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.
  • the first processing unit 10 performs VAD on the continuous long voice file, and the first voice segment and the first non-speech segment in the continuous long voice file are distinguished to facilitate subsequent operations.
  • continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal.
  • the time of 20 ms of GSM is the frame length.
  • the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system.
  • the algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment.
  • the arrangement state of the present embodiment means that after the first activity processing detection and analysis processing by the first processing unit 10, the continuous speech file becomes a continuous arrangement of the first speech frame and the consecutively arranged first non-speech frames are alternately distributed. Arrangement information when the data file.
  • the first obtaining unit 11 is configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.
  • each of the first speech segments and the first non-speech segments separated by the VAD are respectively marked with different coding marks for identification.
  • the first obtaining unit 11 includes:
  • the synthesizing sub-unit 112 is configured to synthesize the first speech frames that are consecutively arranged into the first speech segments according to the arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
  • the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame).
  • the sentence in the continuous long speech file is changed into the first speech segment 1 of the continuously arranged first speech frame 1 to the first speech frame m after the VAD processing by the synthesizing sub-unit 112, and the consecutively arranged
  • the first non-speech segment 1 from which the non-speech frame 1 to the first non-speech frame k are merged is sequentially processed, and the continuous long speech file is changed into the first speech segment 1 to the first speech segment N which are continuously arranged, and is continuously arranged.
  • the first non-speech segment 1 to the first non-speech segment N are one-to-one data files alternately distributed.
  • the first obtaining unit 11 further includes:
  • the extracting sub-unit 113 is configured to extract each of the first non-speech segments.
  • the extracting sub-unit 113 extracts, by the extracting sub-unit 113, the first non-speech segment in the continuous long speech file according to different coding marks of the first speech segment and each of the first non-speech segments, for example, sequentially encoding T1, T2, ... Tn
  • the first non-speech segments 1, 2...N are extracted.
  • the storage sub-unit 114 is configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
  • the non-speech segment buffer area of this embodiment is set in a designated area of the storage sub-unit 114, so that the translated audio stream file is sequentially sequentially according to the same arrangement order of the first non-speech segments in the generation of the continuous long speech. It is integrated with the first non-speech segment and then output.
  • the sending and receiving module 2 includes:
  • the first sending unit 20 is configured to send the continuous long voice file to the voice recognition server.
  • the first receiving unit 21 is configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.
  • the continuous transmission voice file is sent to the voice recognition server by the first sending unit 20, and converted into a first text file corresponding to the continuous long voice file by the voice recognition server.
  • the second sending unit 22 is configured to send the first text file to the translation server.
  • the second receiving unit 23 is configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server.
  • the second sending unit 22 sends the first text file to the translation server, and the first text file is translated by the translation server to form a second text file of the specified language.
  • Chinese is translated into English
  • the first text file of Chinese and the second text file of English are in one-to-one correspondence, that is, a sentence in which a Chinese sentence is translated into English, and each sentence corresponds one-to-one.
  • the third sending unit 24 is configured to send the second text file to the voice synthesis server.
  • the third receiving unit 25 is configured to receive the audio stream file after the voice synthesis server converts the second text file.
  • the second text file is sequentially sent to the voice synthesis server by the third sending unit 24 to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file of English.
  • the second parsing module 3 includes:
  • the analyzing unit 30 is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence.
  • the analysis unit 30 compares and analyzes the character string information of the text file, and marks each sentence of the first text file and the second text file, for example, the first sentence and the second sentence. Words, ..., Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.
  • the second processing unit 31 is configured to process the audio stream file by using a voice activity detection and analysis technology.
  • the second processing unit 31 performs VAD processing on the audio stream file to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to the second text file.
  • the N second non-speech segments have the same length of time.
  • the second obtaining unit 32 is configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame.
  • each second speech segment and each second non-speech segment are obtained by the second obtaining unit 32, and different coding marks are also respectively used for identification.
  • the establishing unit 33 is configured to establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.
  • Each of the first speech segments corresponds to a sentence in the first text file
  • each second speech segment corresponds to a sentence in the second text file
  • the establishing unit 33 is configured according to the one-to-one correspondence between the second text file and the first text file. And finding a one-to-one correspondence between each of the first speech segments and each of the second speech segments, so as to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments, so as to be accurately replaced.
  • a third obtaining unit 34 configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech segment
  • Each of the second speech segments and each of the second non-speech segments are arranged in the same order as the first non-speech segments.
  • the third obtaining unit 34 of the embodiment obtains the translated audio stream file by using the second type one-to-one correspondence, and each first speech segment and each first non-speech segment according to the timing generated in the continuous long speech. And the one-to-one correspondence of the speech segments of the continuous long speech file, so that the rhythm of the continuous long speech file (such as the different interval between each sentence and each sentence), the language background sound (such as background music, applause, etc.) and the statement nature
  • the interval that is, the natural length of the non-speech segment
  • the final translated voice file is closer to the original language environment and improves the user experience.

Abstract

A method for translating a continuous long speech file, and a translation machine. The method for translating a continuous long speech file comprises: parsing a continuous long speech file to obtain first speech segments and first non-speech segments (S1), wherein the first speech segments and the first non-speech segments are distributed according to a time sequence generated in the continuous long speech; sending the continuous long speech file to a server for translation, and receiving an audio code stream file obtained after the server translates the continuous long speech file (S2); parsing the audio code stream file to obtain second speech segments and second non-speech segments having the same distribution sequence as that of the first speech segments and the first non-speech segments (S3); and replacing the second non-speech segments with the first non-speech segments at the same ranking positions in the audio code stream file to obtain a final translated speech file (S4). The rhythm, language background sound, and sentence natural interval of a continuous long speech file are reserved, and the use experience of users is improved.

Description

连续长语音文件的翻译方法与翻译机Continuous long speech file translation method and translation machine 技术领域Technical field
本发明涉及到电子翻译技术,特别是涉及到连续长语音文件的翻译方法与翻译机。The present invention relates to electronic translation techniques, and more particularly to a translation method and a translation machine for continuous long speech files.
背景技术Background technique
在电子翻译领域,教育、录音翻译等应用场景下的连续长语音文件通过语音识别引擎、翻译引擎、合成引擎协同作用,得到翻译后的语音文件,并通过翻译机等电子设备终端输出,方便了不同语言用户的相互交流沟通,为人们的生活提供了极大的便利。但现有翻译引擎对连续长语音文件中的每个语音段翻译输出的语音文件没有背景噪音信息,而且语音文件中的语句间隔是预先设定的固定输出间隔,使得翻译后的语音文件脱离了原始连续长语音文件的节奏和语句自然间隔,失去了原始连续长语音文件的语言环境和情趣,用户体验性能差。In the field of electronic translation, continuous speech files in educational, recording and other application scenarios are coordinated by a speech recognition engine, a translation engine, and a synthesis engine to obtain a translated voice file, which is outputted by an electronic device terminal such as a translation machine, which is convenient. The communication and communication between users of different languages provides great convenience for people's lives. However, the existing translation engine has no background noise information for the voice file translated and outputted for each voice segment in the continuous long voice file, and the statement interval in the voice file is a preset fixed output interval, so that the translated voice file is separated. The rhythm of the original continuous long speech file and the natural interval of the sentence, the language environment and taste of the original continuous long speech file are lost, and the user experience performance is poor.
因此,现有技术还有待改进。Therefore, the prior art has yet to be improved.
技术问题technical problem
本发明的主要目的为提供一种连续长语音文件的翻译方法,旨在解决现有翻译技术中不能保留原始连续长语音文件的节奏和语句自然间隔的技术问题。The main object of the present invention is to provide a method for translating a continuous long speech file, which aims to solve the technical problem that the rhythm of the original continuous long speech file and the natural interval of the sentence cannot be preserved in the existing translation technology.
技术解决方案Technical solution
本发明提出一种连续长语音文件的翻译方法,包括:The invention provides a method for translating a continuous long speech file, comprising:
解析连续长语音文件,得到各第一语音段和各第一非语音段,其中,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布;Parsing the continuous long speech file to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are distributed according to a time sequence generated in the continuous long speech;
发送上述连续长语音文件至服务器进行翻译,并接收上述服务器 翻译上述连续长语音文件后的音频码流文件;Transmitting the continuous long voice file to the server for translation, and receiving the audio stream file after the server translates the continuous long voice file;
解析上述音频码流文件,得到与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段;Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as each of the first speech segment and each of the first non-speech segments;
在上述音频码流文件中将相同排序位置的上述第一非语音段替换掉上述第二非语音段,得到最终的翻译语音文件。The first non-speech segment of the same sorting position is replaced with the second non-speech segment in the audio stream file to obtain a final translated voice file.
优选地,上述解析连续长语音文件,得到各第一语音段和各第一非语音段的步骤,包括:Preferably, the step of parsing the continuous long speech file to obtain each first speech segment and each first non-speech segment comprises:
通过语音活动检测分析技术处理上述连续长语音文件,获得第一语音帧和第一非语音帧的排布状态;Processing the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame;
根据上述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。Each of the first speech segments and each of the first non-speech segments is obtained according to the arrangement state of the first speech frame and the first non-speech frame.
优选地,上述根据上述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段的步骤,包括:Preferably, the step of obtaining each of the first voice segment and each of the first non-speech segments according to the arrangement state of the first voice frame and the first non-speech frame includes:
将连续排布的第一语音帧分别合成各上述第一语音段,将连续排布的第一非语音帧分别合成各上述第一非语音段。The successively arranged first speech frames are respectively synthesized into the respective first speech segments, and the successively arranged first non-speech frames are respectively synthesized into the respective first non-speech segments.
优选地,上述将连续排布的第一语音帧分别合成各上述第一语音段,将连续排布的第一非语音帧分别合成各上述第一非语音段的步骤之后,包括:Preferably, after the step of synthesizing the first voice segments that are consecutively arranged into the first voice segments and combining the first non-speech frames that are consecutively arranged into the first non-speech segments, the method includes:
提取各上述第一非语音段;Extracting each of the first non-speech segments;
将各上述第一非语音段按照在上述连续长语音中生成的时序存储于非语音段缓存区;And storing each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech;
优选地,上述发送上述连续长语音文件至服务器进行翻译,并接收上述服务器翻译上述连续长语音文件后的音频码流文件的步骤,包括:Preferably, the step of transmitting the continuous long voice file to the server for translation, and receiving the audio stream file after the server translates the continuous long voice file comprises:
将连续长语音文件发送至语音识别服务器;Sending a continuous long voice file to the voice recognition server;
接收上述语音识别服务器反馈的与上述连续长语音文件对应的第一文本文件;Receiving, by the voice recognition server, the first text file corresponding to the continuous long voice file;
将上述第一文本文件发送至翻译服务器;Sending the first text file above to the translation server;
接收上述翻译服务器反馈的翻译上述第一文本文件后的指定语种的第二文本文件;Receiving, by the above translation server, a second text file of the specified language after translating the first text file;
将上述第二文本文件发送至语音合成服务器;Sending the second text file to the speech synthesis server;
接收上述语音合成服务器转换上述第二文本文件后的音频码流文件。Receiving the audio stream file after the voice synthesis server converts the second text file.
优选地,上述解析上述音频码流文件,得到与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段的步骤,包括:Preferably, the step of parsing the audio stream file to obtain the second speech segments and the second non-speech segments in the same order as the first speech segments and the first non-speech segments, includes:
将上述第一文本文件的第一字符串信息与上述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系;And correspondingly analyzing the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;
通过语音活动检测分析技术处理上述音频码流文件,获得第二语音帧和第二非语音帧的排布状态;Processing the audio stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-voice frame;
根据上述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段;Obtaining each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame;
根据上述第一类一一对应关系建立各上述第一语音段与各上述第二语音段的第二类一一对应关系;And establishing a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence relationship;
根据上述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序,获得与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。Obtaining the first speech segment and each of the first non-speech according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech Each of the second speech segments and the second non-speech segments are in the same order.
本发明还提供了一种翻译机,包括:The invention also provides a translation machine comprising:
第一解析模块,用于解析连续长语音文件,得到各第一语音段和各第一非语音段,其中,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布;a first parsing module, configured to parse the continuous long speech file, to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are generated according to the continuous long speech Timing distribution
发送接收模块,用于发送上述连续长语音文件至服务器进行翻译,并接收上述服务器翻译上述连续长语音文件后的音频码流文件;a sending and receiving module, configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file;
第二解析模块,用于解析上述音频码流文件,得到与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段;a second parsing module, configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as each of the first speech segment and each of the first non-speech segments;
替换模块,用于在上述音频码流文件中将相同排序位置的上述第一非语音段替换掉上述第二非语音段,得到最终的翻译语音文件。And a replacement module, configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
优选地,上述第一解析模块,包括:Preferably, the first parsing module includes:
第一处理单元,用于通过语音活动检测分析技术处理上述连续长 语音文件,获得第一语音帧和第一非语音帧的排布状态;a first processing unit, configured to process the continuous long voice file by using a voice activity detection and analysis technology, to obtain an arrangement state of the first voice frame and the first non-speech frame;
第一获得单元,用于根据上述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。And a first obtaining unit, configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.
优选地,上述第一获得单元,包括:Preferably, the first obtaining unit includes:
合成子单元,用于根据上述排布状态将连续排布的第一语音帧分别合成各上述第一语音段,将连续排布的第一非语音帧分别合成各上述第一非语音段。And a synthesizing sub-unit, configured to respectively synthesize the first speech segments that are consecutively arranged according to the foregoing arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
优选地,上述第一获得单元,还包括:Preferably, the first obtaining unit further includes:
提取子单元,用于提取各上述第一非语音段;Extracting a subunit for extracting each of the first non-speech segments;
存储子单元,用于将各上述第一非语音段按照在上述连续长语音中生成的时序存储于非语音段缓存区。And a storage subunit, configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
优选地,上述发送接收模块,包括:Preferably, the sending and receiving module includes:
第一发送单元,用于将连续长语音文件发送至语音识别服务器;a first sending unit, configured to send the continuous long voice file to the voice recognition server;
第一接收单元,用于接收上述语音识别服务器反馈的与上述连续长语音文件对应的第一文本文件;a first receiving unit, configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server;
第二发送单元,用于将上述第一文本文件发送至翻译服务器;a second sending unit, configured to send the first text file to the translation server;
第二接收单元,用于接收上述翻译服务器反馈的翻译上述第一文本文件后的指定语种的第二文本文件;a second receiving unit, configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server;
第三发送单元,用于将上述第二文本文件发送至语音合成服务器;a third sending unit, configured to send the second text file to the voice synthesis server;
第三接收单元,用于接收上述语音合成服务器转换上述第二文本文件后的音频码流文件。And a third receiving unit, configured to receive the audio stream file after the voice synthesis server converts the second text file.
优选地,上述第二解析模块,包括:Preferably, the foregoing second parsing module includes:
分析单元,用于将上述第一文本文件的第一字符串信息与上述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系;The analyzing unit is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence;
第二处理单元,用于通过语音活动检测分析技术处理上述音频码流文件,获得第二语音帧和第二非语音帧的排布状态;a second processing unit, configured to process the audio stream file by using a voice activity detection and analysis technology, to obtain an arrangement state of the second voice frame and the second non-speech frame;
第二获得单元,用于根据上述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段;a second obtaining unit, configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame;
建立单元,用于根据上述第一类一一对应关系建立各上述第一语音段与各上述第二语音段的第二类一一对应关系;a establishing unit, configured to establish, according to the first type one-to-one correspondence, the second type one-to-one correspondence between each of the first voice segment and each of the second voice segments;
第三获得单元,用于根据上述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序,获得与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。a third obtaining unit, configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech Each of the segments and each of the first non-speech segments are distributed in the same order of the second speech segment and each of the second non-speech segments.
有益效果Beneficial effect
本发明通过将原始连续长语音文件区分为语音段和非语音段,并保留与原始连续长语音文件相同的非语音段,使得翻译后的音频流码文件与原始连续长语音文件相比,具有几乎相同的节奏、语言背景音和语句自然间隔,增加机器翻译的生动活力感,提高用户的使用体验。The invention distinguishes the original continuous long speech file into a speech segment and a non-speech segment, and retains the same non-speech segment as the original continuous long speech file, so that the translated audio stream code file has a compared with the original continuous long speech file. Almost the same rhythm, language background sounds and natural intervals of sentences increase the vivid vitality of machine translation and improve the user experience.
附图说明DRAWINGS
图1本发明一实施例的连续长语音文件的翻译方法流程示意图;1 is a schematic flow chart of a method for translating a continuous long speech file according to an embodiment of the present invention;
图2本发明一实施例的步骤S1的流程示意图;2 is a schematic flow chart of step S1 of an embodiment of the present invention;
图3本发明一实施例的步骤S11的流程示意图;FIG. 3 is a schematic flowchart of step S11 according to an embodiment of the present invention;
图4本发明一实施例的步骤S2的流程示意图;4 is a schematic flow chart of step S2 of an embodiment of the present invention;
图5本发明一实施例的步骤S3的流程示意图;FIG. 5 is a schematic flowchart of step S3 according to an embodiment of the present invention;
图6本发明一实施例的翻译机的结构示意图;FIG. 6 is a schematic structural diagram of a translation machine according to an embodiment of the present invention; FIG.
图7本发明一实施例的第一解析模块的结构示意图;FIG. 7 is a schematic structural diagram of a first parsing module according to an embodiment of the present invention;
图8本发明一实施例的第一获得单元的结构示意图;FIG. 8 is a schematic structural diagram of a first obtaining unit according to an embodiment of the present invention; FIG.
图9本发明一实施例的发送接收模块的结构示意图;FIG. 9 is a schematic structural diagram of a transmitting and receiving module according to an embodiment of the present invention; FIG.
图10本发明一实施例的第二解析模块的结构示意图。FIG. 10 is a schematic structural diagram of a second parsing module according to an embodiment of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
参照图1,本发明一实施例的连续长语音文件的翻译方法,包括:Referring to FIG. 1, a method for translating a continuous long voice file according to an embodiment of the present invention includes:
S1:解析上述连续长语音文件,得到各第一语音段和各第一非语 音段,其中,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布。S1: Parsing the continuous long speech file to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are distributed according to a time sequence generated in the continuous long speech.
本实施例的终端设备以翻译机为例。本步骤中通过解析上述连续长语音文件,得到各第一语音段与各第一非语音段的交替间隔排布的数据文件,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布,比如表示为:第一语音段1、第一非语音段1、第一语音段2、第一非语音段2、第一语音段3、第一非语音段3、……、第一语音段N、第一非语音段N。The terminal device of this embodiment takes a translation machine as an example. In this step, by analyzing the continuous long speech file, a data file of alternately arranged intervals between each first speech segment and each first non-speech segment is obtained, and each first speech segment and each first non-speech segment are continuous in the above-mentioned continuous length. The time distribution generated in the voice is expressed, for example, as: a first speech segment 1, a first non-speech segment 1, a first speech segment 2, a first non-speech segment 2, a first speech segment 3, a first non-speech segment 3, ..., the first speech segment N, the first non-speech segment N.
S2:发送上述连续长语音文件至服务器进行翻译,并接收上述服务器翻译上述连续长语音文件后的音频码流文件。S2: Send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.
本步骤指连续长语音文件依次经翻译机发送到语音识别服务器、翻译服务器以及语音合成服务器进行翻译的过程。本实施例的音频码流文件指翻译连续长语音文件后得到的对应的音频数据,包括语音数据、非语音数据。This step refers to a process in which a continuous long speech file is sequentially sent by a translation machine to a speech recognition server, a translation server, and a speech synthesis server for translation. The audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.
S3:解析上述音频码流文件,得到与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。S3: Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as the first speech segment and each of the first non-speech segments.
本实施例的音频码流文件是通过一一对应翻译连续长语音文件后得到的音频数据,所以音频码流文件中的各第二语音段和各第二非语音段与各第一语音段和各第一非语音段分布次序相同。The audio stream file in this embodiment is the audio data obtained by translating the continuous long speech files one by one, so each second speech segment and each second non-speech segment in the audio stream file and each first speech segment and Each of the first non-speech segments is distributed in the same order.
S4:在上述音频码流文件中将分布次序相同的相同排序位置的上述第一非语音段替换掉上述第二非语音段,得到最终的翻译语音文件。S4: replacing the first non-speech segment of the same sorting position with the same distribution order in the audio stream file to replace the second non-speech segment to obtain a final translated voice file.
本实施例在与连续长语音文件分布次序相同的音频码流文件中,将相同排序位置上的第二非语音文件替换为第一非语音段,并将第一非语音段与翻译后的音频码流文件合为一体,使得最终的翻译语音文件与原始连续长语音文件具有相同的节奏、语言背景音和语句自然间隔,增加机器翻译的生动活力感,提高用户的使用体验。In this embodiment, in the audio stream file with the same distribution order of the continuous long voice files, the second non-speech file at the same sorting position is replaced with the first non-speech segment, and the first non-speech segment and the translated audio are The code stream files are integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of the machine translation and improves the user experience.
参照图2,进一步地,本发明一实施例中,步骤S1,包括:Referring to FIG. 2, in an embodiment of the present invention, step S1 includes:
S10:通过语音活动检测分析技术处理上述连续长语音文件,获得第一语音帧和第一非语音帧的排布状态。S10: processing the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.
本实施例通过翻译机对连续长语音文件做VAD(Voice Activity  Detection,语音活动检测分析),将上述连续长语音文件中的第一语音段和第一非语音段区分开,以便于后续操作。举例地,按帧处理连续长语音文件,每帧时长根据语音信号特点来设定,比如GSM的20ms的时间为帧长度。首先通过VAD检测出连续长语音文件里的每个第一语音段的开始和结束,并通过算法处理得到连续长语音文件的各第一语音段的时间长度,如采用GSM通信系统中的ETSI VAD算法或者G.729 Annex B VAD算法,将VAD提取的连续长语音文件的参数特征值与门限值比较,以区分开上述第一语音段和上述第一非语音段。本步骤的排布状态是指,经语音活动检测分析处理后,连续长语音文件变为连续排布的第一语音帧与连续排布的第一非语音帧交替分布的数据文件时的排布信息。In this embodiment, the first voice segment and the first non-speech segment in the continuous long voice file are distinguished by a VAD (Voice Activity Detection) for the continuous long voice file by the translation machine, so as to facilitate subsequent operations. For example, continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal. For example, the time of 20 ms of GSM is the frame length. Firstly, the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system. The algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment. The arrangement state of this step refers to the arrangement when the continuous long speech file becomes a data file in which the first speech frame of the continuous arrangement and the first non-speech frame of the continuous arrangement are alternately distributed after the speech activity detection analysis processing. information.
S11:根据上述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。S11: Obtain each first speech segment and each first non-speech segment according to the arrangement state of the first speech frame and the first non-speech frame.
本实施例中,将通过VAD区分开的各第一语音段和各第一非语音段,分别采用不同编码标记,以便识别。In this embodiment, each of the first speech segments and the first non-speech segments separated by the VAD are respectively marked with different coding marks for identification.
参照图3,进一步地,本发明一实施例中,步骤S11,包括:Referring to FIG. 3, in an embodiment of the present invention, step S11 includes:
S112:将连续排布的第一语音帧分别合成各上述第一语音段,将连续排布的第一非语音帧分别合成各上述第一非语音段。S112: synthesize the first speech frames that are consecutively arranged into the first speech segments, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
本实施例通过VAD的判决结果辨别第一语音帧和第一非语音帧,比如:判决结果是1,就是第一语音帧;判决结果是0,就是背景噪声帧(即第一非语音帧),并把连续长语音文件中的语句,变为经VAD处理后的连续排列的第一语音帧1到第一语音帧m合并成的第一语音段1,连续排列的第一非语音帧1到第一非语音帧k合并成的第一非语音段1,依次处理,将连续长语音文件变为连续排列的第一语音段1至第一语音段N,与连续排列的第一非语音段1至第一非语音段N一一交替分布的数据文件。In this embodiment, the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame). And changing the statement in the continuous long speech file into the first speech segment 1 in which the VDD-processed consecutively arranged first speech frame 1 to the first speech frame m are merged, and the first non-speech frame 1 continuously arranged The first non-speech segment 1 merged into the first non-speech frame k is sequentially processed to change the continuous long speech file into the first speech segment 1 to the first speech segment N, and the first non-speech continuously arranged A data file in which the segment 1 to the first non-speech segment N are alternately distributed.
进一步地,本发明一实施例中,步骤S112之后,包括Further, in an embodiment of the present invention, after step S112,
S113:提取各上述第一非语音段。S113: Extract each of the foregoing first non-speech segments.
根据各第一语音段和各第一非语音段的不同编码标记,把连续长语音文件里的第一非语音段提取出来,比如:将按照依次编码为T1、 T2…Tn的第一非语音段1、2...N提取出来。Extracting the first non-speech segment in the continuous long speech file according to different coding flags of the first speech segment and each of the first non-speech segments, for example, the first non-speech code that is sequentially encoded as T1, T2... Tn Segments 1, 2...N are extracted.
S114:将各上述第一非语音段按照在上述连续长语音中生成的时序存储于非语音段缓存区。S114: Store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
本步骤的非语音段缓存区设定于翻译机内存的指定区域,以便按照在上述连续长语音中生成时序中第一非语音段的相同排列次序,依次将翻译后的音频码流文件与第一非语音段合为一体后再输出。The non-speech segment buffer of this step is set in a designated area of the translation machine memory, so as to sequentially convert the translated audio stream file and the first order according to the same arrangement order of the first non-speech segments in the continuous long speech. A non-speech segment is combined and output.
参照图4,本发明一实施例中,步骤S2,包括:Referring to FIG. 4, in an embodiment of the present invention, step S2 includes:
S20:将连续长语音文件发送至语音识别服务器。S20: Send the continuous long voice file to the voice recognition server.
S21:接收上述语音识别服务器反馈的与上述连续长语音文件对应的第一文本文件。S21: Receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.
本步骤通过语音识别服务器得到的与连续长语音文件相对应的第一文本文件。This step is a first text file corresponding to the continuous long voice file obtained by the voice recognition server.
S22:将上述第一文本文件发送至翻译服务器。S22: Send the first text file above to the translation server.
S23:接收上述翻译服务器反馈的翻译上述第一文本文件后的指定语种的第二文本文件。S23: Receive a second text file of the specified language after the translation of the first text file, which is fed back by the translation server.
本步骤通过翻译服务器对第一文本文件进行翻译,以形成指定语种的第二文本文件。比如:中文翻译成英文,中文的第一文本文件与英文的第二文本文件为一一对应关系,即中文一句话翻译成英文的一句话,每句话一一对应。This step translates the first text file through the translation server to form a second text file of the specified language. For example, the Chinese translation into English, the first text file of Chinese and the second text file of English have a one-to-one correspondence, that is, a sentence translated into English in Chinese, one sentence per sentence.
S24:将上述第二文本文件发送至语音合成服务器。S24: Send the second text file to the voice synthesis server.
S25:接收上述语音合成服务器转换上述第二文本文件后的音频码流文件。S25: Receive an audio stream file after the voice synthesis server converts the second text file.
本步骤通过将第二文本文件顺序送入语音合成服务器,以便将第二文本文件顺序转换为指定语种的音频码流文件,比如,英文的音频码流文件。In this step, the second text file is sequentially sent to the speech synthesis server to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file in English.
参照图5,进一步地,本发明一实施例中,步骤S3,包括:Referring to FIG. 5, in an embodiment of the present invention, step S3 includes:
S30:将上述第一文本文件的第一字符串信息与上述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系。S30: Correlate the first character string information of the first text file with the second character string information of the second text file to obtain a first-class one-to-one correspondence.
本步骤通过对比分析文本文件的字符串信息,对第一文本文件和第二文本文件上的字符串组成的每句话分别做好标记,比如:第1句 话、第2句话、……、第N句话,以便通过对应比较获得第二文本文件与第一文本文件的一一对应关系。This step marks each sentence of the first text file and the second text file by comparing the string information of the text file, for example: the first sentence, the second sentence, ... The Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.
S31:通过语音活动检测分析技术处理上述音频码流文件,获得第二语音帧和第二非语音帧的排布状态。S31: processing the audio stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-voice frame.
本步骤通过VAD处理音频码流文件,以区分开音频码流文件中的第二语音段和第二非语音段,每个第二语音段对应第二文本文件中的一句话,N个第二非语音段的时间长度相同。In this step, the audio stream file is processed by the VAD to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to a sentence in the second text file, N second Non-speech segments have the same length of time.
S32:根据上述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段。S32: Obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame.
本实施例中,将通过VAD区分开的各第二语音段和各第二非语音段,同样分别采用不同编码标记,以便识别。In this embodiment, each of the second speech segments and the second non-speech segments separated by the VAD are also respectively labeled with different coding marks for identification.
S33:根据上述第一类一一对应关系建立各上述第一语音段与各上述第二语音段的第二类一一对应关系。S33: Establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.
每个第一语音段对应第一文本文件中的一句话,每个第二语音段对应第二文本文件中的一句话,根据第二文本文件与第一文本文件的一一对应关系,找到各第一语音段与各第二语音段的一一对应关系,以便确定各第一非语音段与各第二非语音段的一一对应关系,以便精确替换。Each of the first speech segments corresponds to a sentence in the first text file, and each of the second speech segments corresponds to a sentence in the second text file, and each one is found according to the one-to-one correspondence between the second text file and the first text file. A one-to-one correspondence between the first speech segment and each of the second speech segments to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments for accurate replacement.
S34:根据上述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序,获得与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。S34: Obtain, according to the second type one-to-one correspondence, and each of the first speech segments and each of the first non-speech segments, according to the timing generated in the continuous long speech, and the first speech segments and the first non-speech Each of the second speech segments and the second non-speech segments are in the same order.
本实施例通过第二类一一对应关系,以及各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序,使翻译后的音频码流文件及连续长语音文件的语音段一一对应,使连续长语音文件的节奏(比如每句话与每句话之间的不同间隔)、语言背景音(比如背景音乐、掌声等)以及语句自然间隔(即非语音段的自然长度),能更好的与翻译后的音频码流文件合成为一体,使得最终的翻译语音文件更加贴近原语言环境,提高用户的使用体验。In this embodiment, the translated audio stream file and the continuous long voice file are performed by the second type one-to-one correspondence, and each first voice segment and each first non-speech segment are generated according to the timing generated in the continuous long voice. One-to-one correspondence of speech segments, so that the rhythm of continuous speech files (such as different intervals between each sentence and each sentence), language background sounds (such as background music, applause, etc.) and natural intervals of sentences (ie, non-speech segments) The natural length) can be better integrated with the translated audio stream file, so that the final translated voice file is closer to the original language environment and improves the user experience.
参照图6,本发明一实施例的翻译机,包括:Referring to FIG. 6, a translation machine according to an embodiment of the present invention includes:
第一解析模块1,用于解析上述连续长语音文件,得到各第一语音段和各第一非语音段,其中,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布。The first parsing module 1 is configured to parse the continuous long speech file to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are in the continuous long speech The resulting timing distribution.
本实施例的终端设备以翻译机为例。本实施例中通过第一解析模块1解析上述连续长语音文件,得到各第一语音段与各第一非语音段的交替间隔排布的数据文件,各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序分布,比如表示为:第一语音段1、第一非语音段1、第一语音段2、第一非语音段2、第一语音段3、第一非语音段3、……、第一语音段N、第一非语音段N。The terminal device of this embodiment takes a translation machine as an example. In this embodiment, the continuous speech file is parsed by the first parsing module 1 to obtain data files of alternate intervals of the first speech segment and each of the first non-speech segments, each first speech segment and each first non-speech The segment is distributed according to the time series generated in the continuous continuous speech, for example, as: first speech segment 1, first non-speech segment 1, first speech segment 2, first non-speech segment 2, first speech segment 3, A non-speech segment 3, ..., a first speech segment N, and a first non-speech segment N.
发送接收模块2,用于发送上述连续长语音文件至服务器进行翻译,并接收上述服务器翻译上述连续长语音文件后的音频码流文件。The sending and receiving module 2 is configured to send the continuous long voice file to the server for translation, and receive the audio stream file after the server translates the continuous long voice file.
本实施例通过发送接收模块2,将连续长语音文件依次发送到语音识别服务器、翻译服务器以及语音合成服务器进行翻译。本实施例的音频码流文件指翻译连续长语音文件后得到的对应的音频数据,包括语音数据、非语音数据。In this embodiment, the continuous and long voice files are sequentially sent to the voice recognition server, the translation server, and the voice synthesis server through the transmitting and receiving module 2 for translation. The audio stream file of this embodiment refers to corresponding audio data obtained after translating a continuous long speech file, including voice data and non-speech data.
第二解析模块3,用于解析上述音频码流文件,得到与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。The second parsing module 3 is configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as the first speech segment and each of the first non-speech segments.
本实施例的音频码流文件是通过一一对应翻译连续长语音文件后得到的音频数据,所以通过第二解析模块3解析音频码流文件,得到的各第二语音段和各第二非语音段与各第一语音段和各第一非语音段分布次序相同。The audio stream file in this embodiment is the audio data obtained by translating the continuous long speech file one by one, so that the second stream segment file and the second non-voice are obtained by parsing the audio stream file through the second parsing module 3. The segment is distributed in the same order as each of the first speech segments and each of the first non-speech segments.
替换模块4,用于在上述音频码流文件中将相同排序位置的上述第一非语音段替换掉上述第二非语音段,得到最终的翻译语音文件。The replacing module 4 is configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
本实施例在与连续长语音文件分布次序相同的音频码流文件中,将相同排序位置上的第二非语音文件通过替换模块4替换为第一非语音段,并将第一非语音段与翻译后的音频码流文件合为一体,使得最终的翻译语音文件与原始连续长语音文件具有相同的节奏、语言背景音和语句自然间隔,增加机器翻译的生动活力感,提高用户的使用体验。In this embodiment, in the audio stream file with the same distribution order of the continuous long voice files, the second non-speech file at the same sorting position is replaced by the replacement module 4 with the first non-speech segment, and the first non-speech segment is The translated audio stream file is integrated, so that the final translated voice file has the same rhythm, language background sound and natural interval of the sentence as the original continuous long voice file, which increases the vivid vitality of machine translation and improves the user experience.
参照图7,进一步地,本发明一实施例中,上述第一解析模块1,包括:Referring to FIG. 7, further, in an embodiment of the present invention, the first parsing module 1 includes:
第一处理单元10,用于通过语音活动检测分析技术处理上述连续长语音文件,获得第一语音帧和第一非语音帧的排布状态。The first processing unit 10 is configured to process the continuous long speech file by using a voice activity detection and analysis technology to obtain an arrangement state of the first speech frame and the first non-speech frame.
本实施例通过第一处理单元10对连续长语音文件做VAD,将上述连续长语音文件中的第一语音段和第一非语音段区分开,以便于后续操作。举例地,按帧处理连续长语音文件,每帧时长根据语音信号特点来设定,比如GSM的20ms的时间为帧长度。首先通过VAD检测出连续长语音文件里的每个第一语音段的开始和结束,并通过算法处理得到连续长语音文件的各第一语音段的时间长度,如采用GSM通信系统中的ETSI VAD算法或者G.729 Annex B VAD算法,将VAD提取的连续长语音文件的参数特征值与门限值比较,以区分开上述第一语音段和上述第一非语音段。本实施例的排布状态是指,经第一处理单元10语音活动检测分析处理后,连续长语音文件变为连续排布的第一语音帧与连续排布的第一非语音帧交替分布的数据文件时的排布信息。In this embodiment, the first processing unit 10 performs VAD on the continuous long voice file, and the first voice segment and the first non-speech segment in the continuous long voice file are distinguished to facilitate subsequent operations. For example, continuous speech files are processed by frame, and the duration of each frame is set according to the characteristics of the voice signal. For example, the time of 20 ms of GSM is the frame length. Firstly, the start and end of each first speech segment in the continuous long speech file are detected by the VAD, and the time length of each first speech segment of the continuous long speech file is obtained by an algorithm, such as using the ETSI VAD in the GSM communication system. The algorithm or the G.729 Annex B VAD algorithm compares the parameter feature values of the continuous long speech files extracted by the VAD with a threshold value to distinguish the first speech segment from the first non-speech segment. The arrangement state of the present embodiment means that after the first activity processing detection and analysis processing by the first processing unit 10, the continuous speech file becomes a continuous arrangement of the first speech frame and the consecutively arranged first non-speech frames are alternately distributed. Arrangement information when the data file.
第一获得单元11,用于根据上述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。The first obtaining unit 11 is configured to obtain each of the first speech segments and each of the first non-speech segments according to the arrangement state of the first speech frame and the first non-speech frame.
本实施例中,将通过VAD区分开的各第一语音段和各第一非语音段,分别采用不同编码标记,以便识别。In this embodiment, each of the first speech segments and the first non-speech segments separated by the VAD are respectively marked with different coding marks for identification.
参照图8,进一步地,本发明一实施例中,上述第一获得单元11,包括:Referring to FIG. 8, further, in an embodiment of the present invention, the first obtaining unit 11 includes:
合成子单元112,用于根据上述排布状态将连续排布的第一语音帧分别合成各上述第一语音段,将连续排布的第一非语音帧分别合成各上述第一非语音段。The synthesizing sub-unit 112 is configured to synthesize the first speech frames that are consecutively arranged into the first speech segments according to the arrangement state, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segments.
本实施例通过VAD的判决结果辨别第一语音帧和第一非语音帧,比如:判决结果是1,就是第一语音帧;判决结果是0,就是背景噪声帧(即第一非语音帧),并通过合成子单元112把连续长语音文件中的语句,变为经VAD处理后的连续排列的第一语音帧1到第一语音 帧m合并成的第一语音段1,连续排列的第一非语音帧1到第一非语音帧k合并成的第一非语音段1,依次处理,将连续长语音文件变为连续排列的第一语音段1至第一语音段N,与连续排列的第一非语音段1至第一非语音段N一一交替分布的数据文件。In this embodiment, the first speech frame and the first non-speech frame are discriminated by the result of the VAD, for example, the judgment result is 1, which is the first speech frame; and the judgment result is 0, that is, the background noise frame (ie, the first non-speech frame). And the sentence in the continuous long speech file is changed into the first speech segment 1 of the continuously arranged first speech frame 1 to the first speech frame m after the VAD processing by the synthesizing sub-unit 112, and the consecutively arranged The first non-speech segment 1 from which the non-speech frame 1 to the first non-speech frame k are merged is sequentially processed, and the continuous long speech file is changed into the first speech segment 1 to the first speech segment N which are continuously arranged, and is continuously arranged. The first non-speech segment 1 to the first non-speech segment N are one-to-one data files alternately distributed.
进一步地,本发明一实施例中,上述第一获得单元11,还包括:Further, in an embodiment of the present invention, the first obtaining unit 11 further includes:
提取子单元113,用于提取各上述第一非语音段。The extracting sub-unit 113 is configured to extract each of the first non-speech segments.
根据各第一语音段和各第一非语音段的不同编码标记,通过提取子单元113把连续长语音文件里的第一非语音段提取出来,比如:将按照依次编码为T1、T2…Tn的第一非语音段1、2...N提取出来。And extracting, by the extracting sub-unit 113, the first non-speech segment in the continuous long speech file according to different coding marks of the first speech segment and each of the first non-speech segments, for example, sequentially encoding T1, T2, ... Tn The first non-speech segments 1, 2...N are extracted.
存储子单元114,用于将各上述第一非语音段按照在上述连续长语音中生成的时序存储于非语音段缓存区。The storage sub-unit 114 is configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
本实施例的非语音段缓存区设定于存储子单元114的指定区域,以便按照在上述连续长语音中生成时序中第一非语音段的相同排列次序,依次将翻译后的音频码流文件与第一非语音段合为一体后再输出。The non-speech segment buffer area of this embodiment is set in a designated area of the storage sub-unit 114, so that the translated audio stream file is sequentially sequentially according to the same arrangement order of the first non-speech segments in the generation of the continuous long speech. It is integrated with the first non-speech segment and then output.
参照图9,本发明一实施例中,上述发送接收模块2,包括:Referring to FIG. 9, in an embodiment of the present invention, the sending and receiving module 2 includes:
第一发送单元20,用于将连续长语音文件发送至语音识别服务器。The first sending unit 20 is configured to send the continuous long voice file to the voice recognition server.
第一接收单元21,用于接收上述语音识别服务器反馈的与上述连续长语音文件对应的第一文本文件。The first receiving unit 21 is configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server.
本实施例通过第一发送单元20将连续长语音文件发送至语音识别服务器,经语音识别服务器转换为与连续长语音文件相对应的第一文本文件。In this embodiment, the continuous transmission voice file is sent to the voice recognition server by the first sending unit 20, and converted into a first text file corresponding to the continuous long voice file by the voice recognition server.
第二发送单元22,用于将上述第一文本文件发送至翻译服务器。The second sending unit 22 is configured to send the first text file to the translation server.
第二接收单元23,用于接收上述翻译服务器反馈的翻译上述第一文本文件后的指定语种的第二文本文件。The second receiving unit 23 is configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server.
本实施例第二发送单元22将第一文本文件发送至翻译服务器,通过翻译服务器对第一文本文件进行翻译,以形成指定语种的第二文本文件。比如:中文翻译成英文,中文的第一文本文件与英文的第二文本文件为一一对应关系,即中文一句话翻译成英文的一句话,每句 话一一对应。In this embodiment, the second sending unit 22 sends the first text file to the translation server, and the first text file is translated by the translation server to form a second text file of the specified language. For example, Chinese is translated into English, and the first text file of Chinese and the second text file of English are in one-to-one correspondence, that is, a sentence in which a Chinese sentence is translated into English, and each sentence corresponds one-to-one.
第三发送单元24,用于将上述第二文本文件发送至语音合成服务器。The third sending unit 24 is configured to send the second text file to the voice synthesis server.
第三接收单元25,用于接收上述语音合成服务器转换上述第二文本文件后的音频码流文件。The third receiving unit 25 is configured to receive the audio stream file after the voice synthesis server converts the second text file.
本实施例通过第三发送单元24将第二文本文件顺序送入语音合成服务器,以便将第二文本文件顺序转换为指定语种的音频码流文件,比如,英文的音频码流文件。In this embodiment, the second text file is sequentially sent to the voice synthesis server by the third sending unit 24 to sequentially convert the second text file into an audio stream file of a specified language, for example, an audio stream file of English.
参照图10,进一步地,本发明一实施例中,上述第二解析模块3,包括:Referring to FIG. 10, further, in an embodiment of the present invention, the second parsing module 3 includes:
分析单元30,用于将上述第一文本文件的第一字符串信息与上述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系。The analyzing unit 30 is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence.
本实施例通过分析单元30对比分析文本文件的字符串信息,对第一文本文件和第二文本文件上的字符串组成的每句话分别做好标记,比如:第1句话、第2句话、……、第N句话,以便通过对应比较获得第二文本文件与第一文本文件的一一对应关系。In this embodiment, the analysis unit 30 compares and analyzes the character string information of the text file, and marks each sentence of the first text file and the second text file, for example, the first sentence and the second sentence. Words, ..., Nth sentence, in order to obtain a one-to-one correspondence between the second text file and the first text file by corresponding comparison.
第二处理单元31,用于通过语音活动检测分析技术处理上述音频码流文件。The second processing unit 31 is configured to process the audio stream file by using a voice activity detection and analysis technology.
本实施例通过第二处理单元31,对音频码流文件进行VAD处理,以区分开音频码流文件中的第二语音段和第二非语音段,每个第二语音段对应第二文本文件中的一句话,N个第二非语音段的时间长度相同。In this embodiment, the second processing unit 31 performs VAD processing on the audio stream file to distinguish the second speech segment and the second non-speech segment in the audio stream file, and each second speech segment corresponds to the second text file. In one sentence, the N second non-speech segments have the same length of time.
第二获得单元32,用于根据上述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段。The second obtaining unit 32 is configured to obtain each second speech segment and each second non-speech segment according to the arrangement state of the second speech frame and the second non-speech frame.
本实施例中,通过第二获得单元32获得各第二语音段和各第二非语音段,同样分别采用不同编码标记,以便识别。In this embodiment, each second speech segment and each second non-speech segment are obtained by the second obtaining unit 32, and different coding marks are also respectively used for identification.
建立单元33,用于根据上述第一类一一对应关系建立各上述第一语音段与各上述第二语音段的第二类一一对应关系。The establishing unit 33 is configured to establish a second type one-to-one correspondence between each of the first voice segment and each of the second voice segments according to the first type one-to-one correspondence.
每个第一语音段对应第一文本文件中的一句话,每个第二语音段 对应第二文本文件中的一句话,建立单元33根据第二文本文件与第一文本文件的一一对应关系,找到各第一语音段与各第二语音段的一一对应关系,以便确定各第一非语音段与各第二非语音段的一一对应关系,以便精确替换。Each of the first speech segments corresponds to a sentence in the first text file, each second speech segment corresponds to a sentence in the second text file, and the establishing unit 33 is configured according to the one-to-one correspondence between the second text file and the first text file. And finding a one-to-one correspondence between each of the first speech segments and each of the second speech segments, so as to determine a one-to-one correspondence between each of the first non-speech segments and each of the second non-speech segments, so as to be accurately replaced.
第三获得单元34,用于根据上述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序,获得与上述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。a third obtaining unit 34, configured to obtain, according to the second type one-to-one correspondence relationship, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first speech segment Each of the second speech segments and each of the second non-speech segments are arranged in the same order as the first non-speech segments.
本实施例的第三获得单元34通过第二类一一对应关系,以及各第一语音段和各第一非语音段按照在上述连续长语音中生成的时序,获得翻译后的音频码流文件及连续长语音文件的语音段一一对应关系,使连续长语音文件的节奏(比如每句话与每句话之间的不同间隔)、语言背景音(比如背景音乐、掌声等)以及语句自然间隔(即非语音段的自然长度),能更好的与翻译后的音频码流文件合成为一体,使得最终的翻译语音文件更加贴近原语言环境,提高用户的使用体验。The third obtaining unit 34 of the embodiment obtains the translated audio stream file by using the second type one-to-one correspondence, and each first speech segment and each first non-speech segment according to the timing generated in the continuous long speech. And the one-to-one correspondence of the speech segments of the continuous long speech file, so that the rhythm of the continuous long speech file (such as the different interval between each sentence and each sentence), the language background sound (such as background music, applause, etc.) and the statement nature The interval (that is, the natural length of the non-speech segment) can be better integrated with the translated audio stream file, so that the final translated voice file is closer to the original language environment and improves the user experience.
以上所述仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above is only the preferred embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the invention and the drawings are directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present invention.

Claims (12)

  1. 一种连续长语音文件的翻译方法,其特征在于,包括:A method for translating a continuous long speech file, comprising:
    解析连续长语音文件,得到各第一语音段和各第一非语音段,其中,各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序分布;Parsing the continuous long speech file to obtain each of the first speech segments and each of the first non-speech segments, wherein each of the first speech segments and each of the first non-speech segments are distributed according to a time sequence generated in the continuous long speech;
    发送所述连续长语音文件至服务器进行翻译,并接收所述服务器翻译所述连续长语音文件后的音频码流文件;Transmitting the continuous long voice file to a server for translation, and receiving an audio stream file after the server translates the continuous long voice file;
    解析所述音频码流文件,得到与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段;Parsing the audio stream file to obtain each second speech segment and each second non-speech segment having the same distribution order as the first speech segment and each first non-speech segment;
    在所述音频码流文件中将相同排序位置的所述第一非语音段替换掉所述第二非语音段,得到最终的翻译语音文件。The first non-speech segment of the same sorting position is replaced with the second non-speech segment in the audio stream file to obtain a final translated voice file.
  2. 根据权利要求1所述的连续长语音文件的翻译方法,其特征在于,所述解析连续长语音文件,得到各第一语音段和各第一非语音段的步骤,包括:The method for translating a continuous long speech file according to claim 1, wherein the step of parsing the continuous long speech file to obtain each of the first speech segment and each of the first non-speech segments comprises:
    通过语音活动检测分析技术处理所述连续长语音文件,获得第一语音帧和第一非语音帧的排布状态;Processing the continuous long voice file by using a voice activity detection and analysis technology to obtain an arrangement state of the first voice frame and the first non-speech frame;
    根据所述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。Each of the first speech segments and each of the first non-speech segments is obtained according to an arrangement state of the first speech frame and the first non-speech frame.
  3. 根据权利要求2所述的连续长语音文件的翻译方法,其特征在于,所述根据所述第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段的步骤,包括:The method for translating a continuous long speech file according to claim 2, wherein the first speech segment and each first non-speech are obtained according to an arrangement state of the first speech frame and the first non-speech frame Steps of the paragraph, including:
    将连续排布的第一语音帧分别合成各所述第一语音段,将连续排布的第一非语音帧分别合成各所述第一非语音段。The successively arranged first speech frames are respectively synthesized into the first speech segments, and the consecutively arranged first non-speech frames are respectively synthesized into the first non-speech segments.
  4. 根据权利要求3所述的连续长语音文件的翻译方法,其特征在于,所述将连续排布的第一语音帧分别合成各所述第一语音段,将连续排布的第一非语音帧分别合成各所述第一非语音段的步骤之后, 包括:The method for translating a continuous long speech file according to claim 3, wherein the firstly arranged first speech frames are respectively synthesized into the first speech segments, and the first non-speech frames arranged consecutively After separately synthesizing each of the first non-speech segments, the method includes:
    提取各所述第一非语音段;Extracting each of the first non-speech segments;
    将各所述第一非语音段按照在所述连续长语音中生成的时序存储于非语音段缓存区。Each of the first non-speech segments is stored in a non-speech segment buffer according to a timing generated in the continuous long speech.
  5. 根据权利要求1所述的连续长语音文件的翻译方法,其特征在于,所述发送所述连续长语音文件至服务器进行翻译,并接收所述服务器翻译所述连续长语音文件后的音频码流文件的步骤,包括:The method for translating a continuous long speech file according to claim 1, wherein the transmitting the continuous long speech file to a server for translation, and receiving the audio stream after the server translates the continuous long speech file The steps of the file, including:
    将连续长语音文件发送至语音识别服务器;Sending a continuous long voice file to the voice recognition server;
    接收所述语音识别服务器反馈的与所述连续长语音文件对应的第一文本文件;Receiving, by the voice recognition server, a first text file corresponding to the continuous long voice file;
    将所述第一文本文件发送至翻译服务器;Sending the first text file to a translation server;
    接收所述翻译服务器反馈的翻译所述第一文本文件后的指定语种的第二文本文件;Receiving, by the translation server, a second text file of a specified language after translating the first text file;
    将所述第二文本文件发送至语音合成服务器;Sending the second text file to a voice synthesis server;
    接收所述语音合成服务器转换所述第二文本文件后的音频码流文件。Receiving an audio stream file after the voice synthesis server converts the second text file.
  6. 根据权利要求5所述的连续长语音文件的翻译方法,其特征在于,所述解析所述音频码流文件,得到与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段的步骤,包括:The method for translating a continuous long speech file according to claim 5, wherein the parsing the audio stream file to obtain the same distribution order as the first speech segments and the first non-speech segments The steps of the second voice segment and each of the second non-speech segments include:
    将所述第一文本文件的第一字符串信息与所述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系;And correspondingly analyzing the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;
    通过语音活动检测分析技术处理所述音频码流文件,获得第二语音帧和第二非语音帧的排布状态;Processing the audio code stream file by using a voice activity detection and analysis technology to obtain an arrangement state of the second voice frame and the second non-speech frame;
    根据所述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段;Obtaining each second speech segment and each second non-speech segment according to an arrangement state of the second speech frame and the second non-speech frame;
    根据所述第一类一一对应关系建立各所述第一语音段与各所述第二语音段的第二类一一对应关系;Establishing a first-to-one correspondence between the first voice segment and each of the second voice segments according to the first-class one-to-one correspondence;
    根据所述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序,获得与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。Obtaining, according to the second type one-to-one correspondence relationship, and each of the first speech segments and each of the first non-speech segments according to the timing generated in the continuous long speech, obtaining the first speech segment and each of the first non-speech segments Each of the second speech segments and the second non-speech segments are in the same order.
  7. 一种翻译机,其特征在于,包括:A translation machine, comprising:
    第一解析模块,用于解析连续长语音文件,得到各第一语音段和各第一非语音段,其中,各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序分布;a first parsing module, configured to parse a continuous long speech file, to obtain each first speech segment and each first non-speech segment, wherein each first speech segment and each first non-speech segment are generated according to the continuous long speech Timing distribution
    发送接收模块,用于发送所述连续长语音文件至服务器进行翻译,并接收所述服务器翻译所述连续长语音文件后的音频码流文件;a sending and receiving module, configured to send the continuous long voice file to a server for translation, and receive an audio stream file after the server translates the continuous long voice file;
    第二解析模块,用于解析所述音频码流文件,得到与所述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段;a second parsing module, configured to parse the audio stream file, and obtain each second speech segment and each second non-speech segment that are in the same order as the first speech segment and each first non-speech segment;
    替换模块,用于在所述音频码流文件中将相同排序位置的所述第一非语音段替换掉所述第二非语音段,得到最终的翻译语音文件。And a replacement module, configured to replace the first non-speech segment of the same sorting position with the second non-speech segment in the audio stream file to obtain a final translated voice file.
  8. 根据权利要求7所述的翻译机,其特征在于,所述第一解析模块,包括:The translation machine according to claim 7, wherein the first parsing module comprises:
    第一处理单元,用于通过语音活动检测分析技术处理所述连续长语音文件,获得第一语音帧和第一非语音帧的排布状态;a first processing unit, configured to process the continuous long voice file by using a voice activity detection and analysis technology, to obtain an arrangement state of the first voice frame and the first non-speech frame;
    第一获得单元,用于根据所述,获得第一语音帧和第一非语音帧的排布状态获得各第一语音段和各第一非语音段。And a first obtaining unit, configured to obtain each of the first voice segment and each of the first non-speech segments according to the arrangement state of the first voice frame and the first non-speech frame.
  9. 根据权利要求8所述的翻译机,其特征在于,所述第一获得单元,包括:The translation machine according to claim 8, wherein the first obtaining unit comprises:
    合成子单元,用于根据所述排布状态将连续排布的第一语音帧分别合成各所述第一语音段,将连续排布的第一非语音帧分别合成各所述第一非语音段。a synthesizing subunit, configured to respectively synthesize the first speech frames that are consecutively arranged according to the arrangement state into each of the first speech segments, and synthesize the first non-speech frames that are consecutively arranged into the first non-speech segment.
  10. 根据权利要求9所述的翻译机,其特征在于,所述第一获得 单元,还包括:The translation machine according to claim 9, wherein the first obtaining unit further comprises:
    提取子单元,用于提取各所述第一非语音段;Extracting a subunit for extracting each of the first non-speech segments;
    存储子单元,用于将各所述第一非语音段按照在所述连续长语音中生成的时序存储于非语音段缓存区。And a storage subunit, configured to store each of the first non-speech segments in a non-speech segment buffer according to a sequence generated in the continuous long speech.
  11. 根据权利要求7所述的翻译机,其特征在于,所述发送接收模块,包括:The translation machine according to claim 7, wherein the transmitting and receiving module comprises:
    第一发送单元,用于将连续长语音文件发送至语音识别服务器;a first sending unit, configured to send the continuous long voice file to the voice recognition server;
    第一接收单元,用于接收所述语音识别服务器反馈的与所述连续长语音文件对应的第一文本文件;a first receiving unit, configured to receive a first text file corresponding to the continuous long voice file fed back by the voice recognition server;
    第二发送单元,用于将所述第一文本文件发送至翻译服务器;a second sending unit, configured to send the first text file to a translation server;
    第二接收单元,用于接收所述翻译服务器反馈的翻译所述第一文本文件后的指定语种的第二文本文件;a second receiving unit, configured to receive a second text file of the specified language after the translation of the first text file that is fed back by the translation server;
    第三发送单元,用于将所述第二文本文件发送至语音合成服务器;a third sending unit, configured to send the second text file to a voice synthesis server;
    第三接收单元,用于接收所述语音合成服务器转换所述第二文本文件后的音频码流文件。And a third receiving unit, configured to receive an audio stream file after the voice synthesis server converts the second text file.
  12. 根据权利要求11所述的翻译机,其特征在于,所述第二解析模块,包括:The translation machine according to claim 11, wherein the second parsing module comprises:
    分析单元,用于将所述第一文本文件的第一字符串信息与所述第二文本文件的第二字符串信息对应分析,得到第一类一一对应关系;The analyzing unit is configured to analyze the first character string information of the first text file and the second character string information of the second text file to obtain a first-class one-to-one correspondence relationship;
    第二处理单元,用于通过语音活动检测分析技术处理所述音频码流文件,获得第二语音帧和第二非语音帧的排布状态;a second processing unit, configured to process the audio code stream file by using a voice activity detection and analysis technology, to obtain an arrangement state of the second voice frame and the second non-speech frame;
    第二获得单元,用于根据所述第二语音帧和第二非语音帧的排布状态获得各第二语音段与各第二非语音段;a second obtaining unit, configured to obtain each second speech segment and each second non-speech segment according to an arrangement state of the second speech frame and the second non-speech frame;
    建立单元,用于根据所述第一类一一对应关系建立各所述第一语音段与各所述第二语音段的第二类一一对应关系;And a establishing unit, configured to establish, according to the first type one-to-one correspondence, a first-to-one correspondence between each of the first voice segment and each of the second voice segments;
    第三获得单元,用于根据所述第二类一一对应关系,以及各第一语音段和各第一非语音段按照在所述连续长语音中生成的时序,获得与所 述各第一语音段和各第一非语音段分布次序相同的各第二语音段和各第二非语音段。a third obtaining unit, configured to obtain, according to the second type one-to-one correspondence, and each first speech segment and each first non-speech segment according to a sequence generated in the continuous long speech, obtain the first Each of the second speech segment and each of the second non-speech segments in the same order in which the speech segment and each of the first non-speech segments are distributed.
PCT/CN2018/072007 2017-12-20 2018-01-09 Method for translating continuous long speech file, and translation machine WO2019119552A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711388000.3 2017-12-20
CN201711388000.3A CN108090051A (en) 2017-12-20 2017-12-20 The interpretation method and translator of continuous long voice document

Publications (1)

Publication Number Publication Date
WO2019119552A1 true WO2019119552A1 (en) 2019-06-27

Family

ID=62177614

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072007 WO2019119552A1 (en) 2017-12-20 2018-01-09 Method for translating continuous long speech file, and translation machine

Country Status (2)

Country Link
CN (1) CN108090051A (en)
WO (1) WO2019119552A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101497A (en) * 2018-07-18 2018-12-28 深圳市锐曼智能技术有限公司 Voice collecting translating equipment, system and method
WO2021109000A1 (en) * 2019-12-03 2021-06-10 深圳市欢太科技有限公司 Data processing method and apparatus, electronic device, and storage medium
CN111862940A (en) * 2020-07-15 2020-10-30 百度在线网络技术(北京)有限公司 Earphone-based translation method, device, system, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008243080A (en) * 2007-03-28 2008-10-09 Toshiba Corp Device, method, and program for translating voice
CN101458681A (en) * 2007-12-10 2009-06-17 株式会社东芝 Voice translation method and voice translation apparatus
CN101727904A (en) * 2008-10-31 2010-06-09 国际商业机器公司 Voice translation method and device
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN106303695A (en) * 2016-08-09 2017-01-04 北京东方嘉禾文化发展股份有限公司 Audio translation multiple language characters processing method and system
CN107391498A (en) * 2017-07-28 2017-11-24 深圳市沃特沃德股份有限公司 Voice translation method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE381755T1 (en) * 2003-06-02 2008-01-15 Ibm VOICE RESPONSIVE SYSTEM, VOICE RESPONSIVE METHOD, VOICE SERVER, VOICE FILE PROCESSING METHOD, PROGRAM AND RECORDING MEDIUM
CN102968991B (en) * 2012-11-29 2015-01-21 华为技术有限公司 Method, device and system for sorting voice conference minutes
CN103167360A (en) * 2013-02-21 2013-06-19 中国对外翻译出版有限公司 Method for achieving multilingual subtitle translation
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN105719642A (en) * 2016-02-29 2016-06-29 黄博 Continuous and long voice recognition method and system and hardware equipment
CN107305541B (en) * 2016-04-20 2021-05-04 科大讯飞股份有限公司 Method and device for segmenting speech recognition text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008243080A (en) * 2007-03-28 2008-10-09 Toshiba Corp Device, method, and program for translating voice
CN101458681A (en) * 2007-12-10 2009-06-17 株式会社东芝 Voice translation method and voice translation apparatus
CN101727904A (en) * 2008-10-31 2010-06-09 国际商业机器公司 Voice translation method and device
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN106303695A (en) * 2016-08-09 2017-01-04 北京东方嘉禾文化发展股份有限公司 Audio translation multiple language characters processing method and system
CN107391498A (en) * 2017-07-28 2017-11-24 深圳市沃特沃德股份有限公司 Voice translation method and device

Also Published As

Publication number Publication date
CN108090051A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN110049270B (en) Multi-person conference voice transcription method, device, system, equipment and storage medium
US7983910B2 (en) Communicating across voice and text channels with emotion preservation
JP6469252B2 (en) Account addition method, terminal, server, and computer storage medium
KR20230043250A (en) Synthesis of speech from text in a voice of a target speaker using neural networks
CN110853615B (en) Data processing method, device and storage medium
US6119086A (en) Speech coding via speech recognition and synthesis based on pre-enrolled phonetic tokens
KR20190104941A (en) Speech synthesis method based on emotion information and apparatus therefor
US20040073423A1 (en) Phonetic speech-to-text-to-speech system and method
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
CN101154220A (en) Machine translation apparatus and method
CN110149805A (en) Double-directional speech translation system, double-directional speech interpretation method and program
WO2019119552A1 (en) Method for translating continuous long speech file, and translation machine
CN102903361A (en) Instant call translation system and instant call translation method
WO2013027360A1 (en) Voice recognition system, recognition dictionary logging system, and audio model identifier series generation device
CN109714608B (en) Video data processing method, video data processing device, computer equipment and storage medium
KR20100111164A (en) Spoken dialogue processing apparatus and method for understanding personalized speech intention
KR20150145024A (en) Terminal and server of speaker-adaptation speech-recognition system and method for operating the system
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN110119514A (en) The instant translation method of information, device and system
WO2019075829A1 (en) Voice translation method and apparatus, and translation device
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
CN114125506B (en) Voice auditing method and device
KR20180033875A (en) Method for translating speech signal and electronic device thereof
CN110310620B (en) Speech fusion method based on native pronunciation reinforcement learning
JP2000010578A (en) Voice message transmission/reception system, and voice message processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18891357

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18891357

Country of ref document: EP

Kind code of ref document: A1