WO2019019135A1 - 语音翻译方法和装置 - Google Patents

语音翻译方法和装置 Download PDF

Info

Publication number
WO2019019135A1
WO2019019135A1 PCT/CN2017/094874 CN2017094874W WO2019019135A1 WO 2019019135 A1 WO2019019135 A1 WO 2019019135A1 CN 2017094874 W CN2017094874 W CN 2017094874W WO 2019019135 A1 WO2019019135 A1 WO 2019019135A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
information
background noise
speech
voice information
Prior art date
Application number
PCT/CN2017/094874
Other languages
English (en)
French (fr)
Inventor
蒋壮
郑勇
张立新
王文琪
温平
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Priority to PCT/CN2017/094874 priority Critical patent/WO2019019135A1/zh
Publication of WO2019019135A1 publication Critical patent/WO2019019135A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a voice translation method and apparatus.
  • the translation processing of the voice information by the communication terminal mainly includes three processes of identification, translation and synthesis.
  • the translated voice information is composed of a voice frame and a mute frame, and the mute frame is actually a blank frame. Is the discontinuity of the speech frame. Therefore, only the voice in the translated voice information, without the background sound of the real environment, greatly reduces the authenticity of the dialogue between the two parties and affects the user experience.
  • the main object of the present invention is to provide a speech translation method and apparatus, which aims to solve the technical problem that the background sound is missing in the translated speech information and the dialogue authenticity is reduced.
  • an embodiment of the present invention provides a voice translation method, where the method includes the following steps.
  • Embodiments of the present invention also provide a voice translation apparatus, where the apparatus includes:
  • a voice information acquiring module configured to obtain original voice information
  • a background noise extraction module configured to extract a background noise frame from the original voice information
  • a voice translation processing module configured to perform translation processing on the original voice information to obtain a translated language first message
  • a mute recognition module configured to identify a mute frame in the translated speech information
  • a background noise superimposing module configured to superimpose the background noise frame on the mute frame in the translated speech information, so that the translated speech information includes information of background noise.
  • a speech translation method provided by an embodiment of the present invention, by extracting a background noise frame from original speech information, and then identifying a mute frame in the translated speech information, and finally superimposing the background noise frame on the translation
  • the mute frame in the subsequent speech information causes the translated speech information to include information of background noise. Therefore, the user can not only hear the clear voice, but also hear the background sound in the real environment, which increases the authenticity of the dialogue between the two parties and enhances the user experience.
  • FIG. 1 is a flow chart of an embodiment of a speech translation method of the present invention
  • FIG. 2 is a schematic diagram of a fragment of original voice information in an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a background noise frame extracted from the original voice information in FIG. 2 in the embodiment of the present invention.
  • FIG. 4 is another schematic diagram of a fragment of original voice information in an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a fragment of translated speech information in an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of translated speech information with background noise added in an embodiment of the present invention.
  • FIG. 8 is a system block diagram of an application scenario of a speech translation method according to an embodiment of the present invention.
  • FIG. 9 is a system block diagram of still another application scenario of a speech translation method according to an embodiment of the present invention.
  • FIG. 10 is a system block diagram of still another application scenario of a speech translation method according to an embodiment of the present invention.
  • FIG. 11 is a system block diagram of still another application scenario of a speech translation method according to an embodiment of the present invention.
  • FIG. 12 is a block diagram showing an embodiment of a speech translation apparatus of the present invention.
  • FIG. 13 is a block diagram of a voice acquisition module of FIG. 12;
  • FIG. 14 is a block diagram of a background noise extraction module of FIG. 12;
  • FIG. 15 is a block diagram of the identification unit of FIG. 14;
  • FIG. 16 is a block diagram of the mute recognition module of FIG. 12; [0034] FIG.
  • 17 is a block diagram of the background noise superimposing module of FIG.
  • the voice translation method and apparatus of the embodiments of the present invention can be applied to various terminal devices, and is particularly applicable to a VOLTE terminal, which is a communication terminal based on VOLTE (Voice over LTE) technology.
  • VoLTE is an IP data transmission technology that does not require a 2G/3G network. All services are carried on a 4G network, which enables data and voice services to be unified under the same network.
  • it can also be applied to other terminal devices, which is not limited by the present invention.
  • FIG. 1 An embodiment of the speech translation method of the present invention is provided with reference to FIG. 1, the method comprising the following steps: [0039] Sl l, acquiring original speech information.
  • the terminal device may collect the original voice information through a sound collection device such as a microphone, or may receive the original voice information sent by the opposite end.
  • the VOLTE terminal establishes a voice communication connection with the opposite end. After the uplink, the VOLTE terminal collects the original voice information through the microphone and caches it. After the downlink, the VOLTE terminal receives the original voice information sent by the peer and caches it.
  • the original voice information is composed of a plurality of voice information frames including a voice frame and a background noise frame. As shown in FIG. 2, a segment of the original voice information is schematically shown, including 1 ⁇ 1 ⁇ background noise frame and 1 ⁇ n frame speech frame.
  • step S12 the terminal device first identifies the background noise frame in the original voice information, and then adds the inter-frame stamp to the background noise frame in the order of the inter-frame, and finally saves the background noise frame.
  • the 1 ⁇ 1 ⁇ background noise frame extracted from FIG. 2 is schematically shown.
  • the terminal device identifies the background noise frame in the original voice information by using voice activity detection (VAD).
  • VAD voice activity detection
  • the terminal device performs voice activity detection on the original voice information, and adopts frame processing to obtain parameter feature values of each frame of the voice information frame.
  • the length of each frame of voice information can be set according to the signal characteristics of the original voice information, such as Global System for Mobile (Global System for Mobile)
  • GSM Global System for Mobile Communications
  • the voice activity detection algorithm can use GSM ETSI VAD algorithm or G.729 Annex B VAD algorithm.
  • the terminal device After obtaining the parameter feature value of each frame of the voice information frame, the terminal device compares the parameter feature value with the preset threshold value, and determines whether the parameter feature value is less than or equal to the threshold value; If the value is less than or equal to the threshold ⁇ , the frame of the speech information frame is determined to be a background noise frame; when the parameter eigenvalue is greater than the threshold ⁇ , the frame of the speech information frame is determined to be a speech frame. Each frame in the original speech information is traversed, and all speech frames and background noise frames in the original speech information are identified.
  • the parameter eigenvalue here refers to the energy value of each frame of the speech signal, usually measured by the level amplitude value.
  • the threshold value can be set according to actual needs, such as setting according to empirical data and experimental data.
  • the terminal device when the terminal device receives the original voice information sent by the peer end, and the original voice information has been denoised by the peer end, the original voice information of the UI is determined by the voice frame and the silence indicator.
  • SID Session Descriptor
  • the terminal device parses the original voice information, identifies the SID frame in the original voice information by using the frame feature information, and then adds the preset noise information into the SID frame, thereby restoring to the background noise frame, and
  • the frame format of the background noise frame is converted and processed into the same frame format as the speech information after the post-translation processing, and the inter-post stamp is added to the background noise frame according to the order of the time.
  • the background noise of this flaw is only the simulated background noise, which is not the background noise in the real environment of the opposite user.
  • S13 Perform translation processing on the original voice information to obtain translated voice information.
  • step S1 the embodiment of the present invention does not limit the sequence of steps S12 and S13.
  • step S1 the sequence of steps S12 and S13.
  • the terminal device may obtain the translated voice information after performing translation processing locally.
  • the original voice information may also be sent to the server, and the translated voice information is returned by the server.
  • a translation process is performed by a VOLTE terminal through a server.
  • the VOLTE terminal sends the original voice information to the server for translation processing, so that the server translates the original voice information from one language to another, obtains the translated voice information and sends the translated voice information to the VOLTE terminal, and the VOLTE terminal receives the translated voice. information.
  • the VOLTE terminal may send the original voice information directly to the server in the form of a voice data stream.
  • the VOLTE terminal sends the original voice information to the server in the form of a data packet.
  • the VOLTE terminal first records the voice information of the original first language, records it as a voice file and caches it, and then sends each cached voice file to the server in the form of a data packet.
  • the translation process mainly includes three processes of identification, translation and synthesis.
  • the three processes can be completed by one server or by two or three servers.
  • the server includes a voice recognition server, a translation server, and a voice synthesis server.
  • the VOLTE terminal establishes an IP-based connection with the voice recognition server, and sets the identification information, that is, the language type to be recognized, including the language type of the local end, and may further include the language type of the peer; establishes an IP-based connection with the translation server, and sets The translation information, that is, the language to be translated, including the local-to-peer mapping, may further include the mapping of the peer to the local end; establishing a connection based on the IP communication with the speech synthesis server, and setting the synthetic information, that is, the type of speech synthesis, such as men and women. Sound, speed, etc.
  • the specific process for the VOLTE terminal to send the original voice information to the server for translation processing is as follows: [0058] S131. Send the original voice information to the voice recognition server, so that the voice recognition server recognizes the original voice information as the first character string.
  • the VOLTE terminal first performs recording processing on the original voice information, records it as a voice file and caches, and then sends each cached voice file to the voice recognition server in the form of a data packet. After receiving the voice file, the voice recognition server identifies and processes the voice file according to the preset identification information, identifies the first character string, and returns the first character string to the VOLTE terminal.
  • S132 Receive a first character string returned by the voice recognition server.
  • the VOLTE terminal After receiving the first character string, the VOLTE terminal sends the first character string to the translation server. After receiving the first string, the translation server translates the first string according to the preset translation information, translates it into a second string (ie, a string of another voice), and returns the second string. Give the VOLT E terminal.
  • S134 Receive a second character string returned by the translation server.
  • S135. Send the second character string to the voice synthesis server, so that the voice synthesis server synthesizes the second character string into voice information.
  • the VOLTE terminal After receiving the second character string, the VOLTE terminal sends the second character string to the voice synthesis server. After receiving the second character string, the speech synthesis server synthesizes the second character string according to the preset synthesis information, and synthesizes the voice information into another language, and the voice information is the translated voice information.
  • S136 Receive voice information returned by the voice synthesis server, where the voice information is translated voice information.
  • the speech synthesis server returns the translated voice information to the VOLTE terminal in the form of a voice stream
  • the identification, translation, and synthesis processing of the original voice information may also be performed by a server.
  • the VOLTE terminal transmits the original voice information to the server, and the server identifies, translates, and synthesizes the voice information and returns it to the VOLTE terminal.
  • the identification, translation, and synthesis processing of the original voice information may also be performed by two servers.
  • the VOLTE terminal sends the original voice information to the first server, and the first server will The original voice information is identified and translated, and then returned to the VOLTE terminal.
  • the VOLTE terminal sends the voice information after the recognition and translation processing to the second server, and the second server combines the voice information and returns the voice information to the VOLTE terminal.
  • the VOLTE terminal sends the original voice information to the first server, and the first server performs the identification processing on the original voice information, and then returns the voice information to the VOLTE terminal, and the VO LTE terminal sends the voice information after the identification processing to the second server.
  • the second server translates and synthesizes the voice information and returns it to the VOLTE terminal.
  • the translated speech information is also composed of a plurality of speech information frames including a speech frame and a mute frame. As shown in Fig. 6, a segment of the translated speech information is schematically shown, which includes a l ⁇ k frame mute frame and a 1 ⁇ L frame speech frame.
  • step S14 the terminal device performs voice activity detection on the translated voice information, and adopts frame processing to acquire parameter feature values of each frame of the voice information frame.
  • the voice activity detection algorithm may adopt the GSM E TSI VAD algorithm or the G.729 Ann ex B VAD algorithm. Of course, other algorithms may also be used, which are not limited by the present invention.
  • the terminal device compares the parameter feature value with the preset threshold value, and determines whether the parameter feature value is less than or equal to the threshold value; If the value is less than or equal to the threshold ⁇ , the frame of the voice information frame is determined to be a mute frame; when the parameter feature value is greater than the threshold value, the frame of the frame is determined to be a speech frame.
  • the parameter eigenvalues here refer to the energy value of each frame of the speech signal, usually measured by the level amplitude value.
  • the threshold value can be set according to actual needs, such as setting based on empirical data and experimental data.
  • step S15 the terminal device first adds a meta-stamp mark to the mute sound frame in the order of the inter-turn, and then sets the background noise frame according to the inter-mark mark of the background noise frame and the inter-mark mark of the mute frame.
  • superimposed on the corresponding mute frame in the translated speech information that is, the background noise frame and the mute frame are combined according to the order of the dice, so that the translated speech information includes the information of the background noise.
  • Figure 7 Shown schematically, a fragment of the translated speech information with background noise added, including a l ⁇ k frame background noise frame (because the mute frame is a blank frame, the background noise frame is superimposed on the mute frame) After that, only the background noise frame is actually available) and the 1 ⁇ L frame speech frame.
  • the terminal device determines whether there is an extra background noise frame, and when there is an extra background noise frame ⁇ (that is, the number of background noise frames exceeds the number of mute frames ⁇ ), the terminal device clears the redundant background noise frame. , to avoid affecting the voice frame, to ensure the voice effect.
  • the translated voice information may be output, or the translated voice information may be sent to the opposite end, and the peer end outputs the Translated voice information. Therefore, the user can not only hear the voice, but also hear the background sound, making the dialogue between the two parties more realistic. Moreover, the background noise frame does not overlap with the speech frame, so the speech frame is not affected, and the user can hear the speech.
  • VOLTE terminal uplink call ⁇ the translated voice information is sent to the opposite end through the voice channel.
  • the peer end processes the voice information through the audio channel, and finally outputs the voice information through the sounding device (handset, speaker, etc.), and the peer user can hear the voice of the VOLTE terminal user and the environment in which they are located.
  • Background sound The VOLTE terminal downlink call ⁇ , the translated voice information is processed through the audio channel, and finally the voice information is output through the sounding device (handset, speaker, etc.), and the VOLTE terminal user can hear the voice of the opposite user and the environment in which it is located. Background sound or simulated background sound.
  • the speech translation method of the embodiment of the present invention by extracting the background noise frame from the original speech information, and then identifying the mute frame in the translated speech information, and finally superimposing the background noise frame on the translated speech information.
  • the translated speech information includes background noise information. Therefore, the user can not only hear the clear voice, but also hear the background sound in the real environment, which increases the authenticity of the dialogue between the two parties and enhances the user experience.
  • the embodiment of the present invention can be applied to the application scenario shown in FIG. 8, where the VOLTE terminal A and the VOLT E terminal B establish a connection through an IP Multimedia Subsystem (IMS) network, and the VOLTE terminal A and the VOLTE are used.
  • the terminal B is respectively connected to the voice recognition server, the translation server and the voice synthesis server, and the VOLTE terminal A and the VOLTE terminal B both process the original voice information collected by the local end by using the voice translation method of the embodiment of the invention, and then send the original voice information to the peer end. , the opposite end is directly output After the voice information.
  • IMS IP Multimedia Subsystem
  • VOLTE terminal A and voice terminal B establish a connection through the IMS network, and VOLTE terminal A is connected to the voice recognition server, the translation server and the voice synthesis server, respectively.
  • the VOLTE terminal A processes the original voice information collected by the local end by using the voice translation method of the embodiment of the present invention, and then sends the original voice information to the peer end, and the opposite end directly outputs the voice information.
  • the VOLTE terminal A processes the original voice information sent by the opposite end by using the voice translation method in the embodiment of the present invention, and outputs the processed voice information.
  • the VOLTE terminal A connects the IMS network and the gateway of the 2G/3G network through the IMS network
  • the voice terminal B connects the IMS network and the gateway of the 2G/3G network through the 2G/3G network
  • the VOLTE terminal A respectively connects the voice.
  • the VOLTE terminal A is in the uplink call ⁇ , and the original voice information collected by the local end is processed by the voice translation method in the embodiment of the present invention, and then processed and then sent to the voice terminal B, and the voice terminal B directly outputs the processed voice information.
  • the VO LTE terminal A processes the original voice information sent by the voice terminal B by using the voice translation method in the embodiment of the present invention, and outputs the processed voice information.
  • the VOLTE terminal A connects the IMS network and the public switched telephone network (PSTN) gateway through the IMS network
  • the voice terminal B connects the IMS network and the PSTN gateway through the PSTN
  • the VOLTE terminal A respectively Connect to a speech recognition server, a translation server, and a voice synthesis server.
  • the original voice information collected by the local end is processed by the voice translation method in the embodiment of the present invention, and then processed and then sent to the voice terminal B, and the voice terminal B directly outputs the processed voice information.
  • the VOLTE terminal A processes the original voice information sent by the voice terminal B by using the voice translation method in the embodiment of the present invention, and outputs the processed language first message.
  • the processing delay of the speech recognition server is generally less than 3 seconds
  • the processing delay of the translation server is generally less than 200 milliseconds
  • the processing delay of the speech synthesis server is generally less than 200 milliseconds
  • the delay of the transmission of the IMS network is generally second.
  • Wo 1 uses the high rate and low delay of LTE communication to implement multi-language real-time translation of voice calls on VOLTE terminals. The speed of voice translation processing is fast, and the delay is small, which will not affect the user's call. .
  • an embodiment of a speech translation apparatus of the present invention which includes speech information acquisition.
  • the module 10 the background noise extraction module 20, the speech translation processing module 30, the mute recognition module 40, and the background noise superimposition module 50.
  • the voice information acquiring module 10 is configured to acquire original voice information.
  • the voice information acquiring module 10 may collect original voice information through a sound collecting device such as a microphone, or may receive original voice information sent by the opposite end.
  • the voice information acquiring module 10 includes an acquiring unit 11 and a receiving unit 12, wherein: the collecting unit 11 is configured to collect original voice information, and the receiving unit 12 is configured to receive the original voice sent by the opposite end. .
  • the VOLTE terminal establishes a voice communication connection with the opposite end.
  • the acquisition unit 11 collects the original voice information through the microphone and caches it.
  • the receiving unit 12 receives the original voice information sent by the peer and caches it.
  • Background noise extraction module 20 is configured to extract a background noise frame from the original speech information.
  • the original speech information is composed of a plurality of speech information frames including a speech frame and a background noise frame.
  • a fragment of the original speech information is schematically shown, including 1 ⁇ 1 ⁇ background noise frame and 1
  • the background noise extraction module 20 includes an identification unit 21, a marking unit 22, and a saving unit 23, as shown in FIG. 14, wherein: the identification unit 21 is configured to recognize the background noise frame in the original voice information; the marking unit 22, setting The inter-frame stamp is added to the background noise frame in order of the day; the saving unit 23 is set to save the background noise frame. As shown in Fig. 3, a 1-m frame background noise frame extracted from Fig. 2 is schematically shown.
  • the recognition unit 21 identifies the background noise frame in the original voice information by voice activity detection (VAD).
  • VAD voice activity detection
  • the identification unit 21 includes a first obtaining unit 211, a first determining unit 212, and a first determining unit 213, where: the first obtaining unit 211 is configured to perform voice activity detection on the original voice information, Taking the frame processing to obtain the parameter feature value of each frame of the voice information frame; the first determining unit 212 is configured to determine whether the parameter feature value is less than or equal to the threshold value; the first determining unit 213 is configured to when the parameter feature value is less than Or equal to the threshold value ⁇ , the decision voice information frame is a background noise frame; when the parameter feature value is greater than the threshold value ⁇ , the frame voice information frame is determined to be a voice frame.
  • the recognition unit 21 By traversing each frame in the original speech information, the recognition unit 21 recognizes all the speech frames and background noise frames in the original speech information.
  • the parameter characteristic value here refers to the energy value of each frame of the speech signal, which is usually measured by the level amplitude value.
  • the threshold value can be set according to actual needs, such as setting according to empirical data and experimental data.
  • the length of each frame of the voice information frame can be set according to the signal characteristics of the original voice information, such as the Global System for Mobile (Global System for Mobile)
  • GSM Global System for Mobile Communications
  • the voice activity detection algorithm can use GSM ETSI VAD algorithm or G.729 Annex B VAD algorithm.
  • the voice information acquiring module 10 receives the original voice information sent by the opposite end, and the original voice information has been denoised by the opposite end, the original voice information of the voice is composed of the voice frame and
  • the Silence Descriptor (SID) frame is composed of the SID frame, which is the result of denoising the background noise frame.
  • SID Silence Descriptor
  • the background noise extraction module 20 parses the original voice information, and through the frame feature information, recognizes the SID frame in the original voice information, and then adds the preset noise information into the SID frame, thereby The background noise frame is restored, and the frame format of the background noise frame is converted and processed into the same frame format as the post-translation processed speech information, and the inter-postmark is added to the background noise frame according to the inter-sequence order. Save it.
  • the background noise of this ⁇ is only the simulated background noise, and it is not the background noise in the real environment of the opposite user.
  • the speech translation processing module 30 is configured to perform translation processing on the original speech information to obtain translated speech information.
  • the voice translation processing module 30 may obtain the translated voice information after performing translation processing locally, or may send the original voice information to the server, and the translation processing is performed by the server to return the translated voice information. .
  • the speech translation processing module 30 performs translation processing by a server as an example.
  • the voice translation processing module 30 sends the original voice information to the server for translation processing, so that the server translates the original voice information from one language to another, obtains the translated voice information, and sends the translated voice information to the voice translation processing module 30, the voice.
  • the translation processing module 30 receives the translated voice information.
  • the voice translation processing module 30 may directly transmit the original voice information to the server in the form of a voice data stream.
  • the voice translation processing module 30 sends the original voice information to the server in the form of a data packet.
  • the voice translation processing module 30 first performs recording processing on the voice information of the original first language, records the voice files as a single voice file, and caches them, and then sequentially sends each cached voice file to the server in the form of a data packet.
  • Translation processing mainly includes three processes of identification, translation and synthesis. These three processes can be completed by one server or by two or three servers.
  • the server includes a voice recognition server, a translation server, and a voice synthesis server.
  • the device of the embodiment of the present invention is applied to a VOLTE terminal as an example.
  • the VOLTE terminal establishes an IP-based connection with the voice recognition server, and sets the identification information, that is, the language type to be recognized, including the language type of the local end, and may further include the peer end.
  • Protocol-based connection establishes an IP-based connection with the translation server, sets translation information, ie, the language to be translated, including the local-to-peer mapping, and may further include mapping of the peer to the local end; establishing IP-based communication with the voice synthesis server Connect, set the composite information, that is, the type of speech synthesis, such as male and female voice, speech rate and so on.
  • the mute recognition module 40 is configured to recognize the mute frame in the translated speech information.
  • the translated speech information is also composed of a plurality of speech information frames including a speech frame and a mute frame. As shown in Fig. 6, a segment of the translated speech information is schematically shown, which includes a l ⁇ k frame mute frame and a 1 ⁇ L frame speech frame.
  • the mute recognition module 40 includes a second acquisition unit 41, a second determination unit 42, and a second determination unit 43, wherein: the second acquisition unit 41 is configured to perform the translated voice information.
  • the voice activity detection is performed by frame processing to obtain a parameter feature value of each frame of the voice information frame;
  • the second determining unit 42 is configured to determine whether the parameter feature value is less than or equal to the threshold value;
  • the second determining unit 43 is set to be The parameter feature value is less than or equal to the threshold value ⁇ , and the decision voice information frame is a mute frame.
  • the mute recognition module 40 can recognize all speech frames and mute frames in the original speech information.
  • the parameter eigenvalue here refers to the energy value of each frame of the speech signal, usually measured by the level amplitude value.
  • the threshold value can be set according to actual needs, such as setting based on empirical data and experimental data.
  • Background Noise Overlay Module 50 Set to add a background noise frame to the mute in the translated voice information On the frame, so that the translated voice information contains background noise information.
  • the background noise superimposing module 50 includes a mark adding unit 51 and a noise superimposing unit 52, wherein: the mark adding unit 51 is configured to add a turn mark to the mute sound frame in order of the turn; The superimposing unit 52 is configured to superimpose the background noise frame on the corresponding mute frame in the translated speech information according to the inter-turn stamp mark of the background noise frame and the inter-turn stamp mark of the mute frame, thereby making the translated speech The information contains information about the background noise. As shown in FIG.
  • a segment of the translated speech information with background noise added is schematically shown, which includes a l ⁇ k frame background noise frame (because the mute frame is a blank frame, the background noise frame is superimposed on After the mute frame, there is actually only the background noise frame) and the 1 ⁇ L frame speech frame.
  • the noise superimposing unit 52 includes a merging unit and a clearing unit, where: the merging unit is configured to merge the background noise frame and the mute frame according to the inter-sequence order; the clearing unit is set to determine whether there is excessive background noise Frames, when there are redundant background noise frames (that is, the number of background noise frames is the number of redundant mute frames), the excess background noise frames are cleared to avoid affecting the speech frames and ensuring the voice effect.
  • the merging unit is configured to merge the background noise frame and the mute frame according to the inter-sequence order
  • the clearing unit is set to determine whether there is excessive background noise Frames, when there are redundant background noise frames (that is, the number of background noise frames is the number of redundant mute frames), the excess background noise frames are cleared to avoid affecting the speech frames and ensuring the voice effect.
  • the device further includes a voice information sending module, configured to send the translated voice information to the peer end.
  • a voice information sending module configured to send the translated voice information to the peer end. This allows the peer user to not only hear the voice, but also hear the background sound, making the conversation between the two parties more realistic. Moreover, the background noise frame does not overlap with the voice frame, so the voice frame is not affected, and the peer user can hear the voice.
  • the apparatus further includes a voice information output module configured to output the translated voice information.
  • a voice information output module configured to output the translated voice information. This allows the local user to not only hear the voice, but also hear the background sound, making the conversation between the two parties more realistic. Moreover, the background noise frame does not overlap with the voice frame, so the voice frame is not affected, and the user can listen to the voice.
  • the voice information sending module sends the translated voice information to the peer end through the voice channel.
  • the peer end processes the voice information through the audio channel, and finally outputs the voice information through the sounding device (handset, speaker, etc.), and the peer user can hear the voice of the VOLTE terminal user and the environment in which they are located. Background sound.
  • the voice information output module processes the translated voice information through the audio channel, and finally outputs the voice information through the sounding device (handset, speaker, etc.), and the VOLTE terminal user can hear the voice of the opposite user.
  • the speech translation apparatus of the embodiment of the present invention extracts a background noise frame from the original speech information, and then The mute frame in the translated speech information is recognized, and finally the background noise frame is superimposed on the mute frame in the translated speech information, so that the translated speech information includes background noise information. Therefore, the user can not only hear the clear voice, but also hear the background sound in the real environment, which increases the authenticity of the dialogue between the two parties and enhances the user experience.
  • Embodiments of the present invention also provide a terminal device, where the terminal device includes a memory, a processor, and at least one application stored in a memory and configured to be executed by the processor, where the application is configured.
  • the speech translation method comprises the steps of: acquiring original speech information; extracting a background noise frame from the original speech information; translating the original speech information to obtain the translated speech information; and identifying the dumb in the translated speech information Audio frame; superimposes the background noise frame onto the mute frame in the translated speech information, so that the translated speech information contains background noise information.
  • the present invention includes apparatus related to performing one or more of the operations described herein.
  • These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer.
  • These devices have computer programs stored therein that are selectively activated or reconfigured.
  • Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and respectively coupled to a bus, including but not limited to any Types of disks (including floppy disks, hard disks, CDs, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only)
  • a readable medium includes any medium that is stored or transmitted by a device (e.g., a computer) in a readable form.

Abstract

一种语音翻译方法和装置,所述方法包括以下步骤:获取原始语音信息(S11);从原始语音信息中提取出背景噪声帧(S12);对原始语音信息进行翻译处理,得到翻译后的语音信息(S13);识别出翻译后的语音信息中的哑音帧(S14);将背景噪声帧叠加到翻译后的语音信息中的哑音帧上,以使翻译后的语音信息中包含背景噪声的信息(S15)。

Description

语音翻译方法和装置
技术领域
[0001] 本发明涉及通信技术领域, 特别是涉及到一种语音翻译方法和装置。
背景技术
[0002] [0002]随着通信终端的使用日益广泛, 人们利用通信终端可以实现多种功能, 例如利用通信终端听音乐, 看视频以及进行语音通话等等。 语音通话是通信终 端的一个基本的和常用的功能, 即使人们远隔千里, 也能够通过通信终端实现 远程语音交流, 无形中缩短了人与人之间的距离。
[0003] 同吋, 随着经济的全球化和国际化发展, 不同国家之间的人们的交往也越来越 密切。 不同国家的人通常使用不同的语言, 当两个用户中至少一个用户听不懂 对方的语言, 另一个用户也不会说对方的语言吋, 两个用户则需要借助通信终 端的翻译功能来听懂对方的语音。
技术问题
[0004] 现有技术中, 通信终端对语音信息的翻译处理, 主要包括识别、 翻译和合成三 个流程, 翻译后的语音信息由语音帧和哑音帧组成, 哑音帧实为空白帧, 是语 音帧的间断点。 因此翻译后的语音信息中只有语音, 没有实吋环境的背景音, 大大降低了双方对话的真实性, 影响用户体验。 问题的解决方案
技术解决方案
[0005] [0004]本发明的主要目的为提供一种语音翻译方法和装置, 旨在解决翻译后的 语音信息中缺失背景音而降低了对话真实性的技术问题。
[0006] 为达以上目的, 本发明实施例提出一种语音翻译方法, 所述方法包括以下步骤
[0007] 获取原始语音信息;
[0008] 从所述原始语音信息中提取出背景噪声帧;
[0009] 对所述原始语音信息进行翻译处理, 得到翻译后的语音信息; [0010] 识别出所述翻译后的语音信息中的哑音帧;
[0011] 将所述背景噪声帧叠加到所述翻译后的语音信息中的哑音帧上, 以使所述翻译 后的语音信息中包含背景噪声的信息。
[0012] 本发明实施例同吋提出一种语音翻译装置, 所述装置包括:
[0013] 语音信息获取模块, 设置为获取原始语音信息;
[0014] 背景噪声提取模块, 设置为从所述原始语音信息中提取出背景噪声帧;
[0015] 语音翻译处理模块, 设置为对所述原始语音信息进行翻译处理, 得到翻译后的 语首 息;
[0016] 哑音识别模块, 设置为识别出所述翻译后的语音信息中的哑音帧;
[0017] 背景噪声叠加模块, 设置为将所述背景噪声帧叠加到所述翻译后的语音信息中 的哑音帧上, 以使所述翻译后的语音信息中包含背景噪声的信息。
发明的有益效果
有益效果
[0018] 本发明实施例所提供的一种语音翻译方法, 通过从原始语音信息中提取出背景 噪声帧, 再识别出翻译后的语音信息中的哑音帧, 最后将背景噪声帧叠加到翻 译后的语音信息中的哑音帧上, 使得翻译后的语音信息中包含背景噪声的信息 。 从而用户不但能够听到清晰的语音, 还能够听到实吋环境下的背景音, 增加 了双方对话的真实性, 增强了用户体验。
对附图的简要说明
附图说明
[0019] [0006]图 1是本发明的语音翻译方法一实施例的流程图;
[0020] 图 2是本发明实施例中原始语音信息的片段的示意图;
[0021] 图 3是本发明实施例中从图 2中的原始语音信息中提取出的背景噪声帧的示意图
[0022] 图 4是本发明实施例中原始语音信息的片段的又一示意图;
[0023] 图 5是本发明实施例中对原始语音信息进行翻译处理的具体流程图;
[0024] 图 6是本发明实施例中翻译后的语音信息的片段的示意图;
[0025] 图 7是本发明实施例中添加了背景噪声的翻译后的语音信息的示意图; [0026] 图 8是实现本发明实施例的语音翻译方法一应用场景的系统框图;
[0027] 图 9是实现本发明实施例的语音翻译方法又一应用场景的系统框图;
[0028] 图 10是实现本发明实施例的语音翻译方法又一应用场景的系统框图;
[0029] 图 11是实现本发明实施例的语音翻译方法又一应用场景的系统框图;
[0030] 图 12是本发明的语音翻译装置一实施例的模块示意图;
[0031] 图 13是图 12中的语音获取模块的模块示意图;
[0032] 图 14是图 12中的背景噪声提取模块的模块示意图;
[0033] 图 15是图 14中的识别单元的模块示意图;
[0034] 图 16是图 12中的哑音识别模块的模块示意图;
[0035] 图 17是图 12中的背景噪声叠加模块的模块示意图。
实施该发明的最佳实施例
本发明的最佳实施方式
[0036] [0007]应当理解, 此处所描述的具体实施例仅仅用以解释本发明, 并不用于限 定本发明。
[0037] 本发明实施例的语音翻译方法和装置, 可以应用于各种终端设备, 尤其适用于 VOLTE终端, 该 VOLTE终端即基于 VOLTE (Voice over LTE) 技术的通信终端 。 VoLTE是一种 IP数据传输技术, 无需 2G/3G网络, 全部业务承载于 4G网络上, 可实现数据与语音业务在同一网络下的统一。 当然, 也可以应用于其它的终端 设备, 本发明对此不作限定。
[0038] 参照图 1, 提出本发明的语音翻译方法一实施例, 所述方法包括以下步骤: [0039] Sl l、 获取原始语音信息。
[0040] 本步骤 S 11中, 终端设备可以通过声音采集装置如麦克风采集原始语音信息, 也可以接收对端发送的原始语音信息。
[0041] 以 VOLTE终端为例, VOLTE终端与对端建立语音通信连接。 上行吋, VOLTE 终端通过麦克风采集原始语音信息并缓存。 下行吋, VOLTE终端接收对端发送 的原始语音信息并缓存。
[0042] S12、 从原始语音信息中提取出背景噪声帧。
[0043] 原始语音信息由多个语音信息帧组成, 该语音信息帧包括语音帧和背景噪声帧 , 如图 2所示, 示意性的示出了原始语音信息的片段, 包括 1~1^贞背景噪声帧和 1 ~n帧语音帧。
[0044] 本步骤 S12中, 终端设备首先识别出原始语音信息中的背景噪声帧, 然后按吋 间先后顺序对背景噪声帧添加吋间戳标记, 最后保存该背景噪声帧。 如图 3所示 , 示意性的示出了从图 2中提取出的 1~1^贞背景噪声帧。
[0045] 本发明实施例中, 终端设备通过语音活动检测 (VAD) 来识别原始语音信息中 的背景噪声帧。
[0046] 具体的, 终端设备对原始语音信息进行语音活动检测, 采取按帧处理, 获取每 一帧语音信息帧的参数特征值。 每帧语音信息帧的吋长可以根据原始语音信息 的信号特点来设定, 比如全球移动通信系统 (Global System for Mobile
Communication, GSM) 的语音信号, 以 20ms作为每帧语音信息帧的帧长度, 语 音活动检测算法可以采用 GSM的 ETSI VAD算法或者 G.729 Annex B VAD算法。
[0047] 当获得每一帧语音信息帧的参数特征值后, 终端设备则比较该参数特征值与预 设的门限值的大小, 判断参数特征值是否小于或等于门限值; 当参数特征值小 于或等于门限值吋, 判决该帧语音信息帧为背景噪声帧; 当参数特征值大于门 限值吋, 判决该帧语音信息帧为语音帧。 遍历原始语音信息中的每一帧, 识别 出原始语音信息中的所有语音帧和背景噪声帧。 这里的参数特征值是指语音信 号每一帧的能量值, 通常以电平幅度值来衡量。 门限值可以根据实际需要设定 , 如根据经验数据、 实验数据进行设定。
[0048] 可选地, 当终端设备接收到对端发送的原始语音信息, 且该原始语音信息已经 由对端进行了去噪处理吋, 此吋的原始语音信息则由语音帧和静默指示符 (Sile nce Descriptor, SID) 帧组成, 该 SID帧即对背景噪声帧进行去噪处理后的结果 。 如图 4所示, 示意性的示出了去噪处理后的原始语音信息的片段, 包括 l~n^ S ID帧和 l~n帧语音帧。
[0049] 终端设备对该原始语音信息进行解析, 通过帧特征信息, 识别出原始语音信息 中的 SID帧, 然后将预设的噪声信息加入到 SID帧中, 从而还原成背景噪声帧, 并对该背景噪声帧的帧格式进行转换处理, 处理成与后期翻译处理后的语音信 息的帧格式一样, 并按照吋间先后顺序对背景噪声帧添加吋间戳标记后予以保 存。 当然, 此吋的背景噪声只是模拟的背景噪声, 并非对端用户真实环境下的 背景噪声。
[0050] S13、 对原始语音信息进行翻译处理, 得到翻译后的语音信息。
[0051] 本发明实施例对步骤 S12和 S13的先后顺序不做限定, 在某些实施例中, 步骤 S1
2和 S 13也可以同吋进行。
[0052] 本发明实施例中, 终端设备可以在本地进行翻译处理后获得翻译后的语音信息
, 也可以将原始语音信息发送给服务器, 由服务器进行翻译处理后返回翻译后 的语音信息。
[0053] 举例而言, 以 VOLTE终端通过服务器进行翻译处理为例。 VOLTE终端将原始 语音信息发送给服务器进行翻译处理, 以使服务器将原始语音信息从一种语言 翻译为另一种语言, 获得翻译后的语音信息并发送给 VOLTE终端, VOLTE终端 接收翻译后的语音信息。
[0054] VOLTE终端可以将原始语音信息直接以语音数据流的方式发送给服务器, 作 为优选, VOLTE终端将原始语音信息以数据包的形式分包发送给服务器。 例如 , VOLTE终端首先将原始第一语言的语音信息进行录音处理, 录制为一个个的 语音文件并缓存, 然后将缓存的每个语音文件以数据包的形式依次发送给服务 器。
[0055] 翻译处理主要包括识别、 翻译和合成三个流程, 这三个流程可以由一个服务器 完成, 也可以由两个或三个服务器完成。
[0056] 本发明实施例中, 服务器包括语音识别服务器、 翻译服务器和语音合成服务器 。 VOLTE终端与语音识别服务器建立基于 IP通信的连接, 设置识别信息, 即需 要识别的语言类型, 包括本端的语言类型, 还可以进一步包括对端的语言类型 ; 与翻译服务器建立基于 IP通信的连接, 设置翻译信息, 即要翻译的语种, 包括 本端对对端的映射, 还可以进一步包括对端对本端的映射; 与语音合成服务器 建立基于 IP通信的连接, 设置合成信息, 即语音合成的类型, 比如男女声、 语速 等。
[0057] 如图 5所示, VOLTE终端将原始语音信息发送给服务器进行翻译处理的具体流 程如下: [0058] S131、 将原始语音信息发送给语音识别服务器, 以使语音识别服务器将原始语 音信息识别为第一字符串。
[0059] VOLTE终端首先将原始语音信息进行录音处理, 录制为一语音文件并缓存, 然后将缓存的每个语音文件以数据包的形式依次发送给语音识别服务器。 语音 识别服务器接收到语音文件后, 根据预设的识别信息对语音文件进行识别处理 , 识别为第一字符串, 并将第一字符串返回给 VOLTE终端。
[0060] S132、 接收语音识别服务器返回的第一字符串。
[0061] S133、 将第一字符串发送给翻译服务器, 以使翻译服务器将第一字符串翻译为 第二字符串。
[0062] VOLTE终端接收到第一字符串后, 将第一字符串发送给翻译服务器。 翻译服 务器接收到第一字符串后, 根据预设的翻译信息对该第一字符串进行翻译处理 , 翻译为第二字符串 (即另一种语音的字符串) , 并将第二字符串返回给 VOLT E终端。
[0063] S134、 接收翻译服务器返回的第二字符串。
[0064] S135、 将第二字符串发送给语音合成服务器, 以使语音合成服务器将第二字符 串合成为语音信息。
[0065] VOLTE终端接收到第二字符串后, 将第二字符串发送给语音合成服务器。 语 音合成服务器接收到第二字符串后, 根据预设的合成信息对第二字符串进行合 成处理, 合成为另一种语言的语音信息, 该语音信息即为翻译后的语音信息。
[0066] S136、 接收语音合成服务器返回的语音信息, 该语音信息即为翻译后的语音信 息。
[0067] 语音合成服务器将该翻译后的语音信息以语音码流的形式返回给 VOLTE终端
[0068] 在其它实施例中, 也可以由一个服务器完成原始语音信息的识别、 翻译和合成 处理。 例如, VOLTE终端将原始语音信息发送给服务器, 服务器将该语音信息 进行识别、 翻译和合成处理后返回给 VOLTE终端。
[0069] 在另一些实施例中, 也可以由两个服务器完成原始语音信息的识别、 翻译和合 成处理。 例如, VOLTE终端将原始语音信息发送给第一服务器, 第一服务器将 该原始语音信息进行识别和翻译处理后返回给 VOLTE终端, VOLTE终端再将识 别和翻译处理后的语音信息发送给第二服务器, 第二服务器将该语音信息进行 合成处理后返回给 VOLTE终端。 又如, VOLTE终端将原始语音信息发送给第一 服务器, 第一服务器将该原始语音信息进行识别处理后返回给 VOLTE终端, VO LTE终端再将识别处理后的语音信息发送给第二服务器, 第二服务器将该语音信 息进行翻译和合成处理后返回给 VOLTE终端。
[0070] 终端设备获得翻译后的语音信息后, 则进入下一步骤 S 14。
[0071] S14、 识别出翻译后的语音信息中的哑音帧。
[0072] 翻译后的语音信息也由多个语音信息帧组成, 该语音信息帧包括语音帧和哑音 帧。 如图 6所示, 示意性的示出了翻译后的语音信息的片段, 其包括 l~k帧哑音 帧和 1~L帧语音帧。
[0073] 本步骤 S14中, 终端设备对翻译后的语音信息进行语音活动检测, 采取按帧处 理, 获取每一帧语音信息帧的参数特征值。 语音活动检测算法可以采用 GSM的 E TSI VAD算法或者 G.729 Annex B VAD算法, 当然也可以采用其它的算法, 本发 明对此不作限定。
[0074] 当获得每一帧语音信息帧的参数特征值后, 终端设备则比较该参数特征值与预 设的门限值的大小, 判断参数特征值是否小于或等于门限值; 当参数特征值小 于或等于门限值吋, 判决该帧语音信息帧为哑音帧; 当参数特征值大于门限值 吋, 判决该帧语音信息帧为语音帧。 遍历原始语音信息中的每一帧, 识别出原 始语音信息中的所有语音帧和哑音帧, 得到每一帧语音帧和哑音帧的起点。 这 里的参数特征值是指语音信号每一帧的能量值, 通常以电平幅度值来衡量。 门 限值可以根据实际需要设定, 如根据经验数据、 实验数据进行设定。
[0075] S15、 将背景噪声帧叠加到翻译后的语音信息中的哑音帧上, 以使翻译后的语 音信息中包含背景噪声的信息。
[0076] 本步骤 S15中, 终端设备首先按吋间先后顺序对哑音声帧添加吋间戳标记, 然 后根据背景噪声帧的吋间戳标记与哑音帧的吋间戳标记, 将背景噪声帧叠加到 翻译后的语音信息中对应的哑音帧上, 也就是说, 根据吋间先后顺序合并背景 噪声帧和哑音帧, 从而使得翻译后的语音信息中包含了背景噪声的信息。 如图 7 所示, 示意性的示出了添加了背景噪声的翻译后的语音信息的片段, 其包括 l~k 帧背景噪声帧 (因哑音帧实为空白帧, 故背景噪声帧叠加于哑音帧后实际只有 背景噪声帧) 和 1~L帧语音帧。
[0077] 优选地, 终端设备判断是否有多余的背景噪声帧, 当有多余的背景噪声帧吋 ( 即背景噪声帧的数量多余哑音帧的数量吋) , 终端设备则清除多余的背景噪声 帧, 以避免影响语音帧, 保证语音效果。
[0078] 终端设备将背景噪声帧叠加到翻译后的语音信息中的哑音帧上之后, 可以输出 翻译后的语音信息, 也可以将翻译后的语音信息发送给对端, 由对端输出该翻 译后的语音信息。 从而用户不但能够听到语音, 还能听到背景音, 使得双方的 对话更加真实。 并且, 背景噪声帧与语音帧不重叠, 因此不会影响语音帧, 用 户能够听清语音。
[0079] 例如: VOLTE终端上行通话吋, 通过语音通道将翻译后的语音信息发送给对 端。 对端接收到语音信息后, 通过音频通路对该语音信息进行处理, 最后通过 发声装置 (听筒、 扬声器等) 输出该语音信息, 对端用户就能够听到 VOLTE终 端用户的语音和其所处环境的背景音。 VOLTE终端下行通话吋, 通过音频通路 对翻译后的语音信息进行处理, 最后通过发声装置 (听筒、 扬声器等) 输出该 语音信息, VOLTE终端用户就能够听到对端用户的语音和其所处环境的背景音 或模拟的背景音。
[0080] 本发明实施例的语音翻译方法, 通过从原始语音信息中提取出背景噪声帧, 再 识别出翻译后的语音信息中的哑音帧, 最后将背景噪声帧叠加到翻译后的语音 信息中的哑音帧上, 使得翻译后的语音信息中包含背景噪声的信息。 从而用户 不但能够听到清晰的语音, 还能够听到实吋环境下的背景音, 增加了双方对话 的真实性, 增强了用户体验。
[0081] 本发明实施例可以应用于如图 8所示的应用场景, 其中, VOLTE终端 A与 VOLT E终端 B通过 IP多媒体系统 (IP Multimedia Subsystem, IMS) 网络建立连接, 且 VOLTE终端 A和 VOLTE终端 B均分别连接语音识别服务器、 翻译服务器和语音合 成服务器, VOLTE终端 A和 VOLTE终端 B均采用本发明实施例的语音翻译方法对 本端采集的原始语音信息进行处理, 处理后再发送给对端, 对端则直接输出处 理后的语音信息。
[0082] 本发明实施例也可以应用于如图 9-图 11所示的应用场景。 图 8中, VOLTE终端 A与语音终端 B通过 IMS网络建立连接, 且 VOLTE终端 A分别连接语音识别服务 器、 翻译服务器和语音合成服务器。 VOLTE终端 A在上行通话吋, 采用本发明 实施例的语音翻译方法对本端采集的原始语音信息进行处理, 处理后再发送给 对端, 对端则直接输出。 VOLTE终端 A在下行通话吋, 采用本发明实施例的语 音翻译方法对对端发送的原始语音信息进行处理, 并输出处理的语音信息。
[0083] 图 10中, VOLTE终端 A通过 IMS网络连接 IMS网络与 2G/3G网络的网关, 语音 终端 B通过 2G/3G网络连接 IMS网络与 2G/3G网络的网关, 且 VOLTE终端 A分别连 接语音识别服务器、 翻译服务器和语音合成服务器。 VOLTE终端 A在上行通话 吋, 采用本发明实施例的语音翻译方法对本端采集的原始语音信息进行处理, 处理后再发送给语音终端 B, 语音终端 B则直接输出处理后的语音信息即可。 VO LTE终端 A在下行通话吋, 采用本发明实施例的语音翻译方法对语音终端 B发送 的原始语音信息进行处理, 并输出处理的语音信息。
[0084] 图 11中, VOLTE终端 A通过 IMS网络连接 IMS网络与公共交换电话网络 (Public Switched Telephone Network, PSTN) 的网关, 语音终端 B通过 PSTN连接 IMS网 络与 PSTN的网关, 且 VOLTE终端 A分别连接语音识别服务器、 翻译服务器和语 音合成服务器。 VOLTE终端 A在上行通话吋, 采用本发明实施例的语音翻译方 法对本端采集的原始语音信息进行处理, 处理后再发送给语音终端 B, 语音终端 B则直接输出处理后的语音信息。 VOLTE终端 A在下行通话吋, 采用本发明实施 例的语音翻译方法对语音终端 B发送的原始语音信息进行处理, 并输出处理后的 语首 息。
[0085] 语音识别服务器的处理吋延一般小于 3秒, 翻译服务器的处理吋延一般小于 200 毫秒, 语音合成服务器的处理吋延一般小于 200毫秒, IMS网络传输的吋延一般 为秒级。 禾 1」用 LTE通信的高速率低吋延的特点, 在 VOLTE终端上实现语音通话 吋的多语言实吋翻译功能, 语音翻译处理的速度快, 吋延小, 不会对用户的通 话造成影响。
[0086] 参照图 12, 提出本发明的语音翻译装置一实施例, 所述装置包括语音信息获取 模块 10、 背景噪声提取模块 20、 语音翻译处理模块 30、 哑音识别模块 40和背景 噪声叠加模块 50。
[0087] 语音信息获取模块 10: 设置为获取原始语音信息。
[0088] 语音信息获取模块 10可以通过声音采集装置如麦克风采集原始语音信息, 也可 以接收对端发送的原始语音信息。
[0089] 如图 13所示, 语音信息获取模块 10包括采集单元 11和接收单元 12, 其中: 采集 单元 11设置为采集原始语音信息, 接收单元 12设置为接收对端发送的原始语音 f π息。
[0090] 以应用于 VOLTE终端为例, VOLTE终端与对端建立语音通信连接。 上行吋, 采集单元 11通过麦克风采集原始语音信息并缓存。 下行吋, 接收单元 12接收对 端发送的原始语音信息并缓存。
[0091] 背景噪声提取模块 20: 设置为从原始语音信息中提取出背景噪声帧。
[0092] 原始语音信息由多个语音信息帧组成, 该语音信息帧包括语音帧和背景噪声帧
, 如图 2所示, 示意性的示出了原始语音信息的片段, 包括 1~1^贞背景噪声帧和 1
~n帧语音帧。
[0093] 背景噪声提取模块 20如图 14所示, 包括识别单元 21、 标记单元 22和保存单元 23 , 其中: 识别单元 21, 设置为识别出原始语音信息中背景噪声帧; 标记单元 22 , 设置为按吋间先后顺序对背景噪声帧添加吋间戳标记; 保存单元 23, 设置为 保存背景噪声帧。 如图 3所示, 示意性的示出了从图 2中提取出的 1-m帧背景噪声 帧。
[0094] 本发明实施例中, 识别单元 21通过语音活动检测 (VAD) 来识别原始语音信息 中的背景噪声帧。
[0095] 如图 15所示, 识别单元 21包括第一获取单元 211、 第一判断单元 212和第一判决 单元 213, 其中: 第一获取单元 211, 设置为对原始语音信息进行语音活动检测 , 采取按帧处理, 获取每一帧语音信息帧的参数特征值; 第一判断单元 212, 设 置为判断参数特征值是否小于或等于门限值; 第一判决单元 213, 设置为当参数 特征值小于或等于门限值吋, 判决语音信息帧为背景噪声帧; 当参数特征值大 于门限值吋, 判决该帧语音信息帧为语音帧。 [0096] 遍历原始语音信息中的每一帧, 识别单元 21就识别出原始语音信息中的所有语 音帧和背景噪声帧。 这里的参数特征值是指语音信号每一帧的能量值, 通常以 电平幅度值来衡量。 门限值可以根据实际需要设定, 如根据经验数据、 实验数 据进行设定。
[0097] 每帧语音信息帧的吋长可以根据原始语音信息的信号特点来设定, 比如全球移 动通信系统 (Global System for Mobile
Communication, GSM) 的语音信号, 以 20ms作为每帧语音信息帧的帧长度, 语 音活动检测算法可以采用 GSM的 ETSI VAD算法或者 G.729 Annex B VAD算法。
[0098] 可选地, 当语音信息获取模块 10接收到对端发送的原始语音信息, 且该原始语 音信息已经由对端进行了去噪处理吋, 此吋的原始语音信息则由语音帧和静默 指示符 (Silence Descriptor, SID) 帧组成, 该 SID帧即对背景噪声帧进行去噪处 理后的结果。 如图 4所示, 示意性的示出了去噪处理后的原始语音信息的片段, 包括 l~n^ SID帧和 l~n帧语音帧。
[0099] 此吋, 背景噪声提取模块 20对该原始语音信息进行解析, 通过帧特征信息, 识 另 IJ出原始语音信息中的 SID帧, 然后将预设的噪声信息加入到 SID帧中, 从而还 原成背景噪声帧, 并对该背景噪声帧的帧格式进行转换处理, 处理成与后期翻 译处理后的语音信息的帧格式一样, 并按照吋间先后顺序对背景噪声帧添加吋 间戳标记后予以保存。 当然, 此吋的背景噪声只是模拟的背景噪声, 并非对端 用户真实环境下的背景噪声。
[0100] 语音翻译处理模块 30: 设置为对原始语音信息进行翻译处理, 得到翻译后的语 音信息。
[0101] 本发明实施例中, 语音翻译处理模块 30可以在本地进行翻译处理后获得翻译后 的语音信息, 也可以将原始语音信息发送给服务器, 由服务器进行翻译处理后 返回翻译后的语音信息。
[0102] 举例而言, 以语音翻译处理模块 30通过服务器进行翻译处理为例。 语音翻译处 理模块 30将原始语音信息发送给服务器进行翻译处理, 以使服务器将原始语音 信息从一种语言翻译为另一种语言, 获得翻译后的语音信息并发送给语音翻译 处理模块 30, 语音翻译处理模块 30接收翻译后的语音信息。 [0103] 语音翻译处理模块 30可以将原始语音信息直接以语音数据流的方式发送给服务 器, 作为优选, 语音翻译处理模块 30将原始语音信息以数据包的形式分包发送 给服务器。 例如, 语音翻译处理模块 30首先将原始第一语言的语音信息进行录 音处理, 录制为一个个的语音文件并缓存, 然后将缓存的每个语音文件以数据 包的形式依次发送给服务器。
[0104] 翻译处理主要包括识别、 翻译和合成三个流程, 这三个流程可以由一个服务器 完成, 也可以由两个或三个服务器完成。
[0105] 本发明实施例中, 服务器包括语音识别服务器、 翻译服务器和语音合成服务器 。 以本发明实施例的装置应用于 VOLTE终端为例, VOLTE终端与语音识别服务 器建立基于 IP通信的连接, 设置识别信息, 即需要识别的语言类型, 包括本端的 语言类型, 还可以进一步包括对端的语言类型; 与翻译服务器建立基于 IP通信的 连接, 设置翻译信息, 即要翻译的语种, 包括本端对对端的映射, 还可以进一 步包括对端对本端的映射; 与语音合成服务器建立基于 IP通信的连接, 设置合成 信息, 即语音合成的类型, 比如男女声、 语速等。
[0106] 哑音识别模块 40: 设置为识别出翻译后的语音信息中的哑音帧。
[0107] 翻译后的语音信息也由多个语音信息帧组成, 该语音信息帧包括语音帧和哑音 帧。 如图 6所示, 示意性的示出了翻译后的语音信息的片段, 其包括 l~k帧哑音 帧和 1~L帧语音帧。
[0108] 如图 16所示, 哑音识别模块 40包括第二获取单元 41、 第二判断单元 42和第二判 决单元 43, 其中: 第二获取单元 41, 设置为对翻译后的语音信息进行语音活动 检测, 采取按帧处理, 获取每一帧语音信息帧的参数特征值; 第二判断单元 42 , 设置为判断参数特征值是否小于或等于门限值; 第二判决单元 43, 设置为当 参数特征值小于或等于门限值吋, 判决语音信息帧为哑音帧。
[0109] 遍历原始语音信息中的每一帧, 哑音识别模块 40就能识别出原始语音信息中的 所有语音帧和哑音帧。 这里的参数特征值是指语音信号每一帧的能量值, 通常 以电平幅度值来衡量。 门限值可以根据实际需要设定, 如根据经验数据、 实验 数据进行设定。
[0110] 背景噪声叠加模块 50: 设置为将背景噪声帧叠加到翻译后的语音信息中的哑音 帧上, 以使翻译后的语音信息中包含背景噪声的信息。
[0111] 如图 17所示, 背景噪声叠加模块 50包括标记添加单元 51和噪声叠加单元 52, 其 中: 标记添加单元 51, 设置为按吋间先后顺序对哑音声帧添加吋间戳标记; 噪 声叠加单元 52, 设置为根据背景噪声帧的吋间戳标记与哑音帧的吋间戳标记, 将背景噪声帧叠加到翻译后的语音信息中对应的哑音帧上, 从而使得翻译后的 语音信息中包含了背景噪声的信息。 如图 7所示, 示意性的示出了添加了背景噪 声的翻译后的语音信息的片段, 其包括 l~k帧背景噪声帧 (因哑音帧实为空白帧 , 故背景噪声帧叠加于哑音帧后实际只有背景噪声帧) 和 1~L帧语音帧。
[0112] 优选地, 噪声叠加单元 52包括合并单元和清除单元, 其中: 合并单元, 设置为 根据吋间先后顺序合并背景噪声帧和哑音帧; 清除单元, 设置为判断是否有多 余的背景噪声帧, 当有多余的背景噪声帧吋 (即背景噪声帧的数量多余哑音帧 的数量吋) , 则清除多余的背景噪声帧, 以避免影响语音帧, 保证语音效果。
[0113] 进一步地, 该装置还包括语音信息发送模块, 其设置为将翻译后的语音信息发 送给对端。 使得对端用户不但能够听到语音, 还能听到背景音, 使得双方的对 话更加真实。 并且, 背景噪声帧与语音帧不重叠, 因此不会影响语音帧, 对端 用户能够听清语音。
[0114] 进一步地, 该装置还包括语音信息输出模块, 其设置为输出翻译后的语音信息 。 使得本端用户不但能够听到语音, 还能听到背景音, 使得双方的对话更加真 实。 并且, 背景噪声帧与语音帧不重叠, 因此不会影响语音帧, 本段用户能够 听清语音。
[0115] 例如: VOLTE终端上行通话吋, 语音信息发送模块通过语音通道将翻译后的 语音信息发送给对端。 对端接收到语音信息后, 通过音频通路对该语音信息进 行处理, 最后通过发声装置 (听筒、 扬声器等) 输出该语音信息, 对端用户就 能够听到 VOLTE终端用户的语音和其所处环境的背景音。 VOLTE终端下行通话 吋, 语音信息输出模块通过音频通路对翻译后的语音信息进行处理, 最后通过 发声装置 (听筒、 扬声器等) 输出该语音信息, VOLTE终端用户就能够听到对 端用户的语音和其所处环境的背景音或模拟的背景音。
[0116] 本发明实施例的语音翻译装置, 通过从原始语音信息中提取出背景噪声帧, 再 识别出翻译后的语音信息中的哑音帧, 最后将背景噪声帧叠加到翻译后的语音 信息中的哑音帧上, 使得翻译后的语音信息中包含背景噪声的信息。 从而用户 不但能够听到清晰的语音, 还能够听到实吋环境下的背景音, 增加了双方对话 的真实性, 增强了用户体验。
[0117] 本发明实施例同吋提出一种终端设备, 所述终端设备包括存储器、 处理器和至 少一个被存储在存储器中并被配置为由处理器执行的应用程序, 所述应用程序 被配置为用于执行语音翻译方法。 所述语音翻译方法包括以下步骤: 获取原始 语音信息; 从原始语音信息中提取出背景噪声帧; 对原始语音信息进行翻译处 理, 得到翻译后的语音信息; 识别出翻译后的语音信息中的哑音帧; 将背景噪 声帧叠加到翻译后的语音信息中的哑音帧上, 以使翻译后的语音信息中包含背 景噪声的信息。
[0118] 本领域技术人员可以理解, 本发明包括涉及用于执行本申请中所述操作中的一 项或多项的设备。 这些设备可以为所需的目的而专门设计和制造, 或者也可以 包括通用计算机中的已知设备。 这些设备具有存储在其内的计算机程序, 这些 计算机程序选择性地激活或重构。 这样的计算机程序可以被存储在设备 (例如 , 计算机) 可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何 类型的介质中, 所述计算机可读介质包括但不限于任何类型的盘 (包括软盘、 硬盘、 光盘、 CD-ROM、 和磁光盘) 、 ROM (Read-Only Memory , 只读存储器 ) 、 RAM (Random Access Memory , 随机存储器) 、 EPROM (Erasable Programmable Read-Only
Memory , 可擦写可编程只读存储器) 、 EEPROM (Electrically Erasable
Programmable Read-Only Memory , 电可擦可编程只读存储器) 、 闪存、 磁性卡 片或光线卡片。 也就是, 可读介质包括由设备 (例如, 计算机) 以能够读的形 式存储或传输信息的任何介质。
以上参照附图说明了本发明的优选实施例, 并非因此局限本发明的权利范围。 本领域技术人员不脱离本发明的范围和实质, 可以有多种变型方案实现本发明 , 比如作为一个实施例的特征可用于另一实施例而得到又一实施例。 凡在运用 本发明的技术构思之内所作的任何修改、 等同替换和改进, 均应在本发明的权 利范围之内。

Claims

权利要求书
一种语音翻译方法, 包括以下步骤:
获取原始语音信息;
从所述原始语音信息中提取出背景噪声帧;
对所述原始语音信息进行翻译处理, 得到翻译后的语音信息; 识别出所述翻译后的语音信息中的哑音帧;
将所述背景噪声帧叠加到所述翻译后的语音信息中的哑音帧上, 以使 所述翻译后的语音信息中包含背景噪声的信息。
根据权利要求 1所述的语音翻译方法, 其中, 所述从所述原始语音信 息中提取出背景噪声帧的步骤包括:
识别出所述原始语音信息中的背景噪声帧;
按吋间先后顺序对所述背景噪声帧添加吋间戳标记;
保存所述背景噪声帧。
根据权利要求 2所述的语音翻译方法, 其中, 所述识别出所述原始语 音信息中的背景噪声帧的步骤包括:
对所述原始语音信息进行语音活动检测, 获取每一帧语音信息帧的参 数特征值;
判断所述参数特征值是否小于或等于门限值;
当所述参数特征值小于或等于门限值吋, 判决所述语音信息帧为背景 噪声帧。
根据权利要求 1所述的语音翻译方法, 其中, 所述识别出所述翻译后 的语音信息中的哑音帧的步骤包括:
对所述翻译后的语音信息进行语音活动检测, 获取每一帧语音信息帧 的参数特征值;
判断所述参数特征值是否小于或等于门限值;
当所述参数特征值小于或等于门限值吋, 判决所述语音信息帧为哑音 帧。
根据权利要求 2所述的语音翻译方法, 其中, 所述将所述背景噪声帧 叠加到所述翻译后的语音信息中的哑音帧上的步骤包括: 按吋间先后顺序对所述哑音声帧添加吋间戳标记; 根据所述背景噪声帧的吋间戳标记与所述哑音帧的吋间戳标记, 将所 述背景噪声帧叠加到所述翻译后的语音信息中对应的哑音帧上。
[权利要求 6] 根据权利要求 5所述的语音翻译方法, 其中, 所述将所述背景噪声帧 叠加到所述翻译后的语音信息中对应的哑音帧上的步骤包括: 根据吋间先后顺序合并所述背景噪声帧和所述哑音帧;
当有多余的背景噪声帧吋, 清除所述多余的背景噪声帧。
[权利要求 7] 根据权利要求 1所述的语音翻译方法, 其中, 所述获取原始语音信息 的步骤包括: 采集原始语音信息。
[权利要求 8] 根据权利要求 7所述的语音翻译方法, 其中, 所述将所述噪声帧叠加 到所述翻译后的语音信息中的哑音帧的位置的步骤之后还包括: 将所述翻译后的语音信息发送给对端。
[权利要求 9] 根据权利要求 1所述的语音翻译方法, 其中, 所述获取原始语音信息 的步骤包括: 接收对端发送的原始语音信息。
[权利要求 10] 根据权利要求 9所述的语音翻译方法, 其中, 所述将所述噪声帧叠加 到所述翻译后的语音信息中的哑音帧的位置的步骤之后还包括: 输出 所述翻译后的语音信息。
[权利要求 11] 一种语音翻译装置, 包括:
语音信息获取模块, 设置为获取原始语音信息; 背景噪声提取模块, 设置为从所述原始语音信息中提取出背景噪声帧 语音翻译处理模块, 设置为对所述原始语音信息进行翻译处理, 得到 翻译后的语音信息;
哑音识别模块, 设置为识别出所述翻译后的语音信息中的哑音帧; 背景噪声叠加模块, 设置为将所述背景噪声帧叠加到所述翻译后的语 音信息中的哑音帧上, 以使所述翻译后的语音信息中包含背景噪声的 f π息。 根据权利要求 11所述的语音翻译装置, 其中, 所述背景噪声提取模块 包括:
识别单元, 设置为识别出所述原始语音信息中背景噪声帧; 标记单元, 设置为按吋间先后顺序对所述背景噪声帧添加吋间戳标记 保存单元, 设置为保存所述背景噪声帧。
根据权利要求 12所述的语音翻译装置, 其中, 所述识别单元包括: 第一获取单元, 设置为对所述原始语音信息进行语音活动检测, 获取 每一帧语音信息帧的参数特征值;
第一判断单元, 设置为判断所述参数特征值是否小于或等于门限值; 第一判决单元, 设置为当所述参数特征值小于或等于门限值吋, 判决 所述语音信息帧为背景噪声帧
根据权利要求 11所述的语音翻译装置, 其中, 所述哑音识别模块包括 第二获取单元, 设置为对所述翻译后的语音信息进行语音活动检测, 获取每一帧语音信息帧的参数特征值;
第二判断单元, 设置为判断所述参数特征值是否小于或等于门限值; 第二判决单元, 设置为当所述参数特征值小于或等于门限值吋, 判决 所述语音信息帧为哑音帧。
根据权利要求 12所述的语音翻译装置, 其中, 所述背景噪声叠加模块 包括:
标记添加单元, 设置为按吋间先后顺序对所述哑音声帧添加吋间戳标 记;
噪声叠加单元, 设置为根据所述背景噪声帧的吋间戳标记与所述哑音 帧的吋间戳标记, 将所述背景噪声帧叠加到所述翻译后的语音信息中 对应的哑音帧上。 合并单元, 设置为根据吋间先后顺序合并所述背景噪声帧和所述哑音 帧;
清除单元, 设置为当有多余的背景噪声帧吋, 清除所述多余的背景噪 声帧。
[权利要求 17] 根据权利要求 11所述的语音翻译装置, 其中, 所述语音信息获取模块 包括采集单元, 所述采集单元设置为采集原始语音信息。
[权利要求 18] 根据权利要求 17所述的语音翻译装置, 其中, 所述装置还包括语音信 息发送模块, 所述语音信息发送模块设置为: 将所述翻译后的语音信 息发送给对端。
[权利要求 19] 根据权利要求 11所述的语音翻译装置, 其中, 所述语音信息获取模块 包括接收单元, 所述接收单元设置为: 接收对端发送的原始语音信息
[权利要求 20] 根据权利要求 19所述的语音翻译装置, 其中, 所述装置还包括语音信 息输出模块, 所述语音信息输出模块设置为: 输出所述翻译后的语音 f π息。
PCT/CN2017/094874 2017-07-28 2017-07-28 语音翻译方法和装置 WO2019019135A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/094874 WO2019019135A1 (zh) 2017-07-28 2017-07-28 语音翻译方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/094874 WO2019019135A1 (zh) 2017-07-28 2017-07-28 语音翻译方法和装置

Publications (1)

Publication Number Publication Date
WO2019019135A1 true WO2019019135A1 (zh) 2019-01-31

Family

ID=65039960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/094874 WO2019019135A1 (zh) 2017-07-28 2017-07-28 语音翻译方法和装置

Country Status (1)

Country Link
WO (1) WO2019019135A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1157442A (zh) * 1995-11-15 1997-08-20 株式会社日立制作所 字符识别翻译系统和语音识别系统
CN101087319A (zh) * 2006-06-05 2007-12-12 华为技术有限公司 一种发送和接收背景噪声的方法和装置及静音压缩系统
CN102903361A (zh) * 2012-10-15 2013-01-30 Itp创新科技有限公司 一种通话即时翻译系统和方法
US20160283469A1 (en) * 2015-03-25 2016-09-29 Babelman LLC Wearable translation device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1157442A (zh) * 1995-11-15 1997-08-20 株式会社日立制作所 字符识别翻译系统和语音识别系统
CN101087319A (zh) * 2006-06-05 2007-12-12 华为技术有限公司 一种发送和接收背景噪声的方法和装置及静音压缩系统
CN102903361A (zh) * 2012-10-15 2013-01-30 Itp创新科技有限公司 一种通话即时翻译系统和方法
US20160283469A1 (en) * 2015-03-25 2016-09-29 Babelman LLC Wearable translation device

Similar Documents

Publication Publication Date Title
US9516497B2 (en) Systems and methods for detecting call provenance from call audio
US10832696B2 (en) Speech signal cascade processing method, terminal, and computer-readable storage medium
CN105448303A (zh) 语音信号的处理方法和装置
CN103514882B (zh) 一种语音识别方法及系统
WO2019164574A1 (en) Transcription of communications
CN112242149B (zh) 音频数据的处理方法、装置、耳机及计算机可读存储介质
US20230326468A1 (en) Audio processing of missing audio information
WO2019075829A1 (zh) 语音翻译方法、装置和翻译设备
CN111199751B (zh) 一种麦克风的屏蔽方法、装置和电子设备
WO2018166367A1 (zh) 一种实时对话中的实时提醒方法、装置、存储介质及电子装置
CN105933181A (zh) 一种通话时延评估方法及装置
CN107391498B (zh) 语音翻译方法和装置
US20210312143A1 (en) Real-time call translation system and method
CN111199745A (zh) 广告识别方法、设备、媒体平台、终端、服务器、介质
CN113284500A (zh) 音频处理方法、装置、电子设备及存储介质
CN1838663B (zh) 在IP网络中检测VoIP应用的实现方法
WO2019019135A1 (zh) 语音翻译方法和装置
TWI282547B (en) A method and apparatus to perform speech recognition over a voice channel
CN116800725A (zh) 数据处理方法及装置
CN110931004A (zh) 一种基于对接技术实现的语音对话分析方法和装置
CN113593587B (zh) 语音分离方法及装置、存储介质、电子装置
GB2516208A (en) Noise reduction in voice communications
US20200184973A1 (en) Transcription of communications
CN114285910A (zh) 通信终端与互联网音频格式重塑系统和方法
US11431767B2 (en) Changing a communication session

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17918728

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17918728

Country of ref document: EP

Kind code of ref document: A1