WO2016165590A1

WO2016165590A1 - Speech translation method and device

Info

Publication number: WO2016165590A1
Application number: PCT/CN2016/078895
Authority: WO
Inventors: 张丽竹
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-04-13
Filing date: 2016-04-08
Publication date: 2016-10-20
Also published as: CN106156009A

Abstract

A speech translation method and device. The method comprises the steps of: when first speech data is received, extracting a voice print characteristic of the first speech data (S10); determining a language type corresponding to the extracted voice print characteristic (S20); when the language type corresponding to the extracted voice print characteristic is a first language, acquiring a pre-stored second language (S30); and converting the first speech data from the first language into second speech data corresponding to the second language (S40). In the method, different languages are differentiated by extracting voice print characteristics, and speech of a language is automatically converted into speech of another language, thereby improving the effectiveness of communication.

Description

Speech translation method and device

Technical field

The present invention relates to the field of voice translation technology, and in particular, to a voice translation method and apparatus.

Background technique

When communicating with people who use different languages, in order to communicate directly and effectively, combined with speech recognition, translation and speech synthesis technology, it is possible to convert the speech of one language into the speech of another language, although the current speech recognition technology There are already recognition models for most languages, but existing voice translation software or devices require users to manually switch the source language and target language to perform corresponding speech recognition and translation before communication. It is impossible to accurately distinguish different languages by speech recognition. This in turn leads to inefficient communication.

The above content is only used to assist in understanding the technical solutions of the present invention, and does not constitute an admission that the above is prior art.

Summary of the invention

The main purpose of the embodiments of the present invention is to provide a voice translation method and device, which aims to solve the problem that the existing voice translation software or device cannot accurately distinguish different languages through voice recognition, thereby causing low communication efficiency.

To achieve the above objective, a voice translation method provided by an embodiment of the present invention includes the following steps:

Extracting a voiceprint feature of the first voice data when the first voice data is received;

Determining a language category corresponding to the extracted voiceprint feature;

Obtaining a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

Converting the first voice data from a first language to second voice data corresponding to a second language.

Preferably, the step of determining a language category corresponding to the extracted voiceprint feature comprises:

Determining whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

When the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, determining that the language category corresponding to the extracted voiceprint feature is the first language;

When the extracted voiceprint feature does not match the pre-stored voiceprint feature of the first language, it is determined that the language category corresponding to the extracted voiceprint feature is the second language.

Preferably, the step of converting the first voice data from the first language to the second voice data corresponding to the second language comprises:

Converting the first voice data into first text data corresponding to the first language according to a first language;

Translating the first text data into second text data corresponding to the second language;

The second text data is synthesized into the second voice data.

Preferably, after the step of converting the first voice data from the first language to the second voice data corresponding to the second language, the method further includes: outputting the second voice data.

Preferably, before the step of extracting the voiceprint feature of the first voice data when the first voice data is received, the method further includes:

Receiving a setting instruction of the first language and the second language;

Providing a selection interface of a language category according to the setting instruction, for the user to select the first language and the second language;

Saving the first language and the second language when the user selects the first language and the second language;

Extracting a voiceprint feature of the first language corresponding voice data, and saving the voiceprint feature.

In addition, in order to achieve the above object, an embodiment of the present invention further provides a voice translation apparatus, including:

An extracting module, configured to extract a voiceprint feature of the first voice data when the first voice data is received;

Determining a module, configured to determine a language category corresponding to the extracted voiceprint feature;

Obtaining a module, configured to acquire a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

The conversion module is configured to convert the first voice data from the first language to the second voice data corresponding to the second language.

Preferably, the determining module includes a determining unit and a determining unit,

The determining unit is configured to determine whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

The determining unit is configured to: when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, determine that the language category corresponding to the extracted voiceprint feature is the first language; When the voiceprint feature does not match the pre-stored voiceprint feature of the first language, it is determined that the language category corresponding to the extracted voiceprint feature is the second language.

Preferably, the conversion module includes a conversion unit, a translation unit, and a synthesis unit.

The converting unit is configured to convert the first voice data into first text data corresponding to the first language according to a first language;

Translating unit, configured to translate the first text data into second text data corresponding to the second language;

The synthesizing unit is configured to synthesize the second text data into the second voice data.

Preferably, the speech translation apparatus further includes an output module configured to output the second voice data.

Preferably, the voice translation device further includes a receiving module, a providing module, and a saving module.

The receiving module is configured to receive a setting instruction of the first language and the second language;

The providing module is configured to provide a selection interface of a language category according to the setting instruction, so that the user selects the first language and the second language;

The saving module is configured to save the first language and the second language when the user selects the first language and the second language; and further configured to save the voiceprint feature of the first language;

The extraction module is further configured to extract a voiceprint feature of the first language corresponding voice data.

Compared with the prior art, the embodiment of the present invention extracts voice data corresponding to the voiceprint corresponding to the voice data, and determines a language category corresponding to the extracted voiceprint feature, and the language category corresponding to the extracted voiceprint feature is And acquiring a pre-stored second language; converting the first voice data from the first language to the second voice data corresponding to the second language. Achieve accurate communication between different languages and automatically convert the voice of one language into the voice of another language, thereby improving the effectiveness of communication.

DRAWINGS

1 is a schematic flow chart of a first embodiment of a voice translation method according to the present invention;

2 is a schematic flow chart of an embodiment of step S40 in FIG. 1;

3 is a schematic flow chart of a second embodiment of a voice translation method according to the present invention;

4 is a schematic flow chart of a third embodiment of a voice translation method according to the present invention;

FIG. 5 is a schematic diagram of functional modules of a first embodiment of a speech translation apparatus according to the present invention; FIG.

6 is a schematic diagram of a refinement function module of an embodiment of the determining module of FIG. 5;

7 is a schematic diagram of a refinement function module of an embodiment of the conversion module of FIG. 5;

FIG. 8 is a schematic diagram of functional modules of a second embodiment of a speech translation apparatus according to the present invention.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

detailed description

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the present invention is: extracting a voiceprint feature of the first voice data when receiving the first voice data; determining a language category corresponding to the extracted voiceprint feature; and extracting the voiceprint feature When the corresponding language category is the first language, acquiring the pre-stored second language; converting the first voice data from the first language to the second voice data corresponding to the second language. Effectively avoid existing speech translation software or devices that cannot accurately distinguish different languages through speech recognition. This in turn leads to problems of low communication efficiency. It realizes the accurate distinction between different languages through speech recognition, and automatically converts the speech of one language into the speech of another language, thereby improving the effectiveness of communication.

Since existing speech translation software or devices cannot accurately distinguish different languages through speech recognition, communication efficiency is low.

Based on the above problems, the present invention provides a speech translation method.

Referring to FIG. 1, FIG. 1 is a schematic flowchart diagram of a first embodiment of a voice translation method according to the present invention.

In an embodiment, the speech translation method comprises:

Step S10, extracting a voiceprint feature of the first voice data when receiving the first voice data;

The voice data is received in real time, and the voiceprint feature is extracted from the received voice data. The voiceprint feature extraction may be extracted during the session, and may be different according to different language selections, such as dialect or medium in the language. English recognition, etc., can also focus on extracting the accent and pronunciation of the speaker. The extraction of the voiceprint feature may be performed by pre-processing the first voice data, the pre-processing is to sample, quantize, pre-emphasize, window, etc. the first voice data, and the original first voice The data is converted into an N-dimensional feature vector to extract the voiceprint features of the first voice data. The manner of receiving the first voice data may be received by a microphone or received by a Bluetooth headset, and the like, and is not limited to other receiving modes.

Step S20, determining a language category corresponding to the extracted voiceprint feature;

A voiceprint model is established according to the extracted voiceprint feature, and it is determined whether the voiceprint model matches the voiceprint model of the pre-stored language category. The voiceprint feature model may select different voiceprint feature models according to different language settings, and appropriately increase the proportion of certain voiceprint features associated with a particular language.

Step S30, acquiring a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

It is determined whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language. Acquiring another voice in the conversation scene as a second language when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language; and extracting the voiceprint feature and the pre-stored first language sound When the pattern features do not match, it is determined that the language category corresponding to the extracted voiceprint feature is the second language. Taking a Chinese and English conversation scene as an example, in the conversation scene, the first language is Chinese and the second language is English. After extracting the voiceprint features of the voice data, it is determined whether the extracted voiceprint features are pre-stored. Chinese voiceprint feature matching. When the extracted voiceprint feature matches the pre-existing Chinese voiceprint feature, it is determined that the language category corresponding to the extracted voiceprint feature is Chinese, and then another voice in the dialogue scenario is English. When the extracted voiceprint feature does not match the pre-stored Chinese voiceprint feature, the voiceprint feature corresponds to the language category, and the other voice in the dialogue scenario is Chinese.

Step S40, converting the first voice data from the first language to the second voice data corresponding to the second language.

After determining the first language and the second language, transmitting the first language, the second language, and the first voice data to the cloud server, for the cloud server to process the first voice data, according to the first language The first voice data is converted into second voice data corresponding to the second language. The processing of the received voice data can also be partially in the cloud server. Processing, partially processed locally.

Specifically, referring to FIG. 2, the process of converting the first voice data from the first language to the second voice data corresponding to the second language may be:

Step S41, converting the first voice data into the first text data corresponding to the first language according to the first language;

Step S42, translating the first text data into second text data corresponding to the second language;

Step S43, synthesizing the second text data into the second voice data.

In this embodiment, the first language is Chinese, and the second language is English. After acquiring Chinese and English, the Chinese voice data is converted into Chinese text data according to Chinese; the Chinese text data is Translating into English text data; displaying Chinese text data and English text data converted into an interface, and finally synthesizing the English text data into English voice data.

In this embodiment, when the first voice data is received, the voiceprint feature of the first voice data is extracted; the language category corresponding to the extracted voiceprint feature is determined; and the language category corresponding to the extracted voiceprint feature is the first And acquiring a pre-stored second language; converting the first voice data from the first language to the second voice data corresponding to the second language. It realizes the accurate distinction between different languages through speech recognition, and automatically converts the voice of one language into the voice of another language, thereby improving the effectiveness of communication.

Referring to FIG. 3, FIG. 3 is a schematic flowchart diagram of a second embodiment of a voice translation method according to the present invention. Based on the first embodiment of the foregoing method, the step S20 includes:

Step S21, determining whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

Step S22, determining that the language category corresponding to the extracted voiceprint feature is the first language when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

Step S23: When the extracted voiceprint feature does not match the pre-stored voiceprint feature of the first language, determine that the language category corresponding to the extracted voiceprint feature is the second language.

Determining whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, and if the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, the language category corresponding to the first voice data It is the first language, and the second language is another voice in the conversation scene. Otherwise, the language category corresponding to the first voice data is the second language. When the first language and the second language are acquired, the first language and the second language are displayed for the user to discern whether the first language and the second language are erroneous. The manner of displaying the first language and the second language may be a voice broadcast of the current first language and the second language, highlighting the current first language and the second language, etc., according to user needs and/or system performance. Settings. Receiving, by the user, that the first language and the second language are incorrect, receiving an instruction to reset the first language and the second language; providing a selection interface of the language category according to the instruction, for the user to select the first language and the second Language; when the user selects the first language and the second language, the first language and the second language are saved. Receiving the first voice data corresponding to the first language, and extracting the voiceprint feature of the first voice data, and storing the voiceprint feature of the first language. Saving the sound After the pattern features, adjust and update the original voiceprint features. When the voice data is received again, the voiceprint feature of the voice data is extracted, and it is determined whether the voiceprint feature matches the updated voiceprint feature.

Further, after the step S40, the method further includes:

Step S50, outputting the second voice data.

The outputting the second voice data may be directly output through a speaker or a headphone output, according to a user's needs and/or performance settings of the system.

The embodiment determines whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language; and determines the voiceprint feature when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language. The corresponding language category is the first language. The language category corresponding to the voiceprint feature is determined by the voiceprint feature, the accuracy of the recognition is improved, and the effectiveness of communication is further improved.

Referring to FIG. 4, FIG. 4 is a schematic flowchart diagram of a third embodiment of a voice translation method according to the present invention. Based on the first embodiment of the foregoing method, before the step S10, the method further includes:

Step S60, receiving a setting instruction of the first language and the second language;

Step S70, providing a selection interface of a language category according to the setting instruction, so that the user selects the first language and the second language;

Step S80, saving the first language and the second language when the user selects the first language and the second language;

Step S90, extracting a voiceprint feature of the first language corresponding voice data, and saving the voiceprint feature.

The setting instruction for receiving the first language and the second language may provide a selection interface of the language category according to the setting instruction when the setting instruction of the first language and the second language is received at the initial stage of the dialogue, for the user to select a first language and a second language; saving the first language and the second language when the user selects the first language and the second language. The first language and the second language can also be selected by voice, according to the needs of the user and/or the performance of the system. After saving the first language and the second language, receiving the first voice data corresponding to the first language, extracting a voiceprint feature of the first voice data, and saving the voiceprint feature. The first and second languages may be Chinese, English, etc., or may be based on a geographical name, such as Guangdong, Canada, etc., if the geographical name is set, the voiceprint feature corresponding to the local primary language category may be pre-stored locally.

In other embodiments of the present invention, the voice translation method may further be: in a multi-language conference, for example, there are four languages: A, B, C, and D. In the conference, an interface is provided for the user to select his or her own language. After the user selects his or her own language, it is transmitted to the cloud server through the Bluetooth or Wi-Fi of the transmission module. The voiceprint features corresponding to the four languages A, B, C, and D and the four languages are pre-stored in the cloud server. Upon receiving the voice data, the voiceprint feature of the voice data is extracted to determine whether the extracted voiceprint feature matches the voiceprint feature of the pre-stored language category. Taking the extracted voiceprint feature and the pre-existing voiceprint feature of the A language as an example, when the extracted voiceprint feature matches the pre-existing voiceprint feature of the A language, the language category corresponding to the extracted voiceprint feature is determined. Is the A language. Obtain pre-stored B, C, and D languages from the cloud server. According to the A language, the received voice data is converted into the A text data corresponding to the A language, and then the A text data is translated into the B text data, the C text data, the D text data, and the B text data is converted into the B voice data, the C text. The data is converted into C voice data, and the D text data is converted into D voice data, and finally transmitted to the speaker or earphone of the user corresponding to the B, C, and D languages through the Bluetooth or Wi-Fi of the transmission module. Effectively avoid the problem that existing speech translation software or devices cannot accurately distinguish different languages through speech recognition, resulting in low communication efficiency. It realizes the accurate distinction between different languages through speech recognition, and automatically converts the speech of one language into the speech of another language, thereby improving the effectiveness of communication.

In this embodiment, by pre-storing the voiceprint features of the first language, the second language, and the first language, when the voice data is received, the voiceprint feature of the voice data may be extracted, according to the voiceprint feature of the first language. The correspondence relationship of the first language can determine the language category corresponding to the voiceprint feature, and accurately distinguish different languages through voice recognition, thereby improving the effectiveness of communication.

The execution bodies of the speech translation methods of the above first to third embodiments may each be a speech translation device or a translation device that is coupled to a speech translation device. Still further, the speech translation method can be implemented by a client translation program installed on a speech translation device or device, including but not limited to a mobile phone, a pad, a notebook computer, and the like.

The invention further provides a speech translation device.

Referring to FIG. 5, FIG. 5 is a schematic diagram of functional modules of a first embodiment of a speech translation apparatus according to the present invention.

In an embodiment, the speech translation apparatus comprises: an extraction module 10, a determination module 20, an acquisition module 30, and a conversion module 40.

The extracting module 10 is configured to extract a voiceprint feature of the first voice data when the first voice data is received;

The determining module 20 is configured to determine a language category corresponding to the extracted voiceprint feature;

Specifically, referring to FIG. 6, the determining module 20 includes a determining unit 21 and a determining unit 22,

The determining unit 21 is configured to determine whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

The determining unit 22 is configured to: when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, determine that the language category corresponding to the extracted voiceprint feature is the first language; When the voiceprint feature does not match the pre-stored voiceprint feature of the first language, it is determined that the language category corresponding to the extracted voiceprint feature is the second language.

Determining whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, and if the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, the language category corresponding to the first voice data It is the first language, and the second language is another voice in the conversation scene. Otherwise, the language category corresponding to the first voice data is the second language. When the first language and the second language are acquired, the first language and the second language are displayed for the user to discern whether the first language and the second language are erroneous. The manner of displaying the first language and the second language may be a voice broadcast of the current first language and the second language, highlighting the current first language and the second language, etc., according to user needs and/or system performance. Settings. Receiving, by the user, that the first language and the second language are incorrect, receiving an instruction to reset the first language and the second language; providing a selection interface of the language category according to the instruction, for the user to select the first language and the second Language; when the user selects the first language and the second language, the first language and the second language are saved. Receiving first voice data corresponding to the first language, and extracting a voiceprint feature of the first voice data, and saving the voiceprint feature. After saving the voiceprint feature, the original voiceprint feature is adjusted and updated. When the voice data is received again, the voiceprint feature of the voice data is extracted, and it is determined whether the voiceprint feature matches the updated voiceprint feature.

The obtaining module 30 is configured to acquire a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

It is determined whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language. Acquiring another voice in the conversation scene as a second language when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language; and extracting the voiceprint feature and the pre-stored first language sound When the pattern features do not match, it is determined that the language category corresponding to the extracted voiceprint feature is the second language. Taking a Chinese and English conversation scene as an example, in the conversation scene, the first language is Chinese and the second language is English. After extracting the voiceprint features of the voice data, it is determined whether the extracted voiceprint features are pre-stored. Chinese voiceprint feature matching. When the extracted voiceprint feature matches the pre-existing Chinese voiceprint feature, it is determined that the language category corresponding to the extracted voiceprint feature is Chinese, and then another voice in the dialogue scenario is English. When the extracted voiceprint feature does not match the pre-stored Chinese voiceprint feature, the extracted voiceprint feature corresponds to the language category, and the other voice in the dialogue scenario is Chinese.

The conversion module 40 is configured to convert the first voice data from the first language to the second voice data corresponding to the second language.

After determining the first language and the second language, transmitting the first language, the second language, and the first voice data to the cloud server, for the cloud server to process the first voice data, according to the first language The first voice data is converted into second voice data corresponding to the second language. The processing of the received voice data can also be partially processed in the cloud server, and partially processed locally.

Specifically, referring to FIG. 7, the conversion module 40 includes a conversion unit 41, a translation unit 42, and a synthesis unit 43,

The converting unit 41 is configured to convert the first voice data into first text data corresponding to the first language according to a first language;

The translating unit 42 is configured to translate the first text data into second text data corresponding to the second language;

The synthesizing unit 43 is configured to synthesize the second text data into the second voice data.

In this embodiment, when the first voice data is received, the voiceprint feature of the first voice data is extracted; the language category corresponding to the extracted voiceprint feature is determined; and the language category corresponding to the extracted voiceprint feature is determined to be the first language. And acquiring the pre-stored second language; converting the first voice data from the first language to the second voice data corresponding to the second language. Realize the accuracy of communication by accurately distinguishing different languages through speech recognition.

Referring to FIG. 8, FIG. 8 is a schematic diagram of functional modules of a second embodiment of a speech translation apparatus according to the present invention.

Based on the foregoing first embodiment, the voice translation apparatus of this embodiment further includes an output module 50, a receiving module 60, a providing module 70, and a saving module 80.

The output module 50 is configured to output the second voice data.

The receiving module 60 is configured to receive a setting instruction of the first language and the second language;

The providing module 70 is configured to provide a selection interface of a language category according to the setting instruction, so that the user selects the first language and the second language;

The saving module 80 is configured to save the first language and the second language when the user selects the first language and the second language, and is further configured to save the voiceprint feature of the first language;

The extraction module 10 is further configured to extract voiceprint features of the first language corresponding voice data.

In other embodiments of the present invention, the voice translation method may further be: in a multi-language conference, for example, there are four languages: A, B, C, and D. In the conference, an interface is provided for the user to select his or her own language. After the user selects their own language, Transfer to the cloud server via Bluetooth or Wi-Fi of the transmission module. The voiceprint features corresponding to the four languages A, B, C, and D and the four languages are pre-stored in the cloud server. Upon receiving the voice data, the voiceprint feature of the voice data is extracted to determine whether the extracted voiceprint feature matches the voiceprint feature of the pre-stored language category. Taking the extracted voiceprint feature and the pre-existing voiceprint feature of the A language as an example, when the extracted voiceprint feature matches the pre-existing voiceprint feature of the A language, the language category corresponding to the extracted voiceprint feature is determined. Is the A language. Obtain pre-stored B, C, and D languages from the cloud server, convert the received voice data into A-text data corresponding to the A language according to the A language, and then translate the A-text data into B-text data, C-text data, and D. Text data, convert B text data into B voice data, convert C text data into C voice data, convert D text data into D voice data, and finally transmit to B, C, D through Bluetooth or Wi-Fi of the transmission module. The language of the user's speaker or headset. Effectively avoid the problem that existing speech translation software or devices cannot accurately distinguish different languages through speech recognition, resulting in low communication efficiency. It realizes the accurate distinction between different languages through speech recognition, and automatically converts the speech of one language into the speech of another language, thereby improving the effectiveness of communication.

In this embodiment, by pre-storing the voiceprint features of the first language, the second language, and the first language, when the voice data is received, the voiceprint feature of the voice data may be extracted, according to the voiceprint feature of the first language. The correspondence relationship of the first language can determine the language category corresponding to the voiceprint feature, accurately distinguish different languages, and improve the effectiveness of communication.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

For example, the specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the optional embodiments, and details are not described herein again.

The above description is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Industrial applicability

The foregoing embodiments of the present invention can be applied to the field of voice translation technology, and solve the problem that the existing voice translation software or device cannot accurately distinguish different languages through voice recognition, thereby causing low communication efficiency, and accurately distinguishing different languages. And automatically convert the voice of one language into the voice of another language, thereby improving the effectiveness of communication.

Claims

A speech translation method comprising the steps of:

Extracting a voiceprint feature of the first voice data when the first voice data is received;

Determining a language category corresponding to the extracted voiceprint feature;

Obtaining a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

Converting the first voice data from a first language to second voice data corresponding to a second language.
The speech translation method according to claim 1, wherein said step of determining a language category corresponding to said extracted voiceprint feature comprises:

Determining whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

When the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, determining that the language category corresponding to the extracted voiceprint feature is the first language;

When the extracted voiceprint feature does not match the pre-stored voiceprint feature of the first language, it is determined that the language category corresponding to the extracted voiceprint feature is the second language.
The speech translation method according to claim 1, wherein the converting the first speech data from the first language to the second speech data corresponding to the second language comprises:

Converting the first voice data into first text data corresponding to the first language according to a first language;

Translating the first text data into second text data corresponding to the second language;

The second text data is synthesized into the second voice data.
The speech translation method of claim 3, wherein after the step of converting the first speech data from the first language to the second speech data corresponding to the second language, the method further comprises:

The second voice data is output.
The speech translation method according to any one of claims 1 to 4, wherein the step of extracting the voiceprint feature of the first voice data when the first voice data is received further includes:

Receiving a setting instruction of the first language and the second language;

Providing a selection interface of a language category according to the setting instruction, for the user to select the first language and the second language;

Saving the first language and the second language when the user selects the first language and the second language;

Extracting the voiceprint feature of the first language corresponding voice data, and saving the voiceprint feature of the first language.
A speech translation device comprising:

An extracting module, configured to extract a voiceprint feature of the first voice data when the first voice data is received;

Determining a module, configured to determine a language category corresponding to the extracted voiceprint feature;

Obtaining a module, configured to acquire a pre-stored second language when the language category corresponding to the extracted voiceprint feature is the first language;

The conversion module is configured to convert the first voice data from the first language to the second voice data corresponding to the second language.
The speech translation apparatus according to claim 6, wherein said determination module comprises a determination unit and a determination unit,

The determining unit is configured to determine whether the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language;

The determining unit is configured to: when the extracted voiceprint feature matches the pre-stored voiceprint feature of the first language, determine that the language category corresponding to the extracted voiceprint feature is the first language; When the voiceprint feature does not match the pre-stored voiceprint feature of the first language, it is determined that the language category corresponding to the extracted voiceprint feature is the second language.
The speech translation apparatus according to claim 6, wherein said conversion module comprises a conversion unit, a translation unit, and a synthesis unit,

The converting unit is configured to convert the first voice data into first text data corresponding to the first language according to a first language;

Translating unit, configured to translate the first text data into second text data corresponding to the second language;

The synthesizing unit is configured to synthesize the second text data into the second voice data.
The speech translation apparatus of claim 6, wherein the speech translation apparatus further comprises an output module configured to output the second speech data.
The speech translation apparatus according to any one of claims 6 to 9, wherein the speech translation apparatus further comprises a receiving module, a providing module, and a saving module,

The receiving module is configured to receive a setting instruction of the first language and the second language;

The providing module is configured to provide a selection interface of a language category according to the setting instruction, so that the user selects the first language and the second language;

The saving module is configured to save the first language and the second language when the user selects the first language and the second language; and further configured to save the voiceprint feature of the first language;

The extraction module is further configured to extract a voiceprint feature of the first language corresponding voice data.