WO2020077868A1 - 同声传译方法、装置、计算机设备和存储介质 - Google Patents

同声传译方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020077868A1
WO2020077868A1 PCT/CN2018/124800 CN2018124800W WO2020077868A1 WO 2020077868 A1 WO2020077868 A1 WO 2020077868A1 CN 2018124800 W CN2018124800 W CN 2018124800W WO 2020077868 A1 WO2020077868 A1 WO 2020077868A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
simultaneous interpretation
model
language
simultaneous
Prior art date
Application number
PCT/CN2018/124800
Other languages
English (en)
French (fr)
Inventor
李晨光
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020077868A1 publication Critical patent/WO2020077868A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • This application relates to a method, apparatus, computer equipment and storage medium for simultaneous interpretation.
  • Simultaneous interpretation refers to a translation method in which the interpreter continuously interprets the content to the audience without interrupting the speaker's speech. Simultaneous interpretation is highly academic and professional. In addition to being widely used in international conferences, it is also widely used in many fields such as diplomacy and foreign affairs, meeting negotiations, business activities, news media, training courses, television broadcasting, international arbitration, etc. .
  • a method, apparatus, computer equipment, and storage medium for voice interpretation are provided.
  • a simultaneous interpretation method includes:
  • the query and simultaneous interpretation language category and the target simultaneous interpretation language correspond to the preset voice simultaneous interpretation model.
  • the voice simultaneous interpretation model is constructed based on the translation correspondence between the type of simultaneous interpretation language target and the simultaneous interpretation target language;
  • the voice characteristics of the model voice data are processed to output the voice data of simultaneous interpretation.
  • a simultaneous interpretation device includes:
  • the data receiving module for simultaneous interpretation is used for receiving voice data for simultaneous interpretation, and determining the language category of the voice data for simultaneous interpretation;
  • Simultaneous interpretation requirements acquisition module used to obtain simultaneous interpretation requirements, including simultaneous interpretation target language and simultaneous interpretation voice output requirements;
  • the simultaneous interpretation model query module is used to query the preset simultaneous interpretation model corresponding to the language category to be simultaneous interpretation and the target language for simultaneous interpretation.
  • the voice simultaneous interpretation model is based on the translation correspondence between the language category to be simultaneous interpretation and the target language for simultaneous interpretation
  • Model voice data acquisition module used to import voice data to be interpreted into the voice simultaneous interpretation model to obtain model voice data
  • the simultaneous voice data acquisition module is used to process the voice characteristics of the model voice data according to the simultaneous voice output requirements and output simultaneous interpretation voice data.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions, and the computer-readable instructions are executed when the processors execute the instructions provided in any embodiment of the present application. Steps of simultaneous interpretation method.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to implement any one of the embodiments of the present application. The steps of the simultaneous interpretation method provided.
  • FIG. 1 is an application scenario diagram of a simultaneous interpretation method according to one or more embodiments.
  • FIG. 2 is a schematic flowchart of a simultaneous interpretation method according to one or more embodiments.
  • FIG. 3 is a schematic flowchart of the steps of constructing a voice simultaneous interpretation model library according to one or more embodiments.
  • FIG. 4 is a schematic flowchart of a simultaneous interpretation method in another embodiment.
  • FIG. 5 is a block diagram of a simultaneous interpretation device according to one or more embodiments.
  • Figure 6 is a block diagram of a computer device in accordance with one or more embodiments.
  • the simultaneous interpretation method provided by this application can be applied to the application environment shown in FIG. 1.
  • the first terminal 102 and the second terminal 106 communicate with the server 104 through the network, respectively.
  • the first terminal 102 sends the to-be-simultaneously-transmitted voice data to the server 104.
  • the server 104 determines the to-be-simultaneous-language type corresponding to the received to-be-simultaneously-transmitted voice data, and queries the corresponding
  • the voice simultaneous interpretation model is built based on the translation correspondence between the language category to be simultaneous interpretation and the target language of the simultaneous interpretation.
  • the voice data of the simultaneous interpretation is imported into the voice simultaneous interpretation model to obtain the model voice data.
  • the first terminal 102 and the second terminal 106 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be an independent server or a server cluster composed of multiple servers. achieve.
  • a simultaneous interpretation method is provided.
  • the method is applied to the server in FIG. 1 as an example for illustration, and includes the following steps:
  • Step S201 Receive voice data to be simultaneous interpreted, and determine a language type to be simultaneous interpreted corresponding to the voice data to be simultaneously interpreted.
  • the voice data to be interpreted is the source voice data that needs to be translated.
  • the voice signal collector of the first terminal 102 can receive the source voice data from the voice source, such as the speech voice signal of the speaker of the conference; It is the language to which the source voice data belongs, such as Chinese, English, French, German and other languages. In specific applications, the types of languages to be interpreted can be further refined. For example, Chinese can be further divided into various dialect sub-languages such as Mandarin, Cantonese, Wu, Sichuan and Minnan.
  • the server 104 may determine the corresponding to-be-simultaneous language category to which it belongs according to the features of the to-be-simultaneous voice data .
  • Step S203 Acquire simultaneous interpretation requirements.
  • the simultaneous interpretation requirements include simultaneous interpretation target language and simultaneous interpretation voice output requirements.
  • the server 104 After the server 104 receives the to-be-simultaneously-transmitted voice data sent by the first terminal 102 and determines the corresponding to-be-simultaneous-language category to which it belongs, it also needs to determine the target language for simultaneous interpretation to be translated.
  • the target language for simultaneous interpretation is the target language category that needs to be translated and output for the voice data. For example, in the process of simultaneous interpretation between English and Chinese, English is the language category for simultaneous interpretation, and Chinese is the target for simultaneous interpretation. Language.
  • the simultaneous voice output requirements can be the voice feature requirements that need to output voice data, which can include voice requirements such as male voice, female voice or children's voice, and voice style requirements such as cheerfulness, depression or excitement. Adjust the output through the simultaneous voice output requirements The voice characteristics of voice data can meet the actual needs of various scenarios and various users.
  • the simultaneous interpretation request may be sent to the server 104 by the second terminal 106 receiving the simultaneous interpretation output.
  • Step S205 Query a preset voice simultaneous interpretation model corresponding to the language category to be simultaneous interpretation and the target language to be simultaneous interpretation, and the voice simultaneous interpretation model is constructed based on the translation correspondence between the language category to be simultaneous interpretation and the target language to be simultaneous interpretation.
  • the voice simultaneous interpretation model is used to translate the input voice data to be simultaneously interpreted and output the voice data corresponding to the target language for simultaneous interpretation.
  • the voice simultaneous interpretation model is set according to the input language and the output language, which is based on the type of The translation correspondence between target languages is constructed. For example, when the type of language to be interpreted is English, it is necessary to combine the target languages of simultaneous interpretation such as Chinese, German, or French to determine the corresponding English-to-Chinese phonetic interpretation model, English-to-German phonetic interpretation model, or English-to-French phonetic interpretation. ⁇ ⁇ Transmission model.
  • the corresponding preset speech simultaneous interpretation model is queried according to the language category of the simultaneous interpretation target language and the target language of simultaneous interpretation.
  • Step S207 Import the voice data to be interpreted into the voice simultaneous interpretation model to obtain model voice data.
  • the received voice data to be simultaneously interpreted is input into the voice simultaneous interpretation model for translation processing, and the corresponding model voice data is output.
  • the speech simultaneous interpretation model can be obtained by combining a speech recognition model, a text translation model, and a target language speech model.
  • the speech recognition model may be, but not limited to, a hidden Markov model, a machine learning model based on an artificial neural network algorithm, etc., such as the LSTM recurrent neural network model, which is used to perform speech recognition on speech data to be interpreted Under the category of simultaneous interpretation language, the language to be interpreted corresponding to the voice data to be interpreted; the text translation model can be constructed based on a character matching algorithm, such as the KMP algorithm, which is used to translate the text to be simultaneous interpretation language output by the speech recognition model
  • the target language text corresponding to the target language for simultaneous interpretation; the target language voice model is used to extract the corresponding voice data from the preset target voice database according to the target language text output by the text translation model, and synthesize and output the final model voice data , So as to achieve the processing of simultaneous interpretation.
  • Step S209 Perform voice feature processing on the model voice data according to the simultaneous voice output requirements, and output simultaneous interpretation voice data.
  • the voice characteristics of the model voice data are processed in conjunction with the simultaneous voice output requirements in the simultaneous interpretation requirements to obtain and output simultaneous interpretation voice data.
  • Voice feature processing may include, but is not limited to, voice color processing, such as voice color male and female voice switching, and voice style processing, such as voice cheerfulness, excitement, sadness, and other emotional style switching.
  • the language category of the to-be-simultaneous interpretation corresponding to the received to-be-simultaneous interpretation speech data is determined, and a corresponding preset voice simultaneous interpretation model is queried according to the language category of the to-be-simultaneous interpretation and the target language of the simultaneous interpretation.
  • the translation model is constructed based on the translation correspondence between the language category to be interpreted and the target language to be interpreted.
  • the speech data to be interpreted is imported into the speech interpretation model to obtain the model speech data, and then the model speech is output through the simultaneous speech output requirements.
  • the data is processed for voice characteristics, and the voice data for simultaneous interpretation is output, thereby realizing simultaneous interpretation.
  • simultaneous interpretation there is no need for special simultaneous interpretation personnel to perform manual translation, which avoids the influence of human factors and effectively improves the efficiency of simultaneous interpretation and the effect of simultaneous interpretation.
  • the step of determining the language category to be interpreted corresponding to the speech data to be interpreted includes: extracting the phonetic feature phonemes from the speech data to be interpreted; querying a preset language phoneme classification model, and the language phoneme classification model
  • the phonetic feature phonemes corresponding to various language categories are obtained by training; the phonetic feature phonemes are input into the phoneme classification model of the language category, and the to-be-simultaneous language categories corresponding to the to-be-simultaneous speech data are obtained.
  • phonemes For different languages, it has different pronunciation rules.
  • Chinese “Mandarin” it is composed of 3 syllables, which can be split into 8 phonemes of "p, u, t, o, ng, h, u, a"; and for English, it includes 48 phonemes, including yuan
  • phonemes and 28 consonants There are 20 phonemes and 28 consonants. Among the 26 letters in English, there are 5 vowels, 19 consonants, and 2 semi-vowels. Therefore, various language types can be distinguished by phoneme characteristics.
  • the speech feature phoneme is extracted from the to-be-simultaneously-transmitted speech data, and the speech feature phoneme is used to determine the to-be-simultaneously-translated language category .
  • the preset language phoneme classification model which is obtained by training the speech feature phonemes corresponding to various language categories.
  • the language phoneme classification model is used to classify the language according to the input speech feature phonemes, to determine the to-be-simultaneous interpretation corresponding to the speech feature phonemes Language category.
  • the language phoneme classification model can be a neural network model trained based on artificial neural network algorithms and speech feature factors of various languages.
  • the speech feature phonemes are input into the language phoneme classification model of the language, and the language phoneme classification model outputs the language class of the language to be interpreted corresponding to the speech data to be interpreted.
  • phonetic feature phonemes when inputting the phonetic feature phonemes into the language phoneme classification model, you can filter the phonetic feature phonemes extracted from the to-be-simultaneous speech data according to the input requirements of the language phoneme classification model, and select from them to meet the input
  • the required phonetic feature phonemes are input into the phoneme classification model of the language to be processed to determine the type of language to be interpreted.
  • the step of extracting voice feature phonemes from the voice data to be interpreted includes: digitizing the voice data to be interpreted to obtain digitized data to be interpreted; performing endpoint detection processing on the digitized data to be interpreted, and Perform speech framing processing on the digitized data to be transmitted simultaneously after endpoint detection processing to obtain voice frame data to be transmitted simultaneously; extract voice characteristic phonemes from the voice frame data to be transmitted simultaneously.
  • the voice data to be simultaneously collected by the first terminal 102 through the voice signal collector is an analog signal, which includes redundant information, such as background noise, channel distortion, etc., and the analog signal needs to be preprocessed ,
  • Such as anti-aliasing filtering, sampling, A / D conversion and other processes for digital processing, and then including pre-emphasis, windowing and framing, endpoint detection and other processing to filter out unimportant information and background Noise can effectively improve the processing efficiency and processing effect of simultaneous interpretation.
  • the voice data to be simultaneously interpreted is first digitized, including anti-aliasing filtering, sampling, and A / D conversion to obtain the digitized data to be simultaneously interpreted.
  • Endpoint detection processing is performed on the digital to-be-simultaneously transmitted data to determine the beginning and end of the digital to-be-simultaneously-transmitted data
  • the voice detection and framing processing is performed on the digitalized to-be-simultaneously-transmitted data after the endpoint detection and processing, and it is divided into segments of frame signals
  • the speech frame data to be interpreted is obtained, and the speech feature phonemes can be extracted from the speech frame data to be interpreted.
  • the step of querying the preset voice simultaneous interpretation model corresponding to the language class to be simultaneous interpreted and the target language to be interpreted includes: querying the preset voice simultaneous interpretation model library; querying and pending from the voice simultaneous interpretation model library Multilingual simultaneous interpretation model corresponding to the type of simultaneous interpretation language; configure the output language configuration of the multilingual simultaneous interpretation model according to the target language of the simultaneous interpretation to obtain the voice simultaneous interpretation model.
  • the voice simultaneous interpretation model library stores multilingual simultaneous interpretation models corresponding to various languages to be simultaneous interpretation.
  • the multilingual simultaneous interpretation model is a language simultaneous interpretation model based on a fixed input language category to be simultaneous interpretation.
  • the multilingual simultaneous interpretation model configures the output language according to the actual simultaneous interpretation target language, and a speech simultaneous interpretation model that satisfies the simultaneous interpretation target language can be obtained.
  • the voice simultaneous interpretation model library when querying the preset voice simultaneous interpretation model corresponding to the language category to be interpreted and the target language for simultaneous interpretation, query the voice simultaneous interpretation model library, and according to the language category to be simultaneous interpretation, from the voice simultaneous interpretation Query the multilingual simultaneous interpretation model corresponding to the type of language to be interpreted in the model library, and then configure the output language configuration of the multilingual simultaneous interpretation model according to the target language of the simultaneous interpretation, to obtain a voice simultaneous interpretation model that satisfies the target language of simultaneous interpretation.
  • the interpretation model can receive the to-be-simultaneously-translated speech data corresponding to the to-be-simultaneously-translated language category, and output the simultaneous-simultaneously-translated speech data corresponding to the simultaneous interpretation target language after translation processing, thereby realizing the simultaneous interpretation of the speech data.
  • the construction steps of the voice simultaneous interpretation model library include:
  • Step S301 Acquire a preset speech recognition model corresponding to the language class to be interpreted.
  • the speech recognition model is used to output the language to be interpreted corresponding to the language class to be interpreted according to the speech data to be interpreted.
  • the speech simultaneous interpretation model can be obtained by the multilingual simultaneous interpretation model after the output language configuration of the simultaneous interpretation target language.
  • the multilingual simultaneous interpretation model is obtained by combining a speech recognition model, a text translation model, and a target language speech model.
  • the multilingual simultaneous interpretation models are collected and stored by the voice simultaneous interpretation model library.
  • a preset speech recognition model corresponding to the language category to be interpreted is obtained, and the speech recognition model is used to output the language category to be interpreted according to the speech data to be interpreted Corresponding language to be interpreted.
  • the speech recognition model may be, but not limited to, a hidden Markov model, a machine learning model based on an artificial neural network algorithm, etc., which is used to perform speech recognition on the speech data to be interpreted.
  • the language to be interpreted corresponding to the simultaneous voice data. For example, for a speech recognition model of a Chinese language, it can translate the received Chinese speech data into Chinese characters.
  • Step S303 Construct a text translation model based on historical translation data between the language to be simultaneous interpretation and the target language corresponding to the target language for simultaneous interpretation, and the text translation model is used to output the target corresponding to the target language for simultaneous interpretation according to the language to be simultaneous interpretation Language text.
  • the mapping relationship is not limited to word mapping, word mapping, phrase mapping and common language mapping.
  • the common language may include famous sayings, colloquialisms, proverbs, aphorisms and slang.
  • the famous Chinese saying "Do not want to apply it to others” can be mapped according to the official translation in the world and the Chinese expression.
  • the text translation model can be constructed according to the mapping relationship, and the target language text corresponding to the target language for simultaneous interpretation can be output according to the text of the language to be simultaneous interpreted by the text translation model.
  • Step S305 Construct a speech model of the target language according to the target language and the corresponding speech data of the target language in the simultaneous target language.
  • a target language speech model is constructed for extracting the voice data corresponding to the target language characters from the preset target speech database, and synthesizing and outputting the final model voice data.
  • the target language speech model can be constructed based on a character matching algorithm, by matching the target language characters with the text corresponding to the preset target speech database, and querying and outputting the corresponding model speech data.
  • Step S307 Combine the speech recognition model, the text translation model and the target language speech model in sequence to obtain a multilingual simultaneous interpretation model.
  • the text translation model and the target language speech model are obtained, they are combined in order to obtain a multilingual simultaneous interpretation model.
  • a one-to-many mapping relationship can be established according to the speech recognition model corresponding to the language category to be simultaneous interpretation, and the text translation model corresponding to the various simultaneous target languages and the target language speech model to achieve simultaneous interpretation of multiple languages
  • the model configures the output language to meet the output requirements of various simultaneous interpretation target languages.
  • Step S309 Obtain a speech simultaneous interpretation model library according to the multilingual simultaneous interpretation model.
  • the multilingual simultaneous interpretation models corresponding to the types of languages to be simultaneous are collected to obtain a speech simultaneous interpretation model library.
  • a voice simultaneous interpretation model is obtained, and the received voice data to be simultaneous interpretation is input into the voice simultaneous interpretation model for translation processing, and the corresponding output is output. Simultaneous interpretation processing of model voice data.
  • the simultaneous voice output requirements include simultaneous scene requirements and simultaneous user needs; according to the simultaneous voice output requirements, the speech characteristics of the model voice data are processed, and the steps of outputting simultaneous interpretation voice data include: querying and Simultaneous interpretation scene requirements correspond to the preset scene voice database.
  • the scene voice database stores scene voice expression data that meets the needs of simultaneous interpretation scenes; the model voice data is updated through the scene voice expression data to obtain scene voice data; through simultaneous interpretation user needs Configure scene voice data and output simultaneous interpretation voice data.
  • simultaneous voice output requirements include simultaneous interpretation scene requirements and simultaneous interpretation user requirements, where simultaneous interpretation scene requirements correspond to simultaneous interpretation application scenarios, such as international conferences, foreign affairs, meeting negotiations, business activities, and news media Etc .; Simultaneous user needs correspond to output objects such as gender, tone, style, etc.
  • a preset scene voice database corresponding to the simultaneous interpretation scene requirement is queried, and the scene voice database stores scene voice expression data satisfying the simultaneous interpretation scene requirement.
  • there will be different expressions for the voice data transmitted simultaneously such as spoken or written language, and professional vocabulary, etc.
  • the scene voice expression data corresponding to the needs of each simultaneous interpretation scenario can be stored in advance
  • Scene voice database by querying the scene voice database, scene voice expression data that meets the needs of simultaneous interpretation scenes can be extracted.
  • the model voice data is updated according to the scene voice expression data, such as replacing the scene voice expression data with the corresponding original expression data, and then synthesizing to obtain the scene voice data, and then interpreting the scene voice data by the user demand Configure the user's needs, get and output the final simultaneous interpretation voice data, thus satisfying the output scene and various needs of users, expanding the application environment of simultaneous interpretation, and improving the effect of simultaneous interpretation.
  • simultaneous user requirements include voice color requirements and voice style requirements
  • steps of configuring scene voice data through simultaneous user requirements and outputting simultaneous interpretation voice data include: performing scene voice data through voice voice requirements Voice color switching, to obtain voice color voice data that meets voice voice color needs; style switching voice color voice data according to voice style requirements, output simultaneous interpretation voice data.
  • simultaneous interpretation user requirements include voice tone requirements and voice style requirements.
  • voice tone requirements may include, but are not limited to, male, female, and children's voice requirements;
  • voice style requirements may include cheerfulness, depression, and translation.
  • Style requirements such as the same source style and excitement of voice signals. In general, you can set the voice color and voice style of the default output, for example, the male voice of the source style. The user personalizes the default output, switches the voice color and voice style, and outputs the corresponding simultaneous voice data.
  • the scene voice data when configuring the scene voice data according to the requirements of the simultaneous interpretation user, is switched according to the voice color requirements, for example, the default male voice is switched to the female voice, thereby obtaining a voice voice that meets the voice color requirements Data, and then switch the style voice data according to the voice style requirements, such as switching the source style to a depressed style, to obtain simultaneous interpretation voice data.
  • the voice color and voice style of the model voice data output by the voice simultaneous interpretation model according to the needs of simultaneous interpretation users, it can adapt to the needs of various simultaneous transmission end users, expand the application environment of simultaneous interpretation, and improve simultaneous interpretation. effect.
  • a simultaneous interpretation method which includes the following steps:
  • Step S401 Receive voice data to be transmitted simultaneously
  • Step S402 extract voice characteristic phonemes from the voice data to be interpreted
  • Step S403 Query the preset phoneme classification model of the language
  • Step S404 Input the voice feature phonemes into the phoneme classification model of the language, and obtain the to-be-simultaneous language category corresponding to the to-be-simultaneous speech data.
  • the first terminal 102 receives source voice data from a voice source that needs to be translated through a voice signal collector
  • the server 104 receives the to-be-simultaneously transmitted voice data sent by the first terminal 102 and extracts voice characteristic phonemes from it.
  • it may include: digitizing the voice data to be interpreted to obtain digitized data for simultaneous interpretation; performing endpoint detection processing on the digitized data to be simultaneous interpretation; and performing voice framing processing on the digitized data to be interpreted after the endpoint detection process to obtain Voice frame data to be interpreted; voice feature phonemes are extracted from the voice frame data to be interpreted.
  • the phonetic feature phonemes used to determine the language category of the to-be-simultaneously interpreted speech data After inputting the phonetic feature phonemes into the language phoneme classification model, the The speech feature phonemes corresponding to the language category are obtained, and the language category of the to-be-simultaneous interpretation language corresponding to the to-be-simultaneous interpretation speech data is output from the language phoneme classification model.
  • Step S405 Acquire simultaneous interpretation requirements.
  • the simultaneous interpretation requirements include simultaneous interpretation target language and simultaneous interpretation voice output requirements.
  • the simultaneous interpretation requirement is sent to the server 104 by the second terminal 106 receiving the simultaneous interpretation output.
  • the simultaneous interpretation target language is the target language category that needs to be translated and output of the voice data to be simultaneous interpretation.
  • the simultaneous interpretation output requirement can be the need to output voice data According to the voice characteristics requirements, the voice characteristics of the output voice data can be adjusted through the simultaneous voice output requirements, which can meet the actual needs of various scenarios and various users.
  • Step S406 query the preset voice simultaneous interpretation model library
  • Step S407 query the multilingual simultaneous interpretation model corresponding to the language category to be simultaneous interpreted from the voice simultaneous interpretation model library;
  • Step S408 Configure the output language configuration of the multilingual simultaneous interpretation model according to the simultaneous interpretation target language to obtain the voice simultaneous interpretation model;
  • Step S409 Import the voice data to be interpreted into the voice simultaneous interpretation model to obtain model voice data.
  • the voice simultaneous interpretation model library stores the multilingual simultaneous interpretation models corresponding to various languages to be simultaneous interpretation.
  • the multilingual simultaneous interpretation model is a language simultaneous interpretation model based on a fixed input language category to be simultaneous interpretation.
  • the output language configuration can be configured to obtain a voice simultaneous interpretation model that satisfies the target language for simultaneous interpretation. After the voice simultaneous interpretation model is obtained, the received voice data to be simultaneously interpreted is input into the voice simultaneous interpretation model for translation processing, and the corresponding model voice data is output.
  • Step S410 Simultaneous interpretation voice output requirements include simultaneous interpretation scene requirements and simultaneous interpretation user requirements; query a preset scene voice database corresponding to the simultaneous interpretation scene requirements, and the scene voice database stores scene voice expression data meeting the simultaneous interpretation scene requirements;
  • Step S411 Update the model voice data through the scene voice expression data to obtain scene voice data
  • Step S412 Configure scene voice data through simultaneous interpretation user requirements and output simultaneous interpretation voice data.
  • the simultaneous user requirements include voice color requirements and voice style requirements.
  • Configuring the scene voice data through the simultaneous user requirements may include: configuring the scene voice data through the simultaneous user requirements and outputting simultaneous interpretation
  • the steps of the voice data include: switching the voice color of the scene voice data through the voice color requirements to obtain the voice color voice data that meets the voice color requirements; switching the voice color data according to the voice style requirements to obtain the simultaneous interpretation voice data.
  • steps in the flowcharts of FIGS. 2-4 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or stages The execution order of is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a simultaneous interpretation device which includes: a data reception module to be simultaneous interpretation 501, a simultaneous interpretation requirement acquisition module 503, a simultaneous interpretation model query module 505, and a model voice data acquisition Module 507 and simultaneous voice data acquisition module 509, where:
  • the to-be-simultaneously-transmitted data receiving module 501 is used to receive to-be-simultaneously-transmitted voice data, and determine the to-be-simultaneously-translated language type corresponding to the to-be-simultaneously-transmitted voice data.
  • the simultaneous interpretation requirement obtaining module 503 is used to obtain the simultaneous interpretation requirement, which includes the simultaneous interpretation target language and the simultaneous interpretation voice output requirement.
  • the simultaneous interpretation model query module 505 is used to query a preset voice simultaneous interpretation model corresponding to the language category to be simultaneous interpretation and the target language for simultaneous interpretation.
  • the voice simultaneous interpretation model is based on the translation correspondence between the language category to be simultaneous interpretation and the target language for simultaneous interpretation The relationship is built.
  • the model voice data acquisition module 507 is used to import the voice data to be interpreted into the voice simultaneous interpretation model to obtain model voice data.
  • the simultaneous voice data acquisition module 509 is configured to perform voice feature processing on the model voice data according to the simultaneous voice output requirements and output simultaneous interpretation voice data.
  • the to-be-simultaneous interpretation data receiving module determines the to-be-simultaneous interpretation language type corresponding to the received to-be-simultaneous interpretation speech data, and the to-be-simultaneous interpretation model query module queries according to the to-be-simultaneous interpretation language type and the target language
  • the voice simultaneous interpretation model is constructed based on the translation correspondence between the language category to be simultaneous interpretation and the target language for simultaneous interpretation, and the voice data to be simultaneous interpretation is imported into the voice synchronization through the model voice data acquisition module
  • the model voice data is obtained, and then the simultaneous voice data acquisition module performs voice feature processing on the model voice data through the simultaneous voice output requirements, and outputs simultaneous interpretation voice data, thereby realizing simultaneous interpretation.
  • the simultaneous voice data acquisition module performs voice feature processing on the model voice data through the simultaneous voice output requirements, and outputs simultaneous interpretation voice data, thereby realizing simultaneous interpretation.
  • the data receiving module 501 to be interpreted includes a feature phoneme extraction unit, a phoneme classification model query unit, and a language to be interpreted unit, wherein: the feature phoneme extracting unit is used to extract from the speech data to be interpreted Phonetic feature phoneme; phoneme classification model query unit, used to query the preset language phoneme classification model.
  • the language phoneme classification model is obtained by training the phonetic feature phonemes corresponding to various language categories; the to-be-simultaneous language determination unit is used to convert phonetic features In the phoneme classification model of the phoneme input language, the category of the language to be interpreted corresponding to the speech data to be interpreted is obtained.
  • the feature phoneme extraction unit includes a digitization subunit, a framing subunit, and a feature phoneme extraction subunit, where: the digitization subunit is used for digitizing the speech data to be interpreted to obtain digitized data to be interpreted; Framing sub-unit, used to perform endpoint detection processing on the digital to-be-simultaneously transmitted data, and performing voice framing processing on the digitized to-be-simultaneously transmitted data after endpoint detection processing to obtain voice frame data to be transmitted; the feature phoneme extraction subunit, It is used to extract voice characteristic phonemes from voice frame data to be interpreted.
  • the simultaneous interpretation model query module 505 includes a simultaneous interpretation model database query unit, a multilingual simultaneous interpretation model query unit and a sound simultaneous interpretation model acquisition unit, wherein: the simultaneous interpretation model database query unit is used to query a preset Voice simultaneous interpretation model library; multilingual simultaneous interpretation model query unit, used to query the multilingual simultaneous interpretation model corresponding to the language category to be simultaneous interpretation from the voice simultaneous interpretation model library; voice simultaneous interpretation model acquisition unit, used for simultaneous interpretation The target language configures the output language configuration of the multilingual simultaneous interpretation model to obtain the voice simultaneous interpretation model.
  • it further includes a speech recognition model module, a text translation model module, a target language speech model module, a multilingual simultaneous interpretation model module and a simultaneous interpretation model library construction module, wherein: the speech recognition model module is used to obtain Simultaneous language types correspond to the preset speech recognition model.
  • the speech recognition model is used to output the to-be-simultaneous language text corresponding to the to-be-simultaneous language type according to the to-be-simultaneous speech data; the text translation model module is used to according to the to-be-simultaneous language
  • the historical translation data between the target language text corresponding to the simultaneous interpretation target language, constructing a text translation model, the text translation model is used to output the target language text corresponding to the simultaneous interpretation target language according to the language to be simultaneous interpretation;
  • the target language speech model module It is used to construct the target language speech model based on the target language text and the corresponding speech data of the target language text in the simultaneous target language;
  • the multilingual simultaneous interpretation model module is used to turn the speech recognition model, text translation model and target language speech model in order Combine to get multilingual simultaneous interpretation models; simultaneous interpretation models Construction of the library module, configured to obtain simultaneous speech model according to the model library simultaneous multilingual.
  • the simultaneous voice output requirements include simultaneous scene requirements and simultaneous user requirements;
  • the simultaneous voice data acquisition module 509 includes a scene voice database query unit, a scene voice data acquisition unit, and a user requirement configuration unit, where:
  • the voice database query unit is used to query the preset scene voice database corresponding to the simultaneous interpretation scene requirements.
  • the scene voice database stores scene voice expression data that meets the simultaneous interpretation scene requirements;
  • the scene voice data acquisition unit is used to express data through the scene voice
  • the model voice data is updated to obtain scene voice data;
  • the user requirement configuration unit is used to configure scene voice data through simultaneous user requirements and output simultaneous interpretation voice data.
  • simultaneous interpretation user requirements include voice color requirements and voice style requirements;
  • the user requirement configuration unit includes a voice color switching sub-unit and a style switching sub-unit, wherein: the voice color switching sub-unit is used for the scene through voice voice color requirements The voice data is switched by voice and color to obtain voice and voice data that meets the voice and color requirements; the style switching subunit is used to switch the voice and voice data according to the voice style requirements and output simultaneous interpretation voice data.
  • Each module in the above simultaneous interpretation device can be implemented in whole or in part by software, hardware, or a combination thereof.
  • the above modules may be embedded in the hardware or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device in one of the embodiments, the computer device may be a server, and the internal structure diagram thereof may be as shown in FIG. 6.
  • the computer device includes a processor, memory, and network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with external terminals through a network connection. When the computer readable instructions are executed by the processor to implement a simultaneous interpretation method.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the steps of implementing the simultaneous interpretation method provided in any embodiment of the present application are performed. .
  • One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to implement any one of the embodiments of the present application. The steps of the simultaneous interpretation method provided.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • RDRAM direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

一种同声传译方法,包括:接收待同传语音数据,并确定待同传语音数据对应的待同传语种类别;获取同传需求,同传需求包括同传目标语种和同传语音输出需求;查询与待同传语种类别和同传目标语种对应预设的语音同传模型,语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到;将待同传语音数据导入语音同传模型中,得到模型语音数据;及根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据。

Description

同声传译方法、装置、计算机设备和存储介质
相关申请的交叉引用
本申请要求于2018年10月17日提交中国专利局,申请号为2018112114143,申请名称为“同声传译方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种同声传译方法、装置、计算机设备和存储介质。
背景技术
同声传译,简称“同传”,是指译员在不打断讲话者讲话的情况下,不间断地将内容口译给听众的一种翻译方式。同声传译具有很强的学术性和专业性,除了广泛应用于国际会议之外,也在外交外事、会晤谈判、商务活动、新闻传媒、培训授课、电视广播、国际仲裁等诸多领域被广泛使用。
然而,发明人意识到,目前的同声传译过程是由专业的同传人员进行人工传译,受同传人员的个人因素影响极大,同传的效率及声音效果有限。
发明内容
根据本申请公开的各种实施例,提供一种声传译方法、装置、计算机设备和存储介质。
一种同声传译方法包括:
接收待同传语音数据,并确定待同传语音数据对应的待同传语种类别;
获取同传需求,同传需求包括同传目标语种和同传语音输出需求;
查询与待同传语种类别和同传目标语种对应预设的语音同传模型,语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到;
将待同传语音数据导入语音同传模型中,得到模型语音数据;及
根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据。
一种同声传译装置包括:
待同传数据接收模块,用于接收待同传语音数据,并确定待同传语音数据对应的待同传语种类别;
同传需求获取模块,用于获取同传需求,同传需求包括同传目标语种和同传语音输出需求;
同传模型查询模块,用于查询与待同传语种类别和同传目标语种对应预设的语音同传模型,语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得 到;
模型语音数据获取模块,用于将待同传语音数据导入语音同传模型中,得到模型语音数据;及
同传语音数据获取模块,用于根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时执行实现本申请任意一个实施例中提供的同声传译方法的步骤。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的同声传译方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中同声传译方法的应用场景图。
图2为根据一个或多个实施例中同声传译方法的流程示意图。
图3为根据一个或多个实施例中语音同传模型库的构建步骤的流程示意图。
图4为另一个实施例中同声传译方法的流程示意图。
图5为根据一个或多个实施例中同声传译装置的框图。
图6为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的同声传译方法,可以应用于如图1所示的应用环境中。第一终端102和第二终端106分别通过网络与服务器104通过网络进行通信。第一终端102将待同传语音数据发送至服务器104,服务器104确定接收到的待同传语音数据对应的待同传语种类别,并根据该待同传语种类别和同传目标语种查询对应预设的语音同传模型,该语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到,将待同传语音数据导入该语音同传模型后得到模型语音数据,再通过同传语音输出需求对模型语音数据进行语 音特征处理,得到同声传译语音数据并将其发送至第二终端106,从而实现了同声传译。第一终端102和第二终端106可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种同声传译方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤S201:接收待同传语音数据,并确定待同传语音数据对应的待同传语种类别。
待同传语音数据为需要进行翻译的源语音数据,可以由第一终端102的语音信号采集器接收语音源发出的源语音数据,例如会议的演讲人员的演讲语音信号等;待同传语种类别为源语音数据所属的语种,例如中文、英语、法语、德语等语种。在具体应用时,待同传语种类别还可以进行具体细化,例如中文,可以进一步划分为普通话、粤语、吴语、四川话和闽南语等各种方言子语种。在其中一个实施例中,服务器104在接收到第一终端102上传的待同传语音数据后,可以根据待同传语音数据的语音数据特征,例如音素特征确定其对应所属的待同传语种类别。
步骤S203:获取同传需求,同传需求包括同传目标语种和同传语音输出需求。
服务器104接收到第一终端102发送的待同传语音数据并确定对应所属的待同传语种类别后,还需要确定所需翻译的同传目标语种。同传目标语种即为需要将待同传语音数据翻译输出的目标语种类别,例如在英译汉的同传过程中,英语为待同传语种类别,而汉语为所需翻译输出的同传目标语种。而同传语音输出需求可以为需要输出语音数据的语音特征要求,具体可以包括如男声、女声或儿童声等声色要求,以及欢快、沉郁或激动等语音风格要求,通过同传语音输出需求调整输出语音数据的语音特征,可以满足各种场景、各种使用者的实际需求。在其中一个实施例中,同传需求可以由接收同声传译输出的第二终端106发送至服务器104。
步骤S205:查询与待同传语种类别和同传目标语种对应预设的语音同传模型,语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到。
语音同传模型用于将输入的待同传语音数据翻译输出与同传目标语种对应的语音数据,语音同传模型根据其输入的语种和输出语种对应设置,其基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到。例如,在待同传语种类别为英语时,需要结合同传目标语种如汉语、德语或法语等以确定对应的英译汉语音同传模型、英译德语音同传模型或英译法语音同传模型。在其中一个实施例中,确定待同传语种类别和同传目标语种后,根据该待同传语种类别和同传目标语种查询对应预设的语音同传模型。
步骤S207:将待同传语音数据导入语音同传模型中,得到模型语音数据。
得到语音同传模型后,将接收的待同传语音数据输入语音同传模型进行翻译处理,输出对应的模型语音数据。在具体实现时,语音同传模型可以由语音识别模型、文字翻译模型以及目标语种语音模型组合得到。语音识别模型可以但不限于为隐式马尔科夫模型、基 于人工神经网络算法的机器学习模型等,具体如LSTM循环神经网络模型,其用于将待同传语音数据进行语音识别,得到在待同传语种类别下,与待同传语音数据对应的待同传语种文字;文字翻译模型可以基于字符匹配算法,如KMP算法构建得到,其用于将语音识别模型输出的待同传语种文字翻译成与同传目标语种对应的目标语种文字;目标语种语音模型用于根据文字翻译模型输出的目标语种文字,从预设的目标语音数据库中提取对应的语音数据,合成并输出最终的模型语音数据,从而实现了同声传译的处理。
步骤S209:根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据。
在语音同传模型输出翻译处理后的模型语音数据后,再结合同传需求中的同传语音输出需求对该模型语音数据进行语音特征处理,得到并输出同声传译语音数据。语音特征处理可以但不限于包括语音声色处理,如声色男女声切换,以及语音风格处理,如声音欢快、激动和悲伤等情绪的风格切换。通过对语音同传模型输出的模型语音数据进行语音特征处理,使得到的同声传译语音数据具有不同的声音特征,而不仅仅局限于同传人员本身的声音特征,能够适用于各种同传应用场景以及面向各类使用者,提高了同声传译的声音效果。
上述同声传译方法中,确定接收到的待同传语音数据对应的待同传语种类别,并根据该待同传语种类别和同传目标语种查询对应预设的语音同传模型,该语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到,将待同传语音数据导入该语音同传模型后得到模型语音数据,再通过同传语音输出需求对模型语音数据进行语音特征处理,输出同声传译语音数据,从而实现了同声传译。在同声传译过程中,不需要专门的同传人员进行人工翻译,避免了人为因素的影响,有效提高了同声传译的效率和同传声音效果。
在其中一个实施例中,确定待同传语音数据对应的待同传语种类别的步骤包括:从待同传语音数据中提取语音特征音素;查询预设的语种音素分类模型,语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;将语音特征音素输入语种音素分类模型中,得到待同传语音数据对应的待同传语种类别。
对于不同的语种,其有不同的发音规则。根据语音的自然属性划分出来的最小语音单位,即音素,在不同语种发音中,其音素特征并不相同。对于汉语的“普通话”,其由3个音节构成,可以拆分为“p,u,t,o,ng,h,u,a”8个音素;而对于英语,包括48个音素,其中元音音素20个、辅音音素28个,英语的26个字母中,有元音字母5个、辅音字母19个、半元音字母2个。所以,可以通过语音音素特征来区别各类语种。
本实施例中,在确定待同传语音数据对应的待同传语种类别时,从待同传语音数据中提取语音特征音素,语音特征音素用于判断待同传语音数据的待同传语种类别。查询预设的语种音素分类模型,其通过训练各种语种类别对应的语音特征音素得到,语种音素分类模型用于根据输入的语音特征音素对语种进行分类,以确定语音特征音素对应的待同传语种类别。语种音素分类模型可以为基于人工神经网络算法,和各语种的语音特征因素训 练得到的神经网络模型。通过将语音特征音素输入语种音素分类模型中,由语种音素分类模型输出得到待同传语音数据对应的待同传语种类别。
在具体应用中,在将语音特征音素输入语种音素分类模型中时,可以按照语种音素分类模型的输入需求,对从待同传语音数据中提取得到的语音特征音素进行筛选处理,从中选取满足输入需求的语音特征音素并将其输入语种音素分类模型中进行待同传语种类别确定的处理。
在其中一个实施例中,从待同传语音数据中提取语音特征音素的步骤包括:对待同传语音数据进行数字化处理,得到数字化待同传数据;对数字化待同传数据进行端点检测处理,并对端点检测处理后的数字化待同传数据进行语音分帧处理,得到待同传语音帧数据;从待同传语音帧数据中提取语音特征音素。
一般地,由第一终端102通过语音信号采集器,如话筒采集得到的待同传语音数据为模拟信号,其包括冗余信息,如背景噪声、信道失真等,需要对该模拟信号进行预处理,如进行反混叠滤波、采样、A/D转换等过程进行数字化处理,之后要进行包括预加重、加窗和分帧、端点检测等处理,以滤除掉其中的不重要的信息以及背景噪声,能够有效提高同声传译的处理效率和处理效果。
本实施例中,在从待同传语音数据中提取语音特征音素时,先对待同传语音数据进行数字化处理,包括反混叠滤波、采样、A/D转换,得到数字化待同传数据,再对数字化待同传数据进行端点检测处理,以确定数字化待同传数据的始末,并对端点检测处理后的数字化待同传数据进行语音分帧处理,将其分割为一段一段的帧信号,即得到待同传语音帧数据,从该待同传语音帧数据中可以提取得到语音特征音素。
在一些实施例中,查询与待同传语种类别和同传目标语种对应预设的语音同传模型的步骤包括:查询预设的语音同传模型库;从语音同传模型库中查询与待同传语种类别对应的多语种同传模型;根据同传目标语种对多语种同传模型进行输出语种配置,得到语音同传模型。
本实施例中,语音同传模型库存储有各种待同传语种类别对应的多语种同传模型,多语种同传模型为根据固定输入待同传语种类别的语种同传模型,通过对该多语种同传模型按照实际的同传目标语种,进行输出语种配置,可以得到满足同传目标语种的语音同传模型。在其中一个实施例中,在查询与待同传语种类别和同传目标语种对应预设的语音同传模型时,查询语音同传模型库,并按照待同传语种类别,从该语音同传模型库中查询与待同传语种类别对应的多语种同传模型,再按照同传目标语种对多语种同传模型进行输出语种配置,得到满足同传目标语种的语音同传模型,该语音同传模型可以接收待同传语种类别对应的待同传语音数据,并进行翻译处理后输出与同传目标语种对应的同声传译语音数据,从而实现了对语音数据的同声传译。
在其中一个实施例中,如图3所示,在查询预设的语音同传模型库的步骤之前,语音同传模型库的构建步骤包括:
步骤S301:获取待同传语种类别对应预设的语音识别模型,语音识别模型用于根据待同传语音数据输出待同传语种类别对应的待同传语种文字。
本实施例中,语音同传模型可以由多语种同传模型按照同传目标语种进行输出语种配置后得到,多语种同传模型由语音识别模型、文字翻译模型以及目标语种语音模型组合得到,各种多语种同传模型汇集后统一由语音同传模型库进行存储。在其中一个实施例中,在创建语音同传模型库时,一方面,获取待同传语种类别对应预设的语音识别模型,语音识别模型用于根据待同传语音数据输出待同传语种类别对应的待同传语种文字。语音识别模型可以但不限于为隐式马尔科夫模型、基于人工神经网络算法的机器学习模型等,其用于将待同传语音数据进行语音识别,得到在待同传语种类别下,与待同传语音数据对应的待同传语种文字。例如,对于中文语种的语音识别模型,其可以将接收到的中文语音数据翻译输出中文汉字。
步骤S303:根据待同传语种文字和同传目标语种对应的目标语种文字之间的历史翻译数据,构建文字翻译模型,文字翻译模型用于根据待同传语种文字输出同传目标语种对应的目标语种文字。
另一方面,基于对待同传语种文字和同传目标语种对应的目标语种文字之间的历史翻译数据的大数据分析结果,建立待同传语种文字和同传目标语种对应的目标语种文字之间的映射关系,具体不限于字映射、词语映射、短语映射和常用语映射,其中,常用语可以包括名言、俗语、谚语、格言和俚语等。在具体应用中,例如对于“己所不欲勿施于人”这一句中文名言,可以根据世界上较为认可的官方翻译对应的文字表达与中文文字表达之间建立映射。根据该映射关系可以构建文字翻译模型,通过该文字翻译模型可以根据待同传语种文字输出同传目标语种对应的目标语种文字。
步骤S305:根据目标语种文字以及目标语种文字在同传目标语种中对应的语音数据,构建目标语种语音模型。
此外,构建目标语种语音模型,用于从预设的目标语音数据库中提取与目标语种文字对应的语音数据,合成并输出最终的模型语音数据。目标语种语音模型可以基于字符匹配算法构建,通过将目标语种文字与预设的目标语音数据库中的语音数据对应的文字进行字符匹配,查询并输出对应的模型语音数据。
步骤S307:将语音识别模型、文字翻译模型和目标语种语音模型依次组合,得到多语种同传模型。
在得到语音识别模型、文字翻译模型以及目标语种语音模型后,将其按序组合,得到多语种同传模型。在具体应用中,可以根据待同传语种类别对应的语音识别模型,以及各种同传目标语种对应的文字翻译模型和目标语种语音模型建立一对多的映射关系,以实现对多语种同传模型进行输出语种配置,满足各种同传目标语种的输出需求。
步骤S309:根据多语种同传模型得到语音同传模型库。
得到多语种同传模型后,将各种待同传语种类别对应的多语种同传模型汇集,得到 语音同传模型库。在同声传译过程中,通过按照同传目标语种对多语种同传模型进行输出语种配置后得到语音同传模型,将接收的待同传语音数据输入语音同传模型进行翻译处理,输出对应的模型语音数据,实现了同声传译处理。
在一些实施例中,同传语音输出需求包括同传场景需求和同传用户需求;根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据的步骤包括:查询与同传场景需求对应预设的场景语音数据库,场景语音数据库存储有满足同传场景需求的场景语音表达数据;通过场景语音表达数据对模型语音数据进行更新,得到场景语音数据;通过同传用户需求对场景语音数据进行配置,输出同声传译语音数据。
基于同声传译不同的应用场景,以及面向的使用者,可以对同声传译最终的输出的同传语音数据进行灵活配置,以适应于各种实际需求。本实施例中,同传语音输出需求包括同传场景需求和同传用户需求,其中,同传场景需求对应于同传的应用场景,如国际会议、外交外事、会晤谈判、商务活动和新闻传媒等;同传用户需求对应于面向的输出对象,如性别、声色、风格等。
在对模型语音数据进行语音特征处理时,查询与同传场景需求对应预设的场景语音数据库,该场景语音数据库存储有满足同传场景需求的场景语音表达数据。在不同的同传应用场景中,对同传输出的语音数据的表达,如口语或书面语,以及专业词汇等会有不同的表达,而各同传场景需求对应的场景语音表达数据可以预先存储在场景语音数据库,通过查询该场景语音数据库可以提取满足同传场景需求的场景语音表达数据。得到场景语音表达数据后,根据该场景语音表达数据对模型语音数据进行更新,如将场景语音表达数据替换对应的原表达数据,再合成得到场景语音数据,再由同传用户需求对场景语音数据进行用户需求的配置,得到并输出最后的同声传译语音数据,从而满足了输出端场景和用户的各种需求,扩展了同传适用环境,提高了同声传译效果。
在一些实施例中,同传用户需求包括语音声色需求和语音风格需求;通过同传用户需求对场景语音数据进行配置,输出同声传译语音数据的步骤包括:通过语音声色需求对场景语音数据进行声色切换,得到满足语音声色需求的声色语音数据;根据语音风格需求对声色语音数据进行风格切换,输出同声传译语音数据。
本实施例中,同传用户需求包括语音声色需求和语音风格需求,其中,语音声色需求可以但不限于包括男声、女声和儿童声等声色需求;语音风格需求可以包括欢快、沉郁、与待翻译语音信号相同的源风格和激动等风格需求。一般地,可以设置默认输出的语音声色和语音风格,例如源风格的男声,由使用者对默认输出进行个性化设置,切换语音声色和语音风格,输出对应的同传语音数据。在其中一个实施例中,在根据同传用户需求对场景语音数据进行配置时,按照语音声色需求对场景语音数据进行声色切换,如将默认男声切换为女声,从而得到满足语音声色需求的声色语音数据,再根据语音风格需求对声色语音数据进行风格切换,如将源风格切换为沉郁风格,得到同声传译语音数据。通过对语音同传模型输出的模型语音数据按照同传用户需求进行语音声色和语音风格切换,能够适应 于各种同传输出端使用者的需求,扩展了同传适用环境,提高了同声传译效果。
在其中一个实施例中,如图4所示,提供了一种同声传译方法,包括以下步骤:
步骤S401:接收待同传语音数据;
步骤S402:从待同传语音数据中提取语音特征音素;
步骤S403:查询预设的语种音素分类模型;
步骤S404:将语音特征音素输入语种音素分类模型中,得到待同传语音数据对应的待同传语种类别。
本实施例中,第一终端102通过语音信号采集器接收语音源发出的需要进行翻译的源语音数据,服务器104接收第一终端102发送的该待同传语音数据,并从中提取语音特征音素,具体可以包括:对待同传语音数据进行数字化处理,得到数字化待同传数据;对数字化待同传数据进行端点检测处理,并对端点检测处理后的数字化待同传数据进行语音分帧处理,得到待同传语音帧数据;从待同传语音帧数据中提取语音特征音素。从待同传语音数据中提取得到用于判断待同传语音数据的待同传语种类别的语音特征音素后,通过将语音特征音素输入语种音素分类模型中,该语种音素分类模型通过训练各种语种类别对应的语音特征音素得到,由语种音素分类模型输出得到待同传语音数据对应的待同传语种类别。
步骤S405:获取同传需求,同传需求包括同传目标语种和同传语音输出需求。
同传需求由接收同声传译输出的第二终端106发送至服务器104,同传目标语种即为需要将待同传语音数据翻译输出的目标语种类别,同传语音输出需求可以为需要输出语音数据的语音特征要求,通过同传语音输出需求调整输出语音数据的语音特征,可以满足各种场景、各种使用者的实际需求。
步骤S406:查询预设的语音同传模型库;
步骤S407:从语音同传模型库中查询与待同传语种类别对应的多语种同传模型;
步骤S408:根据同传目标语种对多语种同传模型进行输出语种配置,得到语音同传模型;
步骤S409:将待同传语音数据导入语音同传模型中,得到模型语音数据。
语音同传模型库存储有各种待同传语种类别对应的多语种同传模型,多语种同传模型为根据固定输入待同传语种类别的语种同传模型,通过对该多语种同传模型按照实际的同传目标语种,进行输出语种配置,可以得到满足同传目标语种的语音同传模型。得到语音同传模型后,将接收的待同传语音数据输入语音同传模型进行翻译处理,输出对应的模型语音数据。
步骤S410:同传语音输出需求包括同传场景需求和同传用户需求;查询与同传场景需求对应预设的场景语音数据库,场景语音数据库存储有满足同传场景需求的场景语音表达数据;
步骤S411:通过场景语音表达数据对模型语音数据进行更新,得到场景语音数据;
步骤S412:通过同传用户需求对场景语音数据进行配置,输出同声传译语音数据。
本实施例中,基于同声传译不同的应用场景,以及面向的使用者,可以对同声传译最终的输出的同传语音数据进行灵活配置,以适应于各种实际需求。在其中一个实施例中,同传用户需求包括语音声色需求和语音风格需求,通过同传用户需求对场景语音数据进行配置可以包括:通过同传用户需求对场景语音数据进行配置,输出同声传译语音数据的步骤包括:通过语音声色需求对场景语音数据进行声色切换,得到满足语音声色需求的声色语音数据;根据语音风格需求对声色语音数据进行风格切换,得到同声传译语音数据。
应该理解的是,虽然图2-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图5所示,提供了一种同声传译装置,包括:待同传数据接收模块501、同传需求获取模块503、同传模型查询模块505、模型语音数据获取模块507和同传语音数据获取模块509,其中:
待同传数据接收模块501,用于接收待同传语音数据,并确定待同传语音数据对应的待同传语种类别。
同传需求获取模块503,用于获取同传需求,同传需求包括同传目标语种和同传语音输出需求。
同传模型查询模块505,用于查询与待同传语种类别和同传目标语种对应预设的语音同传模型,语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到。
模型语音数据获取模块507,用于将待同传语音数据导入语音同传模型中,得到模型语音数据。
同传语音数据获取模块509,用于根据同传语音输出需求,对模型语音数据进行语音特征处理,输出同声传译语音数据。
上述同声传译装置,由待同传数据接收模块确定接收到的待同传语音数据对应的待同传语种类别,并由同传模型查询模块根据该待同传语种类别和同传目标语种查询对应预设的语音同传模型,该语音同传模型基于待同传语种类别和同传目标语种之间的翻译对应关系构建得到,通过模型语音数据获取模块将待同传语音数据导入该语音同传模型后得到模型语音数据,再由同传语音数据获取模块通过同传语音输出需求对模型语音数据进行语音特征处理,输出同声传译语音数据,从而实现了同声传译。在同声传译过程中,不需要专 门的同传人员进行人工翻译,避免了人为因素的影响,有效提高了同声传译的效率和同传声音效果。
在其中一个实施例中,待同传数据接收模块501包括特征音素提取单元、音素分类模型查询单元和待同传语种确定单元,其中:特征音素提取单元,用于从待同传语音数据中提取语音特征音素;音素分类模型查询单元,用于查询预设的语种音素分类模型,语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;待同传语种确定单元,用于将语音特征音素输入语种音素分类模型中,得到待同传语音数据对应的待同传语种类别。
在一些实施例中,特征音素提取单元包括数字化子单元、分帧子单元和特征音素提取子单元,其中:数字化子单元,用于对待同传语音数据进行数字化处理,得到数字化待同传数据;分帧子单元,用于对数字化待同传数据进行端点检测处理,并对端点检测处理后的数字化待同传数据进行语音分帧处理,得到待同传语音帧数据;特征音素提取子单元,用于从待同传语音帧数据中提取语音特征音素。
在其中一个实施例中,同传模型查询模块505包括同传模型库查询单元、多语种同传模型查询单元和音同传模型获取单元,其中:同传模型库查询单元,用于查询预设的语音同传模型库;多语种同传模型查询单元,用于从语音同传模型库中查询与待同传语种类别对应的多语种同传模型;音同传模型获取单元,用于根据同传目标语种对多语种同传模型进行输出语种配置,得到语音同传模型。
在其中一个实施例中,还包括语音识别模型模块、文字翻译模型模块、目标语种语音模型模块、多语种同传模型模块和同传模型库构建模块,其中:语音识别模型模块,用于获取待同传语种类别对应预设的语音识别模型,语音识别模型用于根据待同传语音数据输出待同传语种类别对应的待同传语种文字;文字翻译模型模块,用于根据待同传语种文字和同传目标语种对应的目标语种文字之间的历史翻译数据,构建文字翻译模型,文字翻译模型用于根据待同传语种文字输出同传目标语种对应的目标语种文字;目标语种语音模型模块,用于根据目标语种文字以及目标语种文字在同传目标语种中对应的语音数据,构建目标语种语音模型;多语种同传模型模块,用于将语音识别模型、文字翻译模型和目标语种语音模型依次组合,得到多语种同传模型;同传模型库构建模块,用于根据多语种同传模型得到语音同传模型库。
在一些实施例中,同传语音输出需求包括同传场景需求和同传用户需求;同传语音数据获取模块509包括场景语音数据库查询单元、场景语音数据获取单元和用户需求配置单元,其中:场景语音数据库查询单元,用于查询与同传场景需求对应预设的场景语音数据库,场景语音数据库存储有满足同传场景需求的场景语音表达数据;场景语音数据获取单元,用于通过场景语音表达数据对模型语音数据进行更新,得到场景语音数据;用户需求配置单元,用于通过同传用户需求对场景语音数据进行配置,输出同声传译语音数据。
在其中一个实施例中,同传用户需求包括语音声色需求和语音风格需求;用户需求配置单元包括声色切换子单元和风格切换子单元,其中:声色切换子单元,用于通过语音声 色需求对场景语音数据进行声色切换,得到满足语音声色需求的声色语音数据;风格切换子单元,用于根据语音风格需求对声色语音数据进行风格切换,输出同声传译语音数据。
关于同声传译装置的具体限定可以参见上文中对于同声传译方法的限定,在此不再赘述。上述同声传译装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种同声传译方法。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时执行实现本申请任意一个实施例中提供的同声传译方法的步骤。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的同声传译方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中 的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种同声传译方法,包括:
    接收待同传语音数据,并确定所述待同传语音数据对应的待同传语种类别;
    获取同传需求,所述同传需求包括同传目标语种和同传语音输出需求;
    查询与所述待同传语种类别和所述同传目标语种对应预设的语音同传模型,所述语音同传模型基于所述待同传语种类别和所述同传目标语种之间的翻译对应关系构建得到;
    将所述待同传语音数据导入所述语音同传模型中,得到模型语音数据;及
    根据所述同传语音输出需求,对所述模型语音数据进行语音特征处理,输出同声传译语音数据。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述待同传语音数据对应的待同传语种类别,包括:
    从所述待同传语音数据中提取语音特征音素;
    查询预设的语种音素分类模型,所述语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;及
    将所述语音特征音素输入所述语种音素分类模型中,得到所述待同传语音数据对应的待同传语种类别。
  3. 根据权利要求2所述的方法,其特征在于,所述从所述待同传语音数据中提取语音特征音素,包括:
    对所述待同传语音数据进行数字化处理,得到数字化待同传数据;
    对所述数字化待同传数据进行端点检测处理,并对端点检测处理后的所述数字化待同传数据进行语音分帧处理,得到待同传语音帧数据;及
    从所述待同传语音帧数据中提取语音特征音素。
  4. 根据权利要求1所述的方法,其特征在于,所述查询与所述待同传语种类别和所述同传目标语种对应预设的语音同传模型,包括:
    查询预设的语音同传模型库;
    从所述语音同传模型库中查询与所述待同传语种类别对应的多语种同传模型;及
    根据所述同传目标语种对所述多语种同传模型进行输出语种配置,得到语音同传模型。
  5. 根据权利要求4所述的方法,其特征在于,在所述查询预设的语音同传模型库之前,所述方法还包括:
    获取所述待同传语种类别对应预设的语音识别模型,所述语音识别模型用于根据所述待同传语音数据输出所述待同传语种类别对应的待同传语种文字;
    根据所述待同传语种文字和所述同传目标语种对应的目标语种文字之间的历史翻译数据,构建文字翻译模型,所述文字翻译模型用于根据所述待同传语种文字输出所述同传目标语种对应的目标语种文字;
    根据所述目标语种文字以及所述目标语种文字在所述同传目标语种中对应的语音数据,构建目标语种语音模型;
    将所述语音识别模型、所述文字翻译模型和所述目标语种语音模型依次组合,得到所述多语种同传模型;及
    根据所述多语种同传模型得到所述语音同传模型库。
  6. 根据权利要求1至5任意一项所述的方法,其特征在于,所述同传语音输出需求包括同传场景需求和同传用户需求;所述根据所述同传语音输出需求,对所述模型语音数据进行语音特征处理,输出同声传译语音数据,包括:
    查询与所述同传场景需求对应预设的场景语音数据库,所述场景语音数据库存储有满足所述同传场景需求的场景语音表达数据;
    通过所述场景语音表达数据对所述模型语音数据进行更新,得到场景语音数据;及
    通过所述同传用户需求对所述场景语音数据进行配置,输出所述同声传译语音数据。
  7. 根据权利要求6所述的方法,其特征在于,所述同传用户需求包括语音声色需求和语音风格需求;所述通过所述同传用户需求对所述场景语音数据进行配置,输出所述同声传译语音数据,包括:
    通过所述语音声色需求对所述场景语音数据进行声色切换,得到满足所述语音声色需求的声色语音数据;及
    根据所述语音风格需求对所述声色语音数据进行风格切换,输出所述同声传译语音数据。
  8. 一种同声传译装置,包括:
    待同传数据接收模块,用于接收待同传语音数据,并确定所述待同传语音数据对应的待同传语种类别;
    同传需求获取模块,用于获取同传需求,所述同传需求包括同传目标语种和同传语音输出需求;
    同传模型查询模块,用于查询与所述待同传语种类别和所述同传目标语种对应预设的语音同传模型,所述语音同传模型基于所述待同传语种类别和所述同传目标语种之间的翻译对应关系构建得到;
    模型语音数据获取模块,用于将所述待同传语音数据导入所述语音同传模型中,输出模型语音数据;及
    同传语音数据获取模块,用于根据所述同传语音输出需求,对所述模型语音数据进行语音特征处理,输出同声传译语音数据。
  9. 根据权利要求8所述的装置,其特征在于,所述待同传数据接收模块,包括:
    特征音素提取单元,用于从所述待同传语音数据中提取语音特征音素;
    音素分类模型查询单元,用于查询预设的语种音素分类模型,所述语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;及
    待同传语种确定单元,用于将所述语音特征音素输入所述语种音素分类模型中,得到所述待同传语音数据对应的待同传语种类别。
  10. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收待同传语音数据,并确定所述待同传语音数据对应的待同传语种类别;
    获取同传需求,所述同传需求包括同传目标语种和同传语音输出需求;
    查询与所述待同传语种类别和所述同传目标语种对应预设的语音同传模型,所述语音同传模型基于所述待同传语种类别和所述同传目标语种之间的翻译对应关系构建得到;
    将所述待同传语音数据导入所述语音同传模型中,得到模型语音数据;及
    根据所述同传语音输出需求,对所述模型语音数据进行语音特征处理,输出同声传译语音数据。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    从所述待同传语音数据中提取语音特征音素;
    查询预设的语种音素分类模型,所述语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;及
    将所述语音特征音素输入所述语种音素分类模型中,得到所述待同传语音数据对应的待同传语种类别。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    对所述待同传语音数据进行数字化处理,得到数字化待同传数据;
    对所述数字化待同传数据进行端点检测处理,并对端点检测处理后的所述数字化待同传数据进行语音分帧处理,得到待同传语音帧数据;及
    从所述待同传语音帧数据中提取语音特征音素。
  13. 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    查询预设的语音同传模型库;
    从所述语音同传模型库中查询与所述待同传语种类别对应的多语种同传模型;及
    根据所述同传目标语种对所述多语种同传模型进行输出语种配置,得到语音同传模型。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取所述待同传语种类别对应预设的语音识别模型,所述语音识别模型用于根据所述待同传语音数据输出所述待同传语种类别对应的待同传语种文字;
    根据所述待同传语种文字和所述同传目标语种对应的目标语种文字之间的历史翻译数据,构建文字翻译模型,所述文字翻译模型用于根据所述待同传语种文字输出所述同传目标语种对应的目标语种文字;
    根据所述目标语种文字以及所述目标语种文字在所述同传目标语种中对应的语音数据,构建目标语种语音模型;
    将所述语音识别模型、所述文字翻译模型和所述目标语种语音模型依次组合,得到所述多语种同传模型;及
    根据所述多语种同传模型得到所述语音同传模型库。
  15. 根据权利要求10至14任意一项所述的计算机设备,其特征在于,所述同传语音输出需求包括同传场景需求和同传用户需求;所述处理器执行所述计算机可读指令时还执行以下步骤:
    查询与所述同传场景需求对应预设的场景语音数据库,所述场景语音数据库存储有满足所述同传场景需求的场景语音表达数据;
    通过所述场景语音表达数据对所述模型语音数据进行更新,得到场景语音数据;及
    通过所述同传用户需求对所述场景语音数据进行配置,输出所述同声传译语音数据。
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收待同传语音数据,并确定所述待同传语音数据对应的待同传语种类别;
    获取同传需求,所述同传需求包括同传目标语种和同传语音输出需求;
    查询与所述待同传语种类别和所述同传目标语种对应预设的语音同传模型,所述语音同传模型基于所述待同传语种类别和所述同传目标语种之间的翻译对应关系构建得到;
    将所述待同传语音数据导入所述语音同传模型中,得到模型语音数据;及
    根据所述同传语音输出需求,对所述模型语音数据进行语音特征处理,输出同声传译语音数据。
  17. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    从所述待同传语音数据中提取语音特征音素;
    查询预设的语种音素分类模型,所述语种音素分类模型通过训练各种语种类别对应的语音特征音素得到;及
    将所述语音特征音素输入所述语种音素分类模型中,得到所述待同传语音数据对应的待同传语种类别。
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    对所述待同传语音数据进行数字化处理,得到数字化待同传数据;
    对所述数字化待同传数据进行端点检测处理,并对端点检测处理后的所述数字化待同 传数据进行语音分帧处理,得到待同传语音帧数据;及
    从所述待同传语音帧数据中提取语音特征音素。
  19. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    查询预设的语音同传模型库;
    从所述语音同传模型库中查询与所述待同传语种类别对应的多语种同传模型;及
    根据所述同传目标语种对所述多语种同传模型进行输出语种配置,得到语音同传模型。
  20. 根据权利要求19所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    获取所述待同传语种类别对应预设的语音识别模型,所述语音识别模型用于根据所述待同传语音数据输出所述待同传语种类别对应的待同传语种文字;
    根据所述待同传语种文字和所述同传目标语种对应的目标语种文字之间的历史翻译数据,构建文字翻译模型,所述文字翻译模型用于根据所述待同传语种文字输出所述同传目标语种对应的目标语种文字;
    根据所述目标语种文字以及所述目标语种文字在所述同传目标语种中对应的语音数据,构建目标语种语音模型;
    将所述语音识别模型、所述文字翻译模型和所述目标语种语音模型依次组合,得到所述多语种同传模型;及
    根据所述多语种同传模型得到所述语音同传模型库。
PCT/CN2018/124800 2018-10-17 2018-12-28 同声传译方法、装置、计算机设备和存储介质 WO2020077868A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811211414.3 2018-10-17
CN201811211414.3A CN109448698A (zh) 2018-10-17 2018-10-17 同声传译方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020077868A1 true WO2020077868A1 (zh) 2020-04-23

Family

ID=65547183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124800 WO2020077868A1 (zh) 2018-10-17 2018-12-28 同声传译方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN109448698A (zh)
WO (1) WO2020077868A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008481B (zh) * 2019-04-10 2023-04-28 南京魔盒信息科技有限公司 翻译语音生成方法、装置、计算机设备和存储介质
WO2021077333A1 (zh) * 2019-10-23 2021-04-29 深圳市欢太科技有限公司 同声传译方法及装置、存储介质
CN111144138A (zh) * 2019-12-17 2020-05-12 Oppo广东移动通信有限公司 一种同声传译方法及装置、存储介质
CN112818705B (zh) * 2021-01-19 2024-02-27 传神语联网网络科技股份有限公司 基于组间共识的多语种语音翻译系统与方法
CN112818703B (zh) * 2021-01-19 2024-02-27 传神语联网网络科技股份有限公司 基于多线程通信的多语种共识翻译系统与方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101340676A (zh) * 2008-08-21 2009-01-07 深圳华为通信技术有限公司 一种实现同声翻译的方法、装置和移动终端
US20100235161A1 (en) * 2009-03-11 2010-09-16 Samsung Electronics Co., Ltd. Simultaneous interpretation system
CN102693729A (zh) * 2012-05-15 2012-09-26 北京奥信通科技发展有限公司 个性化语音阅读方法、系统及具有该系统的终端
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
CN108447486A (zh) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 一种语音翻译方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100485909B1 (ko) * 2002-11-06 2005-04-29 삼성전자주식회사 3자 통화 방식의 자동 통역 시스템 및 방법
CN101008942A (zh) * 2006-01-25 2007-08-01 北京金远见电脑技术有限公司 机器翻译装置和机器翻译方法
JP2009186820A (ja) * 2008-02-07 2009-08-20 Hitachi Ltd 音声処理システム、音声処理プログラム及び音声処理方法
CN103559879B (zh) * 2013-11-08 2016-01-06 安徽科大讯飞信息科技股份有限公司 语种识别系统中声学特征提取方法及装置
CN106486125A (zh) * 2016-09-29 2017-03-08 安徽声讯信息技术有限公司 一种基于语音识别技术的同声传译系统
CN108009159A (zh) * 2017-11-30 2018-05-08 上海与德科技有限公司 一种同声传译方法和移动终端
CN108595443A (zh) * 2018-03-30 2018-09-28 浙江吉利控股集团有限公司 同声翻译方法、装置、智能车载终端及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101340676A (zh) * 2008-08-21 2009-01-07 深圳华为通信技术有限公司 一种实现同声翻译的方法、装置和移动终端
US20100235161A1 (en) * 2009-03-11 2010-09-16 Samsung Electronics Co., Ltd. Simultaneous interpretation system
CN102693729A (zh) * 2012-05-15 2012-09-26 北京奥信通科技发展有限公司 个性化语音阅读方法、系统及具有该系统的终端
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
CN108447486A (zh) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 一种语音翻译方法及装置

Also Published As

Publication number Publication date
CN109448698A (zh) 2019-03-08

Similar Documents

Publication Publication Date Title
WO2020077868A1 (zh) 同声传译方法、装置、计算机设备和存储介质
WO2019165748A1 (zh) 一种语音翻译方法及装置
CN114401438B (zh) 虚拟数字人的视频生成方法及装置、存储介质、终端
KR20220004737A (ko) 다국어 음성 합성 및 언어간 음성 복제
KR20210106397A (ko) 음성 전환 방법, 장치 및 전자 기기
JP2017058674A (ja) 音声認識のための装置及び方法、変換パラメータ学習のための装置及び方法、コンピュータプログラム並びに電子機器
US11093110B1 (en) Messaging feedback mechanism
JP2017040919A (ja) 音声認識装置、音声認識方法及び音声認識システム
CN109545183A (zh) 文本处理方法、装置、电子设备及存储介质
CN111445898B (zh) 语种识别方法、装置、电子设备和存储介质
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
CN109256133A (zh) 一种语音交互方法、装置、设备及存储介质
US10714087B2 (en) Speech control for complex commands
KR102174922B1 (ko) 사용자의 감정 또는 의도를 반영한 대화형 수어-음성 번역 장치 및 음성-수어 번역 장치
KR20240053639A (ko) 제한된 스펙트럼 클러스터링을 사용한 화자-턴 기반 온라인 화자 구분
KR20200069727A (ko) 대화자 관계 기반 언어적 특성 정보를 반영한 번역지원 시스템 및 방법
CN112463942A (zh) 文本处理方法、装置、电子设备及计算机可读存储介质
US20190121860A1 (en) Conference And Call Center Speech To Text Machine Translation Engine
US8527270B2 (en) Method and apparatus for conducting an interactive dialogue
CN111354362A (zh) 用于辅助听障者交流的方法和装置
JP2024529889A (ja) ロバストな直接音声間翻訳
Reddy et al. Indian sign language generation from live audio or text for tamil
TWI769520B (zh) 多國語言語音辨識及翻譯方法與相關的系統
CN113409761A (zh) 语音合成方法、装置、电子设备以及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18937271

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06/08/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18937271

Country of ref document: EP

Kind code of ref document: A1