WO2021077333A1 - 同声传译方法及装置、存储介质 - Google Patents
同声传译方法及装置、存储介质 Download PDFInfo
- Publication number
- WO2021077333A1 WO2021077333A1 PCT/CN2019/112790 CN2019112790W WO2021077333A1 WO 2021077333 A1 WO2021077333 A1 WO 2021077333A1 CN 2019112790 W CN2019112790 W CN 2019112790W WO 2021077333 A1 WO2021077333 A1 WO 2021077333A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- result
- translation
- scene
- model
- target
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013519 translation Methods 0.000 claims abstract description 138
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 127
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 127
- 238000012545 processing Methods 0.000 claims abstract description 60
- 238000012512 characterization method Methods 0.000 claims description 31
- 230000015654 memory Effects 0.000 claims description 19
- 230000010365 information processing Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000014616 translation Effects 0.000 description 114
- 238000010586 diagram Methods 0.000 description 21
- 238000013473 artificial intelligence Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 1
- 235000009508 confectionery Nutrition 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
Definitions
- the embodiments of the present application relate to the field of speech processing technology, and in particular, to a simultaneous interpretation method and device, and a storage medium.
- Artificial Intelligence (AI) simultaneous interpretation can recognize the collected voice and obtain the voice recognition result.
- the translation model is used to translate the voice recognition result into the voice data for simultaneous interpretation to obtain the target translation result.
- the speech synthesis model synthesizes the target translation result into the corresponding speech.
- AI simultaneous interpretation can not only be used in international conferences, product launch conferences and other conferences, but also in people's daily life. For example, in work, you can use AI simultaneous interpretation for technology sharing or video conferencing, and in life, you can use AI simultaneous interpretation to meet relevant needs in social or travel scenarios. However, in the process of AI simultaneous interpretation, the simultaneous interpretation method is fixed and single, and the accuracy of simultaneous interpretation results is low.
- the embodiments of the present application expect to provide a simultaneous interpretation method and device, and a storage medium.
- a method of simultaneous interpretation including:
- the translation synthesis model is a model corresponding to the original language and the target language, and the original language is the language category of the voice data to be simultaneously translated;
- the identifying the actual application scenario corresponding to the voice data to be simultaneously transmitted includes:
- the application scenario indicated by the target instruction is determined as the actual application scenario.
- the identifying the actual application scenario corresponding to the voice data to be simultaneously transmitted includes:
- Information processing is performed on the scene characterization information corresponding to the voice data to be simultaneously translated to obtain a processing result
- the scene characterization information includes at least one of the following: the speech recognition result and image video information;
- the processing result includes at least one of the following: a text classification result and a scene object recognition result;
- the scene characterization information includes the voice recognition result
- the information processing of the scene characterization information corresponding to the voice data to be simultaneously translated to obtain the processing result includes:
- the speech recognition result is classified according to a preset classification system or standard, and the text classification result is obtained.
- the scene characterization information includes the image and video information, and the information processing of the scene characterization information corresponding to the voice data to be simultaneously translated, and before obtaining the processing result, further includes:
- the image and video information includes at least one of the following: scene video and scene image;
- the information processing of the scene characterization information corresponding to the voice data to be simultaneously transmitted to obtain the processing result includes:
- the processing result includes the text classification result and the scene object recognition result
- the identifying the actual application scene according to the processing result includes:
- the actual application scenario is determined.
- the translation synthesis model includes a target translation model and a target synthesis model
- the determination of the translation synthesis model based on the actual application scenario includes:
- each of the multiple translation models is used to achieve the source language and the target Model of text conversion between languages;
- each of the multiple speech synthesis models is used to compare the text of the target language Model for speech synthesis.
- the use of the translation synthesis model to perform translation synthesis processing on the speech recognition result to obtain the simultaneous interpretation result includes:
- the embodiment of the present application provides a simultaneous interpretation device, which includes:
- the first recognition module is configured to perform voice recognition on the voice data to be simultaneously transmitted to obtain a voice recognition result
- the second recognition module is configured to recognize the actual application scene corresponding to the voice data to be simultaneously transmitted;
- a model determination module configured to determine a translation synthesis model based on the actual application scenario; the translation synthesis model is a model corresponding to the original language and the target language, and the original language is the language category of the voice data to be simultaneously translated;
- the translation synthesis module is configured to use the translation synthesis model to perform translation synthesis processing on the speech recognition result to obtain a simultaneous interpretation result.
- the second identification module is configured to receive a target instruction; determine the application scenario indicated by the target instruction as the actual application scenario.
- the second recognition module is configured to perform information processing on the scene characterization information corresponding to the voice data to be simultaneously translated to obtain a processing result;
- the scene characterization information includes at least one of the following: the speech recognition The result and the image and video information;
- the processing result includes at least one of the following: a text classification result and a scene object recognition result; and the actual application scene is identified according to the processing result.
- the scene characterization information includes the speech recognition result
- the second recognition module is configured to classify the speech recognition result according to a preset classification system or standard to obtain the text classification result.
- the scene characterization information includes the image and video information
- the second recognition module is configured to obtain the image and video information corresponding to the voice data to be simultaneously translated
- the image and video information includes at least the following One: scene video and scene image; object recognition is performed on the image and video information to obtain the scene object recognition result.
- the processing result includes the text classification result and the scene object recognition result
- the second recognition module is configured to determine the first application scene according to the text classification result; according to the scene object As a result of the recognition, the second application scenario is determined; the actual application scenario is determined from the first application scenario and the second application scenario.
- the translation synthesis model includes a target translation model and a target synthesis model, and the model determination module is configured to determine the Target translation model; each of the multiple translation models is a model for realizing text conversion between the original language and the target language; according to the actual application scenario, and the multiple speech synthesis models are different The corresponding relationship of the application scenario determines the target synthesis model; each of the multiple speech synthesis models is a model for performing speech synthesis on the text of the target language.
- the translation synthesis module is configured to use the target translation model to translate the speech recognition result from the source language to the target language to obtain the target translation result; use the target synthesis model to pair The target translation result is speech synthesized to obtain the simultaneous interpretation result.
- the embodiment of the present application provides a simultaneous interpretation device, the device includes a processor and a memory;
- the processor is configured to execute the simultaneous interpretation program stored in the memory to realize the above-mentioned simultaneous interpretation method.
- the embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above simultaneous interpretation method is realized.
- the technical solution of the embodiment of the present application perform voice recognition on the voice data to be interpreted to obtain the voice recognition result; identify the actual application scenario corresponding to the voice data to be interpreted; determine the translation synthesis model based on the actual application scenario; the translation synthesis model is and The model corresponding to the original language and the target language.
- the original language is the language category of the voice data to be simultaneously interpreted; using the translation synthesis model, the speech recognition result is translated and synthesized to obtain the simultaneous interpretation result.
- the technical solution provided by the embodiments of the present application recognizes the application scenarios of the voice data to be simultaneously interpreted, so that the corresponding model is used for simultaneous interpretation according to the determined application scenarios, and the accuracy and flexibility of the simultaneous interpretation results are improved.
- FIG. 1 is a schematic diagram 1 of the flow of a simultaneous interpretation method provided by an embodiment of this application;
- FIG. 2 is a second schematic diagram of the flow of a simultaneous interpretation method provided by an embodiment of the application
- FIG. 3 is a schematic diagram of an exemplary speech processing process provided by an embodiment of this application.
- Fig. 4 is a first structural diagram of a simultaneous interpretation device provided by an embodiment of the application.
- FIG. 5 is a second structural diagram of a simultaneous interpretation device provided by an embodiment of this application.
- FIG. 1 is a schematic diagram 1 of the flow of a simultaneous interpretation method provided by an embodiment of this application.
- the simultaneous interpretation method mainly includes the following steps:
- S101 Perform voice recognition on the voice data to be simultaneously translated to obtain a voice recognition result.
- the simultaneous interpretation device may first perform voice recognition on the voice data to be simultaneously interpreted, so as to obtain the voice recognition result.
- the voice data to be simultaneously translated may be any voice that requires voice translation, for example, voice collected in real time in an application scenario.
- the voice data to be interpreted can be voices in any type of language.
- the specific voice data to be simultaneously transmitted is not limited in this embodiment of the application.
- the voice data to be interpreted may be collected by a specific voice collection device and then transmitted to the simultaneous interpretation device for voice translation processing.
- the simultaneous interpretation device can be equipped with a voice collection device, so as to directly collect the voice data to be simultaneously interpreted.
- the specific source of the voice data to be simultaneously transmitted is not limited in this embodiment of the application.
- the simultaneous interpretation device may use voice recognition technology, that is, the voice data to be simultaneously translated is converted through recognition and understanding, so as to obtain the voice recognition result.
- voice recognition result is actually the language text of the voice data to be simultaneously translated, which is not limited in the embodiment of the present application.
- the specific speech recognition process is the prior art, and will not be repeated here.
- the simultaneous interpretation device can identify the actual application scenario corresponding to the voice data to be simultaneously interpreted.
- application scenarios can be divided into large-scale international conferences, small-scale work conferences, public service places, public social places, social applications, and general scenarios.
- public service places can be waiting halls, government office halls, etc.
- public social places can be coffee shops, concert halls, etc.
- the actual application scenario corresponding to the voice data to be transcribed is actually the application scenario in which the voice data to be transcribed is collected.
- the specific actual application scenario is not limited in the embodiment of this application.
- the simultaneous interpretation device identifying the actual application scenario of the voice data object to be simultaneously interpreted includes: receiving a target instruction; and determining the application scenario indicated by the target instruction as the actual application scenario.
- the user when the user needs the simultaneous interpretation device to translate the voice data to be simultaneously interpreted, the user can independently determine the actual voice data corresponding to the voice data to be simultaneously interpreted according to the environment in which the voice data to be interpreted is collected.
- Application scenario through a specific interactive interface or touch keys, the target instruction for indicating the actual application scenario is sent to the simultaneous interpretation device, and the voice translation can receive the target indicator to determine the actual application scenario according to the target instruction .
- the user observes that the application scenario for acquiring the voice data to be simultaneously interpreted is a large-scale international conference. Therefore, a target instruction indicating that the actual application scenario is a large-scale international conference can be sent to the simultaneous interpretation device.
- the simultaneous interpretation device receives the target instruction, it can be determined that the actual application scenario is a large-scale international conference.
- FIG. 2 is a schematic diagram of a process for identifying actual application scenarios provided by an embodiment of the application.
- the simultaneous interpretation device identifying the actual application scenario corresponding to the voice data to be simultaneously interpreted may further include the following steps:
- S201 Perform information processing on the scene characterization information corresponding to the voice data to be simultaneously translated to obtain a processing result.
- the simultaneous interpretation device may perform information processing on the scene characterization information corresponding to the voice data to be simultaneously interpreted, and obtain the processing result.
- the scene characterization information corresponding to the voice data to be interpreted includes at least one of the following: voice recognition results and image and video information.
- the processing results include at least one of the following: text classification Results and scene object recognition results.
- the scene characterization information includes a voice recognition result
- the simultaneous interpretation device performs information processing on the scene characterization information corresponding to the simultaneous speech data to obtain the processing result, including: according to a preset classification system or standard The speech recognition result is classified, and the text classification result is obtained.
- a preset classification system or standard is stored in the simultaneous interpretation device, so that the speech recognition result can be classified according to the preset classification system or standard, and the text classification result can be obtained.
- the specific preset classification system or standard may be determined in advance according to actual needs, and is not limited in the embodiment of the present application.
- the simultaneous interpretation device can specifically search for the speech recognition result, so as to mark and count specific keywords to obtain the text classification result.
- the simultaneous interpretation device retrieves the keyword "department” from the speech recognition result, performs marking and statistics, and obtains the text classification result.
- the scene characterization information includes image and video information
- the simultaneous interpretation device performs information processing on the scene characterization information corresponding to the voice data to be simultaneously interpreted, and before obtaining the processing result, it also includes: obtaining the voice to be simultaneously interpreted Image and video information corresponding to the data; the image and video information includes at least one of the following: scene video and scene image.
- the simultaneous interpretation device performs information processing on the scene characterization information corresponding to the voice data to be simultaneously interpreted to obtain the processing result, including: performing object recognition on the image and video information to obtain the scene object recognition result.
- the scene image corresponding to the voice data to be transcribed is the image of the application scene when the voice data to be transcribed is collected
- the scene video corresponding to the voice data to be transcribed is the collection
- the scene image and/or scene video corresponding to the voice data to be interpreted can be obtained through a specific image acquisition device, and further transmitted to the speech synthesis device through the image acquisition device, and the speech synthesis device can obtain the scene image and/or scene video .
- the speech synthesis device may also be equipped with an image collection device, so as to directly collect scene images and/or scene videos corresponding to the voice data to be simultaneously transmitted.
- the scene image and/or the scene video corresponding to the voice data to be interpreted can also be acquired in other ways, and the specific acquisition method is not limited in this embodiment of the application.
- the simultaneous interpretation device can use a specific recognition algorithm to identify people and objects from scene images and/or scene videos, and mark the names of the people and objects and the corresponding Confidence and so on, so as to obtain the scene object recognition result.
- the specific object to be recognized and the specific algorithm used for recognition can be preset according to actual requirements, and the embodiment of the present application does not limit it.
- the simultaneous interpretation device can recognize and mark the characters and characters in the scene image and/or scene video, so as to obtain the scene object recognition result.
- the simultaneous interpretation device can identify the actual application scenario based on the processing result after performing information processing on the scene characterization information corresponding to the voice data to be simultaneously interpreted.
- the processing result obtained by the simultaneous interpretation device when the processing result obtained by the simultaneous interpretation device includes the text classification result and the scene object recognition result, it can analyze according to the two results, thereby identifying the actual application scene .
- the simultaneous interpretation device can analyze the two results separately, identify two application scenarios, and further combine the two application scenarios to determine the actual application scenario.
- the specific method for determining the actual application scenario is not limited in the embodiment of this application.
- the text classification result is that the speech recognition result of the speech data to be simultaneously translated includes the keyword "Department", and the number of occurrences reaches the first threshold.
- the scene object recognition result includes multiple Individuals are all wearing medical clothing. Therefore, the simultaneous interpretation device can identify the actual application scene as a hospital based on these two results.
- the processing result includes the text classification result and the scene object recognition result
- the simultaneous interpretation device recognizes the actual application scene according to the processing result, which may include: recognizing the first application based on the text classification result Scene: According to the result of scene object recognition, the second application scene is identified; from the first application scene and the second application scene, the actual application scene is determined.
- the first application scene can be identified according to the text classification result, and according to the scene
- the second application scenario is identified, and the actual application scenario is determined from the first application scenario and the second application scenario.
- the simultaneous interpretation device can determine the first application scene and the second application scene as the actual application scene. . If the first application scene and the second application scene are two different application scenes, the simultaneous interpretation device can select one application scene from the first application scene and the second application scene as the actual application according to the preset selection rules Scenes.
- the simultaneous interpretation device may store a preset selection rule, and the selection rule may be determined according to the accuracy of the text classification result and the scene object recognition result.
- the selection rule may be determined according to the accuracy of the text classification result and the scene object recognition result.
- it can also be determined according to other actual needs, which is not limited in the embodiment of the present application.
- the preset selection rule may be: if the first application scene identified according to the text classification result is different from the second application scene identified according to the scene object recognition result, the application is selected.
- the application scenario with a larger range of scenarios is regarded as the actual application scenario.
- the text classification result obtained by the simultaneous interpretation device is that the speech recognition result includes the keyword "work summary", and the number of occurrences is greater than the first threshold. Therefore, the first The application scenario is determined to be a small work meeting.
- the scene object recognition result obtained by the simultaneous interpretation device is that the scene video includes a desk, an office chair, and a person wearing a badge.
- the second application scenario is recognized as a small work meeting. That is, the application scenario determined by the simultaneous interpretation device using the text classification result and the scene object recognition result is the same, therefore, the actual application scenario is determined as a small work conference.
- the processing result obtained by the simultaneous interpretation device may also include any one of the text classification result and the scene object recognition result, so as to identify the actual application scene according to the result.
- the simultaneous interpretation device only obtains the text classification result
- the first application scenario identified according to the text classification result is actually the actual application scenario.
- the simultaneous interpretation device only obtains the scene object recognition result, according to The second application scene identified by the scene object recognition result is actually the actual application scene.
- step S101 and step S102 are not limited by the embodiment of the present application.
- the translation synthesis model is a model corresponding to the original language and the target language, and the original language is the language category of the voice data to be simultaneously translated.
- the simultaneous interpretation device when the simultaneous interpretation device recognizes the actual application scenario corresponding to the voice data to be simultaneously interpreted, it may determine the corresponding translation synthesis model based on the actual application scenario.
- the original language is the language category of the voice data to be simultaneously translated.
- the target language is the language that the user needs to interpret the voice data to be simultaneously interpreted simultaneously, which can be preset according to actual needs.
- the translation synthesis model includes a target translation model and a target synthesis model.
- the simultaneous interpretation device determines the translation synthesis model based on the actual application scenario, including: according to the actual application scenario, and multiple translation models and Correspondence of different application scenarios, determine the target translation model; each of the multiple translation models is a model used to realize the text conversion between the original language and the target language; according to the actual application scenarios, and multiple speech synthesis models and different The corresponding relationship of the application scenario determines the target synthesis model; each of the multiple speech synthesis models is a model for speech synthesis of text in the target language.
- the simultaneous interpretation device stores multiple language translation models and multiple speech synthesis models, each of which translates The model and the speech synthesis model correspond to an application scenario.
- multiple language translation models and multiple speech synthesis models are not limited in the embodiment of the present application.
- the simultaneous interpretation device After the simultaneous interpretation device recognizes the actual application scenario corresponding to the voice data to be simultaneously interpreted, it can search for the corresponding translation model from Table 1, thereby determining the searched translation model as the target translation model.
- corresponding translation models can be pre-trained.
- a large-scale speech sample can be used to train a translation model of a general scene.
- the general scene is a scene without strong features. Therefore, the speech samples used can be derived from various application scenarios. After that, collect different application scenarios with strong characteristics, such as speech samples in large-scale international conferences, and perform adaptive training on the basis of translation models corresponding to common scenarios to obtain corresponding translation models, so as to make different translations. Models have different translation modes and styles.
- Speech synthesis model Large-scale international conference Speech synthesis model 1 Small working meeting Speech synthesis model 2 Public service place Speech synthesis model 3 Public social place Speech synthesis model 4 Social applications Speech synthesis model 5 General scene Speech synthesis model 6
- the simultaneous interpretation device After the simultaneous interpretation device recognizes the actual application scenario corresponding to the voice data to be simultaneously interpreted, it can search for the corresponding speech synthesis model from Table 2, so as to determine the found speech synthesis model as the target synthesis model.
- each speech synthesis model has a different speech synthesis style.
- the speech synthesis model 1 is trained, which can synthesize serious and deep intonation speech
- the synthesis model 3 is trained for public service places, and the synthesis model 3 can synthesize sweet and lively speech.
- multiple translation models and multiple speech synthesis models may also be stored in the server, and the server may iteratively update these models on a regular basis.
- the simultaneous interpretation device determines the target translation model and the target synthesis model, it can use the translation synthesis model to perform translation synthesis processing on the speech recognition result to obtain the simultaneous interpretation result.
- the simultaneous interpretation device uses the translation synthesis model to perform translation synthesis processing on the speech recognition result to obtain the simultaneous interpretation result, including: using the target translation model to translate the speech recognition result from the original language Go to the target language to obtain the target translation result; use the target synthesis model to synthesize the target translation result to obtain the simultaneous interpretation result.
- the target translation model can translate the speech recognition result into the language text of the target language, and the language text conforms to the style of the actual application scenario, the language text is the target translation result.
- the specific target translation result is not limited in the embodiment of this application.
- the simultaneous interpretation device determines that the actual application scenario is a large-scale international conference, so that the target translation model is found from Table 1 as language translation model 1. After that, the simultaneous interpretation device can use the language translation model 1 to translate the voice recognition result of the simultaneous voice data from the original language to the target language, so as to obtain the target translation result.
- the target translation result is the language text of the target language corresponding to the speech recognition result, and has a formal and written style, which is suitable for practical application scenarios.
- the simultaneous interpretation device determines the target translation result, it can use the target synthesis model to perform speech synthesis on the target translation result to obtain the simultaneous interpretation result.
- the speech synthesis mode is fixed and single, and in the embodiment of the present application, the simultaneous interpretation device determines the target synthesis model of the target language according to the actual application scenario, and the target synthesis model can synthesize the target translation result into The voice of the target language, and the voice is more suitable for actual application scenarios.
- Fig. 3 is a schematic diagram of an exemplary speech processing process provided by an embodiment of the application.
- the simultaneous interpretation device can first perform voice recognition on the voice data to be interpreted, and further classify the text to obtain the text classification result.
- the scene video corresponding to the voice data to be interpreted can be subjected to object recognition to obtain
- the actual application scenario is determined, and the translation synthesis model is determined according to the actual application scenario, that is, the target translation model and target synthesis model are determined, and the target translation model is used to treat
- the speech recognition result of the simultaneous speech data is translated, and finally, the target synthesis model is used to synthesize the translation result to obtain the simultaneous interpretation result.
- the simultaneous interpretation method performs voice recognition on the voice data to be interpreted to obtain the voice recognition result; recognizes the actual application scenario corresponding to the voice data to be interpreted; determines the translation synthesis model based on the actual application scenario; the translation synthesis model is The model corresponding to the original language and the target language.
- the original language is the language category of the voice data to be simultaneously interpreted; the translation synthesis model is used to translate and synthesize the speech recognition results to obtain the simultaneous interpretation results.
- the technical solution provided by the embodiments of the present application recognizes the application scenarios of the voice data to be simultaneously interpreted, so that the corresponding model is used for simultaneous interpretation according to the determined application scenarios, and the accuracy and flexibility of the simultaneous interpretation results are improved.
- FIG. 4 is a first structural diagram of a simultaneous interpretation device provided by an embodiment of the application. As shown in Figure 4, the simultaneous interpretation device includes:
- the first recognition module 401 is configured to perform voice recognition on the voice data to be simultaneously transmitted to obtain a voice recognition result
- the second recognition module 402 is configured to recognize the actual application scenario corresponding to the voice data to be simultaneously transmitted;
- the model determination module 403 is configured to determine a translation synthesis model based on the actual application scenario; the translation synthesis model is a model corresponding to the original language and the target language, and the original language is the language category of the voice data to be simultaneously translated;
- the translation synthesis module 404 is configured to use the translation synthesis model to perform translation synthesis processing on the speech recognition result to obtain a simultaneous interpretation result.
- the second identification module 402 is configured to receive a target instruction; determine the application scenario indicated by the target instruction as the actual application scenario.
- the second recognition module 402 is configured to perform information processing on the scene characterization information corresponding to the voice data to be simultaneously translated to obtain a processing result;
- the scene characterization information includes at least one of the following: Speech recognition result and image and video information;
- the processing result includes at least one of the following: a text classification result and a scene object recognition result; according to the processing result, the actual application scene is identified.
- the scene characterization information includes the speech recognition result
- the second recognition module 402 is configured to classify the speech recognition result according to a preset classification system or standard to obtain the text classification result .
- the scene characterization information includes the image and video information
- the second recognition module 402 is configured to obtain the image and video information corresponding to the voice data to be simultaneously translated
- the image and video information includes At least one of the following: scene video and scene image; object recognition is performed on the image and video information to obtain the scene object recognition result.
- the processing result includes the text classification result and the scene object recognition result
- the second recognition module 402 is configured to determine the first application scene according to the text classification result
- the scene object recognition result determines the second application scene; from the first application scene and the second application scene, the actual application scene is determined.
- the translation synthesis model includes a target translation model and a target synthesis model
- the model determination module 403 is configured to determine according to the actual application scenario and the correspondence between multiple translation models and different application scenarios
- the target translation model; each of the multiple translation models is a model for realizing text conversion between the original language and the target language; according to the actual application scenario, and multiple speech synthesis models
- the corresponding relationship with different application scenarios determines the target synthesis model; each of the multiple speech synthesis models is a model for performing speech synthesis on the text of the target language.
- the translation synthesis module 404 is configured to use the target translation model to translate the speech recognition result from the source language to the target language to obtain the target translation result; use the target translation model The model performs speech synthesis on the target translation result to obtain the simultaneous interpretation result.
- the first recognition module 401, the second recognition module 402, the model determination module 403, and the translation synthesis module 404 may be implemented by a processor.
- the simultaneous interpretation device provided in the above embodiment performs simultaneous interpretation, only the division of the above-mentioned program modules is used as an example for illustration. In actual applications, the above-mentioned processing can be allocated to different program modules according to needs. Finish, that is, divide the internal structure of the device into different program modules to complete all or module processing described above.
- the simultaneous interpretation device provided in the foregoing embodiment and the simultaneous interpretation method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
- FIG. 5 is a second structural diagram of a simultaneous interpretation device provided by an embodiment of this application.
- the simultaneous interpretation device includes: a processor 501, a memory 502, and a communication bus 503;
- the communication bus 503 is configured to implement a communication connection between the processor 501 and the memory 502;
- the processor 501 is configured to execute the simultaneous interpretation program stored in the memory 502 to implement the foregoing simultaneous interpretation method.
- the embodiment of the application provides a simultaneous interpretation device, which performs voice recognition on the voice data to be interpreted to obtain the voice recognition result; recognizes the actual application scenario corresponding to the voice data to be interpreted; determines the translation synthesis model based on the actual application scenario; translation synthesis
- the model is a model corresponding to the original language and the target language, and the original language is the language category of the speech data to be interpreted simultaneously; using the translation synthesis model, the speech recognition result is translated and synthesized to obtain the simultaneous interpretation result.
- the simultaneous interpretation device provided by the embodiment of the present application recognizes the application scenario of the voice data to be simultaneously interpreted, so that the corresponding model is used for simultaneous interpretation according to the determined application scenario, which improves the accuracy and flexibility of the simultaneous interpretation result.
- the embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by one or more processors, the above simultaneous interpretation method is implemented.
- the computer-readable storage medium may be a volatile memory (volatile memory), such as random-access memory (Random-Access Memory, RAM); or a non-volatile memory (non-volatile memory), such as read-only memory (Read- Only Memory, ROM, flash memory, Hard Disk Drive (HDD), or Solid-State Drive (SSD); it can also be a respective device including one or any combination of the above-mentioned memories, such as Mobile phones, computers, tablet devices, personal digital assistants, etc.
- this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of hardware embodiment, software embodiment, or a combination of software and hardware embodiments. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.
- These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
- the device realizes the functions specified in one or more processes in the schematic diagram and/or one block or more in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps for implementing functions specified in one or more processes in the schematic diagram and/or one block or more in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
应用场景 | 翻译模型 |
大型国际会议 | 翻译模型1 |
小型工作会议 | 翻译模型2 |
公共服务场所 | 翻译模型3 |
公共社交场所 | 翻译模型4 |
社交类应用 | 翻译模型5 |
通用场景 | 翻译模型6 |
应用场景 | 语音合成模型 |
大型国际会议 | 语音合成模型1 |
小型工作会议 | 语音合成模型2 |
公共服务场所 | 语音合成模型3 |
公共社交场所 | 语音合成模型4 |
社交类应用 | 语音合成模型5 |
通用场景 | 语音合成模型6 |
Claims (11)
- 一种同声传译方法,包括:对待同传语音数据进行语音识别,得到语音识别结果;识别所述待同传语音数据对应的实际应用场景;基于所述实际应用场景确定翻译合成模型;所述翻译合成模型为与原语种和目标语种对应的模型,所述原语种为所述待同传语音数据的语种类别;利用所述翻译合成模型,对所述语音识别结果进行翻译合成处理,得到同声传译结果。
- 根据权利要求1所述的方法,其中,所述识别所述待同传语音数据对应的实际应用场景,包括:接收目标指令;将所述目标指令指示的应用场景确定为所述实际应用场景。
- 根据权利要求1所述的方法,其中,所述识别所述待同传语音数据对应的实际应用场景,包括:对所述待同传语音数据对应的场景表征信息进行信息处理,得到处理结果;所述场景表征信息包含以下至少之一:所述语音识别结果和图像视频信息;所述处理结果包含以下至少之一:文本分类结果和场景对象识别结果;根据所述处理结果,识别出所述实际应用场景。
- 根据权利要求3所述的方法,其中,所述场景表征信息包含所述语音识别结果,所述对所述待同传语音数据对应的场景表征信息进行信息处理,得到处理结果,包括:按照预设分类体系或标准对所述语音识别结果进行分类,得到所述文 本分类结果。
- 根据权利要求3所述的方法,其中,所述场景表征信息包含所述图像视频信息,所述对所述待同传语音数据对应的场景表征信息进行信息处理,得到处理结果之前,还包括:获取所述待同传语音数据对应的所述图像视频信息;所述图像视频信息包含以下至少之一:场景视频和场景图像;相应的,所述对所述待同传语音数据对应的场景表征信息进行信息处理,得到处理结果,包括:对所述图像视频信息进行对象识别,得到所述场景对象识别结果。
- 根据权利要求3所述的方法,其中,所述处理结果包含所述文本分类结果和所述场景对象识别结果,所述根据所述处理结果,识别出所述实际应用场景,包括:根据所述文本分类结果,识别出第一应用场景;根据所述场景对象识别结果,识别出第二应用场景;从所述第一应用场景和所述第二应用场景中,确定出所述实际应用场景。
- 根据权利要求1所述的方法,其中,所述翻译合成模型包括目标翻译模型和目标合成模型,所述基于所述实际应用场景确定翻译合成模型,包括:根据所述实际应用场景,以及多个翻译模型与不同应用场景的对应关系,确定所述目标翻译模型;所述多个翻译模型中的每一个模型为用于实现所述原语种与所述目标语种之间文本转换的模型;根据所述实际应用场景,以及多个语音合成模型与不同应用场景的对应关系,确定所述目标合成模型;所述多个语音合成模型中的每一个模型为用于对所述目标语种的文本进行语音合成的模型。
- 根据权利要求7所述的方法,其中,所述利用所述翻译合成模型, 对所述语音识别结果进行翻译合成处理,得到同声传译结果,包括:利用所述目标翻译模型,将所述语音识别结果从所述原语种翻译至所述目标语种,得到目标翻译结果;利用所述目标合成模型对所述目标翻译结果进行语音合成,得到所述同声传译结果。
- 一种同声传译装置,所述装置包括:第一识别模块,配置为对待同传语音数据进行语音识别,得到语音识别结果;第二识别模块,配置为识别所述待同传语音数据对应的实际应用场景;模型确定模块,配置为基于所述实际应用场景确定翻译合成模型;所述翻译合成模型为与原语种和目标语种对应的模型,所述原语种为所述待同传语音数据的语种类别;翻译合成模块,配置为利用所述翻译合成模型,对所述语音识别结果进行翻译合成处理,得到同声传译结果。
- 一种同声传译装置,所述装置包括处理器和存储器;所述处理器,配置为执行所述存储器中存储的同声传译程序,以实现权利要求1至8任一项所述的同声传译方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如权利要求1至8任一项所述的同声传译方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/112790 WO2021077333A1 (zh) | 2019-10-23 | 2019-10-23 | 同声传译方法及装置、存储介质 |
CN201980099626.3A CN114303187A (zh) | 2019-10-23 | 2019-10-23 | 同声传译方法及装置、存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/112790 WO2021077333A1 (zh) | 2019-10-23 | 2019-10-23 | 同声传译方法及装置、存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021077333A1 true WO2021077333A1 (zh) | 2021-04-29 |
Family
ID=75619575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/112790 WO2021077333A1 (zh) | 2019-10-23 | 2019-10-23 | 同声传译方法及装置、存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114303187A (zh) |
WO (1) | WO2021077333A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116227504A (zh) * | 2023-02-08 | 2023-06-06 | 广州数字未来文化科技有限公司 | 一种同传翻译的通讯方法、系统、设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391839A (zh) * | 2014-11-13 | 2015-03-04 | 百度在线网络技术(北京)有限公司 | 机器翻译方法和装置 |
US20180293230A1 (en) * | 2018-06-14 | 2018-10-11 | Chun-Ai Tu | Multifunction simultaneous interpretation device |
CN109448698A (zh) * | 2018-10-17 | 2019-03-08 | 深圳壹账通智能科技有限公司 | 同声传译方法、装置、计算机设备和存储介质 |
CN109614628A (zh) * | 2018-11-16 | 2019-04-12 | 广州市讯飞樽鸿信息技术有限公司 | 一种基于智能硬件的翻译方法与翻译系统 |
-
2019
- 2019-10-23 WO PCT/CN2019/112790 patent/WO2021077333A1/zh active Application Filing
- 2019-10-23 CN CN201980099626.3A patent/CN114303187A/zh active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391839A (zh) * | 2014-11-13 | 2015-03-04 | 百度在线网络技术(北京)有限公司 | 机器翻译方法和装置 |
US20180293230A1 (en) * | 2018-06-14 | 2018-10-11 | Chun-Ai Tu | Multifunction simultaneous interpretation device |
CN109448698A (zh) * | 2018-10-17 | 2019-03-08 | 深圳壹账通智能科技有限公司 | 同声传译方法、装置、计算机设备和存储介质 |
CN109614628A (zh) * | 2018-11-16 | 2019-04-12 | 广州市讯飞樽鸿信息技术有限公司 | 一种基于智能硬件的翻译方法与翻译系统 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116227504A (zh) * | 2023-02-08 | 2023-06-06 | 广州数字未来文化科技有限公司 | 一种同传翻译的通讯方法、系统、设备及存储介质 |
CN116227504B (zh) * | 2023-02-08 | 2024-01-23 | 广州数字未来文化科技有限公司 | 一种同传翻译的通讯方法、系统、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114303187A (zh) | 2022-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020143844A1 (zh) | 意图分析方法、装置、显示终端及计算机可读存储介质 | |
US10621991B2 (en) | Joint neural network for speaker recognition | |
US9047868B1 (en) | Language model data collection | |
WO2020220636A1 (zh) | 文本数据增强方法及装置、电子设备、计算机非易失性可读存储介质 | |
CN109145099B (zh) | 基于人工智能的问答方法和装置 | |
WO2018006727A1 (zh) | 机器人客服转人工客服的方法和装置 | |
US20210012777A1 (en) | Context acquiring method and device based on voice interaction | |
WO2018222228A1 (en) | Automated population of electronic records | |
CN110634472B (zh) | 一种语音识别方法、服务器及计算机可读存储介质 | |
WO2020151690A1 (zh) | 语句生成方法、装置、设备及存储介质 | |
Triantafyllopoulos et al. | Deep speaker conditioning for speech emotion recognition | |
WO2021159902A1 (zh) | 年龄识别方法、装置、设备及计算机可读存储介质 | |
EP4198807A1 (en) | Audio processing method and device | |
CN110019729B (zh) | 智能问答方法及存储介质、终端 | |
CN106649404B (zh) | 一种会话场景数据库的创建方法及装置 | |
CN110297897B (zh) | 问答处理方法及相关产品 | |
JP2015162244A (ja) | 発話ワードをランク付けする方法、プログラム及び計算処理システム | |
CN111108508B (zh) | 脸部情感识别方法、智能装置和计算机可读存储介质 | |
CN108629241B (zh) | 一种数据处理方法和数据处理设备 | |
CN112632244A (zh) | 一种人机通话的优化方法、装置、计算机设备及存储介质 | |
CN113590078A (zh) | 虚拟形象合成方法、装置、计算设备及存储介质 | |
CN111159334A (zh) | 用于房源跟进信息处理的方法及系统 | |
CN111126084B (zh) | 数据处理方法、装置、电子设备和存储介质 | |
KR102117287B1 (ko) | 대화 시스템을 위한 대화 시나리오 데이터베이스 구축 방법 및 장치 | |
CN111444313B (zh) | 基于知识图谱的问答方法、装置、计算机设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19949937 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19949937 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.10.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19949937 Country of ref document: EP Kind code of ref document: A1 |