CN115171645A - Dubbing method and device, electronic equipment and storage medium - Google Patents

Dubbing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115171645A
CN115171645A CN202210772621.6A CN202210772621A CN115171645A CN 115171645 A CN115171645 A CN 115171645A CN 202210772621 A CN202210772621 A CN 202210772621A CN 115171645 A CN115171645 A CN 115171645A
Authority
CN
China
Prior art keywords
audio
dubbing
sub
target
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210772621.6A
Other languages
Chinese (zh)
Inventor
刘坚
李秋平
王明轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202210772621.6A priority Critical patent/CN115171645A/en
Publication of CN115171645A publication Critical patent/CN115171645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G06T5/80
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing

Abstract

The present disclosure relates to a dubbing method, apparatus, electronic device, and storage medium, the method comprising: acquiring a final translation text corresponding to the audio and video to be matched; generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on the learning of an original text corresponding to a video to be dubbed and an original audio, the first dubbed audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbed audio belongs in the audio and video to be dubbed; based on the first dubbing audio, carrying out mouth shape correction on image information corresponding to the first dubbing audio in the video to be dubbed; and synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain the dubbed video. The method does not need the participation of dubbing personnel, the formed dubbing audio frequency contains the voice characteristic information of the role, the mouth shape of the image is consistent with the dubbing, and the dubbing efficiency and the dubbing quality can be improved.

Description

Dubbing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of dubbing technologies, and in particular, to a dubbing method and apparatus, an electronic device, and a storage medium.
Background
In order to improve the domestic spreading degree of overseas film and television works, the language of the film and television contents needs to be locally processed, and two processing modes of 'translation dubbing' and 'original sound plus subtitles' are mainly adopted. The mode of 'translation dubbing' can fully reduce the difficulty of understanding of the audience to the film and television works, and is popular among the audiences.
At present, the translation dubbing is mainly realized by manual dubbing by a dubbing staff, the dubbing period is long, the dubbing cost is high, and the dubbing quality of the translation dubbing of a plurality of overseas film and television works is poor under the influence of recording equipment, a recording field, the dubbing duration and the level of the dubbing staff.
Disclosure of Invention
To solve the technical problem or at least partially solve the technical problem, the present disclosure provides a dubbing method, an apparatus, an electronic device, and a storage medium.
In a first aspect, the present disclosure provides a dubbing method, including:
acquiring a final translation text corresponding to the audio and video to be matched;
generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
based on the first dubbing audio, carrying out mouth shape correction on image information corresponding to the first dubbing audio in the audio and video to be dubbed;
and synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain a dubbed video.
In a second aspect, the present disclosure also provides a dubbing apparatus, including:
the acquisition module is used for acquiring a final translation text corresponding to the audio and video to be matched;
the generating module is used for generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
the correction module is used for correcting the mouth shape of the image information corresponding to the first dubbing audio in the audio and video to be dubbed on the basis of the first dubbing audio;
and the synthesis module is used for synthesizing the image information with the corrected mouth shape with the second dubbing audio to obtain a dubbed video.
In a third aspect, the present disclosure also provides an electronic device, including:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a dubbing method as described above.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the dubbing method as described above.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
the technical scheme provided by the embodiment of the disclosure obtains a final translation text corresponding to the audio and video to be matched; generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on the learning of an original text corresponding to a video to be dubbed and an original audio, the first dubbed audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbed audio belongs in the audio and video to be dubbed; based on the first dubbing audio, carrying out mouth shape correction on image information corresponding to the first dubbing audio in the video to be dubbed; and synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain the dubbed video. Which provides a method for automatically dubbing a video to be dubbed. The method does not need the participation of dubbing personnel, the dubbing audio frequency formed by the method contains the voice characteristic information of roles, the mouth shape of the image is consistent with the dubbing, the dubbing efficiency can be fully improved, the time required by dubbing is reduced, the dubbing cost is reduced, and in the whole dubbing process, the method can not be influenced by recording equipment, a recording field, the dubbing time and the level of the dubbing personnel per se, thereby achieving the purpose of improving the dubbing quality.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of a dubbing method provided in an embodiment of the present disclosure;
fig. 2 is a flowchart of another dubbing method provided in an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a translation checking interface provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of another translation checking interface provided by embodiments of the present disclosure;
fig. 5 is a flowchart of another dubbing method provided in an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a dubbing apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments of the present disclosure may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a flowchart of a dubbing method provided in an embodiment of the present disclosure, where the present embodiment is applicable to a case where a client dubs a video, and the method may be executed by a dubbing apparatus, and the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be configured in an electronic device, such as a terminal, specifically including but not limited to a smart phone, a palm computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, an all-in-one machine, a smart home device, and the like. Alternatively, the embodiment may be applicable to the case of dubbing videos in the server, and the method may be executed by a dubbing apparatus, which may be implemented in software and/or hardware, and which may be configured in an electronic device, such as a server.
As shown in fig. 1, the method may specifically include:
and S110, acquiring a final translation text corresponding to the audio and video to be matched.
The audio and video to be dubbed refers to the video needing dubbing. Optionally, the audio/video to be distributed includes fast production of a movie, a documentary and the like, and also can be large-volume movie and television works, such as movies, television shows and the like.
The audio and video to be matched has an original text and a final translated text. The language of the original text is the same as the language of the audio and video to be matched, and the final translated text is obtained by translating the original text. And finally, the language of the translated text is different from that of the audio and video to be matched. The final translated text is used to generate first dubbed audio.
The different languages mean different types of languages, and the types of languages may be divided according to countries or regions, for example, chinese and english belong to different languages, korean and japanese belong to different languages, a sichuan dialect and a shanghai dialect belong to different languages, and english and american english belong to different languages.
Illustratively, the audio/video to be provided is a movie in english version, and the original text is a dialogue, self-white or voice-over of a character presented in text form using english. If one wishes to dub in chinese for this english version of the movie, the final translation text is a dialogue, self-white, or voice-over of the character presented in text in chinese.
It should be noted that, in practice, if the audio/video to be matched includes subtitle information, the original text is the subtitle information directly separated from the audio/video to be matched. If the audio and video to be matched does not comprise subtitle information, the original text is text information obtained by extracting voice and audio from the video and performing voice recognition on the basis of the extracted voice and audio.
There are various ways to implement this step, which should not be limited in this application. Illustratively, the implementation method of this step includes: acquiring an original text corresponding to the audio and video to be matched; the language of the original text is the same as the language of the audio and video to be matched; obtaining a final translation text based on the original text; the absolute value of the difference value between the number of the phonemes corresponding to the final translated text and the number of the target phonemes is smaller than or equal to a set threshold value; the target phoneme number is determined based on the duration of the audio information corresponding to the original text in the audio and video to be matched.
S120, generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to a video to be dubbed, the first dubbed audio comprises target voice characteristic information, and the target voice characteristic information is voice characteristic information of a role to which the first dubbed audio belongs in the audio and video to be dubbed.
And generating a first dubbing audio based on the final translated text, wherein the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of the role to which the first dubbing audio belongs in the audio and video to be dubbed.
The first dubbing audio is the voice information which is expected to be presented after dubbing, and the first dubbing audio is the speech-sounds spoken by the characters of the audio and video to be dubbed. In practice, the first dubbing audio is speech information obtained by dubbing a dialogue, a self-white or a voice-aside of a character in the video to be dubbed. The language of the first dubbing audio is the same as the language of the final translated text.
The target voice characteristic information is the voice characteristic information of the role to which the first dubbing audio belongs in the audio and video to be dubbed. Optionally, the target speech feature information includes at least one of timbre feature information, emotion feature information and speech stream feature information of a character to which the first dubbing audio belongs in the audio-video to be dubbed.
Based on the tone feature information, the viewer can easily distinguish which character is speaking. Based on the emotional characteristic information, the audience can easily feel contradiction conflict in the audio and video to be matched, and then the audience is immersed in the contradiction conflict.
The language flow is a language expression flow created by ideographic materials combined by characters, words and sentences. In the speech stream, the initial consonants, vowels or tones of some syllables are affected by the adjacent phonemes of adjacent syllables, and this phenomenon is called speech stream sound change. The speech flow sound changes include assimilation, dissimilarity, weakening and desquamation. Assimilation means that two adjacent different sounds in a stream, one of which is affected by the other, become identical or similar to one or several characteristics. Dissimilarity refers to two sounds that are originally the same or similar, one of which for some reason becomes different from its original pronunciation. Attenuation means that some of the voices in the speech stream may become weaker and lighter than originally. The term "speech loss" means that some sounds in the speech stream are not pronounced or lost due to contraction of the speech stream. The language flow characteristic information includes at least one of assimilation, dissimilarity, weakening, and shedding.
Based on the speech stream characteristic information, it is helpful for the audience to distinguish which character is speaking.
And S130, based on the first dubbing audio, carrying out mouth shape correction on image information corresponding to the first dubbing audio in the video to be dubbed.
There are various ways to implement this step, which should not be limited in this application. Illustratively, the implementation method of this step includes: determining a phoneme corresponding to the first dubbing audio; determining mouth shapes corresponding to the phonemes; and correcting the mouth shape of the image information corresponding to the first dubbing audio in the video to be dubbed by utilizing the mouth shape corresponding to each phoneme. The phoneme is the smallest speech unit divided according to the natural attributes of the speech. From an acoustic nature perspective, a phoneme is the smallest unit of speech divided from a tonal quality perspective. From a physiological point of view, a pronunciation action forms a phoneme. If [ ma ] contains [ m ] a ] two pronunciation actions, which are two phonemes.
In one embodiment, a mouth shape database is constructed in advance, and a plurality of phoneme and mouth shape corresponding relation data sets are stored in the mouth shape database. All phonemes corresponding to the first dubbing audio can be obtained by analyzing the first dubbing audio. And further combining the audio and video to be matched to obtain the time stamp corresponding to each phoneme.
As will be understood by those skilled in the art, in order to achieve the dubbing effect, the first dubbing audio and the image information of the video to be dubbed need to be synthesized subsequently. After synthesis, each phoneme in the first dubbing audio corresponds to a specific time information, and the time information is a time stamp corresponding to the phoneme. The mouth shape corresponding to each phoneme in the first dubbing audio can be obtained by inquiring the mouth shape database, the image information corresponding to each phoneme can be obtained based on the time stamp corresponding to each phoneme, and the image information corresponding to each phoneme is corrected by utilizing the mouth shape corresponding to each phoneme, so that the corrected image information can be obtained.
Illustratively, the time stamp of the first phoneme [ m ] in the first dubbing audio is 1. The mouth shape of the phoneme [ M ] is a mouth shape M, the image information with the time stamp of 1. And repeating the steps until the image information in the video to be dubbed is corrected based on each phoneme in the first dubbing audio, and finally obtaining the image information subjected to mouth shape correction.
Alternatively, the mouth shape database may be obtained by performing calculation according to the lip shape in the audio and image information in the video, so as to obtain the mouth shape database.
Optionally, in practice, a common mouth shape database may be constructed, that is, no matter which audio/video to be dubbed is dubbed, the mouth shape corresponding to each phoneme in the first dubbing audio is obtained by querying the common mouth shape database. Or, a mouth shape database corresponding to each audio and video to be matched can be respectively established. And when the mouth shape is corrected, inquiring a mouth shape database corresponding to the audio and video to be matched. Alternatively, a mouth shape database corresponding to each role player can be established separately for each role player. When the mouth shape correction is performed, a mouth shape database corresponding to a player of the role to which the first dubbing audio belongs is queried.
And S140, synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain a dubbed video.
According to the technical scheme, the final translation text corresponding to the audio and video to be matched is obtained; generating a first dubbing audio based on the final translated text, wherein the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of the role to which the first dubbing audio belongs in the audio and video to be dubbed; based on the first dubbing audio, carrying out mouth shape correction on image information corresponding to the first dubbing audio in the video to be dubbed; and synthesizing the image information after the mouth shape correction and the first dubbing audio to obtain a dubbed video. The method for automatically dubbing the video to be dubbed is provided. The method does not need the participation of dubbing personnel, the dubbing audio frequency formed by the method contains the voice characteristic information of roles, the mouth shape of the image is consistent with the dubbing, the dubbing efficiency can be fully improved, the time required by dubbing is reduced, the dubbing cost is reduced, and in the whole dubbing process, the method can not be influenced by recording equipment, a recording field, the dubbing time and the level of the dubbing personnel per se, thereby achieving the purpose of improving the dubbing quality.
Fig. 2 is a flowchart of a method for implementing step S110 in fig. 1 according to an embodiment of the present disclosure. Referring to fig. 2, the method includes:
s111, acquiring an original text corresponding to the audio and video to be matched; and executing S112, wherein the language of the original text is the same as the language of the audio and video to be matched.
The original text is the text information on which the translation is based. In practice, if the audio/video to be matched includes subtitle information, the original text is the subtitle information directly separated from the audio/video to be matched. And if the audio and video to be dubbed does not comprise subtitle information, the original text is the text information obtained by performing audio extraction on the audio and video to be dubbed and performing voice recognition on the basis of the extracted audio.
And S112, obtaining an intermediate translation text based on the original text, and executing S113.
The intermediate translation text is a translation result obtained by translating the original text.
S113, judging whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is smaller than or equal to a set threshold value or not; if yes, go to S114; if not, go to S115.
In this step, the concept of the phoneme is consistent with the concept of the phoneme mentioned in the introduction of the mouth shape correction section above.
The number of phonemes corresponding to the intermediate translation text refers to the number of phonemes corresponding to the audio if the intermediate translation text is converted into the audio. In other words, the number of phonemes associated with the intermediate translation text refers to the total number of phonemes that need to be spoken when the intermediate translation text is spoken.
The target phoneme number is determined based on the duration of the audio information corresponding to the original text in the audio and video. Specifically, the target phoneme number refers to the number of phonemes that the audio information corresponding to the original text can accommodate within the duration of the audio/video to be matched. Here, "can accommodate" is to be understood as meaning that the dubbing is performed at a predetermined speech rate, and the number of phonemes that can be spoken during the duration. The present application does not limit the preset speech rate, but needs to ensure that the user can understand the content expressed by the dubbing at the preset speech rate.
And S114, taking the intermediate translation text as a final translation text.
And if the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is less than or equal to a set threshold value, the intermediate translation text is considered to reach the standard, and therefore the intermediate translation text is used as the final translation text.
And S115, responding to the modification instruction of the intermediate translation text, modifying the intermediate translation text, and executing S113.
And if the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is greater than the set threshold, determining that the intermediate translation text does not reach the standard, and continuing to modify the intermediate translation text. After the modification, S113 is executed, that is, whether the modified intermediate translation text reaches the standard is determined again.
The technical scheme can ensure that the number of phonemes corresponding to the final translated text is proper, the number of phonemes corresponding to the first dubbing audio obtained subsequently based on the final translated text is proper, and the first dubbing audio and the corrected image information are synthesized to obtain a video, so that on the premise of achieving the effect of sound-picture synchronization, the speech speed of each character is proper, the condition that the speech speed of a certain character is too fast or too slow can not occur, the difficulty in understanding the video is reduced, and the dubbing quality is improved.
On the basis of the above technical solution, optionally, the method further includes: displaying a translation check interface, wherein the translation check interface comprises an original text, an intermediate translation text and check target information; the proofreading target information is used for indicating whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the target number of phonemes is less than or equal to a set threshold value. The translation proofreading interface is a page for assisting a proofreader to proofread the intermediate translation text. And the translation proofreading interface can allow a proofreader to edit and modify the intermediate translation text. Because the translation proofreading interface can directly and synchronously display the original text, the intermediate translation text and the proofreading target information, in the proofreading process, a proofreader can determine the problems of the current intermediate translation according to the proofreading target information and modify the intermediate translation text information based on the original text information so as to achieve the aim of proofreading and reduce the proofreading difficulty of the intermediate translation text.
Optionally, after modifying the intermediate translation text in response to the instruction for modifying the intermediate translation text, the method further includes: and updating the collation target information. This arrangement assists the proof reader to know again whether the modified intermediate translation text is still problematic.
On the basis of the technical schemes, the original text comprises one or more sub-original texts; the intermediate translation text comprises one or more sub intermediate translation texts; the collation target information includes one or more pieces of sub collation target information; a sub original text, a sub intermediate translation text and sub proofreading target information form a proofreading information group; in any one collation information group, the sub original text, the sub intermediate translation text and the sub collation target information have a corresponding relationship. The essence of the setting is that the difficulty of proofreading the intermediate translation text is reduced in a mode of breaking up the whole into parts. Alternatively, in practice, a sentence is a child of the original text.
In practice, there are various display methods for displaying the original text, the intermediate translation text and the proofreading target information by using the translation proofreading interface, and the application does not limit the display methods. Two presentation methods are given below by way of example.
Method one
The translation checking interface includes a first area and a second area. Each sub original text is sequentially displayed in the first area along the vertical direction, and each sub intermediate translation text is sequentially displayed in the second area along the vertical direction; each piece of sub-proof target information is displayed in the second area, and the distance between the display position of each piece of sub-proof target information and the display position of the corresponding sub-intermediate translation text is smaller than a set distance threshold value; in any one collation information set, the sub original text and the sub intermediate translation text are in a transverse contrast relationship.
The distance between the display position of each piece of sub-proof target information and the display position of the corresponding sub-intermediate translation text is smaller than the set distance threshold, which means that the distance between each piece of sub-proof target information and the corresponding sub-intermediate translation text is closer, so that a proof reader can visually confirm the corresponding relation between each piece of sub-proof target information and each piece of sub-intermediate translation text.
Fig. 3 is a schematic diagram of a translation checking interface according to an embodiment of the present disclosure. Referring to FIG. 3, the translation collation interface includes a first area A and a second area B. Original text 10 includes sub-original text 11, sub-original text 12, and sub-original text 13. The sub original text 11, the sub original text 12, and the sub original text 13 are all displayed in the first area a in the vertical order. The intermediate translation text 20 includes a sub-intermediate translation text 21, a sub-intermediate translation text 22, and a sub-intermediate translation text 23. The sub-intermediate translation text 21, the sub-intermediate translation text 22 and the sub-intermediate translation text 23 are sequentially displayed in the second area B along the vertical direction. The collation target information 30 includes sub collation target information 31, sub collation target information 32, and sub collation target information 33. The sub collation target information 31, the sub collation target information 32, and the sub collation target information 33 are all displayed in the second area B in this order in the vertical direction.
The child original text 11, the child intermediate translation text 21, and the child collation target information 31 constitute a collation information group. The sub intermediate translation text 21 is a translation result of the sub original text 11, and the sub collation target information 31 is used to indicate whether or not an absolute value of a difference between the number of phonemes corresponding to the sub intermediate translation text 21 and the target number of phonemes of the collation information group is less than or equal to a set threshold value. The target phoneme number of the proofreading information group is determined based on the duration of the audio information corresponding to the sub-original text 11 in the audio/video to be matched. Therefore, the child original text 11, the child intermediate translation text 21, and the child collation target information 31 have a correspondence relationship.
The relationship among the sub original text 12, the sub intermediate translation text 22, and the sub proof target information 32, and the relationship among the sub original text 13, the sub intermediate translation text 23, and the sub proof target information 33 are similar to the relationship among the sub original text 11, the sub intermediate translation text 21, and the sub proof target information 31, and are not described again here.
With reference to fig. 3, in any one of the proofreading information sets, the sub original text and the sub intermediate translation text are in a horizontal comparison relationship, and the sub proofreading target information is located at the upper right corner of the corresponding sub intermediate translation text, so that a proofreader can conveniently determine the problems of the sub intermediate translation texts.
Method two
The translation proofreading interface comprises a third area; all the proofreading information groups are sequentially displayed in a third area along the vertical direction; in any one proofreading information group, the child original text and the child intermediate translation text are in a vertical contrast relation; and the display position of each piece of sub-proofreading target information and the display position of the corresponding sub-intermediate translation text are smaller than the set distance threshold.
Similarly, the fact that the distance between the display position of each piece of sub-proof target information and the display position of the corresponding sub-intermediate translation text is smaller than the set distance threshold means that the distance between each piece of sub-proof target information and the corresponding sub-intermediate translation text is closer, so that a proof reader can intuitively clarify the corresponding relationship between each piece of sub-proof target information and each sub-intermediate translation text.
FIG. 4 is a schematic diagram of another translation checking interface provided by the embodiments of the present disclosure. Referring to FIG. 4, the translation collation interface includes a third area C; and all the proofreading information groups are sequentially arranged in the third area C along the vertical direction. And vertically arranging the sub original texts and the sub intermediate translation texts in any correction information group. For example, the sub original text 11 and the sub intermediate translation text 21 in the first collation information group are vertically arranged. The sub original texts 12 and the sub intermediate translation texts 22 in the second collation information group are vertically arranged. The sub original text 13 and the sub intermediate translated text 23 in the third collation information group are vertically arranged.
And each piece of sub-proof target information is positioned at the upper right corner of the corresponding sub-intermediate translation text. Illustratively, the sub-proof reading target information 31 is located at the upper right corner of the corresponding sub-intermediate translation text 21, so that a proof reader can easily determine problems existing in each sub-intermediate translation text, and the proof reading is performed on each sub-intermediate translation text with reference to each sub-original text, which is beneficial to improving the proof reading efficiency and accuracy.
In the above-described embodiment, the display position of each piece of sub collation target information may be located on the left side, right side, above, below, or the like of the display position of the corresponding sub intermediate translation text.
In one embodiment, in any one of the proofreading information sets, if the absolute value of the difference between the number of phonemes corresponding to the current sub-intermediate translation text and the number of target phonemes corresponding to the proofreading information set is less than or equal to a set threshold, the sub-proofreading target information is in a target achievement state; if the absolute value of the difference value between the number of phonemes corresponding to the current sub-intermediate translation text and the number of target phonemes corresponding to the proofreading information group is greater than a set threshold value, the sub-proofreading target information is in a target unachieved state; the target phoneme number corresponding to the proofreading information group is determined based on the sub-original text in the proofreading information group.
Illustratively, the threshold is set to 0, and with continued reference to fig. 3 or fig. 4, in the first collation information group, the difference between the number of phonemes corresponding to the sub intermediate translation text 21 and the target number of phonemes of the collation information group is equal to 0, and the sub collation target information 31 of the collation information group is in the target achievement state. In the second collation information group, the difference between the number of phonemes corresponding to the sub intermediate translation text 22 and the number of target phonemes corresponding to the collation information group is equal to-5, the absolute value (5) thereof is larger than 0, and the sub collation target information 32 of the collation information group is in the target unachieved state. In the third collation information group, the difference between the number of phonemes corresponding to the sub intermediate translation text 23 and the number of target phonemes corresponding to the collation information group is equal to +1, the absolute value (1) thereof is larger than 0, and the sub collation target information 33 of the collation information group is in the target unachieved state.
In an embodiment, a specific implementation method of "determining the target phoneme number corresponding to the collation information group based on the original text in the collation information group" may include: and obtaining the starting time and the ending time of any sub original text based on the audio and video to be matched, and further obtaining the duration of any sub original text. Since the duration is proportional to the number of phonemes it can accommodate, a functional relationship between the duration and the number of phonemes it can accommodate can be pre-constructed with the duration as an argument and the number of phonemes it can accommodate as a dependent variable. Based on the functional relationship and the duration of any sub-original text, the number of phonemes which can be accommodated within the duration of any sub-original text, that is, the target number of phonemes of the collation information group to which any sub-original text belongs, can be obtained.
In one embodiment, the child collation target information in the target achievement state includes a target achievement prompt; the target information of the child proofreading in the target unachieved state includes a difference value between the number of phonemes corresponding to the current child intermediate translation text and the number of target phonemes. With continued reference to fig. 3 or fig. 4, the target achievement designator is "√". The arrangement can be convenient for a proofreader to quickly know which sub-intermediate translation texts reach the standard without continuously modifying; and the sub intermediate translation texts do not reach the standard and need to be continuously modified, so that the proofreading efficiency is improved.
Alternatively, in practice, it may be further configured that the sub-proof target information in the target unachieved state includes a difference between the target number of phonemes and the number of phonemes corresponding to the current sub-intermediate text.
In another embodiment, the background color of the sub-collation target information in the target achievement state is different from the background color of the sub-collation target information in the target unachieved state; and/or the border color of the proofreading target information in the target achievement state is different from the border color of the proofreading target information in the target unachieved state. The arrangement can facilitate the proofreader to quickly know which sub-intermediate translation texts reach the standard without continuously modifying; and the sub-intermediate translation texts do not reach the standard and need to be continuously modified, so that the proofreading efficiency is improved.
On the basis of the above technical solutions, optionally, referring to fig. 3 or fig. 4, the translation check interface further includes an audio/video playing window D, where the audio/video playing window D is used for playing an audio/video to be matched. The arrangement is convenient for a proofreader to watch videos and/or listen audio while proofreading the text of the sub-intermediate translation, and the diversified proofreading requirements of the proofreader can be met.
On the basis of the above technical solutions, optionally, the method further includes: and determining a target duration corresponding to the first dubbing audio, and adjusting the duration of the first dubbing audio so that the absolute value of the difference between the duration of the first dubbing audio and the target duration is less than or equal to a preset time length. The target duration refers to the time length of the first dubbing audio allowed to be played in the audio and video to be dubbed. Under the scene that the audio and video to be dubbed comprises the original dubbing, the target duration refers to the time spent by the role to which the first dubbing audio belongs to speaking the original text in the audio and video to be dubbed.
Illustratively, if the audio-video to be dubbed comprises a plot of a conversation between a character a and a character B, dubbing is performed on the character a, and the character a speaks "XXXXXX" and "xxxxxxxx" in a first language as original texts. The "XXXXXX" is translated to obtain a final translation text "YYYYYYYY" in the second language, and the first dubbing audio is generated based on the final translation text "YYYYYYYY". The duration of the first dubbed audio is the length of time required to play the first dubbed audio. The target duration corresponding to the first dubbing audio is the length of time required for the character a to speak "XXXXXX" in the audio/video to be dubbed.
The duration of the first dubbing audio is adjusted, which includes, but is not limited to, increasing or decreasing the playing speed of the first dubbing audio. The duration of the first dubbing audio is adjusted, so that the absolute value of the difference between the duration of the first dubbing audio and the target duration is less than or equal to the preset time length, and the sound-picture synchronization effect can be further realized.
On the basis of the above technical solutions, optionally, the method further includes: determining background sound audio based on the audio and video to be matched; synthesizing the background audio and the first dubbing audio to obtain a second dubbing audio; s140 is replaced by: and synthesizing the image information after the mouth shape correction and the second dubbing audio to obtain the dubbed video. The essence of the setting is that the video after dubbing still comprises the background sound of the video before dubbing, so that the problem of information loss of the video after dubbing is avoided, and the video after dubbing has a high artistic effect.
Optionally, before synthesizing the background audio and the first dubbing audio to obtain the second dubbing audio, optimizing the background audio, such as eliminating noise information in the background audio. And synthesizing the optimized background sound audio and the first dubbing audio to obtain a second dubbing audio.
Fig. 5 is a flowchart of another dubbing method according to an embodiment of the present disclosure. Fig. 5 is a specific example of fig. 1. Referring to fig. 5, the dubbing method can be divided into three stages in the execution process, namely background sound processing, TTS model training and dubbing mouth shape simulation.
In the background sound processing stage, acquiring an audio and video to be matched; and performing audio track separation on the original audio in the video to be dubbed to obtain background audio and human voice audio, wherein the human voice audio is the original audio. And optimizing the background sound audio to obtain the optimized background sound audio.
And in the training stage of the TTS model, judging whether the audio and video to be matched comprises subtitles or not. If the audio/video to be matched comprises subtitles, the subtitles are the original text. And if the audio and video to be matched does not comprise subtitles, carrying out voice recognition on the original audio to obtain an original text. And performing machine learning on the original audio and the original text to obtain a cross-language TTS model with the voice characteristics of the role to which the original audio belongs.
And a dubbing mouth shape simulation stage, translating the original text to obtain an intermediate translated text. And judging whether the absolute value of the difference value between the number of the phonemes corresponding to the single sentence of the current intermediate translation text and the number of the target phonemes corresponding to the sentence is less than or equal to a set threshold value or not. And if the absolute value of the difference value between the number of the phonemes corresponding to a single sentence of the intermediate translation text and the target phoneme number corresponding to the sentence is less than or equal to a set threshold value, determining the intermediate translation text as a final translation text. Dubbing audio (i.e., first dubbing audio) is generated based on the final translation text and the cross-lingual TTS model. And if the absolute value of the difference value between the number of phonemes corresponding to a single sentence of the intermediate translation text and the target number of phonemes is larger than the set threshold, outputting modification prompt information for prompting that the absolute value of the difference value between the number of phonemes corresponding to the current intermediate translation text and the target number of phonemes corresponding to the sentence is larger than the set threshold, so that a translation proofreader modifies the intermediate translation text based on the modification prompt information. And after the intermediate translation text is modified, judging whether the absolute value of the difference value between the number of phonemes corresponding to the current intermediate translation text and the number of target phonemes is smaller than or equal to a set threshold value or not again. Repeating the steps until the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is less than or equal to the set threshold value.
Judging whether the absolute value of the difference between the duration of the dubbing audio (namely the first dubbing audio) and the track (namely the target duration) corresponding to the picture corresponding to the dubbing audio in the audio and video to be dubbed is less than or equal to the preset time length, if so, keeping the duration of the dubbing audio unchanged; if not, the duration of the dubbing audio (i.e. the first dubbing audio) is adjusted until the absolute value of the difference between the duration of the dubbing audio (i.e. the first dubbing audio) and the target duration is less than or equal to the preset time length. After the absolute value of the difference between the duration of the dubbing audio (namely, the first dubbing audio) and the track (namely, the target duration) corresponding to the picture corresponding to the dubbing audio in the audio and video to be dubbed is determined to be less than or equal to the preset time length, reverberation processing is carried out on the dubbing audio and the optimized background sound audio to obtain a second dubbing audio.
Generally, a mouth shape database is queried to determine a mouth shape corresponding to each phoneme in dubbing audio (namely, first dubbing audio); and correcting the mouth shape of the image information corresponding to the dubbing audio (namely the first dubbing audio) in the video to be dubbed by using the mouth shape corresponding to each phoneme. And synthesizing the image information after the mouth shape correction and the second dubbing audio to obtain the dubbed video.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.
Fig. 6 is a schematic structural diagram of a dubbing apparatus in an embodiment of the present disclosure. The dubbing apparatus provided by the embodiments of the present disclosure may be configured in the client, or may be configured in the server. Referring to fig. 6, the dubbing apparatus specifically includes:
the obtaining module 310 is configured to obtain a final translation text corresponding to the audio and video to be matched;
a generating module 320, configured to generate a first dubbing audio based on the final translated text and a cross-lingual TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
the correction module 330 is configured to perform mouth shape correction on image information corresponding to the first dubbing audio in the audio and video to be dubbed based on the first dubbing audio;
and a synthesizing module 340, configured to synthesize the image information with the corrected mouth shape with the second dubbed audio to obtain a dubbed video.
Further, the obtaining module 310 is configured to:
acquiring an original text corresponding to the audio and video to be matched; the language of the original text is the same as that of the audio and video to be matched;
obtaining a final translation text based on the original text; the absolute value of the number of the phonemes corresponding to the final translated text and the target number of the phonemes is less than or equal to a set threshold value; and the target phoneme number is determined based on the duration of the audio information corresponding to the original text in the audio and video to be matched.
Further, the obtaining module 310 is configured to:
obtaining an intermediate translation text based on the original text;
judging whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is less than or equal to a set threshold value or not;
if so, taking the intermediate translation text as the final translation text;
if not, responding to a modification instruction of the intermediate translation text, and modifying the intermediate translation text.
Further, the device also comprises a display module, wherein the display module is used for:
displaying a translation proofreading interface, wherein the translation proofreading interface comprises the original text, the intermediate translation text and proofreading target information; the proofreading target information is used for indicating whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the target number of phonemes is smaller than or equal to a set threshold value.
Further, the display module is further configured to update the proofreading target information after modifying the intermediate translation text in response to a modification instruction for the intermediate translation text.
Further, the original text comprises one or more sub-original texts; the intermediate translation text comprises one or more sub intermediate translation texts; the collation target information includes one or more pieces of sub-collation target information;
a proofreading information group is formed by the sub original text, the sub intermediate translation text and the sub proofreading target information; in any one of the collation information sets, the sub original text, the sub intermediate translation text, and the sub collation target information have a correspondence relationship.
Further, in any one of the proofreading information sets, if an absolute value of a difference between the number of phonemes corresponding to the current sub intermediate translation text and the number of target phonemes is less than or equal to a set threshold, the sub proofreading target information is in a target achievement state; if the absolute value of the difference value between the number of phonemes corresponding to the current sub-intermediate translation text and the number of target phonemes is larger than a set threshold value, the sub-proof target information is in a target unreachable state; the target number of phonemes is determined based on the sub-original text in the collation information group.
Further, the sub-collation target information in the target achievement state includes a target achievement prompt;
the sub-proof target information in the target unachieved state comprises a difference value between the number of phonemes corresponding to the current sub-intermediate translation text and the target number of phonemes; or the sub proofreading target information in the target unachieved state includes a difference value between the target phoneme number and the phoneme number corresponding to the current sub intermediate translation text.
Further, a background color of the sub-collation target information in the target achieved state is different from a background color of the sub-collation target information in the target unachieved state; and/or the presence of a gas in the gas,
the frame color of the sub-collation target information in the target achievement state is different from the frame color of the sub-collation target information in the target unachieved state.
Further, the translation proofreading interface comprises a first area and a second area;
each sub original text is sequentially displayed in the first area along the vertical direction, and each sub intermediate translation text is sequentially displayed in the second area along the vertical direction; each piece of sub-proofreading target information is displayed in the second area, and the distance between the display position of each piece of sub-proofreading target information and the display position of the corresponding sub-intermediate translation text is smaller than a set distance threshold;
in any one of the proofreading information sets, the sub original text and the sub intermediate translation text are in a transverse contrast relationship.
Further, the translation proofreading interface comprises a third area; all the proofreading information groups are sequentially displayed in the third area along the vertical direction;
in any one of the proofreading information groups, the child original text and the child intermediate translation text are in a vertical contrast relationship;
and the display position of each piece of sub-proofreading target information and the display position of the corresponding sub-intermediate translation text are smaller than a set distance threshold value.
Further, the apparatus further comprises an adjusting module configured to:
determining a target duration corresponding to the first dubbed audio,
and adjusting the duration of the first dubbing audio so that the absolute value of the difference between the duration of the first dubbing audio and the target duration is less than or equal to a preset time length.
Further, the synthesis module is further configured to:
determining a background sound audio frequency based on the audio and video to be matched;
synthesizing the background sound audio and the first dubbing audio to obtain a second dubbing audio;
and synthesizing the image information with the corrected mouth shape with the second dubbing audio to obtain a dubbed video.
Further, the correction module is further configured to:
determining phonemes comprised by the first dubbing audio;
determining a mouth shape corresponding to each phoneme;
and correcting the mouth shape of the image information corresponding to the first dubbing audio frequency in the audio and video to be dubbed by using the mouth shape corresponding to each phoneme.
The dubbing apparatus provided in the embodiment of the present disclosure is capable of executing steps executed by a client or a server in the dubbing method provided in the embodiment of the present disclosure, and has execution steps and beneficial effects, which are not described herein again.
Fig. 7 is a schematic structural diagram of an electronic device in an embodiment of the disclosure. Referring now specifically to fig. 7, a schematic diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device 1000 in the embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), wearable electronic devices, and the like, and fixed terminals such as digital TVs, desktop computers, smart home devices, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003 to implement a dubbing method of an embodiment as described in the present disclosure. In the RAM 1003, various programs and information necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communications apparatus 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange information. While fig. 7 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart, thereby implementing the dubbing method as described above. In such an embodiment, the computer program may be downloaded and installed from the network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include an information signal propagated in baseband or as part of a carrier wave, in which computer readable program code is carried. Such a propagated information signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital information communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
acquiring a final translation text corresponding to the audio and video to be matched;
generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
based on the first dubbing audio, correcting the mouth shape of image information corresponding to the first dubbing audio in the audio and video to be dubbed;
and synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain a dubbed video.
Optionally, when the one or more programs are executed by the electronic device, the electronic device may further perform other steps described in the above embodiments.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In accordance with one or more embodiments of the present disclosure, there is provided an electronic device including:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement any of the dubbing processing methods as provided by the present disclosure.
According to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing any of the dubbing processing methods as provided by the present disclosure.
The disclosed embodiments also provide a computer program product comprising a computer program or instructions which, when executed by a processor, implement the dubbing processing method as described above.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (17)

1. A dubbing method, comprising:
acquiring a final translation text corresponding to the audio and video to be matched;
generating a first dubbing audio based on the final translated text and a cross-lingual TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
based on the first dubbing audio, correcting the mouth shape of image information corresponding to the first dubbing audio in the audio and video to be dubbed;
and synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain a dubbed video.
2. The dubbing method according to claim 1, wherein the obtaining of the final translation text corresponding to the audio and video to be dubbed comprises:
acquiring an original text corresponding to the audio and video to be matched; the language of the original text is the same as the language of the audio and video to be matched;
obtaining a final translation text based on the original text; the absolute value of the difference value between the number of the phonemes corresponding to the final translated text and the target number of the phonemes is less than or equal to a set threshold value; and the target phoneme number is determined based on the duration of the audio information corresponding to the original text in the audio and video to be matched.
3. The dubbing method of claim 2, wherein the obtaining a final translation text based on the original text comprises:
obtaining an intermediate translation text based on the original text;
judging whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the number of target phonemes is less than or equal to a set threshold value or not;
if so, taking the intermediate translation text as the final translation text;
if not, responding to a modification instruction of the intermediate translation text, and modifying the intermediate translation text.
4. The dubbing method according to claim 3, further comprising:
displaying a translation proofreading interface, wherein the translation proofreading interface comprises the original text, the intermediate translation text and proofreading target information; the proofreading target information is used for indicating whether the absolute value of the difference value between the number of phonemes corresponding to the intermediate translation text and the target number of phonemes is smaller than or equal to a set threshold value.
5. The dubbing method of claim 4, wherein after the modifying the intermediate translation text in response to the modification instruction for the intermediate translation text, further comprising:
and updating the proofreading target information.
6. The dubbing method of claim 4,
the original text comprises one or more sub-original texts; the intermediate translation text comprises one or more sub intermediate translation texts; the collation target information includes one or more pieces of sub-collation target information;
one of the sub original texts, one of the sub intermediate translation texts and one of the sub collation target information form a collation information set; in any one of the collation information sets, the sub original text, the sub intermediate translation text, and the sub collation target information have a correspondence relationship.
7. The dubbing method of claim 6,
in any one of the proofreading information sets, if an absolute value of a difference between the number of phonemes corresponding to the current sub-intermediate translation text and the number of target phonemes is smaller than or equal to a set threshold, the sub-proofreading target information is in a target achievement state; if the absolute value of the difference value between the number of phonemes corresponding to the current sub-intermediate translation text and the number of target phonemes is larger than a set threshold value, the sub-proof target information is in a target unreachable state; the target phoneme number is determined based on the sub-original text in the collation information set.
8. The method of claim 7,
the sub-collation target information in the target achievement state includes a target achievement prompt;
the sub-proof target information in the target unachieved state comprises a difference value between the number of phonemes corresponding to the current sub-intermediate translation text and the target number of phonemes; or the sub-proof target information in the target unachieved state includes a difference between the target phoneme number and the phoneme number corresponding to the current sub-intermediate translation text.
9. The method of claim 7,
a background color of the sub collation target information in the target achieved state is different from a background color of the sub collation target information in the target unachieved state; and/or the presence of a gas in the atmosphere,
the frame color of the sub-collation target information in the target achievement state is different from the frame color of the sub-collation target information in the target unachieved state.
10. The method of claim 6, wherein the translation checking interface includes a first area and a second area;
each sub original text is sequentially displayed in the first area along the vertical direction, and each sub intermediate translation text is sequentially displayed in the second area along the vertical direction; each piece of sub-proofreading target information is displayed in the second area, and the distance between the display position of each piece of sub-proofreading target information and the display position of the corresponding sub-intermediate translation text is smaller than a set distance threshold;
in any one of the collation information sets, the sub original text and the sub intermediate translation text are in a transverse contrast relationship.
11. The method of claim 6, wherein the translation proofing interface includes a third area; all the proofreading information groups are sequentially displayed in the third area along the vertical direction;
in any one of the proofreading information groups, the child original text and the child intermediate translation text are in a vertical contrast relationship;
and the display position of each piece of sub-proofreading target information and the display position of the corresponding sub-intermediate translation text are smaller than a set distance threshold.
12. The dubbing method according to claim 1, further comprising:
determining a target duration corresponding to the first dubbed audio,
and adjusting the duration of the first dubbing audio so that the absolute value of the difference between the duration of the first dubbing audio and the target duration is less than or equal to a preset time length.
13. The dubbing method according to claim 1, further comprising:
determining a background sound audio frequency based on the audio and video to be matched;
synthesizing the background sound audio and the first dubbing audio to obtain a second dubbing audio;
the synthesizing the image information with the corrected mouth shape with the first dubbing audio to obtain the dubbed video comprises:
and synthesizing the image information with the corrected mouth shape with the second dubbing audio to obtain a dubbed video.
14. The dubbing method according to claim 1, wherein the mouth shape correction of the image information corresponding to the first dubbing audio in the audio/video to be dubbed further comprises:
determining phonemes comprised by the first dubbing audio;
determining a mouth shape corresponding to each phoneme;
and correcting the mouth shape of the image information corresponding to the first dubbing audio in the audio and video to be dubbed by utilizing the mouth shape corresponding to each phoneme.
15. A dubbing apparatus, comprising:
the acquisition module is used for acquiring a final translation text corresponding to the audio and video to be matched;
the generating module is used for generating a first dubbing audio based on the final translated text and the cross-language TTS model; the cross-language TTS model is obtained based on learning of an original text and an original audio corresponding to the audio and video to be matched, the first dubbing audio comprises target voice characteristic information, and the target voice characteristic information is the voice characteristic information of a role to which the first dubbing audio belongs in the audio and video to be matched;
the correction module is used for correcting the mouth shape of the image information corresponding to the first dubbing audio frequency in the audio and video to be dubbed on the basis of the first dubbing audio frequency;
and the synthesis module is used for synthesizing the image information subjected to mouth shape correction and the first dubbing audio to obtain a dubbed video.
16. An electronic device, characterized in that the electronic device comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-14.
17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-14.
CN202210772621.6A 2022-06-30 2022-06-30 Dubbing method and device, electronic equipment and storage medium Pending CN115171645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210772621.6A CN115171645A (en) 2022-06-30 2022-06-30 Dubbing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210772621.6A CN115171645A (en) 2022-06-30 2022-06-30 Dubbing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115171645A true CN115171645A (en) 2022-10-11

Family

ID=83489254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210772621.6A Pending CN115171645A (en) 2022-06-30 2022-06-30 Dubbing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115171645A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248974A (en) * 2022-12-29 2023-06-09 南京硅基智能科技有限公司 Video language conversion method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248974A (en) * 2022-12-29 2023-06-09 南京硅基智能科技有限公司 Video language conversion method and system

Similar Documents

Publication Publication Date Title
US9547642B2 (en) Voice to text to voice processing
WO2020098115A1 (en) Subtitle adding method, apparatus, electronic device, and computer readable storage medium
CN112099628A (en) VR interaction method and device based on artificial intelligence, computer equipment and medium
US20140372100A1 (en) Translation system comprising display apparatus and server and display apparatus controlling method
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
CN112908292B (en) Text voice synthesis method and device, electronic equipment and storage medium
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN113724686B (en) Method and device for editing audio, electronic equipment and storage medium
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
JP2012181358A (en) Text display time determination device, text display system, method, and program
KR20200142282A (en) Electronic apparatus for providing content translation service and controlling method thereof
WO2022228179A1 (en) Video processing method and apparatus, electronic device, and storage medium
CN114157920A (en) Playing method and device for displaying sign language, smart television and storage medium
CN114143479A (en) Video abstract generation method, device, equipment and storage medium
CN113808576A (en) Voice conversion method, device and computer system
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
JP2019215449A (en) Conversation auxiliary apparatus, conversation auxiliary method, and program
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
KR101920653B1 (en) Method and program for edcating language by making comparison sound
CN115174825A (en) Dubbing method and device, electronic equipment and storage medium
CN112562733A (en) Media data processing method and device, storage medium and computer equipment
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN113312928A (en) Text translation method and device, electronic equipment and storage medium
JP6486582B2 (en) Electronic device, voice control method, and program
CN113450783A (en) System and method for progressive natural language understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination