CN115967840A

CN115967840A - Method, equipment and device for generating multilingual video and readable storage medium

Info

Publication number: CN115967840A
Application number: CN202211356981.4A
Authority: CN
Inventors: 张广谱; 邹文聪
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-04-14

Abstract

The invention discloses a method, equipment, a device and a readable storage medium for generating a multilingual video, wherein the method for generating the multilingual video extracts a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translates each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replaces the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, thereby obtaining a video file for playing the audio of the target language and meeting the requirements of a user on spoken language learning and hearing practice of the target language.

Description

Method, equipment and device for generating multilingual video and readable storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method, a device, and an apparatus for generating a multilingual video, and a readable storage medium.

Background

At present, foreign programs watched by a television/computer and other equipment are usually configured with translated subtitles or audio obtained by manual translation dubbing, so that most people can understand and understand the program content. However, with the diversification of program types and the diversification of network resources, we can find out the foreign program resources (video clips, movies, television series, etc.) that we want to watch by themselves, but most of the foreign program resources that we find out by themselves usually have neither dubbing translation nor subtitle translation, so it is difficult for most of ordinary people to understand and understand the foreign programs. Although some software may be available on the market today for caption translation, no sound translation is available, which is not satisfactory for some groups who want to learn spoken language and practice hearing by watching video.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a method, equipment, a device and a readable storage medium for generating a multilingual video, and aims to solve the technical problem that the existing translation software cannot meet the requirements of part of groups who want to learn spoken language and practice hearing by watching videos.

In order to achieve the above object, the present invention provides a method for generating a multilingual video, wherein the method for generating the multilingual video includes:

determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;

extracting a first audio fragment set of an original language from the original audio data;

translating the first audio clip set and generating a second audio clip set corresponding to the target language;

replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;

and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.

Further, the step of translating the first audio clip set and generating a second audio clip set corresponding to the target language includes:

obtaining a first audio clip from the first audio clip set;

translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;

screening the selectable audio segments with the duration closest to the duration of the first audio segment from the selectable audio segments as second audio segments corresponding to the first audio segment;

obtaining a next first audio segment, and executing the step of translating the first audio segment based on multiple translation schemes to obtain multiple selectable audio segments of the target language until obtaining a second audio segment corresponding to each first audio segment in the first audio segment set;

and taking each second audio segment as the second audio segment set.

Further, after the step of screening, from the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:

comparing the size of the first time length of the first audio segment with the corresponding second time length of the second audio segment;

and if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio clip based on the time interval difference between the first duration and the second duration so as to keep the first duration and the second duration consistent.

Further, the step of performing frame extraction or frame supplement processing on the second audio segment based on the time period difference between the first time period and the second time period includes:

when the first time length is longer than the second time length, a preset smooth audio frame is added and inserted in the second audio segment according to a preset interval, wherein the adding and inserting quantity of the preset smooth audio frame is determined by the time interval difference;

and when the first time length is smaller than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.

Further, before the step of interpolating the preset smooth audio frames in the second audio segment according to the preset interval, the method includes:

acquiring adjacent audio frames of the interpolation positions of the preset smooth audio frames in the second audio clip;

and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.

Further, the target video file further includes an original subtitle file, and after the step of determining the target video file and the target language based on the preset user operation, the method further includes:

translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;

generating a target subtitle file based on a subtitle time axis in the original subtitle file and each second subtitle speech segment;

and loading the target subtitle file to the target video file.

Further, the step of generating a target subtitle file based on a subtitle time axis in the original subtitle file and each of the second subtitle speech segments includes:

and setting each second caption language segment in the position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.

In addition, to achieve the above object, the present invention further provides a multilingual video-generating apparatus, including:

the acquisition module is used for determining a target video file and a target language based on preset user operation and acquiring original audio data of the target video file;

the extraction module is used for extracting a first audio clip set of an original language from the original audio data;

the translation module is used for translating the first audio fragment set and generating a second audio fragment set corresponding to the target language;

a replacing module, configured to replace, based on a correspondence relationship between before and after translation, each first audio segment in the first audio segment set in the original audio data with each first audio segment in the second audio segment set to obtain target audio data;

and the covering module is used for loading the target audio data to the target video file and covering the original audio data so as to generate a video file corresponding to the target language.

In addition, to achieve the above object, the present invention also provides a multilingual video generation apparatus, including: the system comprises a memory, a processor and a multilingual video generation program which is stored on the memory and can run on the processor, wherein the multilingual video generation program realizes the steps of the multilingual video generation method when the processor executes the multilingual video generation program.

Further, to achieve the above object, the present invention provides a computer-readable storage medium having a multilingual video generation program stored thereon, which, when executed by a processor, implements the steps of the multilingual video generation method as described above.

The method, the equipment, the device and the readable storage medium for generating the multilingual video determine a target video file and a target language based on preset user operation, and acquire original audio data of the target video file; extracting a first audio fragment set of an original language from the original audio data; translating the first audio clip set and generating a second audio clip set corresponding to the target language; replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data; and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language. The method comprises the steps of extracting a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translating each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replacing the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, so that a video file for playing the audio of the target language is obtained, and the requirements of a user on spoken language learning and hearing training of the target language are met.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a multilingual video-generating method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a multilingual video-generating method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a multilingual video-generating method according to a third embodiment of the present invention;

FIG. 5 is a flowchart illustrating a multilingual video-generating method according to a fourth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus according to the multilingual video creation method of the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

The device of the embodiment of the invention can be a television, and also can be an electronic terminal device with data receiving, data sending and data processing functions, such as a smart phone, a PC, a tablet personal computer, a portable computer and the like.

As shown in fig. 1, the apparatus may include: a processor 1001, e.g. a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the device may also include a camera, RF (Radio Frequency) circuitry, sensors, audio circuitry, wiFi modules, and so forth. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include an operating system, a network communication module, a user interface module, and a multilingual video generation program therein.

In the device shown in fig. 1, the network interface 1004 is mainly used for connecting a backend server and communicating data with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the multilingual video generation program stored in the memory 1005, and perform the following operations:

Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and further perform the following operations:

the step of translating the first audio segment set and generating a second audio segment set corresponding to the target language comprises:

obtaining a first audio clip from the first audio clip set;

translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language;

screening the selectable audio segments with the duration closest to the duration of the first audio segment from all the selectable audio segments as second audio segments corresponding to the first audio segment;

obtaining a next first audio clip, and executing the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language until obtaining a second audio clip corresponding to each first audio clip in the first audio clip set;

and taking each second audio segment as the second audio segment set.

Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and also perform the following operations:

after the step of screening, from the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:

the step of performing frame extraction or frame supplement processing on the second audio segment based on the time interval difference between the first time interval and the second time interval comprises:

when the first time length is longer than the second time length, a preset smooth audio frame is inserted in the second audio segment according to a preset interval, wherein the insertion quantity of the preset smooth audio frame is determined by the time interval difference;

and when the first time length is less than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.

before the step of interpolating the preset smooth audio frame in the second audio segment according to the preset interval, the method comprises:

acquiring adjacent audio frames of the position of the preset smooth audio frame in the second audio segment in an interpolation mode;

the target video file further comprises an original subtitle file, and after the step of determining the target video file and the target language based on the preset user operation, the method further comprises:

and loading the target subtitle file to the target video file.

the step of generating the target subtitle file based on the subtitle time axis in the original subtitle file and each second subtitle speech segment comprises the following steps:

Referring to fig. 2, a method for generating a multilingual video according to a first embodiment of the present invention includes:

step S10, determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;

in this embodiment, the implementation subject of the method for generating a multilingual video may be an intelligent device such as a television, a computer, and a mobile phone. The user can select the target video file to be translated and the target language to be translated, such as Chinese, english, japanese, korean and the like. The preset user operation is the interactive operation between the user and the intelligent device, for example, the user can select a target video file and a target language through a touch screen or a remote controller. Video files typically contain video data and audio data. The original audio data is extracted separately from the target video file.

Step S20, extracting a first audio clip set of an original language from the original audio data;

in particular, for program videos (video clips, movies or television shows), especially videos viewed by groups who want to learn spoken language and practice hearing by watching the videos. Such video files usually contain two types of sound. One is background sound, and one is speech sound containing language text information. The speech sound containing language and character information is a translatable part and is also a part which is helpful for learning spoken language and practicing hearing. Therefore, in this embodiment, the portion of the original audio data that needs to be translated is the speaking voice containing language and text information. The voice recognition technology can be used for extracting sound clips containing language file information in original audio data to obtain a first audio clip set, and the first audio clip set corresponds to the language of an original video, such as English. For the recognition of the sound segment containing the language file information, a machine learning algorithm may be used, for example, a two-classifier (the two-classifier may be trained by using a sample labeled as a background sound segment and a sample labeled as a sound segment containing the language file information to obtain a sound segment containing the language word information to be recognized), the sound segment containing the language word information in the original audio data is recognized by the two-classifier, and each recognized and extracted sound segment is used as the first audio segment set.

Step S30, translating the first audio clip set and generating a second audio clip set corresponding to the target language;

it can be understood that the first audio fragment set includes a large number of sound fragments with language and text information, a single sound fragment with language and text information is a first audio fragment (in an original language), each first audio fragment in the first audio fragment set is translated to obtain a corresponding second audio fragment (in a target language), and a set of a plurality of second audio fragments is a second audio fragment set. The translation process of translating the first audio segment in the original language into the second audio segment in the target language may also be to translate the first audio segment by using a speech recognition technology or to translate the original subtitles in the original video file and synthesize human voice, so as to obtain the second audio segment. In addition, the synthetic model for synthesizing human voice can be configured with multiple character vocalization models (such as standard male voice and standard female voice), namely the same, and the target vocalization model can also be determined based on preset user operation so as to meet the requirements of different users on vocalization types. At present, the translation technology and the AI (Artificial Intelligence) speech synthesis technology are mature, and the detailed translation and synthesis process will not be described herein.

Step S40, replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;

specifically, according to the correspondence relationship before and after translation, a corresponding second audio clip exists in each first audio clip in the first audio clip set and in the second audio clip set. Cutting out all first audio segments in the original audio data, then placing all second audio segments at positions corresponding to the first audio segments according to the corresponding relation, for example, translating the first audio segments A to obtain second audio segments A, cutting out the first audio segments A in the original audio data, then placing the corresponding second audio segments A at the positions of the original first audio segments A, and similarly, replacing all the first audio segments in the original audio data by the method to obtain the target audio data. It is understood that the target audio data obtained at this time is audio data of the target language.

And S50, loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.

Specifically, the original audio data in the target video file is covered by the target audio data to obtain the video file playing the target language audio, and the user can practice spoken language or hearing of the target language based on the video file playing the target language audio, so that the learning requirement of the user is met.

In this embodiment, a target video file and a target language are determined based on a preset user operation, and original audio data of the target video file is acquired; extracting a first audio fragment set of an original language from the original audio data; translating the first audio clip set and generating a second audio clip set corresponding to the target language; replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data; and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language. The method comprises the steps of extracting a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translating each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replacing the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, so that a video file for playing the audio of the target language is obtained, and the requirements of a user on spoken language learning and hearing training of the target language are met.

Further, referring to fig. 3, a second embodiment of the multilingual video of the present invention is proposed based on the first embodiment of the multilingual video generation method of the present invention.

step S310, acquiring a first audio clip from the first audio clip set;

specifically, the first audio clip set includes a plurality of first audio clips, and the first audio clips are arbitrarily acquired from the first audio clip set or acquired in an order based on an audio timeline.

Step S320, translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;

specifically, the multiple translation schemes refer to multiple translation modes existing when an acquired first audio piece is translated into a target language, for example, an original language of a certain first audio piece is english, and the corresponding contents are: if China is to be a great origin this must be brand true. The corresponding target language is Chinese, and the Chinese which can be translated by the same first audio fragment comprises the following steps: if China becomes a dream of a great country, the realization is necessary; if China becomes a dream of great countries, the realization is bound; if china is to become a great country and this dream must be fulfilled. Therefore, the plurality of selectable audio segments which can be translated by the same first audio segment have the same meaning, and the time length of each selectable audio segment can be different due to different translation modes.

Step S330, selecting the selectable audio segment with the duration closest to the duration of the first audio segment from the selectable audio segments as a second audio segment corresponding to the first audio segment;

specifically, also based on the above example, the first audio segment is If China is to be a great nature this must be concrete measure true, and the corresponding duration is 5s. The plurality of selectable audio segments translated by the plurality of translation schemes are respectively selectable audio segment 1: if China becomes a dream in a great country, the dream is realized (the duration is 4 s); optional audio clip 2: if China becomes a dream of great countries, the dream is realized certainly (the time is 5 s); optional audio clip 3: if china is to become a great country and this dream must be fulfilled (duration 6 s). And if the duration of the selectable audio segment 2 is 5s and is closest to the duration 5s of the first audio segment, taking the audio segment 2 as a corresponding second audio segment. It can be understood that, in this embodiment, one first audio segment may obtain translated selectable audio segments with different durations through different translation schemes, and the selectable audio segment with the duration closest to the first audio segment is used as the second audio segment corresponding to the first audio segment, so that the problem that the synchronous playing of video and audio is affected due to different durations when the first audio segment corresponding to the original audio data is replaced by the second audio segment can be avoided.

Step S340, acquiring a next first audio clip, and performing the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language until a second audio clip corresponding to each first audio clip in the first audio clip set is obtained;

step S350, using each second audio clip as the second audio clip set.

Specifically, a next first audio clip is obtained from the first audio clip set, the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language is also performed until each first audio clip in the first audio clip set correspondingly generates a second audio clip, and a collection of the second audio clips is used as the second audio clip set.

In this embodiment, corresponding to the same first audio segment, the translated selectable audio segments with different durations may be obtained through different translation schemes, and then the selectable audio segment with the duration closest to the first audio segment is used as the second audio segment corresponding to the first audio segment, and each first audio segment is subjected to the above-mentioned processing, so as to obtain the second audio segment set. It can be understood that, when the second audio segment corresponding to the first audio segment is determined, the time length is screened so that the time length of the second audio segment is close to the time length of the first audio segment, thereby avoiding the problem that the synchronous playing of video and audio is affected due to different time lengths when the first audio segment corresponding to the original audio data is replaced by the second audio segment.

Further, referring to fig. 4, a third embodiment of the multilingual video of the present invention is proposed based on the second embodiment of the multilingual video-generating method of the present invention.

After the step of screening, from among the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:

step S331, comparing the first duration of the first audio segment with the second duration of the corresponding second audio segment;

step S332, if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio segment based on a time interval difference between the first duration and the second duration, so that the first duration and the second duration are consistent.

Further, when the first time length is longer than the second time length, a preset smooth audio frame is interpolated in the second audio segment according to a preset interval, wherein the interpolation number of the preset smooth audio frame is determined by the time interval difference; and when the first time length is smaller than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.

Further, before the step of interpolating the preset smooth audio frame in the second audio segment according to the preset interval, the method includes: acquiring adjacent audio frames of the position of the preset smooth audio frame in the second audio segment in an interpolation mode; and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.

It is understood that the above-mentioned filtered duration of the second audio segment may still be different from the duration of the first audio segment, and for this, the second audio segment may be subjected to frame extraction or frame supplement processing. Specifically, a first duration of the first audio segment is compared with a second duration of the second audio segment, and if the first duration is different from the second duration, the video file that plays the audio of the target language may still have a condition of audio-video asynchronism during actual playing. Otherwise, if the first duration and the second duration are the same, the frame extraction or frame supplement processing may not be performed. The method can be divided into two different situations, and if the first duration is longer than the second duration, frame supplementing processing needs to be performed in the second audio segment, specifically, a preset smooth audio frame is supplemented and inserted in the second audio segment based on a preset interval. The preset interval is an interval of audio frames, for example, a preset smooth audio frame is inserted into the interval of b audio frames. The preset interval size can be freely set by a technician, so that the preset smooth audio frame is uniformly inserted into the second audio segment. Further, the preset smooth audio frame may be generated based on an adjacent frame of the position where the smooth audio frame is inserted, for example, if a certain preset smooth audio frame needs to be inserted between the audio frame a and the audio frame B in the second audio segment, the preset smooth audio frame may select any one of the audio frame a and the audio frame B, or an average frame of the audio frame a and the audio frame B is used as the preset smooth audio frame. And the interpolation number of the preset smooth audio frames is generated based on the time interval difference between the first time length and the second time length, and the interpolation number of the preset smooth audio frames can be obtained by multiplying the time interval difference and the audio frame rate of the second audio segment. For example, the time difference is 2s, the corresponding audio frame rate is (25 frames per second), and the obtained interpolation number is 50, then 50 preset smooth audio frames are interpolated in the second audio segment, so that the second time length of the second audio segment is consistent with the first time length of the first audio segment. And if the first time length is smaller than the second time length, performing frame extraction processing on a second audio clip. The audio frames (translated audio frames) in the second audio piece are extracted at preset intervals in the second audio piece. The same extraction number of the translated audio frames is also determined by the time interval difference, and the detailed process is not described herein again.

In this embodiment, for the second duration of the filtered second audio segment still different from the first duration corresponding to the first audio segment, the frame extraction or frame supplement processing is performed on the second audio segment according to the difference between the second duration and the first duration, so that the durations of the second audio segment and the first audio segment are kept consistent. Therefore, the condition that the sound and the picture are not synchronous when the video file playing the target language audio is actually played is further avoided.

Further, referring to fig. 5, a fourth embodiment of the multilingual video of the present invention is proposed based on the first embodiment of the multilingual video-generating method of the present invention.

step S601, translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;

step S602, generating a target caption file based on a caption time axis in an original caption file and each second caption speech segment;

and further, setting each second caption language segment in a position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.

Step S603, loading the target subtitle file to the target video file.

Specifically, the target video file usually further includes an original subtitle file, and the original subtitle file is usually in an original language. And translating each first caption language segment in the original caption file into a second caption language segment in the target language. Each first caption language segment has a corresponding second caption language segment (corresponding relation before and after translation). And the occurrence and ending time of each first caption speech segment is determined by the caption time axis of the original caption file, namely the occurrence and ending time of each first caption speech segment is recorded on the caption time axis. And setting each second caption language segment at the position of the corresponding first caption language segment on the caption time axis, thereby generating a target caption file displaying the target language. It should be noted that, when playing a video file, the original subtitles and the target subtitles may appear synchronously, thereby assisting the user in spoken language learning and hearing practice.

Further, referring to fig. 6, an embodiment of the present invention further provides a multilingual video generation apparatus 1000, where the multilingual video generation apparatus 1000 includes:

an obtaining module 100, configured to determine a target video file and a target language based on a preset user operation, and obtain original audio data of the target video file;

an extracting module 200, configured to extract a first audio clip set of an original language from the original audio data;

the translation module 300 is configured to translate the first audio clip set and generate a second audio clip set corresponding to the target language;

a replacing module 400, configured to replace, based on a correspondence relationship between before and after translation, each first audio segment in the first audio segment set in the original audio data with each first audio segment in the second audio segment set to obtain target audio data;

an overlay module 500, configured to load the target audio data into the target video file and overlay the original audio data to generate a video file corresponding to the target language.

Optionally, the translation module 300 is further configured to:

obtaining a first audio clip from the first audio clip set;

and taking each second audio segment as the second audio segment set.

Optionally, the translation module 300 is further configured to:

and loading the target subtitle file to the target video file.

Optionally, the translation module 300 is further configured to:

The multilingual video generation device provided by the invention adopts the multilingual video generation method in the embodiment, and solves the technical problem that the existing translation software cannot meet the requirements of part of groups who want to learn spoken language and practice hearing by watching videos. Compared with the prior art, the beneficial effects of the multilingual video generation device provided by the embodiment of the invention are the same as those of the multilingual video generation method provided by the embodiment, and other technical features of the multilingual video generation device are the same as those disclosed in the embodiment method, which are not repeated herein.

In addition, an embodiment of the present invention further provides a device for generating a multilingual video, where the device for generating a multilingual video includes: the system comprises a memory, a processor and a multilingual video generation program which is stored on the memory and can run on the processor, wherein the multilingual video generation program realizes the steps of the multilingual video generation method when the processor executes the multilingual video generation program.

The specific implementation of the multilingual video generation device of the present invention is basically the same as the embodiments of the new multilingual video generation method, and further description thereof is omitted here.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a multilingual video generation program is stored, and when the multilingual video generation program is executed by a processor, the steps of the multilingual video generation method are implemented as described above.

The specific implementation of the medium of the present invention is substantially the same as the embodiments of the new method for generating a multilingual video, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a television, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. A method for generating a multilingual video, the method comprising:

2. The method of claim 1, wherein translating the first set of audio segments and generating a second set of audio segments corresponding to the target language comprises:

obtaining a first audio clip from the first audio clip set;

and taking each second audio segment as the second audio segment set.

3. A method of generating a multilingual video according to claim 2, wherein, after the step of selecting, from among the selectable audio clips, one of the selectable audio clips having a duration closest to the duration of the first audio clip as the second audio clip corresponding to the first audio clip, the method comprises:

4. The method of claim 3, wherein said step of decimating or de-framing said second audio segment based on the time period difference between said first duration and said second duration comprises:

5. The method of claim 4, wherein before the step of interpolating the predetermined smooth audio frames at predetermined intervals in the second audio piece, the method comprises:

6. The method for generating multilingual video according to claim 1, wherein said target video file further comprises an original subtitle file, and wherein said step of determining the target video file and the target language based on the predetermined user operation further comprises:

and loading the target subtitle file to the target video file.

7. The method for generating multilingual video according to claim 6, wherein the step of generating the target subtitle file based on the subtitle timeline in the original subtitle file and each of the second subtitle segments comprises:

8. A multilingual video-generating apparatus, comprising:

the translation module is used for translating the first audio clip set and generating a second audio clip set corresponding to the target language;

a replacing module, configured to replace each first audio clip in the first audio clip set in the original audio data with each first audio clip in the second audio clip set based on a correspondence between before and after translation to obtain target audio data;

9. A multilingual video generation apparatus, characterized in that it comprises: memory, processor and multilingual video generation program stored on said memory and executable on said processor, said multilingual video generation program implementing the steps of the multilingual video generation method according to any one of claims 1 to 7 when executed by said processor.

10. A computer-readable storage medium, characterized in that a multilingual video generation program is stored on the computer-readable storage medium, and when executed by a processor, implements the steps of the multilingual video generation method according to any one of claims 1 to 7.