CN115967840A - Method, equipment and device for generating multilingual video and readable storage medium - Google Patents

Method, equipment and device for generating multilingual video and readable storage medium Download PDF

Info

Publication number
CN115967840A
CN115967840A CN202211356981.4A CN202211356981A CN115967840A CN 115967840 A CN115967840 A CN 115967840A CN 202211356981 A CN202211356981 A CN 202211356981A CN 115967840 A CN115967840 A CN 115967840A
Authority
CN
China
Prior art keywords
audio
target
segment
language
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211356981.4A
Other languages
Chinese (zh)
Inventor
张广谱
邹文聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth RGB Electronics Co Ltd
Original Assignee
Shenzhen Skyworth RGB Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth RGB Electronics Co Ltd filed Critical Shenzhen Skyworth RGB Electronics Co Ltd
Priority to CN202211356981.4A priority Critical patent/CN115967840A/en
Publication of CN115967840A publication Critical patent/CN115967840A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method, equipment, a device and a readable storage medium for generating a multilingual video, wherein the method for generating the multilingual video extracts a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translates each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replaces the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, thereby obtaining a video file for playing the audio of the target language and meeting the requirements of a user on spoken language learning and hearing practice of the target language.

Description

Method, equipment and device for generating multilingual video and readable storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a method, a device, and an apparatus for generating a multilingual video, and a readable storage medium.
Background
At present, foreign programs watched by a television/computer and other equipment are usually configured with translated subtitles or audio obtained by manual translation dubbing, so that most people can understand and understand the program content. However, with the diversification of program types and the diversification of network resources, we can find out the foreign program resources (video clips, movies, television series, etc.) that we want to watch by themselves, but most of the foreign program resources that we find out by themselves usually have neither dubbing translation nor subtitle translation, so it is difficult for most of ordinary people to understand and understand the foreign programs. Although some software may be available on the market today for caption translation, no sound translation is available, which is not satisfactory for some groups who want to learn spoken language and practice hearing by watching video.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, equipment, a device and a readable storage medium for generating a multilingual video, and aims to solve the technical problem that the existing translation software cannot meet the requirements of part of groups who want to learn spoken language and practice hearing by watching videos.
In order to achieve the above object, the present invention provides a method for generating a multilingual video, wherein the method for generating the multilingual video includes:
determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;
extracting a first audio fragment set of an original language from the original audio data;
translating the first audio clip set and generating a second audio clip set corresponding to the target language;
replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;
and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.
Further, the step of translating the first audio clip set and generating a second audio clip set corresponding to the target language includes:
obtaining a first audio clip from the first audio clip set;
translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;
screening the selectable audio segments with the duration closest to the duration of the first audio segment from the selectable audio segments as second audio segments corresponding to the first audio segment;
obtaining a next first audio segment, and executing the step of translating the first audio segment based on multiple translation schemes to obtain multiple selectable audio segments of the target language until obtaining a second audio segment corresponding to each first audio segment in the first audio segment set;
and taking each second audio segment as the second audio segment set.
Further, after the step of screening, from the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:
comparing the size of the first time length of the first audio segment with the corresponding second time length of the second audio segment;
and if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio clip based on the time interval difference between the first duration and the second duration so as to keep the first duration and the second duration consistent.
Further, the step of performing frame extraction or frame supplement processing on the second audio segment based on the time period difference between the first time period and the second time period includes:
when the first time length is longer than the second time length, a preset smooth audio frame is added and inserted in the second audio segment according to a preset interval, wherein the adding and inserting quantity of the preset smooth audio frame is determined by the time interval difference;
and when the first time length is smaller than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.
Further, before the step of interpolating the preset smooth audio frames in the second audio segment according to the preset interval, the method includes:
acquiring adjacent audio frames of the interpolation positions of the preset smooth audio frames in the second audio clip;
and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.
Further, the target video file further includes an original subtitle file, and after the step of determining the target video file and the target language based on the preset user operation, the method further includes:
translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;
generating a target subtitle file based on a subtitle time axis in the original subtitle file and each second subtitle speech segment;
and loading the target subtitle file to the target video file.
Further, the step of generating a target subtitle file based on a subtitle time axis in the original subtitle file and each of the second subtitle speech segments includes:
and setting each second caption language segment in the position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.
In addition, to achieve the above object, the present invention further provides a multilingual video-generating apparatus, including:
the acquisition module is used for determining a target video file and a target language based on preset user operation and acquiring original audio data of the target video file;
the extraction module is used for extracting a first audio clip set of an original language from the original audio data;
the translation module is used for translating the first audio fragment set and generating a second audio fragment set corresponding to the target language;
a replacing module, configured to replace, based on a correspondence relationship between before and after translation, each first audio segment in the first audio segment set in the original audio data with each first audio segment in the second audio segment set to obtain target audio data;
and the covering module is used for loading the target audio data to the target video file and covering the original audio data so as to generate a video file corresponding to the target language.
In addition, to achieve the above object, the present invention also provides a multilingual video generation apparatus, including: the system comprises a memory, a processor and a multilingual video generation program which is stored on the memory and can run on the processor, wherein the multilingual video generation program realizes the steps of the multilingual video generation method when the processor executes the multilingual video generation program.
Further, to achieve the above object, the present invention provides a computer-readable storage medium having a multilingual video generation program stored thereon, which, when executed by a processor, implements the steps of the multilingual video generation method as described above.
The method, the equipment, the device and the readable storage medium for generating the multilingual video determine a target video file and a target language based on preset user operation, and acquire original audio data of the target video file; extracting a first audio fragment set of an original language from the original audio data; translating the first audio clip set and generating a second audio clip set corresponding to the target language; replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data; and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language. The method comprises the steps of extracting a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translating each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replacing the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, so that a video file for playing the audio of the target language is obtained, and the requirements of a user on spoken language learning and hearing training of the target language are met.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a multilingual video-generating method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a multilingual video-generating method according to a second embodiment of the present invention;
FIG. 4 is a flowchart illustrating a multilingual video-generating method according to a third embodiment of the present invention;
FIG. 5 is a flowchart illustrating a multilingual video-generating method according to a fourth embodiment of the present invention;
fig. 6 is a schematic structural diagram of an apparatus according to the multilingual video creation method of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The device of the embodiment of the invention can be a television, and also can be an electronic terminal device with data receiving, data sending and data processing functions, such as a smart phone, a PC, a tablet personal computer, a portable computer and the like.
As shown in fig. 1, the apparatus may include: a processor 1001, e.g. a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the device may also include a camera, RF (Radio Frequency) circuitry, sensors, audio circuitry, wiFi modules, and so forth. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include an operating system, a network communication module, a user interface module, and a multilingual video generation program therein.
In the device shown in fig. 1, the network interface 1004 is mainly used for connecting a backend server and communicating data with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the multilingual video generation program stored in the memory 1005, and perform the following operations:
determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;
extracting a first audio fragment set of an original language from the original audio data;
translating the first audio clip set and generating a second audio clip set corresponding to the target language;
replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;
and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and further perform the following operations:
the step of translating the first audio segment set and generating a second audio segment set corresponding to the target language comprises:
obtaining a first audio clip from the first audio clip set;
translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language;
screening the selectable audio segments with the duration closest to the duration of the first audio segment from all the selectable audio segments as second audio segments corresponding to the first audio segment;
obtaining a next first audio clip, and executing the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language until obtaining a second audio clip corresponding to each first audio clip in the first audio clip set;
and taking each second audio segment as the second audio segment set.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and also perform the following operations:
after the step of screening, from the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:
comparing the size of the first time length of the first audio segment with the corresponding second time length of the second audio segment;
and if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio clip based on the time interval difference between the first duration and the second duration so as to keep the first duration and the second duration consistent.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and also perform the following operations:
the step of performing frame extraction or frame supplement processing on the second audio segment based on the time interval difference between the first time interval and the second time interval comprises:
when the first time length is longer than the second time length, a preset smooth audio frame is inserted in the second audio segment according to a preset interval, wherein the insertion quantity of the preset smooth audio frame is determined by the time interval difference;
and when the first time length is less than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and further perform the following operations:
before the step of interpolating the preset smooth audio frame in the second audio segment according to the preset interval, the method comprises:
acquiring adjacent audio frames of the position of the preset smooth audio frame in the second audio segment in an interpolation mode;
and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and also perform the following operations:
the target video file further comprises an original subtitle file, and after the step of determining the target video file and the target language based on the preset user operation, the method further comprises:
translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;
generating a target subtitle file based on a subtitle time axis in the original subtitle file and each second subtitle speech segment;
and loading the target subtitle file to the target video file.
Further, the processor 1001 may call the multilingual video generation program stored in the memory 1005, and further perform the following operations:
the step of generating the target subtitle file based on the subtitle time axis in the original subtitle file and each second subtitle speech segment comprises the following steps:
and setting each second caption language segment in the position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.
Referring to fig. 2, a method for generating a multilingual video according to a first embodiment of the present invention includes:
step S10, determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;
in this embodiment, the implementation subject of the method for generating a multilingual video may be an intelligent device such as a television, a computer, and a mobile phone. The user can select the target video file to be translated and the target language to be translated, such as Chinese, english, japanese, korean and the like. The preset user operation is the interactive operation between the user and the intelligent device, for example, the user can select a target video file and a target language through a touch screen or a remote controller. Video files typically contain video data and audio data. The original audio data is extracted separately from the target video file.
Step S20, extracting a first audio clip set of an original language from the original audio data;
in particular, for program videos (video clips, movies or television shows), especially videos viewed by groups who want to learn spoken language and practice hearing by watching the videos. Such video files usually contain two types of sound. One is background sound, and one is speech sound containing language text information. The speech sound containing language and character information is a translatable part and is also a part which is helpful for learning spoken language and practicing hearing. Therefore, in this embodiment, the portion of the original audio data that needs to be translated is the speaking voice containing language and text information. The voice recognition technology can be used for extracting sound clips containing language file information in original audio data to obtain a first audio clip set, and the first audio clip set corresponds to the language of an original video, such as English. For the recognition of the sound segment containing the language file information, a machine learning algorithm may be used, for example, a two-classifier (the two-classifier may be trained by using a sample labeled as a background sound segment and a sample labeled as a sound segment containing the language file information to obtain a sound segment containing the language word information to be recognized), the sound segment containing the language word information in the original audio data is recognized by the two-classifier, and each recognized and extracted sound segment is used as the first audio segment set.
Step S30, translating the first audio clip set and generating a second audio clip set corresponding to the target language;
it can be understood that the first audio fragment set includes a large number of sound fragments with language and text information, a single sound fragment with language and text information is a first audio fragment (in an original language), each first audio fragment in the first audio fragment set is translated to obtain a corresponding second audio fragment (in a target language), and a set of a plurality of second audio fragments is a second audio fragment set. The translation process of translating the first audio segment in the original language into the second audio segment in the target language may also be to translate the first audio segment by using a speech recognition technology or to translate the original subtitles in the original video file and synthesize human voice, so as to obtain the second audio segment. In addition, the synthetic model for synthesizing human voice can be configured with multiple character vocalization models (such as standard male voice and standard female voice), namely the same, and the target vocalization model can also be determined based on preset user operation so as to meet the requirements of different users on vocalization types. At present, the translation technology and the AI (Artificial Intelligence) speech synthesis technology are mature, and the detailed translation and synthesis process will not be described herein.
Step S40, replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;
specifically, according to the correspondence relationship before and after translation, a corresponding second audio clip exists in each first audio clip in the first audio clip set and in the second audio clip set. Cutting out all first audio segments in the original audio data, then placing all second audio segments at positions corresponding to the first audio segments according to the corresponding relation, for example, translating the first audio segments A to obtain second audio segments A, cutting out the first audio segments A in the original audio data, then placing the corresponding second audio segments A at the positions of the original first audio segments A, and similarly, replacing all the first audio segments in the original audio data by the method to obtain the target audio data. It is understood that the target audio data obtained at this time is audio data of the target language.
And S50, loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.
Specifically, the original audio data in the target video file is covered by the target audio data to obtain the video file playing the target language audio, and the user can practice spoken language or hearing of the target language based on the video file playing the target language audio, so that the learning requirement of the user is met.
In this embodiment, a target video file and a target language are determined based on a preset user operation, and original audio data of the target video file is acquired; extracting a first audio fragment set of an original language from the original audio data; translating the first audio clip set and generating a second audio clip set corresponding to the target language; replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data; and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language. The method comprises the steps of extracting a first audio clip set from a first audio clip with language and character information of an original language in original audio data, translating each first audio clip in the first audio clip set into a second audio clip of a target language to obtain a second audio clip set, and replacing the corresponding first audio clip in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation, so that a video file for playing the audio of the target language is obtained, and the requirements of a user on spoken language learning and hearing training of the target language are met.
Further, referring to fig. 3, a second embodiment of the multilingual video of the present invention is proposed based on the first embodiment of the multilingual video generation method of the present invention.
The step of translating the first audio segment set and generating a second audio segment set corresponding to the target language comprises:
step S310, acquiring a first audio clip from the first audio clip set;
specifically, the first audio clip set includes a plurality of first audio clips, and the first audio clips are arbitrarily acquired from the first audio clip set or acquired in an order based on an audio timeline.
Step S320, translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;
specifically, the multiple translation schemes refer to multiple translation modes existing when an acquired first audio piece is translated into a target language, for example, an original language of a certain first audio piece is english, and the corresponding contents are: if China is to be a great origin this must be brand true. The corresponding target language is Chinese, and the Chinese which can be translated by the same first audio fragment comprises the following steps: if China becomes a dream of a great country, the realization is necessary; if China becomes a dream of great countries, the realization is bound; if china is to become a great country and this dream must be fulfilled. Therefore, the plurality of selectable audio segments which can be translated by the same first audio segment have the same meaning, and the time length of each selectable audio segment can be different due to different translation modes.
Step S330, selecting the selectable audio segment with the duration closest to the duration of the first audio segment from the selectable audio segments as a second audio segment corresponding to the first audio segment;
specifically, also based on the above example, the first audio segment is If China is to be a great nature this must be concrete measure true, and the corresponding duration is 5s. The plurality of selectable audio segments translated by the plurality of translation schemes are respectively selectable audio segment 1: if China becomes a dream in a great country, the dream is realized (the duration is 4 s); optional audio clip 2: if China becomes a dream of great countries, the dream is realized certainly (the time is 5 s); optional audio clip 3: if china is to become a great country and this dream must be fulfilled (duration 6 s). And if the duration of the selectable audio segment 2 is 5s and is closest to the duration 5s of the first audio segment, taking the audio segment 2 as a corresponding second audio segment. It can be understood that, in this embodiment, one first audio segment may obtain translated selectable audio segments with different durations through different translation schemes, and the selectable audio segment with the duration closest to the first audio segment is used as the second audio segment corresponding to the first audio segment, so that the problem that the synchronous playing of video and audio is affected due to different durations when the first audio segment corresponding to the original audio data is replaced by the second audio segment can be avoided.
Step S340, acquiring a next first audio clip, and performing the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language until a second audio clip corresponding to each first audio clip in the first audio clip set is obtained;
step S350, using each second audio clip as the second audio clip set.
Specifically, a next first audio clip is obtained from the first audio clip set, the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language is also performed until each first audio clip in the first audio clip set correspondingly generates a second audio clip, and a collection of the second audio clips is used as the second audio clip set.
In this embodiment, corresponding to the same first audio segment, the translated selectable audio segments with different durations may be obtained through different translation schemes, and then the selectable audio segment with the duration closest to the first audio segment is used as the second audio segment corresponding to the first audio segment, and each first audio segment is subjected to the above-mentioned processing, so as to obtain the second audio segment set. It can be understood that, when the second audio segment corresponding to the first audio segment is determined, the time length is screened so that the time length of the second audio segment is close to the time length of the first audio segment, thereby avoiding the problem that the synchronous playing of video and audio is affected due to different time lengths when the first audio segment corresponding to the original audio data is replaced by the second audio segment.
Further, referring to fig. 4, a third embodiment of the multilingual video of the present invention is proposed based on the second embodiment of the multilingual video-generating method of the present invention.
After the step of screening, from among the selectable audio segments, one of the selectable audio segments having a time length closest to that of the first audio segment as a second audio segment corresponding to the first audio segment, the method includes:
step S331, comparing the first duration of the first audio segment with the second duration of the corresponding second audio segment;
step S332, if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio segment based on a time interval difference between the first duration and the second duration, so that the first duration and the second duration are consistent.
Further, when the first time length is longer than the second time length, a preset smooth audio frame is interpolated in the second audio segment according to a preset interval, wherein the interpolation number of the preset smooth audio frame is determined by the time interval difference; and when the first time length is smaller than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.
Further, before the step of interpolating the preset smooth audio frame in the second audio segment according to the preset interval, the method includes: acquiring adjacent audio frames of the position of the preset smooth audio frame in the second audio segment in an interpolation mode; and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.
It is understood that the above-mentioned filtered duration of the second audio segment may still be different from the duration of the first audio segment, and for this, the second audio segment may be subjected to frame extraction or frame supplement processing. Specifically, a first duration of the first audio segment is compared with a second duration of the second audio segment, and if the first duration is different from the second duration, the video file that plays the audio of the target language may still have a condition of audio-video asynchronism during actual playing. Otherwise, if the first duration and the second duration are the same, the frame extraction or frame supplement processing may not be performed. The method can be divided into two different situations, and if the first duration is longer than the second duration, frame supplementing processing needs to be performed in the second audio segment, specifically, a preset smooth audio frame is supplemented and inserted in the second audio segment based on a preset interval. The preset interval is an interval of audio frames, for example, a preset smooth audio frame is inserted into the interval of b audio frames. The preset interval size can be freely set by a technician, so that the preset smooth audio frame is uniformly inserted into the second audio segment. Further, the preset smooth audio frame may be generated based on an adjacent frame of the position where the smooth audio frame is inserted, for example, if a certain preset smooth audio frame needs to be inserted between the audio frame a and the audio frame B in the second audio segment, the preset smooth audio frame may select any one of the audio frame a and the audio frame B, or an average frame of the audio frame a and the audio frame B is used as the preset smooth audio frame. And the interpolation number of the preset smooth audio frames is generated based on the time interval difference between the first time length and the second time length, and the interpolation number of the preset smooth audio frames can be obtained by multiplying the time interval difference and the audio frame rate of the second audio segment. For example, the time difference is 2s, the corresponding audio frame rate is (25 frames per second), and the obtained interpolation number is 50, then 50 preset smooth audio frames are interpolated in the second audio segment, so that the second time length of the second audio segment is consistent with the first time length of the first audio segment. And if the first time length is smaller than the second time length, performing frame extraction processing on a second audio clip. The audio frames (translated audio frames) in the second audio piece are extracted at preset intervals in the second audio piece. The same extraction number of the translated audio frames is also determined by the time interval difference, and the detailed process is not described herein again.
In this embodiment, for the second duration of the filtered second audio segment still different from the first duration corresponding to the first audio segment, the frame extraction or frame supplement processing is performed on the second audio segment according to the difference between the second duration and the first duration, so that the durations of the second audio segment and the first audio segment are kept consistent. Therefore, the condition that the sound and the picture are not synchronous when the video file playing the target language audio is actually played is further avoided.
Further, referring to fig. 5, a fourth embodiment of the multilingual video of the present invention is proposed based on the first embodiment of the multilingual video-generating method of the present invention.
The target video file further comprises an original subtitle file, and after the step of determining the target video file and the target language based on the preset user operation, the method further comprises:
step S601, translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;
step S602, generating a target caption file based on a caption time axis in an original caption file and each second caption speech segment;
and further, setting each second caption language segment in a position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.
Step S603, loading the target subtitle file to the target video file.
Specifically, the target video file usually further includes an original subtitle file, and the original subtitle file is usually in an original language. And translating each first caption language segment in the original caption file into a second caption language segment in the target language. Each first caption language segment has a corresponding second caption language segment (corresponding relation before and after translation). And the occurrence and ending time of each first caption speech segment is determined by the caption time axis of the original caption file, namely the occurrence and ending time of each first caption speech segment is recorded on the caption time axis. And setting each second caption language segment at the position of the corresponding first caption language segment on the caption time axis, thereby generating a target caption file displaying the target language. It should be noted that, when playing a video file, the original subtitles and the target subtitles may appear synchronously, thereby assisting the user in spoken language learning and hearing practice.
Further, referring to fig. 6, an embodiment of the present invention further provides a multilingual video generation apparatus 1000, where the multilingual video generation apparatus 1000 includes:
an obtaining module 100, configured to determine a target video file and a target language based on a preset user operation, and obtain original audio data of the target video file;
an extracting module 200, configured to extract a first audio clip set of an original language from the original audio data;
the translation module 300 is configured to translate the first audio clip set and generate a second audio clip set corresponding to the target language;
a replacing module 400, configured to replace, based on a correspondence relationship between before and after translation, each first audio segment in the first audio segment set in the original audio data with each first audio segment in the second audio segment set to obtain target audio data;
an overlay module 500, configured to load the target audio data into the target video file and overlay the original audio data to generate a video file corresponding to the target language.
Optionally, the translation module 300 is further configured to:
obtaining a first audio clip from the first audio clip set;
translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;
screening the selectable audio segments with the duration closest to the duration of the first audio segment from the selectable audio segments as second audio segments corresponding to the first audio segment;
obtaining a next first audio clip, and executing the step of translating the first audio clip based on multiple translation schemes to obtain multiple selectable audio clips of the target language until obtaining a second audio clip corresponding to each first audio clip in the first audio clip set;
and taking each second audio segment as the second audio segment set.
Optionally, the translation module 300 is further configured to:
comparing the size of the first time length of the first audio segment with the corresponding second time length of the second audio segment;
and if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio clip based on the time interval difference between the first duration and the second duration so as to keep the first duration and the second duration consistent.
Optionally, the translation module 300 is further configured to:
when the first time length is longer than the second time length, a preset smooth audio frame is added and inserted in the second audio segment according to a preset interval, wherein the adding and inserting quantity of the preset smooth audio frame is determined by the time interval difference;
and when the first time length is less than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.
Optionally, the translation module 300 is further configured to:
acquiring adjacent audio frames of the position of the preset smooth audio frame in the second audio segment in an interpolation mode;
and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.
Optionally, the translation module 300 is further configured to:
translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;
generating a target subtitle file based on a subtitle time axis in the original subtitle file and each second subtitle speech segment;
and loading the target subtitle file to the target video file.
Optionally, the translation module 300 is further configured to:
and setting each second caption language segment in the position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.
The multilingual video generation device provided by the invention adopts the multilingual video generation method in the embodiment, and solves the technical problem that the existing translation software cannot meet the requirements of part of groups who want to learn spoken language and practice hearing by watching videos. Compared with the prior art, the beneficial effects of the multilingual video generation device provided by the embodiment of the invention are the same as those of the multilingual video generation method provided by the embodiment, and other technical features of the multilingual video generation device are the same as those disclosed in the embodiment method, which are not repeated herein.
In addition, an embodiment of the present invention further provides a device for generating a multilingual video, where the device for generating a multilingual video includes: the system comprises a memory, a processor and a multilingual video generation program which is stored on the memory and can run on the processor, wherein the multilingual video generation program realizes the steps of the multilingual video generation method when the processor executes the multilingual video generation program.
The specific implementation of the multilingual video generation device of the present invention is basically the same as the embodiments of the new multilingual video generation method, and further description thereof is omitted here.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a multilingual video generation program is stored, and when the multilingual video generation program is executed by a processor, the steps of the multilingual video generation method are implemented as described above.
The specific implementation of the medium of the present invention is substantially the same as the embodiments of the new method for generating a multilingual video, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a television, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims (10)

1. A method for generating a multilingual video, the method comprising:
determining a target video file and a target language based on preset user operation, and acquiring original audio data of the target video file;
extracting a first audio fragment set of an original language from the original audio data;
translating the first audio clip set and generating a second audio clip set corresponding to the target language;
replacing each first audio clip in the first audio clip set in the original audio data with each second audio clip in the second audio clip set based on the corresponding relation before and after translation to obtain target audio data;
and loading the target audio data to the target video file and covering the original audio data to generate a video file corresponding to the target language.
2. The method of claim 1, wherein translating the first set of audio segments and generating a second set of audio segments corresponding to the target language comprises:
obtaining a first audio clip from the first audio clip set;
translating the first audio segment based on a plurality of translation schemes to obtain a plurality of selectable audio segments of the target language;
screening the selectable audio segments with the duration closest to the duration of the first audio segment from the selectable audio segments as second audio segments corresponding to the first audio segment;
obtaining a next first audio segment, and executing the step of translating the first audio segment based on multiple translation schemes to obtain multiple selectable audio segments of the target language until obtaining a second audio segment corresponding to each first audio segment in the first audio segment set;
and taking each second audio segment as the second audio segment set.
3. A method of generating a multilingual video according to claim 2, wherein, after the step of selecting, from among the selectable audio clips, one of the selectable audio clips having a duration closest to the duration of the first audio clip as the second audio clip corresponding to the first audio clip, the method comprises:
comparing the size of the first time length of the first audio segment with the corresponding second time length of the second audio segment;
and if the first duration is different from the second duration, performing frame extraction or frame supplement processing on the second audio clip based on the time interval difference between the first duration and the second duration so as to keep the first duration and the second duration consistent.
4. The method of claim 3, wherein said step of decimating or de-framing said second audio segment based on the time period difference between said first duration and said second duration comprises:
when the first time length is longer than the second time length, a preset smooth audio frame is inserted in the second audio segment according to a preset interval, wherein the insertion quantity of the preset smooth audio frame is determined by the time interval difference;
and when the first time length is less than the second time length, extracting the translated audio frames in the second audio clip according to a preset interval, wherein the extraction quantity of the translated audio frames is determined by the time difference.
5. The method of claim 4, wherein before the step of interpolating the predetermined smooth audio frames at predetermined intervals in the second audio piece, the method comprises:
acquiring adjacent audio frames of the interpolation positions of the preset smooth audio frames in the second audio clip;
and taking the adjacent audio frames or the average frames of the adjacent audio frames as the preset smooth audio frames.
6. The method for generating multilingual video according to claim 1, wherein said target video file further comprises an original subtitle file, and wherein said step of determining the target video file and the target language based on the predetermined user operation further comprises:
translating each first caption language segment in the original caption file to obtain each second caption language segment corresponding to the target language;
generating a target subtitle file based on a subtitle time axis in the original subtitle file and each second subtitle speech segment;
and loading the target subtitle file to the target video file.
7. The method for generating multilingual video according to claim 6, wherein the step of generating the target subtitle file based on the subtitle timeline in the original subtitle file and each of the second subtitle segments comprises:
and setting each second caption language segment in the position corresponding to the first caption language segment in the time axis according to the corresponding relation before and after translation to generate the target caption file.
8. A multilingual video-generating apparatus, comprising:
the acquisition module is used for determining a target video file and a target language based on preset user operation and acquiring original audio data of the target video file;
the extraction module is used for extracting a first audio clip set of an original language from the original audio data;
the translation module is used for translating the first audio clip set and generating a second audio clip set corresponding to the target language;
a replacing module, configured to replace each first audio clip in the first audio clip set in the original audio data with each first audio clip in the second audio clip set based on a correspondence between before and after translation to obtain target audio data;
and the covering module is used for loading the target audio data to the target video file and covering the original audio data so as to generate a video file corresponding to the target language.
9. A multilingual video generation apparatus, characterized in that it comprises: memory, processor and multilingual video generation program stored on said memory and executable on said processor, said multilingual video generation program implementing the steps of the multilingual video generation method according to any one of claims 1 to 7 when executed by said processor.
10. A computer-readable storage medium, characterized in that a multilingual video generation program is stored on the computer-readable storage medium, and when executed by a processor, implements the steps of the multilingual video generation method according to any one of claims 1 to 7.
CN202211356981.4A 2022-11-01 2022-11-01 Method, equipment and device for generating multilingual video and readable storage medium Pending CN115967840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211356981.4A CN115967840A (en) 2022-11-01 2022-11-01 Method, equipment and device for generating multilingual video and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211356981.4A CN115967840A (en) 2022-11-01 2022-11-01 Method, equipment and device for generating multilingual video and readable storage medium

Publications (1)

Publication Number Publication Date
CN115967840A true CN115967840A (en) 2023-04-14

Family

ID=87362538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211356981.4A Pending CN115967840A (en) 2022-11-01 2022-11-01 Method, equipment and device for generating multilingual video and readable storage medium

Country Status (1)

Country Link
CN (1) CN115967840A (en)

Similar Documents

Publication Publication Date Title
CN111163274B (en) Video recording method and display equipment
CN110557678B (en) Video processing method, device and equipment
US8650591B2 (en) Video enabled digital devices for embedding user data in interactive applications
CN107155138A (en) Video playback jump method, equipment and computer-readable recording medium
CN110602516A (en) Information interaction method and device based on live video and electronic equipment
CN111683266A (en) Method and terminal for configuring subtitles through simultaneous translation of videos
CN112068750A (en) House resource processing method and device
US20120301030A1 (en) Image processing apparatus, image processing method and recording medium
EP2665290A1 (en) Simultaneous display of a reference video and the corresponding video capturing the viewer/sportsperson in front of said video display
CN114157920B (en) Method and device for playing sign language, intelligent television and storage medium
US20210029304A1 (en) Methods for generating video, electronic device and storage medium
CN113923462A (en) Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
CN112188249B (en) Electronic specification-based playing method and display device
KR20140047553A (en) Method and apparatus for video streaming
CN111010598A (en) Screen capture application method and smart television
CN113132780A (en) Video synthesis method and device, electronic equipment and readable storage medium
JP5346797B2 (en) Sign language video synthesizing device, sign language video synthesizing method, sign language display position setting device, sign language display position setting method, and program
CN113992972A (en) Subtitle display method and device, electronic equipment and readable storage medium
CN111831615B (en) Method, device and system for generating video file
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN111107283B (en) Information display method, electronic equipment and storage medium
CN112055245A (en) Color subtitle realization method and display device
CN115967840A (en) Method, equipment and device for generating multilingual video and readable storage medium
CN113852757B (en) Video processing method, device, equipment and storage medium
CN113485580B (en) Display device, touch pen detection method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination