CN112954453A

CN112954453A - Video dubbing method and apparatus, storage medium, and electronic device

Info

Publication number: CN112954453A
Application number: CN202110179770.7A
Authority: CN
Inventors: 张同新; 姚佳立; 张昊宇
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-11
Anticipated expiration: 2041-02-07
Also published as: CN112954453B

Abstract

The present disclosure relates to a video dubbing method and apparatus, a storage medium, and an electronic device, the method including: splitting an audio and video to be matched into a plurality of sub-videos according to a video scene; inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model; and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case. The present disclosure may be that the dubbing of the video is more lively and natural.

Description

Video dubbing method and apparatus, storage medium, and electronic device

Technical Field

The present disclosure relates to the field of video processing, and in particular, to a video dubbing method and apparatus, a storage medium, and an electronic device.

Background

Video is a common multimedia form, and at present, information acquisition through video is a common life style in the rapid development of information technology. At present, functions capable of automatically dubbing videos appear, but the automatic dubbing schemes generally cannot adjust the dubbing style according to the content of the videos, so that the dubbing of the videos is not vivid and natural.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a video dubbing method, the method comprising: splitting an audio and video to be matched into a plurality of sub-videos according to a video scene; inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model; and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

In a second aspect, the present disclosure provides a video dubbing apparatus, the apparatus comprising: the scene determining module is used for splitting the audio and video to be matched into a plurality of sub-videos according to the video scene; the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub-video into a style prediction model and acquiring a style label output by the style prediction model; and the dubbing generation module is used for generating the dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device, including a storage device having a computer program stored thereon, and a processing device configured to execute the computer program to implement the steps of the method according to the first aspect of the present disclosure.

Through the technical scheme, the following technical effects can be at least achieved:

the method comprises the steps of splitting a video to be dubbed into different sub-videos according to scenes, generating style labels for the sub-videos, and generating dubbing for the sub-videos based on the style labels, so that the dubbing styles of the sub-videos in different styles are different, and the dubbing of the videos is more natural and vivid.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

fig. 1 is a flow chart illustrating a method of dubbing video in accordance with an exemplary disclosed embodiment.

Fig. 2 is a block diagram illustrating a video dubbing apparatus according to an exemplary disclosed embodiment.

FIG. 3 is a block diagram illustrating an electronic device according to an exemplary disclosed embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flow chart illustrating a method of dubbing video in accordance with an exemplary disclosed embodiment, the method comprising, as shown in fig. 1:

and S11, splitting the audio and video to be matched into a plurality of sub-videos according to the video scenes.

The video scene refers to a scene generated by the change of the preset shooting condition of the video, for example, different sub-mirrors can generate different video scenes, and different shooting environments can also generate different video scenes; the sub-video may be a video clip generated by dividing a time point at which a video scene changes as a division point, and for example, the video content when the shooting angle of the video is angle 1 and the video content when the shooting angle is angle 2 may be regarded as two different sub-videos, the video content when the shooting location of the video is location 1 and the video content when the shooting location is location 2 may be regarded as two different sub-videos, and the video content when the shooting object of the video is person 1 and the video content when the shooting object of the video is person 2 may be regarded as two different sub-videos.

In a possible implementation manner, the sub-video with the video length lower than the preset threshold and the sub-video adjacent to the sub-video can be merged to obtain a new sub-video, so that the number of scenes needing to be subjected to style prediction can be reduced, more features are provided for the generation of the style label, and the efficiency of the generation of the style label is improved.

For example, a video is divided into 8 sub-videos, the video length of the sub-video 1 is 10 seconds, the video length of the sub-video 2 is 5 seconds, the video length of the sub-video 3 is 7 seconds, the video length of the sub-video 4 is 40 seconds, the video length of the sub-video 5 is 8 seconds, the video length of the sub-video 6 is 17 seconds, the video length of the sub-video 7 is 3 seconds, the video length of the sub-video 8 is 22 seconds, and when the preset threshold value is 20 seconds, the sub-video 1, the sub-video 2, and the sub-video 3 can be merged into one sub-video, the sub-video 4 can be regarded as an independent sub-video, the sub-video 5 and the sub-video 6 can be merged into one sub-video, and the sub-video 7 and the sub-video 8 can be merged into one sub-video.

In a possible implementation manner, the video may be further subjected to scene splitting according to a preset number of scenes, where the preset number of scenes is four, and the preset number of scenes corresponds to the background portion, the narration portion, the highlight portion, and the end portion, respectively, and then the video may be divided into four sub-videos based on the image content and/or the text content of the video content.

In a possible implementation manner, the audio/video to be configured may be split into a plurality of sub-videos according to a video scene based on a scene splitting model.

The training steps of the scene segmentation model are as follows:

inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a preset loss function until the difference between the sample segmentation points and the segmentation points output by the model meets a training condition or the iteration number meets the training condition.

The preset loss function is used for penalizing the difference value between the segmentation point output by the model and the sample labeling segmentation point.

The sample labeling division point of the sample video can be determined based on the division requirement, for example, when the video needs to be divided according to the minute mirror, the sample labeling division point can be a division point labeled according to the minute mirror in advance, when the video needs to be divided according to the scenario of the video, the division point can be labeled according to the scenario, and a scenario label is added to the division point of the video, such as a background division point, a narration division point, a highlight division point, an ending division point, and the like. In order to improve the accuracy of the scene segmentation model, the type of the sample video may be determined according to the type of the video to be segmented, for example, when the video to be segmented is a movie video, the sample video is also a movie video, and when the video to be segmented is a popular science video, the sample video is also a popular science video. In a possible implementation manner, multiple types of scene segmentation models can be trained in advance according to different video classifications, and when a scene is segmented, a scene segmentation model corresponding to the type of a video to be segmented can be selected for splitting.

In a possible implementation manner, when the audio and video to be matched has corresponding document content, the document content can be subjected to scene segmentation through a semantic discrimination model, the document content expressing different contents is divided into different scenes, and the audio and video to be matched is divided into a plurality of sub-videos according to the scenes divided by the document content.

S12, inputting the to-be-dubbed file of the to-be-dubbed sub-video into the style prediction model, and obtaining the style label output by the style prediction model.

The scheme to be dubbed of the sub-video to be dubbed corresponding to different videos can be obtained through the following two forms:

the first method comprises the following steps: and identifying caption content from the sub-video to be dubbed, and taking the caption content as the file to be dubbed. When the video has text subtitles, the file to be dubbed can be obtained by identifying the subtitle content, and the subtitle content and the time axis of the video have a corresponding relation, so that the time axis corresponding to the subtitle content is determined while the subtitle content is obtained, and the dubbing audio is added at the corresponding time axis position after the dubbing is finished.

And the second method comprises the following steps: acquiring the file content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information. When the audio and video to be dubbed has corresponding file content, the file content corresponding to the sub video to be dubbed can be directly extracted from the file content, for example, when the file content has time axis information, the time information of the sub video can be extracted, the time axis corresponding to the time information is determined, the file content corresponding to the time axis is determined to be the file to be dubbed, when the time axis information does not exist in the file content, the time information of the sub video can be extracted, the video position of the sub video in the audio and video to be dubbed is determined, the character position corresponding to the video position is determined in the file content, and the file content at the character position is taken as the file to be dubbed.

Through the style prediction model, the to-be-dubbed case can be labeled with style labels, the style labels can include labels of emotion classes, such as excitement, happiness, sadness and the like, can also include expressions of scenario classes, such as horror, comedy, tragedy and the like, and can also include expressions of scenario trend classes, such as highlight, ending, statement and the like.

In one possible embodiment, the training step of the style prediction model is as follows:

inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on a sample labeling style label of the sample text, the style label output by the style prediction model and a preset loss function so as to enable the style label output by the style prediction model to be close to the sample labeling style label, wherein the training can be stopped when the difference between the two labels meets a preset condition or the training iteration number reaches a preset number.

The preset loss function is used for penalizing the difference value between the style label output by the model and the sample labeling style label.

And S13, generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

The dubbing audio can be generated by using a dubbing model, a program or an engine with a stylized dubbing function, and selectable labels are set for the style prediction model according to the style types of different dubbing programs, for example, when the style types of the dubbing programs include three types of happy, excited and hard, the types of the labels which can be output by the style prediction model can be corresponding to the three types of styles, for example, the labels respectively representing happy, happy and happy are all corresponding to the happy style types.

After the dubbing audio of the sub-video is generated, the dubbing audio can be inserted into the position corresponding to the sub-video in the video, and the dubbed complete video can be obtained by dubbing all the sub-videos, and the dubbing style of each scene in the video is matched with the video style of the scene, so that the video is more vivid and natural.

In a possible implementation manner, a video length corresponding to each subtitle content may be determined, a dubbing speed may be determined based on the video length corresponding to each subtitle content and a text length of the to-be-dubbed text, and a dubbing audio of the to-be-dubbed sub-video may be generated based on the style label and the to-be-dubbed text at the dubbing speed.

That is, when recognizing the text subtitle from the video to obtain the dubbing scheme, the length of the video covered by the text subtitle can be determined, and the dubbing speed can be determined according to the length of the video and the length of the scheme, so that the dubbing can be played within the time of the text subtitle; for example, when the video length is short, the dubbing speed may be increased, and when the video length is long, the dubbing speed may be decreased or may be at a preset dubbing speed, which may be a natural recitation speed experimentally tested.

In a possible embodiment, the document content may be divided into sentences to obtain a plurality of document clauses, time axis information of each document clause in the sub-video is determined based on the time information of the sub-video to be dubbed, and text subtitles corresponding to the document clause and dubbing audio corresponding to the document clause are added to the sub-video to be dubbed based on the time axis information of each document clause.

That is, the speed of the document content can be adjusted in sentence units, so that the dubbing speed is more natural, and the text subtitle can be reasonably added to the video, so that reading inconvenience caused by long or short text subtitle can be avoided.

In a possible implementation manner, the audio and video to be dubbed can be split into a plurality of sub-videos through a scene segmentation model, a scene label of each sub-video is obtained, a to-be-dubbed file of the to-be-dubbed sub-video and the scene label of the sub-video are input into a style prediction model, and a style label output by the style prediction model is obtained.

The scene segmentation model is not only used for splitting the video into the sub-videos, but also can add scene labels to the sub-videos, for example, labels such as 'office buildings', 'forests', 'streets' and the like can be added to the video according to different backgrounds, or labels such as 'backgrounds', 'highlights', 'tails' and the like can be added to the video according to different scenarios, and the style prediction model generates style labels based on the scene labels and the dubbing patterns, so that the generated style labels can accurately represent the style of the video.

Fig. 2 is a block diagram illustrating a video dubbing apparatus according to an exemplary disclosed embodiment, the apparatus 200, as shown in fig. 2, comprising:

the scene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos according to a video scene.

The style determining module 220 is configured to input the to-be-dubbed case of the to-be-dubbed sub video into the style prediction model, and obtain a style label output by the style prediction model.

And a dubbing generating module 230, configured to generate a dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

In a possible implementation manner, the apparatus further includes a text recognition module, configured to recognize subtitle content from the sub-video to be dubbed, and use the subtitle content as the to-be-dubbed file.

In a possible implementation manner, the device further comprises a document acquisition module, configured to acquire document content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.

In a possible implementation manner, the scene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos through a scene splitting model, and obtain a scene tag of each sub-video; the style determining module 220 is configured to input the to-be-dubbed case of the sub-video to be dubbed and the scene tag of the sub-video into the style prediction model, and obtain the style tag output by the style prediction model.

In a possible implementation manner, the apparatus further includes a length determining module, configured to determine a video length corresponding to each subtitle content; the dubbing generating module 230 is configured to determine a dubbing speed based on a video length corresponding to each subtitle content and a text length of the to-be-dubbed scenario; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.

In a possible implementation manner, the apparatus further includes a time determining module, configured to perform clause segmentation on the document content to obtain a plurality of document clauses; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.

In a possible implementation manner, the scene determining module 210 is configured to split the audio and video to be configured into a plurality of sub-videos according to a video scene based on a scene splitting model; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function.

In one possible embodiment, the training step of the style prediction model is as follows: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.

The steps specifically executed by each module have been specifically described in the embodiment of the method portion corresponding to the module, and are not described herein again.

Referring now to FIG. 3, a block diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 308, or installed from the ROM 302. The computer program, when executed by the processing device 301, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some implementations, the electronic devices may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, a video dubbing method comprising: splitting an audio and video to be matched into a plurality of sub-videos according to a video scene; inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model; and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

Example 2 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: and identifying caption content from the sub-video to be dubbed, and taking the caption content as the file to be dubbed.

Example 3 provides the method of example 1, further comprising, in accordance with one or more embodiments of the present disclosure: acquiring the file content of the audio and video to be matched; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.

Example 4 provides the method of example 1, the splitting the audio and video to be configured into a plurality of sub-videos according to a video scene includes: splitting the audio and video to be distributed into a plurality of sub-videos through a scene segmentation model, and acquiring a scene label of each sub-video; the method for inputting the to-be-dubbed case of the to-be-dubbed sub-video into the style prediction model and acquiring the style label output by the style prediction model comprises the following steps: inputting the to-be-dubbed case of the sub-video to be dubbed and the scene label of the sub-video into a style prediction model, and acquiring the style label output by the style prediction model.

Example 5 provides the method of example 2, further comprising, in accordance with one or more embodiments of the present disclosure: determining the video length corresponding to each subtitle content; generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case, wherein the generating dubbing audio comprises the following steps: determining dubbing speed based on the video length corresponding to each subtitle content and the character length of the file to be dubbed; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.

Example 6 provides the method of example 3, further comprising, in accordance with one or more embodiments of the present disclosure: sentence dividing is carried out on the document content to obtain a plurality of document clauses; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.

Example 7 provides the method of example 1, the splitting the audio and video to be configured into a plurality of sub-videos according to a video scene, including: splitting an audio and video to be distributed into a plurality of sub-videos according to a video scene based on a scene segmentation model; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function.

Example 8 provides the method of example 1, the training steps of the style prediction model are as follows: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.

Example 9 provides, in accordance with one or more embodiments of the present disclosure, a video dubbing apparatus comprising: the scene determining module is used for splitting the audio and video to be matched into a plurality of sub-videos according to the video scene; the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub-video into a style prediction model and acquiring a style label output by the style prediction model; and the dubbing generation module is used for generating the dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

Example 10 provides the apparatus of example 9, further including a text recognition module to identify subtitle content from the sub-video to be dubbed as the to-be-dubbed case, according to one or more embodiments of the present disclosure.

Example 11 provides the apparatus of example 9, further comprising a document acquisition module, configured to acquire document content of the audio and video to be matched, according to one or more embodiments of the present disclosure; acquiring time information of a sub video to be dubbed; and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.

Example 12 provides the apparatus of example 9, in accordance with one or more embodiments of the present disclosure, where the scene determination module is configured to split the audio and video to be distributed into a plurality of sub-videos through a scene segmentation model, and obtain a scene tag of each of the sub-videos; and the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub video and the scene label of the sub video into the style prediction model and acquiring the style label output by the style prediction model.

Example 13 provides the apparatus of example 10, further including a length determination module to determine a video length corresponding to each subtitle content, in accordance with one or more embodiments of the present disclosure; the dubbing generating module is used for determining the dubbing speed based on the video length corresponding to each subtitle content and the character length of the file to be dubbed; and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.

Example 14 provides the apparatus of example 11, further including a time determination module to clause the document content to obtain a plurality of document clauses, in accordance with one or more embodiments of the present disclosure; determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed; and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.

Example 15 provides the apparatus of example 9, the scene determination module to split the audio and video to be configured into a plurality of sub-videos according to a video scene based on a scene splitting model, according to one or more embodiments of the present disclosure; the training steps of the scene segmentation model are as follows: inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function;

example 16 provides the apparatus of example 9, the training of the style prediction model comprising the steps of: inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for dubbing video, the method comprising:

splitting an audio and video to be matched into a plurality of sub-videos according to a video scene;

inputting a to-be-dubbed scheme of a to-be-dubbed sub video into a style prediction model, and acquiring a style label output by the style prediction model;

and generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

2. The method of claim 1, further comprising:

and identifying caption content from the sub-video to be dubbed, and taking the caption content as the file to be dubbed.

3. The method of claim 1, further comprising:

acquiring the file content of the audio and video to be matched;

acquiring time information of a sub video to be dubbed;

and determining the to-be-dubbed file corresponding to the to-be-dubbed sub-video from the file content based on the time information.

4. The method according to claim 1, wherein the splitting of the audio and video to be configured into a plurality of sub-videos according to video scenes comprises:

splitting the audio and video to be distributed into a plurality of sub-videos through a scene segmentation model, and acquiring a scene label of each sub-video;

the method for inputting the to-be-dubbed case of the to-be-dubbed sub-video into the style prediction model and acquiring the style label output by the style prediction model comprises the following steps:

inputting the to-be-dubbed case of the sub-video to be dubbed and the scene label of the sub-video into a style prediction model, and acquiring the style label output by the style prediction model.

5. The method of claim 2, further comprising:

determining the video length corresponding to each subtitle content;

generating dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case, wherein the generating dubbing audio comprises the following steps:

determining dubbing speed based on the video length corresponding to each subtitle content and the character length of the file to be dubbed;

and generating dubbing audio of the sub-video to be dubbed at the dubbing speed based on the style label and the file to be dubbed.

6. The method of claim 3, further comprising:

sentence dividing is carried out on the document content to obtain a plurality of document clauses;

determining time axis information of each file clause in the sub-video based on the time information of the sub-video to be dubbed;

and adding a text subtitle corresponding to the text clause and dubbing audio corresponding to the text clause to the sub-video to be dubbed based on the time axis information of each text clause.

7. The method according to claim 1, wherein the splitting of the audio and video to be configured into a plurality of sub-videos according to video scenes comprises:

splitting an audio and video to be distributed into a plurality of sub-videos according to a video scene based on a scene segmentation model;

the training steps of the scene segmentation model are as follows:

inputting a sample video into a scene segmentation model to be trained, acquiring segmentation points output by the scene segmentation model, and adjusting parameters in the scene segmentation model based on sample labeling segmentation points of the sample video, the segmentation points output by the scene segmentation model and a first loss function.

8. The method of claim 1, wherein the style prediction model is trained by:

inputting a sample text into a style prediction model to be trained, acquiring a style label output by the style prediction model, and adjusting parameters in the style prediction model based on the sample labeling style label of the sample text, the style label output by the style prediction model and a second loss function.

9. A video dubbing apparatus, the apparatus comprising:

the scene determining module is used for splitting the audio and video to be matched into a plurality of sub-videos according to the video scene;

the style determining module is used for inputting the to-be-dubbed file of the to-be-dubbed sub-video into a style prediction model and acquiring a style label output by the style prediction model;

and the dubbing generation module is used for generating the dubbing audio of the sub-video to be dubbed based on the style label and the to-be-dubbed case.

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.