CN111538851B

CN111538851B - Method, system, equipment and storage medium for automatically generating demonstration video

Info

Publication number: CN111538851B
Application number: CN202010301638.4A
Authority: CN
Inventors: 焦金珂; 李健; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2023-09-12
Anticipated expiration: 2040-04-16
Also published as: CN111538851A

Abstract

The application provides a method, a system, equipment and a storage medium for automatically generating a demonstration video, and relates to the technical field of data processing. The purpose of automatically generating the PPT file into the video with dubbing and captions without manually recording the explanation voice is achieved. The method comprises the following steps: sequentially taking each slide in the presentation as a current slide one by one; generating an audio fragment according to the display text and the remark explanation text of the current slide; setting the display time length of the current slide in the video according to the time length of the audio fragment; sequentially superposing each slide according to the display time length corresponding to each slide to obtain silent video; and sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video.

Description

Method, system, equipment and storage medium for automatically generating demonstration video

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for automatically generating a presentation video.

Background

And playing the explanation content in video under the scenes of online class, large lectures, large conferences and the like. In a classroom scene, a teacher edits teaching contents in a PPT file in advance, displays the teaching contents in a form of a graph, a text and a combination, and then makes a video played in the classroom by playing the PPT file and recording voice. Under the scene of a large conference or the scene of a large lecture, a user edits report content or lecture content into a PPT file in advance, displays the explanation content in a chart mode and the like, and then completes the video of the lecture or the recording of the report video in a mode of playing the PPT file and recording voice at the same time. Meanwhile, the self-contained video export function of the PowerPoint software also needs to insert audio in the PPT file or needs to record bystandings in advance, and the requirement on the version of the PowerPoint software is higher.

The method for obtaining the video based on the PPT file needs to use screen recording software of a third party or needs to manually click the PPT by a user, the PPT is displayed page by page and a speaker needs to speak in the process of displaying, so that the method is very tedious and consumes time and energy.

Disclosure of Invention

The embodiment of the application provides a method, a system, equipment and a storage medium for automatically generating a demonstration video, which aim to achieve the purposes of automatically generating a PPT file into a video with dubbing and captions without manually clicking for showing and manually recording explanation voice.

An embodiment of the present application provides a method for automatically generating a presentation video, where the method includes:

sequentially taking each slide in the presentation as a current slide one by one;

generating an audio fragment according to the display text and the remark explanation text of the current slide;

setting the display time length of the current slide in the video according to the time length of the audio fragment;

sequentially superposing each slide according to the display time length corresponding to each slide to obtain silent video;

and sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video.

Optionally, the method further comprises:

reading and recording the total page code number of the presentation;

converting the current slide into a picture;

numbering the pictures according to the page number of the current slide;

numbering the text content according to the page number of the current slide; the text content is obtained by extracting at least one of the display text and the remark explanation text of the current slide;

taking the number of the text content as the number of the audio fragment;

setting the display duration of the current slide in the video according to the duration of the audio fragment, wherein the setting comprises the following steps:

editing the frame number of the picture according to the duration of the audio fragment;

taking the picture after the frame number editing as a display picture frame;

sequentially superposing each slide according to the display time length corresponding to each slide to obtain silent video, wherein the method comprises the following steps:

when the numbers of all the audio clips and the numbers of all the pictures are in one-to-one correspondence, the numbers of the pictures are used as the display sequence of the display picture frames;

and superposing the display picture frames corresponding to each slide according to the display sequence of the display picture frames corresponding to each slide to obtain the silent video.

Optionally, numbering the text content with the page number of the current slide, including:

when the current slide does not have the display text and does not have the remark explanation text, taking a no-content mark as the text content;

numbering the content-free marks according to the page number of the current slide;

taking the number of the text content as the number of the audio fragment comprises the following steps:

and when the text content is the content-free mark, generating a mute audio fragment with preset duration, and taking the number of the content-free mark as the number of the mute audio fragment.

Optionally, after generating the audio clip according to the display text and the remark explanation text of the current slide, the method further includes:

obtaining a first starting time stamp and a first ending time stamp of the audio fragment according to the duration of the audio fragment corresponding to each slide and the total page code number;

obtaining a second starting time stamp and a second ending time stamp of the current slide in the video according to the first starting time stamp and the first ending time stamp of the audio fragment;

Sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video, wherein the method comprises the following steps:

sequentially associating a second starting time stamp of each slide in the silent video with a first starting time stamp of an audio fragment corresponding to each slide according to the digital sequence of the total page code number, and associating a second ending time stamp of each slide in the silent video with a first ending time stamp of an audio fragment corresponding to each slide;

and adding the audio clips corresponding to each slide to the associated silent video to generate the demonstration video.

Optionally, the method further comprises:

dividing the text content with the number into a plurality of display sentences according to the positions of the punctuation marks in the text content;

obtaining a third starting time stamp and a third ending time stamp of each display sentence according to the first starting time stamp and the first ending time stamp of the audio fragment;

sequentially arranging each display sentence according to the third starting time stamp and the third ending time stamp of each display sentence to obtain a subtitle file corresponding to the audio fragment;

Taking the serial number of the text content as the serial number of the subtitle file;

and adding the subtitle file corresponding to each slide to the demonstration video according to the serial number of the subtitle file to obtain the demonstration video with the subtitle.

A second aspect of an embodiment of the present application provides a system for automatically generating a presentation video, the system for automatically generating a presentation video including:

the text extraction and analysis module is used for sequentially taking each slide in the presentation as a current slide one by one;

the voice synthesis module is used for generating an audio fragment according to the display characters and the remark explanation characters of the current slide;

the video generation module is used for setting the display duration of the current slide in the video according to the duration of the audio clip;

the video generation module is further used for sequentially superposing each slide according to the display duration corresponding to each slide to obtain silent video;

the video generation module is further used for sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video.

Optionally, the text extraction and analysis module is further configured to read and record a total page code number of the presentation;

The system for automatically generating a presentation video further comprises:

the picture generation module is used for reading and recording the total page code number of the presentation file;

the picture generation module is further used for converting the current slide into pictures;

the picture generation module is further used for numbering the pictures according to the page numbers of the current slide;

the text extraction and analysis module is also used for numbering text contents according to the page number of the current slide; the text content is obtained by extracting at least one of the display text and the remark explanation text of the current slide;

the voice synthesis module is further used for taking the number of the text content as the number of the audio fragment;

the video generation module is further used for editing the frame number of the picture according to the duration of the audio clip;

the video generation module is also used for taking the pictures after the frame number is edited as display picture frames;

the video generation module is further configured to use the number of the picture as a display order of the display picture frame when numbers of all audio clips and numbers of all pictures are in one-to-one correspondence;

and the video generation module is further used for superposing the display picture frames corresponding to each slide according to the display sequence of the display picture frames corresponding to each slide to obtain the silent video.

Optionally, the text extraction and analysis module is further configured to use a no-content mark as the text content when the current slide does not have the display text and does not have the remark explanation text;

the text extraction and analysis module is further used for numbering the content-free marks according to the page numbers of the current slide;

the voice synthesis module is further configured to generate a mute audio segment with a preset duration when the text content is the content-free mark, and take the number of the content-free mark as the number of the mute audio segment.

Optionally, the system for automatically generating the demonstration video further comprises:

the voice synthesis module is further used for obtaining a first starting time stamp and a first ending time stamp of the audio fragments according to the duration of the audio fragments corresponding to each slide and the total page code number;

the video generation module is further configured to obtain a second start time stamp and a second end time stamp of the current slide in the video according to the first start time stamp and the first end time stamp of the audio clip;

the video generating module is further configured to sequentially associate, according to the digital sequence of the total number of codes of the total page, a second start timestamp of each slide in the silent video with a first start timestamp of an audio clip corresponding to each slide, and associate a second end timestamp of each slide in the silent video with a first end timestamp of an audio clip corresponding to each slide;

The video generation module is further configured to add an audio clip corresponding to each slide to the silent video after association, and generate the presentation video.

the subtitle generating module is used for dividing the text content with the number into a plurality of display sentences according to the positions of the punctuation marks in the text content;

the subtitle generating module is further configured to obtain a third start timestamp and a third end timestamp of each display sentence according to the first start timestamp and the first end timestamp of the audio clip;

the video generation module is further configured to sequentially arrange each display sentence according to the third start time stamp and the third end time stamp of each display sentence, so as to obtain a subtitle file corresponding to the audio clip;

the video generation module is further used for taking the serial number of the text content as the serial number of the subtitle file;

and the video generation module is further used for adding the subtitle file corresponding to each slide to the demonstration video according to the serial number of the subtitle file to obtain the demonstration video with the subtitle.

A third aspect of the embodiments of the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect of the present application.

A fourth aspect of the embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the application when the processor executes the computer program.

According to the method for automatically generating the demonstration video, provided by the embodiment of the application, the pictures of each slide are respectively converted to obtain the display picture frame for displaying in the video, the display text and remark text of each slide are extracted to obtain text content, an audio clip is generated based on the text content, meanwhile, the pictures, the audio clip and the subtitle file obtained based on the same slide are associated in the order of the page number of the demonstration manuscript, so that the images, the dubbing and the subtitles of the synthesized demonstration video are mutually corresponding and coherent.

According to the embodiment of the application, the duration of the audio clip corresponding to each slide is used as the display duration of the slide in the video, so that the time for the audio clip explaining each slide to stay in the video is the same as the time for the slide to stay in the video, and the audio clip explaining content and the image content are synchronized in sense of hearing when each slide is displayed in the video.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of the steps of a method for automatically generating a presentation video according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a system for automatically generating a presentation video according to an embodiment of the present application;

FIG. 3 is a diagram showing the effect of pictures corresponding to each slide in a presentation video synthesized according to the embodiment of the present application;

fig. 4 is a flowchart illustrating steps of a method for adding subtitles according to an embodiment of the present application;

fig. 5 is a flowchart of automatically generating a presentation video according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Under the prior art, the conversion of PPT (PowerPoint) files into video is mainly performed by the following two schemes:

1. a video deriving function carried by using PowerPoint software; the PowerPoint software's own video export function requires a higher software version and has no universal applicability. Taking the 2016 version of PowerPoint software as an example, if a video file with dubbing is to be exported, a prepared audio clip needs to be inserted into each PPT page, or the recording function of the PowerPoint software is used first, and the video exporting function can be used after the recording and the bystanding are performed while a slide show is played. This scheme is comparatively loaded down with trivial details, and it is comparatively consuming time to record the side by side and needs equipment to have the microphone simultaneously, and is higher to the software version requirement.

2. Recording software by using a third party screen; with the third party screen recording software, additional third party applications and microphone are required, and the video recording requires the simulation of a PPT slide show while the presenter is speaking in concert. The scheme is strongly dependent on third party application, is also tedious and time-consuming, and is more manually operated.

Both schemes cannot realize automatic generation of video and automatic dubbing and subtitle addition in video, so that the two schemes have the problems of dependence on software and microphone hardware, and additionally, the two schemes are time-consuming to operate, require manual dubbing, manual recording and the like, and greatly consume manpower.

In view of the above problems, an embodiment of the present application provides a method for automatically generating a presentation video, which is applied to a system for automatically generating a presentation video, and inputs a PPT file into the system for automatically generating a presentation video, so that a video with dubbing and subtitles can be automatically generated.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a method for automatically generating a presentation video according to an embodiment of the present application.

Step S11: sequentially taking each slide in the presentation as a current slide one by one;

the current slide refers to a slide to which text extraction is being performed, or a slide to which text of an audio clip being subjected to speech synthesis corresponds, or a slide to which content of an image being displayed by a synthesized video belongs.

Step S12: generating an audio fragment according to the display text and the remark explanation text of the current slide;

the remark explanation text is text in a remark area obtained by clicking each slide in the Powerpoint software. The text within the remark area is entered by the user, understanding or expanding the content of each page of slide.

The display text refers to the text that the viewer can see when showing the slide, excluding the text in the form.

Step 13: setting the display time length of the current slide in the video according to the time length of the audio fragment;

the duration of the audio clip corresponding to a certain slide is used as the display duration in the slide video, so that the synchronization of the picture and the voice is ensured.

The display duration refers to the duration of a single page slide in the video.

Step S14: sequentially superposing each slide according to the display time length corresponding to each slide to obtain silent video;

step S15: and sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video.

The presentation video is a video for presenting the spoken content in a classroom, lecture hall, or large conference, and is generated from a presentation file (PPT file).

Referring to fig. 2, fig. 2 is a schematic structural diagram of a system for automatically generating a presentation video according to an embodiment of the present application. The process of the present application for obtaining a presentation video is described in detail with reference to fig. 2.

In the process of processing the presentation, the system for automatically generating the presentation video needs to process the slides page by page, in particular needs to obtain text content according to the content of each slide page, and also needs to convert each slide page into pictures, so in the process of processing the slides of the presentation, the embodiment of the application needs to treat each slide page as the current slide according to the sequence of page numbers.

The PPT file is a slide presentation file, and may be a courseware of a teacher giving a lesson, or may be a slide presented by a presenter when the presenter is used for a presentation.

In the Windows environment, powerpoint software is tool software for creating a presentation, and multimedia information such as a table, an image, a sound, a text file, etc. can be inserted into a slide of the presentation. The time consumption for manufacturing the presentation file by using the PowerPoint software is less; simple operation, convenient modification and strong sharing property. The Powerpoint can realize interaction through the hyperlink, has certain interactivity, and is suitable for making presentation courseware with low requirement on animation.

The PPT file entered into the system that automatically generates the presentation video is in PPT (or pptx) format.

Reading and recording the total page code number of the presentation;

and after the picture generation module receives the input presentation, recording the total page code number of the input presentation. The total page number of the presentation may be obtained by reading the configuration file of the presentation.

Presentation refers to a courseware, lecture file, or report file entered in ppt or pptx format.

Converting the current slide into a picture;

numbering the pictures according to the page number of the current slide;

And after the total page number is read, the picture generation module generates slides in the presentation file into pictures page by page according to the page number sequence. The picture format is a common picture format such as jpg and png.

In order to ensure the integrity of the slide, the picture generation module numbers the pictures according to the page numbers of the slides to which the pictures belong.

For example, the first page of slide generates picture 1, the second page of slide generates picture 2, the nth page of slide generates picture N, and so on.

The presentation is input to the picture generation module and also to the text extraction and analysis module.

Reading and recording the total page code number of the presentation;

the text extraction and analysis module records the total page code number of the input presentation by the same mode as the picture generation module so as to ensure that the characters extracted from the presentation can correspond to the pictures generated based on the presentation. Because the extraction of the text content and the generation of the picture are independently carried out by two modules in the system for automatically generating the demonstration video, the text content and the picture are respectively numbered according to the page number of the demonstration manuscript, and the completeness of the text content and the picture and the one-to-one correspondence of the text content and the picture can be ensured.

For example, a presentation file of 100 pages generates pictures from each slide to obtain 100 pictures, and then the text in each slide is extracted to obtain 100 corresponding text contents, the number of the text contents is 1-100, the number of the pictures is also 1-100, and for each text content, the pictures with the same number correspond to the corresponding text contents.

the text extraction and analysis module does not consider the text extraction of pictures and table contents when extracting the text contents in the slides in the presentation page by page.

If the slide of a certain page has more characters or the remark explanation characters corresponding to the slide are more, the text content extracted based on the slide of the page is extracted in a abstracting way.

The general method for judging more characters is to preset a word number threshold according to the character distribution of the whole presentation, and abstract the character content when the total number of the extracted words exceeds the word number threshold so as to simplify the text content.

The total number of words refers to the total number of words that the single slide based extraction includes the displayed text, and the text in its remark column.

The method for abstracting the abstract comprises the following steps:

1. extracting text content in a special format in the slide, such as highlighted text sentences, paragraphs and the like with special colors, larger fonts, thickened fonts and the like; the text content with the special format is generally the display text of the slide, the display text of the extracted slide is provided with a display mark of the text, and the display mark is read to judge whether the current text content is the text content with the special format.

2. Through natural language understanding means such as text analysis, the central ideas of the extracted text contents are analyzed, and proper smooth text sentences, paragraphs and the like are formed. The central idea of analyzing the extracted text content can be performed by extracting keywords, forming key sentences and the like. And finally, taking the composed smooth text sentences, paragraphs and the like as the text content of the page slide.

if the display content of a slide of a certain page is pictures and charts, no effective display text exists, and no text is reserved in the remark column of the slide of the page, the text content of the slide of the page is taken as a no-content mark.

The no content mark may be a special symbol or a null symbol. When empty text content is identified, audio synthesis is performed directly in such a way that the text content is free of content.

When the text content is numbered by the number of pages, whether the text content is extracted text content, text sentences and paragraphs obtained by analysis and combination through natural language understanding means or no content mark, the text content is numbered, so that the continuity of the numbering of the text content is ensured.

After the text extraction and analysis module extracts the text content, the text content is input to the voice synthesis module. The speech synthesis module generates an audio clip from the text content.

The voice synthesis module can complete relevant parameter configuration (such as tone, volume, speech speed, intonation, background music and the like) of voice synthesis in advance, and supports user to self-define and adjust relevant parameters of voice synthesis so as to enable the obtained audio clip to meet the personalized requirements of the user. The text content is synthesized into the audio fragment by two modes of online voice synthesis (a voice synthesis model of a cloud server) and offline voice synthesis (local voice synthesis software).

The text content is input into a voice synthesis module with complete parameter configuration, and the voice synthesis module can generate user-defined audio clips.

Taking the number of the text content as the number of the audio fragment;

in order to ensure the synchronization of the images and the audio when the video is played, the audio clips are still numbered with slide page numbers.

The serial numbers of the text contents correspond to the page numbers of the current slides, and the serial numbers of the audio clips numbered by the serial numbers of the text contents also correspond to the page numbers of the current slides, so that the audio clips and the text contents correspond to the same slide.

Particularly, when the speech synthesis module recognizes that text content marked by a certain code is a content-free mark, namely, a certain page of slide is recognized, and related characters are not available, a method for generating a mute audio segment is started, and the mute audio segment is numbered by a number corresponding to the text content, so that the slide corresponding to the number is prevented from being quickly skipped in video due to the absence of the characters.

The preset duration can be set according to the content of the presentation and the actual situation of the classroom or report. For example, when the meeting time is long enough, the viewer needs to understand the chart content of each slide in detail, the preset time period may be set to a longer time.

For example, 20 pages are total, in the presentation file for propaganda tourist attractions, page 4 is the detail display of the main scenic spot, in order to shake the vision of audiences with simple landscape images without character interference, when the presentation file is manufactured, the user does not add text content, at this time, the preset duration is set to 5 seconds, when the text content with the number 4 is obtained, the voice synthesis module generates a mute audio clip with the duration of 5 seconds, and the number 4 continues to be the number of the mute audio clip, finally, the video synthesized by the video generation module is finished, the display time of the slide of page 4 is 5 seconds, and the audiences can enjoy the scenery of the display of the slide of page 4 with enough time.

According to the embodiment of the application, through the way of marking the text content without characters, the slide without characters is not missed when the audio clip is generated, so that the video synthesized according to the audio duration is not ignored, and the complete display of the synthesized video on the presentation content is ensured.

The voice synthesis module inputs the audio clips to the video generation module after obtaining the audio clips, and the video generation module receives the pictures input by the picture generation module at the same time of receiving the audio clips.

The picture generation module, the voice synthesis module and the text extraction and analysis module operate page by page when processing slides in the presentation, but the contents obtained by the picture generation module, the voice synthesis module and the text extraction and analysis module are all corresponding to the whole presentation. The picture generation module outputs pictures corresponding to all slides in the presentation to the video generation module, and the voice synthesis module outputs audio clips corresponding to the text contents of all slides in the presentation to the video generation module.

For an exemplary presentation with a total page number of 100, the picture generation module outputs pictures with numbers 1-100, the speech synthesis module outputs audio clips with numbers 1-100, and the numbers of the audio clips and the numbers of the pictures are in one-to-one correspondence.

Accordingly, the video generation module may set a display duration of the slide show corresponding to the audio clip being processed in the video according to a duration of the audio clip being processed.

taking the picture after the frame number editing as a display picture frame;

the duration of each slide in the video is the number of frames of the pictures generated for that slide.

For example, referring to fig. 3, fig. 3 is a picture showing effect diagram corresponding to each slide in a presentation video synthesized according to an embodiment of the present application. For 4 pages of presentation, 4 pictures can be correspondingly obtained, the audio clip obtained based on the text content of the 1 st slide is 3 seconds, the audio clip obtained based on the text content of the 2 nd slide is 4 seconds, the audio clip obtained based on the text content of the 3 rd slide is 2 seconds, the audio clip obtained based on the text content of the 4 th slide is 5 seconds, the finally obtained video is that the 1 st picture is displayed for 3 seconds, the 2 nd picture is displayed for 4 seconds, the 3 rd picture is displayed for 2 seconds, and the 4 th picture is displayed for 5 seconds.

Another embodiment of the application proposes a specific method of adding audio clips to the silent video. The audio clips of all slides are added to the silent video, and besides the audio clips corresponding to each slide are correspondingly added to the pictures corresponding to the silent video according to the sequence of the pictures, the time stamps of the corresponding audio are also required to realize the synchronization of hearing and vision.

the first start time stamp is a start time stamp of any audio segment of all audio segments and the first end time stamp is an end time stamp of any audio segment of all audio segments.

the second start time stamp is a start time stamp of any picture among all pictures constituting the silent video, and the second end time stamp is an end time stamp of any picture among all pictures constituting the silent video.

Because the frame number of each picture is the same as the duration of the audio clip corresponding to the picture, the total duration of all the audio clips of the same presentation is necessarily the same as the total time of the silent video added by the pictures.

Assuming a presentation with a number of page codes of 4, the audio clip obtained from the text of the 1 st slide is 3 seconds, the audio clip obtained from the text of the 2 nd slide is 4 seconds, the audio clip obtained from the text of the 3 rd slide is 2 seconds, the audio clip obtained from the text of the 4 th slide is 5 seconds, the total duration of all audio clips is 14 seconds, the total duration of silent video is also 14 seconds, the start time stamp of the audio clip obtained from the text of the 1 st slide is 0 seconds, the end time stamp is 3 seconds, the start time stamp of the audio clip obtained from the text of the 2 nd slide is 3 seconds, the end time stamp is 7 seconds, the start time stamp of the audio clip obtained from the text of the 3 rd slide is 7 seconds, the end time stamp is 9 seconds, the start time stamp of the audio clip obtained from the text of the 4 th slide is 9 seconds, and the end time stamp is 14 seconds. The starting time stamp and the ending time stamp of each audio fragment are used as the starting time stamp and the ending time stamp of each picture, and the following steps are obtained: the 1 st picture displayed in the silent video has a start time stamp of 0 seconds, an end time stamp of 3 seconds, a start time stamp of 2 nd picture of 3 seconds, an end time stamp of 7 seconds, a start time stamp of 3 rd picture of 7 seconds, an end time stamp of 9 seconds, a start time stamp of 4 th picture of 9 seconds, and an end time stamp of 14 seconds.

associating the first start time stamp with the second start time stamp means that the second start time stamp is set according to the first start time stamp so that the first start time stamp corresponds to the second start time stamp.

Each picture of the silent video after association has a start time stamp and an end time stamp, so that the audio clips can be added in the order of numbers, the start time stamp and the end time stamp of the audio clips are aligned with the start time stamp and the end time stamp of the picture one by one, and all the audio clips are added to complete automatic dubbing of the silent video, and the audio demonstration video is obtained.

Another embodiment of the present application provides a method of adding subtitles. Referring to fig. 4, fig. 4 is a flowchart illustrating steps of a method for adding subtitles according to an embodiment of the present application.

Step S41: dividing the text content with the number into a plurality of display sentences according to the positions of the punctuation marks in the text content;

after the text extraction and analysis module obtains all slide text contents in the presentation, all text contents are input to the voice synthesis module and all text contents are input to the subtitle generation module. After the voice synthesis module obtains the audio clip corresponding to the text content, the audio clip is also input into the subtitle generation module.

Step S42: obtaining a third starting time stamp and a third ending time stamp of each display sentence according to the first starting time stamp and the first ending time stamp of the audio fragment;

the caption generating module is internally provided with a caption time code matching module, and the input text content and the audio fragment of the text content are processed through the caption time code matching technology to obtain the start time stamp and the end time stamp of each display sentence.

The third start timestamp is the start timestamp of the display sentence. The third end timestamp is an end timestamp of the display sentence.

The starting time stamp and the ending time stamp of the caption file corresponding to each slide and the starting time stamp and the ending time stamp of the picture corresponding to each slide are obtained by the time stamp of the audio fragment corresponding to the slide, so that the corresponding relations among the caption file, the audio fragment and the picture in time are homologous, and the accuracy of the corresponding relations among the caption file, the audio fragment and the picture in time is ensured.

Step S43, according to the third starting time stamp and the third ending time stamp of each display sentence, arranging each display sentence in turn to obtain a subtitle file corresponding to the audio fragment;

the format of the subtitle file may be ass, ssa, srt or the like.

Step S44, taking the serial number of the text content as the serial number of the subtitle file;

the page numbers of the slides are associated with the subtitle files, the audio clips and the pictures obtained based on the same slide, so that the continuity of the generated demonstration video on images, subtitles and dubbing and the indiscriminate effect of vision and hearing are ensured.

And step S45, adding the subtitle file corresponding to each slide to the demonstration video according to the serial number of the subtitle file to obtain the demonstration video with the subtitle.

Referring to fig. 5, fig. 5 is a flowchart illustrating automatic generation of a presentation video according to an embodiment of the present application. The process shown in fig. 5 may be implemented based on a system for automatically generating a presentation video, and after a presentation (PPT) is input into the system for automatically generating a presentation video, a video with subtitles and dubbing may be directly obtained.

The first step: uploading the PPT file to a system that automatically generates the presentation video.

And a second step of: the system for automatically generating the presentation video synchronously generates pictures and audio clips. The picture generation module of the system for automatically generating the demonstration video generates pictures page by page according to the PPT file and numbers the pictures; the text extraction and analysis module in the system for automatically generating the demonstration video extracts text content according to the PPT file, performs voice synthesis on the text content, generates audio fragments, and numbers the audio fragments synchronously.

And a third step of: a caption generating module in a system for automatically generating a presentation video generates a caption file according to the extracted text content and audio information.

Fourth step: and a video generating module in the system for automatically generating the demonstration video superimposes the pictures, the audio clips and the subtitle files to manufacture a synthetic video. Firstly, making pictures into silent videos according to the duration of an audio clip, then, corresponding the audio clip to the pictures corresponding to the same slide, and finally, correspondingly superposing a subtitle file and the audio clip.

Fifth step: the system that automatically generates the presentation video exports the generated presentation video file.

According to the scheme for automatically generating the demonstration video, provided by the embodiment of the application, each page of slide is subjected to picture conversion to obtain a display picture frame for being displayed in the video, the display text and remark text of each page of slide are extracted to obtain text contents, an audio clip is generated based on the text contents, meanwhile, pictures, audio clips and caption files obtained based on the same page of slide are associated in order of page numbers of the demonstration manuscript, so that images, dubbing and captions of the synthesized demonstration video are mutually corresponding and coherent.

Based on the same inventive concept, the embodiment of the application provides a system for automatically generating a demonstration video. With continued reference to fig. 2, fig. 2 is a schematic structural diagram of a system for automatically generating a presentation video according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:

The text extraction and analysis module 21 is used for sequentially taking each slide in the presentation as a current slide one by one;

a voice synthesis module 22, configured to generate an audio clip according to the display text and the remark explanation text of the current slide;

a video generating module 23, configured to set a display duration of the current slide in a video according to a duration of the audio clip;

the video generating module 23 is further configured to sequentially superimpose each slide according to a display duration corresponding to each slide, so as to obtain a silent video;

the video generating module 23 is further configured to sequentially add an audio clip corresponding to each slide to the silent video, and generate a presentation video.

the system for automatically generating a presentation video further comprises:

the picture generation module 24 is used for reading and recording the total page code number of the presentation;

the picture generation module 24 is further configured to convert the current slide into a picture;

the picture generation module 24 is further configured to number the pictures with the page numbers of the current slide;

The text extraction and analysis module 21 is further configured to number text contents with the page number of the current slide; the text content is obtained by extracting at least one of the display text and the remark explanation text of the current slide; the speech synthesis module 22 is further configured to take the number of the text content as the number of the audio segment;

the video generating module 23 is further configured to edit the frame number of the picture according to the duration of the audio clip;

the video generating module 23 is further configured to take the edited picture after the number of frames as a display picture frame;

the video generating module 23 is further configured to use the number of the picture as the display order of the display picture frame when the numbers of all the audio clips and the numbers of all the pictures are in one-to-one correspondence;

the video generating module 23 is further configured to superimpose the display picture frames corresponding to each slide according to the display order of the display picture frames corresponding to each slide, so as to obtain the silent video.

The text extraction and analysis module 21 is further configured to number the no-content mark with a page number of the current slide;

the speech synthesis module 22 is further configured to generate a mute audio segment with a preset duration when the text content is the no-content mark, and take the number of the no-content mark as the number of the mute audio segment.

the speech synthesis module 23 is further configured to obtain a first start timestamp and a first end timestamp of the audio segment according to the duration of the audio segment corresponding to each slide and the total number of codes of the page;

the video generating module 23 is further configured to obtain a second start time stamp and a second end time stamp of the current slide in the video according to the first start time stamp and the first end time stamp of the audio clip;

the video generating module 23 is further configured to sequentially associate, in numerical order of the total number of codes, a second start timestamp of each slide in the silent video with a first start timestamp of an audio clip corresponding to each slide, and associate a second end timestamp of each slide in the silent video with a first end timestamp of an audio clip corresponding to each slide;

The video generating module 23 is further configured to add an audio clip corresponding to each slide to the silent video after association, and generate the presentation video.

Optionally, the system for automatically generating the demonstration video further comprises

A caption generating module 25, configured to divide the text content with the number into a plurality of display sentences according to the positions of the punctuation marks in the text content;

the subtitle generating module 25 is further configured to obtain a third start timestamp and a third end timestamp of each display sentence according to the first start timestamp and the first end timestamp of the audio clip;

the video generating module 23 is further configured to sequentially arrange each display sentence according to the third start timestamp and the third end timestamp of each display sentence, so as to obtain a subtitle file corresponding to the audio clip;

the video generating module 23 is further configured to take the number of the text content as the number of the subtitle file;

the video generating module 23 is further configured to add the subtitle file corresponding to each slide to the presentation video according to the number of the subtitle file, so as to obtain the presentation video with subtitles.

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for automatically generating a presentation video according to any of the above embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the steps in the method for automatically generating a presentation video according to any one of the foregoing embodiments of the present application.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive or illustrative manner, and each embodiment is mainly described by the differences from other embodiments, and identical and similar parts between the embodiments are mutually referred.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The above detailed description of the method, the device, the equipment and the storage medium for automatically generating the demonstration video provided by the application is only used for helping to understand the method and the core idea of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of automatically generating a presentation video, the method comprising:

reading and recording the total page code number of the presentation; sequentially taking each slide in the presentation as a current slide one by one;

converting the current slide into a picture; numbering the pictures according to the page number of the current slide;

the step of numbering text contents according to the page number of the current slide comprises the following steps: when the current slide does not have the display text and does not have the remark explanation text, taking a no-content mark as the text content; numbering the content-free marks according to the page number of the current slide;

Generating an audio fragment according to the display text and the remark explanation text of the current slide; taking the number of the text content as the number of the audio fragment;

setting the display duration of the current slide in the video according to the duration of the audio fragment, wherein the setting comprises the following steps: editing the frame number of the picture according to the duration of the audio fragment; taking the picture after the frame number editing as a display picture frame;

sequentially superposing each slide according to the display time length corresponding to each slide to obtain silent video, wherein the method comprises the following steps: when the numbers of all the audio clips and the numbers of all the pictures are in one-to-one correspondence, the numbers of the pictures are used as the display sequence of the display picture frames; superposing the display picture frames corresponding to each slide according to the display sequence of the display picture frames corresponding to each slide to obtain the silent video;

2. The method of claim 1, wherein the numbering of the text content as the numbering of the audio segments comprises:

3. The method of claim 1, wherein after generating an audio clip from the displayed text and the remark lecture text of the current slide, the method further comprises:

4. A method according to claim 3, characterized in that the method further comprises:

5. A system for automatically generating a presentation video, the system for automatically generating a presentation video comprising:

the video generation module is further used for sequentially adding the audio clips corresponding to each slide to the silent video to generate a demonstration video;

the text extraction and analysis module is also used for reading and recording the total page code number of the presentation;

the system for automatically generating a presentation video further comprises:

the picture generation module is used for reading and recording the total page code number of the presentation file; converting the current slide into a picture; numbering the pictures according to the page number of the current slide;

the video generation module is further used for editing the frame number of the picture according to the duration of the audio clip; taking the picture after the frame number editing as a display picture frame;

the video generation module is further configured to use the number of the picture as a display order of the display picture frame when numbers of all audio clips and numbers of all pictures are in one-to-one correspondence; superposing the display picture frames corresponding to each slide according to the display sequence of the display picture frames corresponding to each slide to obtain the silent video;

the text extraction and analysis module is further used for taking a no-content mark as the text content when the current slide does not have the display text and does not have the remark explanation text; and numbering the content-free mark according to the page number of the current slide.

6. The system for automatically generating a presentation video according to claim 5, wherein the speech synthesis module is further configured to generate a mute audio segment of a preset duration when the text content is the no-content mark, and take the number of the no-content mark as the number of the mute audio segment.

7. A readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-4.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method according to any of claims 1-4.