CN110807126B

CN110807126B - Method, device, storage medium and equipment for converting article into video

Info

Publication number: CN110807126B
Application number: CN201810863876.7A
Authority: CN
Inventors: 董霙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2023-05-26
Anticipated expiration: 2038-08-01
Also published as: CN110807126A

Abstract

The embodiment of the application discloses a method, a device, a storage medium and equipment for converting an article into a video, and belongs to the technical field of computers. The method comprises the following steps: acquiring materials in an article to be converted; converting text materials in the materials into audio; obtaining a video template matched with the category to which the article belongs from a template library, wherein the template library comprises at least one video template, the video template matched with the category to which the article comprising text materials belongs comprises a typesetting template and a background picture, and the background picture is matched with the category to which the article belongs; and synthesizing the text material, the background picture and the audio according to the typesetting template to obtain the video. According to the method and the device for searching the text material, the problem that when the article comprises the text material, the pictures searched according to the keywords are inaccurate and can cause mismatching of the text material and the background picture in the video is solved, so that the matching degree of the text material and the background picture is guaranteed, and the quality of the video is improved.

Description

Method, device, storage medium and equipment for converting article into video

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method, a device, a storage medium and equipment for converting an article into a video.

Background

Reading is an important way for people to obtain information. Since articles are mostly presented in text form, it is time consuming and laborious for users to read the text. Therefore, the article can be converted into the video, and the difficulty of acquiring the information by the user is reduced through the video.

In the related art, when an article only includes text material, a server searches a preset picture library for a background picture matched with a keyword in the text material, and then generates a video according to the text material and the background picture.

Since the editor may not find a suitable background picture, the editor does not map the article when editing the article, so that the article only includes text material, and then the background picture found according to the keywords may not be accurate at this time, which may cause mismatching between the text material and the background picture in the video.

Disclosure of Invention

The embodiment of the application provides a method, a device, a storage medium and equipment for converting an article into a video, which are used for solving the problem that when the article comprises text materials, the background pictures searched according to keywords are inaccurate, and the text materials in the video are not matched with the background pictures. The technical scheme is as follows:

In one aspect, a method of converting an article to video is provided, the method comprising:

acquiring materials in an article to be converted;

converting text materials in the materials into audio, wherein the text materials are materials formed by characters in the articles;

obtaining a video template matched with the category to which the article belongs from a template library, wherein the template library comprises at least one video template, the video template matched with the category to which the article comprising text materials belongs comprises a typesetting template and a background picture, and the background picture is matched with the category to which the article belongs;

and synthesizing the text material, the background picture and the audio according to the typesetting template to obtain a video.

In one aspect, there is provided an apparatus for converting an article into video, the apparatus comprising:

the acquisition module is used for acquiring materials in the articles to be converted;

the conversion module is used for converting text materials in the materials obtained by the obtaining module into audio, wherein the text materials are materials formed by characters in the articles;

the acquisition module is further used for acquiring a video template matched with the category to which the article belongs from a template library, at least one video template is preset in the template library, the video template matched with the category to which the article comprising text materials belongs comprises a typesetting template and a background picture, and the background picture is matched with the category to which the article belongs;

And the synthesis module is used for synthesizing the text material, the background picture and the audio according to the typesetting template to obtain a video.

In one aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement a method of converting an article into video as described above is provided.

In one aspect, there is provided an apparatus for converting an article into video, the apparatus comprising a processor and a memory having stored therein at least one instruction loaded and executed by the processor to implement a method for converting an article into video as described above.

The beneficial effects of the technical scheme provided by the embodiment of the application at least comprise:

when an article comprises text materials, a video template matched with the category to which the article belongs is obtained from a template library, and the background picture in the video template is matched with the category to which the article belongs, and the category to which the article belongs can correctly reflect the content of the text materials, so that the background picture in the video template is relatively related to the content of the text materials, the problem that when the article comprises the text materials, the picture searched according to a keyword is inaccurate and the text materials in a video are not matched with the background picture is solved, the matching degree of the text materials and the background pictures is ensured, and the quality of the video is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system for converting articles to video according to some exemplary embodiments;

FIG. 2 is a schematic diagram of a server according to some exemplary embodiments;

FIG. 3 is a method flow diagram of a method for converting articles to video provided by one embodiment of the present application;

FIG. 4 is a method flow diagram of a method for converting an article to video provided in another embodiment of the present application;

FIG. 5 is a flow chart of a process for text material in an article provided in another embodiment of the present application;

FIG. 6 is a schematic diagram of converting an article provided in another embodiment of the present application into video;

FIG. 7 is a block diagram of the steps for converting an article to video provided by another embodiment of the present application;

FIG. 8 is a schematic diagram of the conversion of an article into video provided in accordance with another embodiment of the present application;

FIG. 9 is a block diagram of an apparatus for converting articles into video according to one embodiment of the present application;

fig. 10 is a block diagram of an apparatus for converting an article into video according to still another embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application is explained.

When a user browses a web page through a browser, the user typically browses articles in the web page. The articles referred to herein may include at least one of text material, picture material, and video material. Since most articles contain text, some articles may be provided with corresponding pictures or videos to facilitate visual interpretation of the text by the pictures or videos, the following is exemplified by articles comprising text material, or articles comprising text material and picture material, or articles comprising text material, picture material and video material.

The text material may be a material formed by all or part of characters in the article, and optionally, the text in the article may be extracted in a summary manner to obtain the text material, which is described in detail below. The picture material may be a material composed of all or part of the pictures in the article, and optionally, when the number of the pictures in the article is small, the pictures related to the text material may be obtained from a pre-stored picture library as the pictures in the picture material, which is described in detail below. The video material may be a material composed of all or part of the videos in the article, and optionally, when the playing duration of the videos in the video material is long, a video segment in the video may be intercepted as the video material, which is described in detail below.

Because the articles are mostly presented in the form of texts, and the user is hard to read the texts, the articles can be converted into videos, so that the user can listen to the audio and watch the video pictures to clearly know the information transmitted by the articles, the user does not need to hard to read the texts, and the difficulty of the user to acquire the information can be reduced. Or alternatively, the process may be performed,

because the number of the articles on the network is large, the articles are mostly presented in the form of texts, and the text reading of the users is time-consuming, the users do not have to read the texts one by one, so that the articles are required to be converted into videos, the information transmitted by the articles is quickly known through the videos, and then the articles interested in the users are selected for careful reading.

In addition, due to the diversification of the video presentation forms, users are more likely to attract the attention of users than boring text reading, and are more likely to read articles in this way.

In the related art, if an article includes text material, a server needs to extract keywords from the text material; for each keyword, searching a background picture matched with the keyword in a preset picture library; and synthesizing the text material and the background picture according to the typesetting template. For example, when a keyword extracted from an article introducing a machine learning algorithm is "machine learning", a server may find out a picture of a person being learned or a picture of a machine in a picture library, and the picture is not matched with the content of a text material, so that when the picture is used as a background picture, the problem that the text material is not matched with the background picture is caused, thereby affecting the quality of a video. In the embodiment of the application, when the article comprises the text material, the video template matched with the category to which the article belongs is obtained from the template library, and the background picture in the video template is matched with the category to which the article belongs, and the category to which the article belongs can correctly reflect the content of the text material, so that the background picture in the video template is relatively related to the content of the text material, thereby ensuring the matching degree of the text material and the background picture and improving the quality of video.

In addition, if the text material is required to be matched with the background picture, the text material can be obtained by manually processing the text by a user when converting the text into a video, for example, deleting the text of the text; selecting a background picture; setting typesetting templates of text materials and picture materials; and synthesizing the text material and the background picture according to the typesetting template. Because the manual processing flow is complex, the video generation efficiency is low, professional knowledge is required for the user, and the learning cost of the user is high. If the user does not have professional knowledge, the professional staff needs to be entrusted to process, the staff still needs to perform the series of manual processing, the staff still needs to spend time to communicate with the user to clear the user's appeal, the video generation efficiency is still low, and when the article conversion requirement of batch exists, the time for the user to acquire the video is long and the video needs to be calculated in days. According to the method and the device, text materials in the articles are automatically converted into audio, and then a template library comprising at least one video template is preset, so that the video templates matched with the categories to which the articles belong are automatically selected from the template library, the video templates comprise typesetting templates and background pictures, and therefore the server can synthesize the text materials, the background pictures and the audio according to the typesetting templates to obtain videos, the function of automatically converting the articles into the videos is achieved, and users do not need to manually process the articles. That is, the user can obtain the video through one-stop service only by inputting the article, the whole process is fully automatic, so that the operation of the user is simpler, the learning cost is zero, the video generation efficiency is higher, the manufacturing period is short, and the user can convert the articles in batches. In addition, since at least one video template is included in the template library, a plurality of video templates can be provided for selection, and the effect of video can be improved, thereby obtaining high-quality video.

The following describes products according to embodiments of the present application.

The embodiment of the present application may be implemented as a functional module in a website, may be implemented as a program, or may be implemented as a plug-in a program, which is not limited. In one possible implementation, when implemented as a function module in a website, the function module may be named a "dynamic reading template" that is embedded in the website, such as a Cross build. The Cross builder is a responsive self-help website based on a rail fluid layout, and the fluid layout refers to a layout mode that the page size changes along with the window size of a browser.

The system architecture of the embodiments of the present application is described below.

The server provided by the embodiment of the application can comprise a front end and a background, and communication between the front end and the background can be performed through a thraft (software framework). Wherein, threft is a cross-language software framework, which can support json format, so that the front end and the background can transfer data through json format. thrft also supports nodejs (asynchronous communication mechanism) by which asynchronous communication between the front end and the background is achieved.

The front end may also be referred to as a client. When the client is a web browser, the user interaction interface may be implemented using HTML (HyperText Markup Language ) +js (JavaScript, scripting language).

The background includes a preprocessor and a renderer, and the preprocessor and the renderer may be disposed on the same server or may be disposed on different servers, which is not limited in this embodiment. The pre-processor is used for pre-processing the articles to obtain pre-processing results, and sending the pre-processing results to the renderer, wherein the renderer is used for synthesizing and rendering the pre-processing results to obtain videos. The preprocessing here includes at least one of preprocessing of text material, preprocessing of picture material, and preprocessing of video material, the preprocessing results being explained in detail below.

When the preprocessor and the renderer are deployed on the same server, the server may be a Linux server, a windows server, or other servers, which is not limited in this embodiment. When the preprocessor and the renderer are deployed on different servers, the preprocessor may be deployed on a Linux server, the renderer may be deployed on a windows server, or the preprocessor and the renderer may be deployed on other servers, respectively, which is not limited in this embodiment.

It should be noted that, when the preprocessor preprocesses the article and the renderer synthesizes and renders the preprocessing result, asynchronous communication can be realized through nodejs. In addition, when the languages of the preprocessor and the renderer are different, data may be transferred through the json format.

The preprocessor may be implemented by an LNMP architecture, or may be implemented by another architecture, and in fig. 1, the LNMP architecture is illustrated as an example. The LNMP architecture is an Nginx+Mysql+PHP architecture under a Linux operating system, and Nginx is a reverse proxy server and can provide load balancing for a background server or buffer service for a background server with slower speed; mysql is a database for storing user information, intermediate data cached when preprocessing articles, a picture library for matching text material, and so forth; PHP (Hypertext Preprocessor) is a hypertext preprocessing language for preprocessing articles. Besides the PHP pretreatment, the PHP pretreatment may be used to pretreat the articles in other languages such as PYTHON and C++, which is not limited in this embodiment.

In this embodiment, when converting a batch of articles, a first message queue (RabbitMQ) may be created in the preprocessor, and each article is added as a message to the first message queue, and the preprocessor processes the messages in the first message queue one by one. When converting text material to audio, a second message queue may be created in the preprocessor, each summary is added as a message to the second message queue, and the preprocessor processes the messages in the second message queue one by one.

Optionally, a third message queue may be created in the preprocessor, and after each preprocessing result is obtained by the preprocessor, the preprocessing result is added as a message to the third message queue, and the renderer processes the messages in the third message queue one by one. Of course, the third message queue may also be disposed in the renderer, or the third message queue may also be disposed between the preprocessing service and the renderer, which is not limited in this embodiment.

Optionally, the renderer in the embodiment of the present application has a rich structure, and may provide various rendering modes. And a renderer can be deployed in the server, and multiple renderers can be deployed. For example, the renderer may be FFMPEG (Fast Forward Moving Picture Experts Group, fast moving picture expert group) software, PPT (powerpoint) software, or the like, and the embodiment is not limited thereto.

Since the constitution of the renderer is rich, various rendering modes can be provided, and various rendering effects are provided, the dynamic effects which are originally required to be realized by programming can be completed by the renderer. Therefore, a developer can complete the dynamic effect by a designer without paying attention to the dynamic effect during programming, so that the effect of separating development and design is achieved, the development difficulty is simplified, and the development period is shortened.

Optionally, the background may also perform a docker containerized deployment to facilitate migration of services in the preprocessor or renderer. For example, preprocessing services in the preprocessor can be transplanted to other servers through a dock, so that a plurality of servers can preprocess different articles in parallel, and the preprocessing speed is improved; or, the rendering service in the renderer can be transplanted to other servers through the docker, so that a plurality of servers can synthesize and render different preprocessing results in parallel, and the rendering speed is improved. Referring to fig. 2, fig. 2 illustrates that the preprocessor and the renderers are disposed on different servers, and one preprocessor 210 and two renderers 220 are illustrated as examples, and the preprocessor 210 is respectively connected to the two renderers 220, at this time, the two renderers 220 may synthesize and render a plurality of preprocessing results generated by the preprocessor 210 in parallel, so as to obtain a video.

Referring to fig. 3, a flowchart of a method for converting an article into a video according to an embodiment of the present application is shown, where the method for converting an article into a video may be in the server including the preprocessor and the renderer shown in fig. 1. A method of converting the article to video, comprising:

Step 301, acquiring materials in an article to be converted.

The server can acquire the article first and then extract the material from the article; the material that has been extracted from the article may also be obtained directly, as explained in step 401, and will not be repeated here.

When the server extracts the material from the article, if the article comprises a material, the server can extract the material; if the article includes at least two materials, the server can selectively extract the materials from the article according to the requirement of the user or default configuration. For example, when an article includes text material, picture material, and video material, and it is desired that the generated video includes text and pictures, the server may extract the text material and the picture material without extracting the video material.

In step 302, text material in the material is converted to audio.

Since Text material includes words, the server may call a TTS (Text To Speech) interface, and perform Speech conversion on the words one by one through the TTS interface To obtain audio. Alternatively, the TTS interface may be an interface of text-to-speech technology in the top view.

Besides the voice conversion of the text material through the TTS interface, the text material can be converted into the audio by adopting a real person recording mode, and the effect of the obtained audio is better because the voice of the real person recording is natural.

Step 303, obtaining a video template matched with the category to which the article belongs from a template library, wherein the template library comprises at least one video template, the video template matched with the category to which the article comprising text materials belongs comprises a typesetting template and a background picture, and the background picture is matched with the category to which the article belongs.

When the material comprises text material, the video template comprises a typesetting template and a background picture, as described in step 405. When the material includes text material and picture material, or when the material includes text material, picture material and video material, the video template includes a typesetting template, as described in

steps

406 and 407.

The typesetting template is a template for typesetting materials in articles. For example, when the material includes text material, typesetting information of the text may be specified by the typesetting template, and the typesetting information may include a play position, a font, a color, a font size, a special effect when playing and disappearing, and the like of the text. When the material includes a picture material, typesetting information of pictures in the picture material can be specified through a typesetting template, and the typesetting information can include play positions of the pictures, special effects when the pictures are played and disappeared, and the like. When the material includes video material, typesetting information of videos in the video material may be specified by a typesetting template, and the typesetting information may include a play position of the videos, special effects when playing and disappearing, and the like. The present embodiment does not limit the layout information specified by the layout template.

The background picture is a picture displayed as a background of the text material. The background picture is matched with the category to which the article belongs. For example, when the category to which the article belongs is an algorithm, the background picture may be a picture containing mathematical symbols; when the category to which the article belongs is administrative, the background picture can be a picture containing a television station anchor; when the category to which the article belongs is cartoon, the background picture may be a picture including cartoon characters, or the like, and the embodiment is not limited.

In this embodiment, a template library is preset in the server, and the template library may be generated by the server or may be obtained by the server from another device. Wherein the template library comprises at least one video template.

After the server acquires the template library, the category to which the article belongs can be acquired, and then the video template matched with the category is acquired from the template library.

And step 304, synthesizing the text material, the background picture and the audio according to the typesetting template to obtain the video.

The preprocessor in the server can form a preprocessing result by the text material, the background picture and the audio according to the typesetting template, the preprocessing result is sent to the renderer by the json format, the renderer analyzes the preprocessing result of the json format, a rendering file is generated, and the rendering interface is called to execute a rendering command on the rendering file to obtain the video.

In summary, in the method for converting an article into a video provided in the embodiment of the present invention, when the article includes text material, a video template matching with a category to which the article belongs is obtained from a template library, and since a background picture in the video template matches with the category to which the article belongs and the category to which the article belongs can correctly reflect the content of the text material, the background picture in the video template is relatively related to the content of the text material, thereby solving the problem that when the article includes text material, a picture found according to a keyword is inaccurate, which may result in mismatching of the text material and the background picture in the video, thereby ensuring the matching degree of the text material and the background picture, and improving the quality of the video.

The embodiment shown in fig. 3 simply introduces a process of converting an article including text material into video, and in fact, whether the article includes text material, an article including text material and picture material, or an article including text material, picture material and video material, the embodiment of the present application can generate a high-quality video, and has high inclusion to the material. The following describes the process of converting these three articles into video in detail.

Referring to fig. 4, a flowchart of a method for converting an article into a video according to another embodiment of the present application is shown, and the method for converting an article into a video may be applied to the server including the preprocessor and the renderer shown in fig. 1. A method of converting the article to video, comprising:

in step 401, material in an article to be converted is obtained.

The present embodiment provides three acquisition modes of materials, and the three acquisition modes are described below.

In the first acquisition mode, an article to be converted sent by a client is acquired, and materials in the article are extracted.

The client side displays an input box, a user can edit an article in the input box, or the user can copy the edited article into the input box and click a sending button to trigger a sending instruction, and the client side sends the article to the server according to the sending instruction, and at the moment, the server receives the article and extracts materials in the article.

Besides text materials, the article may also include picture materials and video materials, and the extraction modes of the three materials are described below.

1) When the material is a text material, the server may identify the words in the article and combine all of the words into the text material. In general, the text in the article is more, and displaying all the text in the article in the video results in longer playing time of the video, which is not beneficial to the user to quickly acquire the information transmitted by the article, so that the server can extract important sentences in the article and determine the obtained abstract as text material.

In this embodiment, the server may extract the digest by a trained digest extraction model, where the digest extraction model may be referred to as an AI (Artificial Intelligence ) digest system; the abstract can also be extracted by an abstract extraction algorithm, and the two extraction modes are respectively described below.

In a first extraction method, extracting materials in an article includes: inputting the articles into a abstract extraction model obtained by training in advance; the output of the abstract extraction model is determined to be text material.

The abstract extraction model can be obtained by training a server or by training other devices, and the server obtains the abstract extraction model from the other devices.

Taking a server training abstract extraction model as an example for explanation, the server can obtain a large number of training samples, and each training sample comprises an article and an abstract obtained by manually extracting sentences in the article; the server builds an initial model by using a neural network or a machine learning algorithm; and training the initial model by using a large number of training samples to obtain a abstract extraction model. Optionally, a digest word number threshold may also be set when training the digest extraction model, so that the number of words of the extracted digest is less than the digest word number threshold.

After obtaining the abstract extraction model, the server can input the article into the abstract extraction model, the abstract extraction model extracts sentences in the article according to the abstract word number threshold, the extracted sentences are used as abstract to be output, and the abstract output by the abstract extraction model is determined to be text material by the server.

In a second extraction method, extracting materials in an article includes: word segmentation is carried out on the articles to obtain each keyword; for each sentence comprising the keyword, calculating the similarity of the sentence and other sentences; calculating the weight of the sentence according to the similarity, wherein the weight is used for representing the importance of the sentence; and selecting n sentences with highest weight values to form text materials, wherein n is a positive integer.

Wherein the server may extract the abstract in the article using a predetermined algorithm. That is, the server first determines each keyword in the article; and each sentence is regarded as a node, if two sentences have similarity, an undirected weighted edge exists between the nodes corresponding to the two sentences, the similarity between any two sentences can be calculated according to a similarity formula, and the similarity formula is not limited. After the similarity among the sentences is determined, a node diagram formed by the sentences serving as nodes can be constructed, the weight of each node is calculated according to a preset formula, and finally, the server sorts the sentences according to the order of the weight from high to low, and n sentences ranked in front are selected to obtain the abstract. When the predetermined algorithm is a TextRank algorithm, the predetermined formula is a TextRank formula, and the weight is a TextRank value.

It should be noted that the total number of words of the n sentences does not exceed the digest word number threshold.

After obtaining the abstract, the server can also correct some obvious problems in the abstract. For example, the server may perform similarity calculation on each sentence in the abstract, when the similarity of the two sentences is higher than the similarity threshold, determine that the meaning expressed by the two sentences is the same, delete one sentence at the time, and select at least one sentence with the highest weight from the sentences which are not selected as the abstract in the article according to the abstract word number threshold as the sentence in the abstract, so that the total word number of the revised abstract does not exceed the abstract word number threshold. Alternatively, the server may rank the individual sentences in the abstract according to their locations in the article. Alternatively, the server may delete a word in the abstract that is not smooth, e.g., the first sentence in the abstract is "but xxx", and since the first sentence will not normally express a turning meaning, the server may delete the word "but" to ensure that the sentence is smooth. The present embodiment is not limited to the modification.

Since the modification of the abstract by the server may not meet the requirements of the user, optionally, after extracting the material in the article, the method further includes: transmitting the text material to a client; receiving a modified text material sent by a client, wherein the modified text material is obtained by modifying the text material displayed by the client by a user; and determining the modified text material as the finally obtained text material.

The user can modify the abstract to improve the quality of the abstract, so that the modified abstract can clearly express the central ideas of the articles.

Referring to fig. 5, a process flow for text material in an article is shown. It should be noted that after the abstract is obtained, an intelligent clause may be further performed on the abstract, and the clause is added as a subtitle to the picture material, which is described in detail below.

2) When the material is a picture material, the server can firstly determine the number of pictures included in the article, and when the number of the pictures is greater than or equal to a picture number threshold value m, m pictures with the best picture quality are selected from the pictures to obtain the picture material, wherein m is a positive integer; when the number of the pictures is smaller than m, the server can firstly select k (k is smaller than m) pictures in the article, and then select m-k pictures matched with keywords in the article from a preset picture library to obtain picture materials. The picture library comprises various types of pictures and is used for providing pictures matched with the keywords. In general, the value of m may be set to 5-10, but may be set to other values, which is not limited in this embodiment.

After the picture material is obtained, the server can also process the pictures in the picture material through the image processing library, so that the format and the size of each picture are unified. For example, the image processing library uniformly converts all the pictures into the same format, or the image processing library scales the sizes of the pictures so that the sizes of all the scaled pictures are the same, which is not limited in this embodiment. The image processing library may be an imagemagick library, or may be another library, which is not limited in this embodiment.

3) When the material is video material, the server can select at least one video from videos in the article, and for each video, when the playing time length of the video exceeds a time length threshold, the server can intercept a video segment with the time length being the time length threshold from the video as the video material. When the server selects a video, the duration threshold may be the playing duration of the video material described in step 407; when the server selects at least two videos, the duration threshold may be the duration of playing the video material/number of videos described in step 407, as described in detail in step 407.

In the second acquisition mode, an article link sent by the client is acquired, articles to be converted are acquired according to the article link, and materials in the articles are extracted.

The client side displays an input box, a user can input an article link in the input box, then clicks a sending button to trigger a sending instruction, the client side sends the article link to the server according to the sending instruction, at the moment, the server receives the article link, acquires the article according to the article link, and finally extracts materials in the article. The process of extracting the material by the server is described in detail in the first acquisition mode, and is not described herein.

In a third obtaining manner, a material sent by a client is obtained after the client extracts an article to be converted.

The client side displays an input box, a user can input at least one material in the article in the input box, and then clicks a sending button to trigger a sending instruction, and the client side sends the material to the server according to the sending instruction, and at the moment, the server receives the material.

Step 402, converting text material in the material into audio.

The process of converting text material into audio is described in step 302, and will not be described herein.

Optionally, when the video further includes background audio, converting text material in the material to audio includes: performing voice conversion on characters in the text material to obtain voice audio; acquiring background audio; and synthesizing the voice audio and the background audio to obtain the audio.

The background audio may be the background audio selected by the server at random, the background audio selected by the server and matched with the typesetting format, or the background audio selected by the user, which is not limited in this embodiment.

In step 403, a label of the article is obtained, where the label is used to indicate a category to which the article belongs.

The article may be of various categories, such as entertainment, physical, sports, child-care, economy, health, etc., and the embodiment is not limited thereto.

The present embodiment provides three ways of obtaining a tag, and the three ways of obtaining are described below respectively.

In the first acquisition mode, the labels of the articles sent by the client are acquired.

The client can send the article or the article link or the material, and can also send the label of the article, and the server acquires the label at the moment. Alternatively, the client may send a tag when sending an article or article link or material.

The tag sent by the client may be edited by the user according to the information transmitted by the article, or may be extracted from the description information of the article by the client, or may be extracted according to the keywords in the text material, and the embodiment does not limit the obtaining mode of the tag. Wherein the descriptive information is used to describe the subject matter of the article, the category to which it belongs, etc.

In the second acquisition mode, the labels of the articles are extracted from the descriptive information of the articles.

In a third approach, labels of articles are extracted from keywords in text material.

The server can segment keywords in the articles, and determines the category to which the articles belong according to the category to which the keywords belong. For example, if a keyword of stock is mentioned several times in an article and the category to which the keyword of stock belongs is economy, the server may determine that the category to which the article belongs is economy.

And step 404, obtaining the video templates corresponding to the labels from the template library, and determining the video templates as video templates matched with the articles.

The method comprises the steps that a plurality of data items are prestored in a template library, each data item comprises a corresponding relation between a tag and a video template, after the tag is acquired by a server, the data item containing the tag can be searched in the template library, and then the video template contained in the data item is determined to be a video template matched with the category to which an article belongs.

Where the material comprises text material, the video template comprises a typesetting template and a background picture, as described in step 405. When the material includes text material and picture material, or when the material includes text material, picture material and video material, the video template includes a typesetting template, as described in

steps

406 and 407.

And 405, when the material comprises a text material, synthesizing the text material, the background picture and the audio according to the typesetting template to obtain a video, and ending the flow.

Synthesizing the text material, the background picture and the audio according to the typesetting template to obtain a video, wherein the method comprises the following steps:

1) And matching the audio with each sentence in the text material to obtain first playing time information of each sentence, wherein the first playing time information comprises the playing starting time and the playing duration of the sentence.

The first playing time information of each sentence is obtained in two ways, and the two ways of obtaining are described below.

In a first acquisition mode, when the audio is an audio fragment set obtained by dividing the text material and respectively performing voice conversion on each sentence, each audio fragment in the audio fragment set is respectively matched with each sentence, and first playing time information of the sentence is determined according to the audio fragment matched with the sentence.

Wherein the server obtains a collection of audio clips, and each audio clip corresponds to a sentence in the text material. The server may mark the playing time of each audio clip when each audio clip is obtained, and since the playing time of each audio clip is the playing time of the sentence matched with the audio clip, the first playing time information of the sentence may be determined according to the playing time of the audio clip.

In the second acquisition mode, when the audio is obtained by voice conversion of the whole text material, sentence segmentation is carried out on the text material to obtain each sentence; segmenting the audio according to each statement to obtain each audio segment; first play time information of the sentence is determined according to the audio clip matched with the sentence.

Wherein the server obtains a complete audio. The server may segment the audio to obtain each audio segment, and mark the playing time of each audio segment, where the playing time of the audio segment is the playing time of the sentence matched with the audio segment, so that the first playing time information of the sentence may be determined according to the playing time of the audio segment.

In this embodiment, the server may process the audio through processing software, where the processing software may be FFMPEG software, or other software, and this embodiment is not limited.

2) And synthesizing the text material, the background picture and the audio according to the typesetting template and the first playing time information to obtain the video.

The background picture here is a background picture in the video template, not a picture extracted from the article.

When the background picture comprises a picture, the playing start time of the picture is the playing start time of the video, and the playing time of the picture is the playing time of the video, so that second playing time information comprising the playing start time and the playing time of the picture is obtained; when the background picture comprises at least two pictures, the playing time length of the audio can be obtained through processing software, the playing time length is divided by the number of the pictures to obtain the playing time length of each picture, the position of each picture is designated, and the playing starting time of each picture is determined according to the playing time length and the position of each picture, so that second playing time information comprising the playing starting time of the picture and the playing time length is obtained. For example, the playing time of the pictures is 5s, the playing start time of the first picture is 1s, the playing start time of the second picture is 6s, the playing start time of the third picture is 11s, and so on.

The preprocessor in the server can produce animation according to the typesetting template, the text material, the first playing time information, the background picture and the second playing time information, namely, key frames in the video are produced, a preprocessing result comprising the key frames and the audio is obtained, the preprocessing result is sent to the renderer through a json format, the renderer analyzes the preprocessing result in the json format, a rendering file is generated, a rendering interface is called, and a rendering command is executed on the rendering file by using the rendering interface, so that the video is obtained. The rendering file herein may be a compilable engineering template file in xml format.

It should be noted that, the server may also add information such as the subject of an article on the first picture in the background pictures, perform intelligent clause on the abstract, and then add the obtained sentence as a subtitle to the rest of pictures for playing. When the caption is added, a play word number threshold value in each frame can be set, and when the word number of a sentence exceeds the play word number threshold value, the sentence can be split into at least two phrases, and the phrases are added to the picture as the caption of one frame.

Step 406, when the material includes text material and picture material, matching each sentence in the audio and text material to obtain first playing time information of each sentence, where the first playing time information includes a playing start time and a playing duration of the sentence; determining the playing time length of each picture in the picture material according to the playing time length of the audio, and determining the playing start time of each picture according to the position and the playing time length of each picture in the article to obtain second playing time information; and synthesizing the text material, the picture material and the audio according to the typesetting template, the first playing time information and the second playing time information to obtain a video, and ending the flow.

The difference between the step 406 and the step 405 is that the pictures in the picture material in the step 406 are extracted from the article, and the pictures have a definite position relationship in the article, so the server needs to determine the playing start time of each picture according to the position and the playing time of each picture in the article.

Step 407, when the material includes text material, picture material and video material, matching each sentence in the audio and text material to obtain first playing time information of each sentence, where the first playing time information includes playing start time and playing duration of the sentence; determining the playing time length of the video material and the playing time length of each picture in the picture material according to the playing time length of the audio, and determining the playing start time of the video material and each picture according to the positions of the video material and each picture in the article, the playing time length of the video material and the playing time length of each picture to obtain third playing time information; and synthesizing the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information and the third playing time information to obtain a video, and ending the flow.

The two calculation modes of the playing time length of each picture in the video material and the picture material are respectively described below.

In the first calculation mode, the server may determine the playing duration of the video material, so that the playing duration of the video material is smaller than the playing duration of the audio. That is, no matter how many videos the video material includes, it is necessary to ensure that the sum of the play durations of all videos is smaller than the play duration of the audio. It should be noted that, when the playing time period of the video material is greater than or equal to the playing time period of the audio, the video clip may be intercepted from the video in the video material as the final video material. Because the playing time length of the video material is determined, the server can subtract the playing time length of the video from the playing time length of the audio, and then divide the remaining time length by the number of the pictures to obtain the playing time length of each picture. For example, the playing time of the audio is 30s, the playing time of the video is 5s, and 5 pictures are included, and the playing time of each picture is (30-5)/5=5 s.

In the second calculation mode, the server may consider each video in the video material as a picture, and at this time, the server may divide the playing duration of the audio by the sum of the number of pictures and the number of videos to obtain the playing duration of each picture and the playing duration of each video. For example, the playing duration of the audio is 30s, the number of pictures in the picture material is 3, and the number of videos in the video material is 2, and then the playing duration of each picture and each video is 30/(3-2) =5s.

Step 407 differs from step 406 in that in step 407, the material further includes video material, and since the video and the pictures have a clear positional relationship in the article, the server needs to determine the playing start time of the video material and each picture according to the positions of the video material and each picture in the article, the playing time of the video material, and the playing time of each picture. For example, the playing duration of the video and the pictures is 5s, and the video is located between the second picture and the third picture, the playing start time of the first picture is 1s, the playing start time of the second picture is 6s, the playing start time of the video is 11s, the playing start time of the third picture is 16s, and so on.

After determining the playing time of the video material, when determining that the playing time of the video in the article is longer than the playing time of the video material, the service also needs to intercept a video clip from the video as the video material. At this time, the server inputs the playing time length of the video and the video material into a video intercepting model, the video intercepting model is used for identifying key frames in the video, determining the content of the video according to the key frames, and intercepting a video fragment with the playing time length being the playing time length of the video material from the video according to the content; and determining the video clips output by the video interception model as video materials.

The video capturing model can be obtained through training by a server or training by other equipment, and the server can obtain the video capturing model from the other equipment.

Taking a server training video interception model as an example for explanation, the server can acquire a large number of training samples, wherein each training sample comprises a video, a set playing time length and a video fragment which is obtained by manually intercepting key contents in the video and has the playing time length of the set playing time length; the server builds an initial model by using a neural network or a machine learning algorithm; the initial model is trained by a large number of training samples, and a video capture model is obtained.

After the video interception model is obtained, if the article only comprises one video, the server can input the playing time length of the video and the video material into the video interception model; if the article includes at least two videos, for each video, the server may input the playing duration of the video corresponding to the video in the video and the video material into the video capturing model. The video interception model identifies key frames in the video, the content of the video is determined according to the key frames, a video segment with playing time length being the playing time length of the video material is intercepted from the video according to the content, the video segment is output, and the video segment output by the video interception model is determined to be the video material by the server.

Optionally, when the video material includes audio, before synthesizing the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information and the third playing time information to obtain the video, the method further includes: weakening the volume of the audio in the video material, and synthesizing the audio in the video material and the audio obtained by converting the text material; or, the audio in the video material is masked.

The server can reserve the audio in the video material, and at the moment, the volume of the audio can be weakened so as to reduce the interference on the audio of the played text material; alternatively, the server may not retain audio in the video material to avoid interference with playing the audio of the text material.

Referring to fig. 6, a schematic diagram of a server converting an article into video is shown. That is, the user first enters an article or an article link or material in the article; the server analyzes the keywords; then intelligently picking up high-quality pictures and abstracts in the articles; generating animation from the image-text content, and converting the characters into audio; and finally, rendering the synthesized video.

Referring to fig. 7, a block diagram illustrating the steps of a server converting an article into video is shown. The queue constructed in fig. 7 is the second message queue.

Please refer to fig. 8, which illustrates a schematic diagram of an operation of converting an article into a video by a user, namely selecting a video template, inputting the article, modifying a summary, and downloading the video, wherein the user can obtain the video only by performing four operations. It should be noted that, selecting a video template and modifying a abstract are optional steps, and if the two steps are omitted, the user can download the video after inputting the article, thereby simplifying the operation of the user.

Optionally, after obtaining the video, the server may transcode the video through the processing software, so as to obtain multiple videos adapting to different resolutions, or obtain multiple videos in different formats, and then the user downloads the video adapting to the own device through the client, so as to ensure the video playing effect. The processing software may be FFMPEG software or other software, and the present embodiment is not limited thereto.

After the abstract is obtained, the abstract is sent to the client, so that a user can modify the abstract, and the quality of the abstract can be improved due to the modification of the abstract by the user, so that the modified abstract can clearly express the central thought of the article.

When the number of the pictures in the article is insufficient, the pictures matched with the keywords in the article can be selected from a preset picture library, so that the pictures of the video are enriched, and the quality of the video is improved.

By acquiring the article links sent by the client, acquiring the articles to be converted according to the article links, extracting the materials in the articles, and simplifying the operation of uploading the articles by the user.

The video templates corresponding to the labels are obtained from the template library, and the video templates are determined to be video templates matched with the articles, so that the obtained video templates are matched with the categories to which the articles belong, and the video effect is improved.

Referring to fig. 9, a block diagram of an apparatus for converting an article into a video according to an embodiment of the present application is shown, and the apparatus for converting an article into a video may be applied to a server including a preprocessor and a renderer as shown in fig. 1. An apparatus for converting an article into video, comprising:

An obtaining module 910, configured to obtain materials in an article to be converted;

the conversion module 920 is configured to convert text materials in the materials obtained by the obtaining module 910 into audio, where the text materials are materials composed of characters in the article;

the obtaining module 910 is further configured to obtain a video template that is matched with a category to which the article belongs from a template library, where at least one video template is preset in the template library, and the video template that is matched with the category to which the article including the text material belongs includes a typesetting template and a background picture, and the background picture is matched with the category to which the article belongs;

and the synthesis module 930 is configured to synthesize the text material, the background picture and the audio according to the typesetting template to obtain a video.

Optionally, the acquiring module 910 is further configured to: acquiring a label of an article, wherein the label is used for indicating the category to which the article belongs; and acquiring the video templates corresponding to the labels from the template library, and determining the video templates as video templates matched with the articles.

Optionally, the acquiring module 910 is further configured to: acquiring labels of articles sent by a client; or extracting the label of the article from the description information of the article; or extracting the labels of the articles according to the keywords in the text material.

Optionally, the synthesizing module 930 is further configured to: matching the audio with each sentence in the text material to obtain first playing time information of each sentence, wherein the first playing time information comprises playing start time and playing duration of the sentence; and synthesizing the text material, the background picture and the audio according to the typesetting template and the first playing time information to obtain the video.

Optionally, when the material further includes a picture material, the video template matched with the category to which the article including the text material and the picture material belongs includes a typesetting template; the apparatus further comprises:

the matching module is used for matching each sentence in the audio and text materials after the video template matched with the category to which the article belongs is acquired from the template library by the acquisition module 910, so as to obtain first playing time information of each sentence, wherein the first playing time information comprises the playing starting time and the playing duration of the sentence;

the determining module is used for determining the playing time length of each picture in the picture material according to the playing time length of the audio, and determining the playing start time of each picture according to the position and the playing time length of each picture in the article to obtain second playing time information;

The synthesizing module 930 is further configured to synthesize the text material, the picture material, and the audio according to the typesetting template, the first playing time information, and the second playing time information, so as to obtain a video.

Optionally, when the material further includes a picture material and a video material, the video template matched with the category to which the article including the text material, the picture material and the video material belongs includes a typesetting template; the apparatus further comprises:

the determining module is used for determining the playing time length of the video material and the playing time length of each picture in the picture material according to the playing time length of the audio, and determining the playing start time of the video material and each picture according to the positions of the video material and each picture in the article, the playing time length of the video material and the playing time length of each picture, so as to obtain third playing time information;

the synthesizing module 930 is further configured to synthesize the text material, the picture material, the video material and the audio according to the video template, the first playing time information and the third playing time information, so as to obtain a video.

Optionally, the apparatus further comprises:

the input module is used for inputting the playing time lengths of the video and the video materials into a video interception model after the determining module determines the playing time length of the video materials and the playing time length of each picture in the picture materials according to the playing time length of the audio, and intercepting a video fragment with the playing time length being the playing time length of the video materials from the video according to the content when the playing time length of the video in the article is longer than the playing time length of the video materials, wherein the video interception model is used for identifying key frames in the video;

and the determining module is also used for determining the video clips output by the video interception model as the video materials.

Optionally, when the video material includes audio, the synthesizing module 930 is further configured to: before synthesizing the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information and the third playing time information to obtain a video, weakening the volume of the audio in the video material, and synthesizing the audio in the video material and the audio obtained by converting the text material; or before synthesizing the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information and the third playing time information to obtain the video, shielding the audio in the video material.

Optionally, the synthesizing module 930 is further configured to: when the audio is an audio fragment set obtained by dividing the text material and respectively performing voice conversion on each obtained sentence, respectively matching each audio fragment in the audio fragment set with each sentence, and determining first playing time information of the sentence according to the audio fragment matched with the sentence; or when the audio is obtained by voice conversion of the whole text material, sentence dividing is carried out on the text material to obtain each sentence; segmenting the audio according to each statement to obtain each audio segment; and determining the first playing time information of the sentence according to the audio fragment matched with the sentence.

Optionally, the matching module is further configured to, when the audio is an audio clip set obtained by dividing the text material and performing speech conversion on each obtained sentence, match each audio clip in the audio clip set with each sentence, and determine first playing time information of the sentence according to the audio clip matched with the sentence; or when the audio is obtained by voice conversion of the whole text material, sentence dividing is carried out on the text material to obtain each sentence; segmenting the audio according to each statement to obtain each audio segment; and determining the first playing time information of the sentence according to the audio fragment matched with the sentence.

Optionally, the conversion module 920 is further configured to: performing voice conversion on characters in the text material to obtain voice audio; acquiring background audio; and synthesizing the voice audio and the background audio to obtain the audio.

Optionally, the acquiring module 910 is further configured to: acquiring an article to be converted sent by a client, and extracting materials in the article; or acquiring an article link sent by the client, acquiring an article to be converted according to the article link, and extracting materials in the article; or, acquiring the material sent by the client, wherein the material is obtained after the client extracts the article to be converted.

Optionally, when the material is text material, the obtaining module 910 is further configured to: inputting the articles into a abstract extraction model obtained by training in advance; the output of the abstract extraction model is determined to be text material.

Optionally, when the material is text material, the obtaining module 910 is further configured to: word segmentation is carried out on the articles to obtain each keyword; for each sentence comprising the keyword, calculating the similarity of the sentence and other sentences; calculating the weight of the sentence according to the similarity, wherein the weight is used for representing the importance of the sentence; and selecting n sentences with highest weight values to form text materials, wherein n is a positive integer.

Optionally, when the material is text material, the apparatus further comprises:

the sending module is used for sending the text materials to the client;

the receiving module is used for receiving the modified text material sent by the client, wherein the modified text material is obtained by modifying the text material displayed by the client by a user;

and the determining module is used for determining the modified text material obtained by the receiving module as the finally obtained text material.

In summary, when an article includes text material, the apparatus for converting the article into video provided in the embodiment of the present application obtains, from a template library, a video template matching with a category to which the article belongs, and since a background picture in the video template matches with the category to which the article belongs, and the category to which the article belongs can correctly reflect contents of the text material, the background picture in the video template is relatively related to the contents of the text material, thereby solving a problem that when the article includes text material, a picture found according to a keyword is inaccurate, and the text material and the background picture in the video are not matched, so that matching degree of the text material and the background picture is ensured, and quality of the video is improved.

Referring to fig. 10, a block diagram of a server according to an embodiment of the present application is shown. The server 1000 includes a Central Processing Unit (CPU) 1001, a system memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. The server 1000 also includes a basic input/output system (I/O system) 1006 for supporting the transfer of information between various devices within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for the user to enter information. Wherein the display 1008 and the input device 1009 are connected to the central processing unit 1001 through an input output controller 1006 that is connected to a system bus 1005. The basic input/output system 1006 may also include an input/output controller 1006 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1006 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1007 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1004 and mass storage devices 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory further includes one or more programs, one or more programs being stored in the memory, the one or more programs being for performing the method of converting an article provided by the above-described embodiments into video.

One embodiment of the present application provides a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions loaded and executed by a processor to implement a method for converting an article into video as described above.

One embodiment of the present application provides an apparatus for converting an article into video, the apparatus for converting an article into video including a processor and a memory, the memory having at least one instruction stored therein, the instruction being loaded and executed by the processor to implement a method for converting an article into video as described above.

It should be noted that: in the device for converting an article into video according to the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device for converting an article into video is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for converting the article into the video provided in the above embodiment belongs to the same concept as the method embodiment for converting the article into the video, and the detailed implementation process of the device is referred to as the method embodiment, which is not repeated here.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description is not intended to limit the embodiments of the present application, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiments of the present application are intended to be included within the scope of the embodiments of the present application.

Claims

1. A method of converting an article to video, the method comprising:

acquiring an article to be converted;

inputting the text in the article into a abstract extraction model, extracting sentences in the text by the abstract extraction model according to an abstract word number threshold, and deleting unobscured words in the extracted abstract to obtain text materials of the article;

converting the text material to audio;

when the article further comprises pictures and the number of the pictures included in the article is determined to be larger than a picture threshold, selecting the pictures with the best quality and the number equal to the picture threshold from the article to obtain the picture materials of the article;

when the article further comprises a video and the playing time length of the video in the article is greater than a time length threshold, inputting the video and the time length threshold into a video capturing model, wherein the video capturing model identifies key frames in the video, determines the content of the video according to the key frames, and takes a video fragment with the playing time length of the video captured from the video according to the content as the video material of the article, wherein the time length threshold is smaller than the playing time length of the audio;

Obtaining a video template matched with the category to which the article belongs from a template library, wherein the template library comprises at least one video template;

under the condition that the material of the article only comprises the text material, the matched video template comprises a typesetting template and a background picture, the background picture is matched with the category to which the article belongs, the typesetting template is used for designating the special effect when the playing position, the font, the color, the word size, the playing and the disappearance of the words in the text material, and the text material, the background picture and the audio are synthesized according to the typesetting template to obtain a video;

in the case that the material of the article includes the text material, the picture material and the video material, the matched video template includes a typesetting template, where the typesetting template is used to specify a play position, a font, a color, a word size, a special effect when playing and disappearing of words in the text material, specify a play position of pictures in the picture material, a special effect when playing and disappearing, and specify a play position of videos in the video material, a special effect when playing and disappearing;

matching the audio with each sentence in the text material to obtain first playing time information of each sentence, wherein the first playing time information comprises playing start time and playing duration of the sentence;

Determining the playing time length of each picture in the picture material according to the playing time length of the audio, and determining the playing start time of each picture in the picture material according to the position of each picture in the picture material in the article and the playing time length of each picture in the picture material to obtain second playing time information;

determining a playing start time of the video material according to the position of the video material in the article and the playing time of the video material to obtain third playing time information;

and synthesizing the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information, the second playing time information and the third playing time information to obtain a video.

2. The method of claim 1, wherein when the video material comprises audio, the method further comprises:

weakening the volume of the audio in the video material, and synthesizing the audio in the video material and the audio obtained by converting the text material; or alternatively, the process may be performed,

masking audio in the video material.

3. The method according to any one of claims 1 to 2, wherein said matching the audio with each sentence in the text material to obtain first playing time information of each sentence includes:

When the audio is an audio fragment set obtained by dividing the text material and respectively performing voice conversion on each obtained sentence, respectively matching each audio fragment in the audio fragment set with each sentence, and determining first playing time information of the sentence according to the audio fragment matched with the sentence;

when the audio is obtained by voice conversion of the whole text material, sentence segmentation is carried out on the text material to obtain each sentence; segmenting the audio according to each statement to obtain each audio segment; and determining the first playing time information of the sentence according to the audio fragment matched with the sentence.

4. The method of claim 1, wherein the obtaining, from a template library, a video template that matches a category to which the article belongs, comprises:

acquiring a label of the article, wherein the label is used for indicating the category to which the article belongs;

and acquiring a video template corresponding to the tag from the template library, and determining the video template as a video template matched with the category to which the article belongs.

5. The method according to claim 1, wherein the method further comprises:

Transmitting the text material to a client;

receiving a modified text material sent by the client, wherein the modified text material is obtained by modifying the text material displayed by the client by a user;

and determining the modified text material as the finally obtained text material.

6. An apparatus for converting an article to video, the apparatus comprising:

the acquisition module is used for acquiring articles to be converted;

the input module is used for inputting the text in the article into a abstract extraction model, extracting sentences in the text by the abstract extraction model according to an abstract word number threshold value, and deleting the words which are not smooth in the extracted abstract to obtain the text material of the article;

the conversion module is used for converting the text material into audio;

the determining module is used for selecting the pictures with the best quality and the number equal to the picture threshold value from the article when the article further comprises pictures and the number of the pictures included in the article is determined to be larger than the picture threshold value, so as to obtain the picture material of the article;

the input module is used for inputting the video and the duration threshold value into a video interception model when the article further comprises the video and the playing time of the video in the article is longer than the duration threshold value, wherein the video interception model identifies key frames in the video, determines the content of the video according to the key frames, and takes a video fragment with the playing time of the video intercepted from the video according to the content as the video material of the article, and the duration threshold value is smaller than the playing time of the audio;

The acquisition module is further used for acquiring a video template matched with the category to which the article belongs from a template library, wherein at least one video template is preset in the template library;

when the material of the article only comprises the text material, the matched video template comprises a typesetting template and a background picture, the background picture is matched with the category to which the article belongs, the typesetting template is used for designating the special effects when the playing position, the font, the color, the word size, the playing and the disappearance of the words in the text material, and the synthesis module is used for synthesizing the text material, the background picture and the audio according to the typesetting template to obtain a video;

The matching module is used for matching the audio with each sentence in the text material to obtain first playing time information of each sentence, wherein the first playing time information comprises the playing starting time and the playing duration of the sentence;

the determining module is further configured to determine a playing duration of each picture in the picture material according to the playing duration of the audio, and determine a playing start time of each picture in the picture material according to a position of each picture in the picture material in the article and the playing duration of each picture in the picture material, so as to obtain second playing time information;

the determining module is further configured to determine a playing start time of the video material according to a position of the video material in the article and a playing duration of the video material, so as to obtain third playing time information;

the synthesizing module is further configured to synthesize the text material, the picture material, the video material and the audio according to the typesetting template, the first playing time information, the second playing time information and the third playing time information, so as to obtain a video.

7. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of converting an article of any one of claims 1 to 5 into video.

8. An article to video conversion device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of converting an article to video according to any one of claims 1 to 5.