CN112565875B

CN112565875B - Method, device, equipment and computer readable storage medium for automatically generating video

Info

Publication number: CN112565875B
Application number: CN202011383389.4A
Authority: CN
Inventors: 卞东海; 彭卫华; 罗雨; 蒋帅
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2023-03-03
Anticipated expiration: 2040-11-30
Also published as: CN112565875A

Abstract

According to example embodiments of the present disclosure, a method, an apparatus, a device, a computer-readable storage medium, and a computer program product for automatically generating a video are provided. Relates to the fields of knowledge mapping, deep learning and video creation. A method of automatically generating a video, comprising receiving a user input comprising first multimedia content and a key phrase for describing the video, the first multimedia content having at least one of a plurality of predetermined data formats; determining at least one node from a pre-constructed knowledge-graph based on the key phrases; obtaining second multimedia content associated with at least one node based on the first multimedia content; and generating a video based on the first multimedia content and the second multimedia content. Thus, the video can be automatically and efficiently generated.

Description

Method, device, equipment and computer readable storage medium for automatically generating video

Technical Field

Embodiments of the present disclosure relate generally to the field of information processing, and more particularly, to methods, apparatuses, devices, computer-readable storage media, and computer program products for automatically generating videos.

Background

With the development of mobile data networks, the data percentage of video on the internet is gradually surpassing text. In the technical field related to video production, innovative applications based on artificial intelligence are still in a vacant state at present, and there is no scheme for automatically producing videos. The traditional video production process has the following defects: the requirement on users is high, qualified videos need a producer to be capable of applying a lot of complex software, and materials used for producing the videos are not easy to collect. Therefore, a scheme for automatically generating high-quality video is required.

Disclosure of Invention

According to an example embodiment of the present disclosure, a scheme for automatically generating a video is provided.

In a first aspect of the present disclosure, there is provided a method of automatically generating a video, comprising: receiving user input comprising first multimedia content and a key phrase for describing a video, the first multimedia content having at least one of a plurality of predetermined data formats; determining at least one node from a pre-constructed knowledge graph based on the key phrase; obtaining second multimedia content associated with at least one node based on the first multimedia content; and generating a video based on the first multimedia content and the second multimedia content.

In a second aspect of the present disclosure, there is provided an apparatus for automatically generating a video, comprising: an input receiving module configured to receive a user input comprising first multimedia content and a key phrase for describing a video, the first multimedia content having at least one of a plurality of predetermined data formats; a first node determination module configured to determine at least one node from a pre-constructed knowledge-graph based on the key phrase; a first multimedia content acquisition module configured to acquire second multimedia content associated with at least one node based on the first multimedia content; and a first video generation module configured to generate a video based on the first multimedia content and the second multimedia content.

In a third aspect of the disclosure, an electronic device is provided that includes one or more processors; and storage means for storing the one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect of the disclosure.

In a fourth aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions to implement a method according to the first aspect of the present disclosure by a processor.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 shows a flow diagram of an example of a process of automatically generating a video, according to some embodiments of the present disclosure;

fig. 3 shows a flow diagram of another example of a process of automatically generating a video, in accordance with some embodiments of the present disclosure;

FIG. 4 shows a schematic block diagram of an apparatus for automatically generating video according to an embodiment of the present disclosure; and

FIG. 5 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and "comprise," and similar language, are to be construed as open-ended, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same objects. Other explicit and implicit definitions are also possible below.

As mentioned above, in the conventional scheme, when authoring a video, there are drawbacks as follows: (1) In the video production process, besides the content shot by the user, various video materials are often needed to perfect the expression effect required by the user, and the video materials have a professional threshold which is difficult to exceed for the user, and mainly have the problems of difficult acquisition, few types, high price and the like; (2) It takes a lot of time for the user to compose the material, for example, to determine the conversion effect between materials of different formats or to determine the position of text in the video, etc.

An example embodiment of the present disclosure proposes a scheme of automatically generating a video. In the scheme, first multimedia content input by a user and a description of a video to be generated are received. And then acquiring the second multimedia content according to the multimedia content and the description. And finally, combining the first multimedia content input by the user and the acquired second multimedia content to generate a video. Thereby, high-quality multimedia content for generating a video can be automatically acquired to efficiently generate the video.

Fig. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. As shown, the example environment 100 includes a computing device 110, a database 120, and videos 130 and users 140. Computing device 110 may be connected to database 120. The computing device 110 may also receive user input from the user 140. Database 120 may be any suitable database, centralized or distributed, including but not limited to knowledgegraph technology-based databases and retrieval-based databases.

In one embodiment, to generate the video 130, the computing device 110 may retrieve multimedia content, i.e., material, for the video 130 from a knowledge graph stored in a database. Knowledge-graphs essentially aim to describe semantic networks of objectively existing knowledge in the real world, and the like associations between knowledge. Knowledge maps are currently divided into general knowledge maps and vertical knowledge maps (also known as industry knowledge maps) based on the application field of the knowledge maps. The generic knowledge graph is not domain specific and can be analogized to structured encyclopedia knowledge. Such knowledge maps contain a great deal of common sense knowledge, emphasizing the breadth of the knowledge. The vertical knowledge map is oriented to a specific field, is constructed based on industry knowledge and emphasizes the depth of the knowledge. A knowledge-graph herein may be a video-specific knowledge-graph, where each node in the knowledge-graph stores multimodal data associated with the node. But may also be a general knowledge graph and the disclosure is not limited thereto.

In an alternative embodiment, some of the data in the knowledge-graph in database 130 is incomplete. For example, taking an engine as an example, when initially constructing the knowledge graph, the concept of the engine may include oil consumption, color, displacement, brand and model, which are common knowledge and are known to the public, so that when initially constructing the knowledge graph, concepts related to the engine attribute can be added to the knowledge graph, and one concept is located at one node. And initially constructing an integral large framework of the knowledge graph. However, the knowledge of the specific fuel consumption, the specific color, the specific volume, the specific brand and model is varied, and the knowledge of the specific volume, the specific discharge volume and the specific brand and model is not common knowledge, so that the knowledge cannot be given specifically, and the data needs to be collected. Furthermore, the data available in the knowledge-graph may be only monomodal, e.g. only in text format, which cannot be used for the authoring of video.

In the case of the deficient knowledge-graph described above, the computing device 110 may retrieve the full-network data from outside the database via the network to reconstruct the knowledge-graph. The network may be any suitable network including, but not limited to, the internet, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), wired networks such as fiber optic networks and coaxial cables, and wireless networks such as WIFI, cellular telecommunications networks, bluetooth, and the like.

The computing device 110 may be any suitable computing device, whether centralized or distributed, including but not limited to personal computers, servers, clients, hand-held or laptop devices, multiprocessors, microprocessors, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed clouds, combinations thereof, and the like.

The computing device 110 may also compose the video 130 using the multimedia content obtained as described above, and the detailed process of generating the video will be described below.

Fig. 2 illustrates a flow diagram of an example of a process 200 for automatically composing a video, according to some embodiments of the present disclosure. Process 200 may be implemented by computing device 110.

At 210, the computing device 110 receives a user input comprising first multimedia content and a key phrase for describing a video, the first multimedia content having at least one of a plurality of predetermined data formats.

Computing device 110 may receive user input from user 140 that includes first multimedia content provided by the user for video authoring, with which first multimedia content computing device 110 generates video as underlying data. The first multimedia content may be data in one format of text data, picture data, video data, or sound data, or multi-modal data. The first multimedia content may be a common structured type such as excel, json, csv, etc.

The user input also includes key phrases for describing the video, i.e., a brief description of what the user 140 reflects with the video to be authored, which may be the subject of the video. For example, the first multimedia content input by the user a is excel text data of weather temperature in february, the key phrase input by the user a is "temperature trend map in february 2", or the first multimedia content input by the user B is a picture of the image of the user B, and the key phrase input by the user B is "the user B takes you to visit the world-wide area".

In one embodiment, the computing device 110 may clean text data in the first multimedia content and perform a word segmentation operation on the text data to extract entity attribute information of the text data. The text data after being cleaned can be used for generating information such as subtitles, charts, titles and the like of the video.

In one embodiment, the computing device 110 may summarize video data in the first multimedia content to obtain important material therein for use in the final composite video. The computing device 110 may further perform scene segmentation on the video data, understand the video using a specific algorithm for each scene, obtain information such as a topic, a classification, and a related person to which each scene belongs, and select a most suitable scene as a material according to a key phrase input by a user.

In one embodiment, the computing device 110 may identify the picture data in the first multimedia content, filter the pictures that are less than 300 x 300 in size, contain advertisements, relate to politics, etc., and query the multimodal atlas for intelligent supplementation if too few pictures are filtered.

In one embodiment, the computing device 110 may convert the sound data in the first multimedia content to text data and/or convert the text data to sound data for use in forming subtitles or audio information in the final video generation.

The above processing of different types of data is further described below.

At 220, the computing device 110 determines at least one node from the pre-constructed knowledge-graph based on the key phrase. There may be multiple nodes in the pre-constructed multimodal knowledge graph, each node may correspond to multiple similar key phrase compositions, and associated multimodal data is stored. The pre-constructed multimodal knowledge-graph can be stored in database 120 or can be constructed by computing device 110, and the disclosure is not limited thereto.

In one embodiment, the computing device 110 determines a degree of match between the key-phrase and a target key-phrase corresponding to a target node in the knowledge-graph, and determines the target node as at least one node if the degree of match is determined to be greater than a first predetermined threshold. For example, the description continues with the above-described user a and user B as examples. The computing device 110 determines that the matching degree between the target key phrase "trend", "situation" and the like in the knowledge graph and the key phrase "trend graph" input by the user a is greater than a first predetermined threshold value of 0.8 (maximum 1), and then determines that the node a corresponding to the target key phrase "trend", "situation" is a target node, and may also determine that the node a' corresponding to the target key phrase "weather", "temperature" is a target node through "air temperature". The matching degree in the key phrase support can be obtained by calculating the Euclidean distance of the vector of the phrase, and the details of the disclosure are not repeated. Similarly, the computing device 110 may determine as the target node a node that corresponds to the target key phrase "historic site" in which the degree of match between the target key phrase "historic site" and the key phrase "point of interest" entered by user B is greater than a first predetermined threshold of 0.8. The value of the first predetermined threshold of 0.8 is illustrative and the disclosure is not limited thereto.

At 230, the computing device 110 obtains second multimedia content associated with at least one node based on the first multimedia content. For example, after determining at least one node in 220 above, the computing device 140 may determine the associated second multimedia content in the knowledge-graph as supplemental material to the first multimedia content via the node.

In one embodiment, the computing device 110 determines at least one data format not included in the first multimedia content and obtains data associated with at least one node in the at least one data format as the second multimedia content. For example, the description is continued by taking the above-described user a and user B as examples. The computing device 110 determines that the weather in the first multimedia content input by the user a is excel text data, and the computing device 110 determines that the first multimedia content does not include picture data, video data, and sound data. Computing device 110 may determine data associated with nodes a and a' in the form of picture data, video data, and sound data from the knowledge graph in database 120, such as, but not limited to, weather icons, sounds of thunder, background animations of snow, animation templates of curve trends, and the like.

For user B, the computing device 110 determines that the first multimedia content is a picture of the user B's own image, then it may determine data associated with node B in text data, video data, and voice data formats, such as, but not limited to, videos, text introductions, and animations of world attractions, from the knowledge graph in the database 120.

In another embodiment, the computing device 110 determines an amount of data for the multimedia content in the at least one data format in the first multimedia content, and if the amount of data is determined to be less than a second predetermined threshold, the computing device 110 obtains data in the at least one data format associated with the at least one node as the second multimedia content. For example, the description continues with the above-described user a and user B as examples. The computing device 110 determines that the amount of data of excel text data in the first multimedia content input by the a user is less than a second predetermined threshold of 8KB, e.g., it only recorded weather information for 10 days in 2 months, the computing device 110 obtains data associated with node a' in at least one data format as the second multimedia content, e.g., text information for weather for the entire 2 months as the second multimedia content.

For user B, computing device 110 determines that the amount of data of the first multimedia content input by user 140 is, for example, less than a second predetermined threshold of 20MB, then computing device 110 obtains data associated with node B in at least one data format as second multimedia content, such as videos, textual descriptions, and animations of world interest.

Note that the values of the first predetermined threshold and the second predetermined threshold are merely exemplary, and different thresholds may be adjusted based on user input and the computing device 110.

The method and the device solve the problem that the material is difficult to obtain in the traditional scheme by automatically analyzing the basic material provided by the user and the requirement on the video and matching the basic material and the requirement on the video in the multi-mode knowledge map according to the analysis result to automatically obtain the data with insufficient data amount or the high-quality data with insufficient data format in the basic material.

In an alternative embodiment, the computing device 110 may supplement the insufficient information in the given data by finding relevant fields in the multimodal atlas using a preset video retrieval module, similar picture retrieval module, topic retrieval module, and the like.

At 240, the computing device 110 generates a video based on the first multimedia content and the second multimedia content. For example, the computing device 110 may further process the multimedia content in the various formats obtained above to generate a video.

In one embodiment, the computing device 110 semantically analyzes the textual content in the first multimedia content and the second multimedia content to generate textual elements. The computing device 110 then determines at least one of a position of the text element in the video, a size of a word in the text element, a display effect of the text element, and a display time of the text element to generate the video.

For example, the computing device 110 semantically analyzes text content in the multimedia content, associates it with picture information, and further determines its restricted location, duration of display, size of text, dynamic changes in location, dynamic effects of text, etc. in the associated picture. According to the above operation. Textual content may be associated with the image frames so that the textual information may clearly describe each image frame.

In another embodiment, the computing device 110 obtains video content in the first multimedia content and the second multimedia content. The computing device 110 then determines a plurality of image frames in the video content that are associated with the key phrase. The computing device 110 then determines the order of the plurality of image frames in the video and the transition effect between the plurality of image frames, and finally, the computing device 110 generates the video using the transition effect in order. For example, the computing device 110 may determine to determine key image frames in video content in the multimedia content from a key phrase input by the user 140. The computing device 110 may treat, as the plurality of image frames associated with the key phrase, image frames of which the degree of match with the key phrase is greater than a threshold value. The image frames often best reflect the user's intended final presentation of the video. The computing device 110 may then determine an image frame-to-image frame, image frame-to-video, video-to-image frame 3 transition effect from the determined associated plurality of image frames. For example, taking user B as an example, as described above, the computing device 110 may determine the order of the pictures and videos to appear in the final composite video based on different styles of world interests and add a conversion effect between the picture and video data to make the conversion more natural.

By analyzing the attributes of characters, pictures and videos in the video material, the sequence of the characters, the pictures and the videos in the video can be automatically and preferably combined according to the attributes, and the conversion effect is automatically set, so that the video is smoother and natural on the whole.

Thus, it is possible to automatically acquire high-quality multimedia contents for generating a video and efficiently generate a high-quality video according to the relation between the multimedia contents. The key problems that the technical requirement of video production on users is high and materials are difficult to obtain are solved.

Fig. 3 illustrates a flow diagram of another example of a process of automatically generating a video, in accordance with some embodiments of the present disclosure. The process 300 may be implemented by the computing device 110. Where the computing device 110 may implement the steps of the present disclosure through the underlying framework of FFMPEG-based, a process that combines text, pictures, video, sound, etc. through a series of operations into one video. FFMPEG is a relatively low-level video processing tool in linux system, and the main functions include encoding and decoding of video and the like. Other frameworks may also be employed, of course, and the disclosure is not intended to be limiting.

At 310, the computing device 110 receives user input of first multimedia content and a key phrase describing a video from the user 140.

At 320, the computing device 110 processes the first multimedia content input by the user and retrieves second multimedia content from the knowledge-graph 360 to supplement the first multimedia content based on the user input.

At 330, the computing device 110 performs a base processing on the first multimedia content and the second multimedia content. The basic processing includes but is not limited to: operations related to sound, including synthesis of multi-source sound, sound transformation, sound clipping and the like; self-defining masking operation, wherein the function mainly serves the subsequent application processing 340 and can realize the production of various animation effects; the video related functions specifically realize the operation of dynamic effects such as the size and position conversion of the video, the color, the frame number, the duration and the like.

At 440, computing device 110 performs further application processing based on base processing 330. Application process 340 enables a higher level encapsulation of the functionality of FFMPEG relative to base process 330.

For example, the computing device 110 may apply the base process 340 to generate a chart video. The computing device 110 may analyze the data input by the user 140 and then add the data to the video 130 by selecting different charts as desired by the user 140.

In one embodiment, the computing device 110 may be based on a method of self-defining a mask, which is mainly applied to a situation where a chart can be displayed at one time, and when the method is implemented, a dynamic template needs to be selected first, and then the dynamic template is used as a mask for currently filling the chart, so as to form an animation effect to generate a video.

In another embodiment, the computing device 110 may use a method normalized based on coordinate axes, which is mainly applied to a case where the chart cannot be displayed at one time, for example, by taking the above-mentioned user a as an example, a "temperature trend chart in month 2", a case where only 7 days are displayed each time, that is, 28 trend information of 7 days needs to be displayed, the trend lines are displayed continuously in the video, which results in that the content in the chart is continuously changed, and in order to represent the change, the computing device 110 may display the change of the trend in real time by calculating the current data information and continuously moving the coordinate axes to generate the required chart video.

In application process 340, computing device 110 may use the AR character interface provided by database 120 to automatically generate an AR character report from the input data. The computing device 110 may convert the sound data into video data, and determine the size, location, etc. of the text at each time point to form a series of time texts by first performing text conversion on the sound and then calculating the relation between the text and the sound time point. The computing device 110 may also perform statistics on the length, number of entity words, redundancy, etc. of a given text string, then perform scatter mixing according to the above data, and then perform barrage position, movement rate, color, font size, etc. according to the video time to generate the auto-captioning.

Finally, at 350, the computing device 110 composes the video 130. The computing device may first cloud store the video 130 and then send the stored address to the user 140 for viewing and sharing by the user.

According to the present disclosure, automatic synthesis of high quality video may be achieved by analyzing user input data, supplementing the user data, and then performing the above-described processing on the data.

The specific implementation of each step from step 410 to step 450 refers to the description of fig. 2, and is not repeated herein.

Fig. 4 shows a schematic block diagram of an apparatus 400 for automatically generating video according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes: an input receiving module 410 configured to receive a user input comprising first multimedia content and a key phrase for describing a video, the first multimedia content having at least one of a plurality of predetermined data formats; a first node determination module 420 configured to determine at least one node from a pre-constructed knowledge-graph based on the key phrase; a first multimedia content obtaining module 430 configured to obtain second multimedia content associated with at least one node based on the first multimedia content; and a first video generation module 440 configured to generate a video based on the first multimedia content and the second multimedia content.

In some embodiments, the first node determining module 420 may include: the matching module is configured to determine the matching degree between the key phrases and target key phrases corresponding to target nodes in the knowledge graph; and a second node determination module configured to determine the target node as at least one node if it is determined that the degree of match is greater than a first predetermined threshold.

In some embodiments, the first multimedia content acquiring module 430 comprises: a data format determination module configured to determine at least one data format that is not included in the first multimedia content; and a second multimedia content acquisition module configured to acquire data in at least one data format associated with at least one node as second multimedia content.

In some embodiments, the first multimedia content acquiring module 430 comprises: a data amount determination module configured to determine a data amount of multimedia content of at least one data format in the first multimedia content; and a third multimedia content acquisition module configured to acquire data in at least one data format associated with the at least one node as second multimedia content if the amount of data is determined to be less than a second predetermined threshold.

In some embodiments, wherein the first video generation module 440 comprises: a text element generation module configured to perform semantic analysis on text content in the first multimedia content and the second multimedia content to generate a text element; and a second video generation module configured to generate a video based on the text element.

In some embodiments, wherein the second video generation module comprises: and the third video generation module is configured to determine at least one of the position of the text element in the video, the size of the character in the text element, the display effect of the text element and the display time of the text element, and generate the video.

In some embodiments, wherein the first video generation module 440 comprises: the video content acquisition module is configured to acquire video content in the first multimedia content and the second multimedia content; an image frame determination module configured to determine a plurality of image frames in the video content associated with the key phrase; and a fourth video generation module configured to generate a video based on the plurality of image frames.

In some embodiments, wherein the fourth video generation module comprises: a conversion effect determination module configured to determine an order of the plurality of image frames in the video and a conversion effect between the plurality of image frames; and a fifth video generation module configured to generate a video using the conversion effect in order.

In some embodiments, the plurality of predetermined data formats includes at least one of a text data format, a picture data format, a video data format, and a sound data format.

Fig. 5 illustrates a schematic block diagram of an example device 500 that may be used to implement embodiments of the present disclosure. Device 500 may be used to implement computing device 110 of fig. 1. As shown, device 400 includes a Central Processing Unit (CPU) 510 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 520 or loaded from a storage unit 580 into a Random Access Memory (RAM) 530. In the RAM 530, various programs and data required for the operation of the device 500 can also be stored. The CPU 510, ROM 520, and RAM 530 are connected to each other by a bus 540. An input/output (I/O) interface 550 is also connected to bus 540.

Various components in device 500 are connected to I/O interface 550, including: an input unit 560 such as a keyboard, a mouse, etc.; an output unit 570 such as various types of displays, speakers, and the like; a storage unit 580 such as a magnetic disk, an optical disk, or the like; and a communication unit 590 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 590 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processing unit 510 performs various methods and processes described above, such as process 200 and/or process 300. For example, in some embodiments, process 200 and/or process 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 580. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 500 via ROM 520 and/or communications unit 590. When the computer program is loaded into RAM 530 and executed by CPU 610, one or more steps of process 200 and/or process 300 described above may be performed. Alternatively, in other embodiments, CPU 510 may be configured to perform process 200 and/or process 300 by any other suitable means (e.g., by way of firmware).

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of automatically generating video, comprising:

receiving user input comprising first multimedia content and a key phrase describing the video, the first multimedia content having at least one of a plurality of predetermined data formats, and the first multimedia content being multimedia content for the video obtained from a knowledge-graph stored in a database;

determining at least one node from a pre-constructed knowledge graph based on the key phrase;

obtaining second multimedia content associated with the at least one node based on the first multimedia content; and

generating the video based on the first multimedia content and the second multimedia content.

2. The method of claim 1, wherein determining at least one node from a pre-constructed knowledge-graph based on the key phrase comprises:

determining the matching degree between the key phrases and target key phrases corresponding to target nodes in the knowledge graph; and

and if the matching degree is determined to be larger than a first preset threshold value, determining the target node as the at least one node.

3. The method of claim 1, wherein obtaining second multimedia content associated with the at least one node based on the first multimedia content comprises:

determining at least one data format not included in the first multimedia content; and

data associated with the at least one node in the at least one data format is obtained as second multimedia content.

4. The method of claim 1, wherein obtaining second multimedia content associated with the at least one node based on the first multimedia content comprises:

determining a data amount of multimedia content of at least one data format in the first multimedia content; and

if it is determined that the amount of data is less than a second predetermined threshold, data associated with the at least one node in the at least one data format is retrieved as second multimedia content.

5. The method of claim 1, wherein generating the video based on the first multimedia content and the second multimedia content comprises:

performing semantic analysis on text content in the first multimedia content and the second multimedia content to generate text elements; and

generating the video based on the text element.

6. The method of claim 5, wherein generating the video based on the text element comprises:

determining at least one of a position of the text element in a video, a word size in the text element, a display effect of the text element, and a display time of the text element to generate the video.

7. The method of claim 1, wherein generating the video based on the first multimedia content and the second multimedia content comprises:

acquiring video contents in the first multimedia content and the second multimedia content;

determining a plurality of image frames in the video content associated with the key phrase; and

generating the video based on the plurality of image frames.

8. The method of claim 7, wherein generating the video based on the image frames comprises:

determining an order of the plurality of image frames in the video and a transition effect between the plurality of image frames; and

and generating the video by using the conversion effect according to the sequence.

9. The method of claim 1, wherein the plurality of predetermined data formats includes at least one of a text data format, a picture data format, a video data format, and a sound data format.

10. An apparatus for automatically generating video, comprising:

an input receiving module configured to receive a user input comprising first multimedia content and a key phrase for describing the video, the first multimedia content having at least one of a plurality of predetermined data formats, and the first multimedia content being multimedia content for the video obtained from a knowledge-graph stored in a database;

a first node determination module configured to determine at least one node from a pre-constructed knowledge-graph based on the key phrase;

a first multimedia content acquisition module configured to acquire second multimedia content associated with the at least one node based on the first multimedia content; and

a first video generation module configured to generate the video based on the first multimedia content and the second multimedia content.

11. The apparatus of claim 10, wherein the first node determination module comprises:

a matching module configured to determine a matching degree between the key phrase and a target key phrase corresponding to a target node in the knowledge-graph; and

a second node determination module configured to determine the target node as the at least one node if it is determined that the degree of match is greater than a first predetermined threshold.

12. The device of claim 10, wherein the first multimedia content acquisition module comprises:

a data format determination module configured to determine at least one data format not included in the first multimedia content; and

a second multimedia content acquisition module configured to acquire data associated with the at least one node in the at least one data format as second multimedia content.

13. The device of claim 10, wherein the first multimedia content acquisition module comprises:

a data amount determination module configured to determine a data amount of multimedia content of at least one data format in the first multimedia content; and

a third multimedia content obtaining module configured to obtain data associated with the at least one node in the at least one data format as second multimedia content if it is determined that the amount of data is less than a second predetermined threshold.

14. The apparatus of claim 10, wherein the first video generation module comprises:

a text element generation module configured to perform semantic analysis on text content in the first multimedia content and the second multimedia content to generate text elements; and

a second video generation module configured to generate the video based on the text element.

15. The apparatus of claim 14, wherein the second video generation module comprises:

a third video generating module configured to determine at least one of a position of the text element in a video, a size of a word in the text element, a display effect of the text element, and a display time of the text element, and generate the video.

16. The apparatus of claim 10, wherein the first video generation module comprises:

a video content obtaining module configured to obtain video content of the first multimedia content and the second multimedia content;

an image frame determination module configured to determine a plurality of image frames in the video content associated with the key phrase; and

a fourth video generation module configured to generate the video based on the plurality of image frames.

17. The apparatus of claim 16, wherein the fourth video generation module comprises:

a transition effect determination module configured to determine an order of the plurality of image frames in the video and a transition effect between the plurality of image frames; and

a fifth video generation module configured to generate the video using the conversion effect in the order.

18. The apparatus of claim 10, wherein the plurality of predetermined data formats includes at least one of a text data format, a picture data format, a video data format, and a sound data format.

19. An electronic device, the electronic device comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1 to 8.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.