CN115580758A

CN115580758A - Video content generation method and device, electronic equipment and storage medium

Info

Publication number: CN115580758A
Application number: CN202211249471.7A
Authority: CN
Inventors: 温梦
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-01-06

Abstract

The embodiment of the application provides a video content generation method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: the method comprises the steps of obtaining first topic information, and screening at least two target semantic types from preset video semantic types according to the first topic information. Shooting information is acquired, and shooting elements are identified from the shooting information. And according to each target semantic type, the first topic information and the shooting element, performing content matching in a video material library to obtain material content corresponding to the target semantic type. And detecting the interactive information, and acquiring the content to be combined from the material content corresponding to each target semantic type according to the interactive information, so as to combine the content to be combined corresponding to all the target semantic types to obtain the video content. Therefore, the method and the device can generate personalized video content and improve the video quality.

Description

Video content generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for generating video content, an electronic device, and a storage medium.

Background

Currently, short video authoring is popular on a variety of large network platforms, and many self-media creators can share knowledge and direct their customers by following the short video. However, the existing short video production template is still single, so that the video contents produced by most creators are similar, and the normal release of the short video is influenced.

Disclosure of Invention

The embodiment of the application mainly aims to provide a video content generation method and device, electronic equipment and a storage medium, and aims to generate personalized video content.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a video content generating method, where the method includes:

acquiring first topic information;

according to the first topic information, at least two target semantic types are screened out from preset video semantic types, and each target semantic type is used for determining different video semantic features;

acquiring shooting information, and identifying shooting elements from the shooting information;

according to each target semantic type, the first topic information and the shooting element, carrying out content matching in a video material library to obtain material content corresponding to the target semantic type;

detecting interaction information, and acquiring content to be combined from material content corresponding to each target semantic type according to the interaction information;

and combining the contents to be combined corresponding to all the target semantic types to obtain the video contents.

In some embodiments, before the screening out the target semantic type from the preset video semantic types according to the first topic information, the method further comprises:

collecting a video sample;

analyzing the content of the video sample to obtain first video content, and extracting second topic information from the first video content;

acquiring a semantic type related to the first video content from preset video semantic types to serve as a first semantic type;

establishing a corresponding relation between the second topic information and the first semantic type;

the method for screening out the target semantic type from the preset video semantic types according to the first topic information comprises the following steps:

and screening a target semantic type from the video semantic types according to the first topic information and the corresponding relation.

In some embodiments, the obtaining, as the first semantic type, a semantic type related to the first video content from preset video semantic types includes:

acquiring a material classification standard according to a preset video semantic type, wherein the material classification standard is used for determining different semantic types in the video semantic type;

according to the material classification standard, performing content grouping processing on the first video content to obtain a plurality of groups of material data;

marking the semantic type corresponding to each group of the material data according to the video semantic type to add a first semantic type;

the method further comprises the following steps:

and adding the plurality of groups of material data into a video material library.

In some embodiments, after the extracting second topic information from the first video content, the method further comprises:

performing video capture processing according to the second topic information to obtain a target video;

performing content analysis on the target video to obtain second video content;

the content grouping processing is performed on the first video content according to the material classification standard to obtain a plurality of groups of material data, and the method comprises the following steps:

and performing content grouping processing on the first video content and the second video content according to the material classification standard to obtain a plurality of groups of material data.

In some embodiments, the performing content matching in a video material library according to each of the target semantic type, the first topic information, and the shooting element to obtain material content corresponding to the target semantic type includes:

acquiring a text material corresponding to each target semantic type from a video material library;

performing label identification according to the first topic information and the shooting element to obtain a text label and a multimedia label;

according to the text label, performing content matching in the text material to obtain text content;

acquiring a multimedia material corresponding to the text content from the video material library;

according to the multimedia tag, performing content matching in the multimedia material to obtain multimedia content;

and taking the text content and the multimedia content as material content corresponding to the target semantic type.

In some embodiments, the detecting the interaction information and obtaining the content to be combined from the material content corresponding to each target semantic type according to the interaction information includes:

detecting voice information, and performing information matching in the material content according to the voice information to obtain reference content;

performing display processing on the reference content;

and detecting editing information of the reference content, and updating the reference content according to the editing information to obtain the content to be combined.

In some embodiments, the combining the contents to be combined corresponding to all the target semantic types to obtain the video content includes:

acquiring sequencing information corresponding to each target semantic type;

acquiring a video time axis, wherein the video time axis comprises a plurality of playing time periods which are connected in series according to a time sequence;

acquiring playing time periods corresponding to the target semantic types from the video time axis according to the sequencing information to serve as target time periods of the target semantic types,

and importing the content to be combined corresponding to each target semantic type into the target time period of the target semantic type to obtain the video content.

To achieve the above object, a second aspect of an embodiment of the present application proposes a video content generating apparatus, including:

the acquisition module is used for acquiring first topic information;

the screening module is used for screening at least two target semantic types from preset video semantic types according to the first topic information, wherein each target semantic type is used for determining different video semantic features;

the identification module is used for acquiring shooting information and identifying shooting elements from the shooting information;

the matching module is used for carrying out content matching in a video material library according to each target semantic type, the first topic information and the shooting element to obtain material content corresponding to the target semantic type;

the interaction module is used for detecting interaction information and acquiring content to be combined from material content corresponding to each target semantic type according to the interaction information;

and the combination module is used for combining the contents to be combined corresponding to all the target semantic types to obtain the video contents.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program, which when executed by a processor implements the method of the first aspect.

According to the video content generation method and device, the electronic device and the storage medium, the appropriate target semantic type can be screened from the preset video semantic types by acquiring the first topic information, so that a semantic framework of the video content is formed. Based on the method, the shooting element is identified to determine the shooting scene, so that the shooting element is combined with the first topic information and the target semantic type, the material content which accords with the shooting scene and the topic is matched for the target semantic type in the video material library, and a more accurate material reference effect is achieved. Furthermore, the contents to be combined meeting the requirements of the user are obtained from the material contents according to the detected interactive information so as to be combined into the final video content, so that the personalization degree of the generated video content can be improved, and the video quality is improved.

Drawings

Fig. 1 is a schematic flowchart of a video content generating method provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for constructing a corresponding relationship in an embodiment of the present application;

FIG. 3 is a schematic view of a specific flowchart of step S130 in FIG. 1;

FIG. 4 is a specific flowchart of step S140 in FIG. 1;

fig. 5 is a schematic structural diagram of a video content generating apparatus provided in an embodiment of the present application;

fig. 6 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, as well as in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, expert systems, and the like. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (Information Extraction): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

An Image description (Image Caption) generates a natural language description for an Image and utilizes the generated description to assist an application in understanding the semantics expressed in the visual scene of the Image. For example, the image description may convert an image search into a text search for classifying images and improving image search results. People usually only need to quickly browse to describe the details of the visual scene of the image, and automatically adding descriptions to the image is a comprehensive and arduous computer vision task, and complex information contained in the image needs to be converted into natural language descriptions. In contrast to common computer vision tasks, image captions not only need to identify objects from an image, but also need to associate the identified objects with natural semantics and describe them in natural language. Thus, image description requires one to extract deep features of an image, associate with semantic features and transform for generating the description.

Based on this, the embodiment of the application provides a video content generation method and device, an electronic device and a storage medium, and aims to generate personalized video content.

The video content generation method and apparatus, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described with reference to the following embodiments, and first, the video content generation method in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a video content generation method, and relates to the technical field of artificial intelligence. The video content generation method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a video content generation method, but is not limited to the above form. The following description will be given by taking the terminal as an example.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In each embodiment of the present application, when data related to the user identity or characteristic, such as user information, user behavior data, user history data, and user location information, is processed, permission or consent of the user is obtained, and the data collection, use, and processing comply with relevant laws and regulations and standards of relevant countries and regions. In addition, when the embodiment of the present application needs to acquire sensitive personal information of a user, individual permission or individual consent of the user is obtained through a pop-up window or a jump to a confirmation page, and after the individual permission or individual consent of the user is definitely obtained, necessary user-related data for enabling the embodiment of the present application to operate normally is acquired.

Fig. 1 is a schematic flowchart of a video content generating method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S100 to S150.

Step S100: first topic information is obtained.

In the embodiment of the present application, the first topic information may include at least one of topic keywords, phrases or sentences, which is not particularly limited. The obtaining mode of the first topic information includes but is not limited to: the terminal starts an application and generates an input box in an application interface, so that first topic information input in the input box by a user is received, for example, the user manually inputs a title of a shot work; the terminal generates a plurality of topic labels in the application interface, and acquires the topic label selected by the user from the plurality of topic labels as first topic information.

The obtaining mode of the plurality of topic tags includes but is not limited to: the method comprises the steps of collecting hot topic data (such as a hot search list, a blog title, a blog label, a video title, a video brief introduction, a video label, high-frequency barrage content and the like) through a social network site or an information network site, obtaining a plurality of hot labels from the hot topic data through word segmentation processing or clustering processing, and then carrying out statistical processing on each hot label according to at least one statistical index of topic heat (such as a like number or a score) and occurrence frequency to obtain a statistical value corresponding to each hot label, so that the hot label of which the statistical value meets a preset condition is screened out to serve as the topic label. The preset condition is manually specified and adjusted, for example, the preset condition is that the statistical value is greater than the specified value, and is not limited.

Step S110: and screening at least two target semantic types from preset video semantic types according to the first topic information.

In the embodiment of the present application, the preset video semantic types may include a plurality of video content related semantic types, such as title, opening, feature, summary, and dialog, which are not specifically limited. Each semantic type is used for determining different video semantic features, so that the video semantic types are used for determining semantic structures and relations of video contents. Specifically, in step S110, the terminal may obtain a corresponding relationship between the topic information and the semantic type, so as to obtain the target semantic type corresponding to the first topic information according to the corresponding relationship. The correspondence relationship may be manually specified, may also be obtained through data analysis, and is not limited.

Step S120: shooting information is acquired, and a shooting element is identified from the shooting information.

In this embodiment of the application, the shooting information may be a shooting picture shot by a shooting device of the terminal, or a screen recording picture obtained after screen recording is performed on an application interface of the terminal, and the like, and is not particularly limited. Accordingly, the photographing element may be a constituent object included in a photographing screen or a screen recording screen, and the photographing element may be obtained by performing image description processing (for example, using an image description model based on CNN + LSTM + attention) or image object detection (for example, using an object detection model based on R-CNN) on the photographing information, which is not particularly limited. For example, if the photographing information is a street view image, the photographing elements may include pedestrians, buildings, trees, street lamps, and the like. If the shooting information is a screen recording picture, the shooting elements may include software names, software icons, AI character images, other User Interface (UI) elements, and the like in the screen recording picture.

Step S130: and according to each target semantic type, the first topic information and the shooting element, performing content matching in a video material library to obtain material content corresponding to the target semantic type.

In the embodiment of the present application, the video material library is used for storing material data corresponding to different semantic types, and the material data includes, but is not limited to, text, pictures, expressions, charts, music, voice, video, animation, subtitles, picture filters, post effects, video engineering files, and the like. In practical application, a plurality of matching texts may be labeled to the material data in advance, content matching (for example, based on BM25 or a text matching algorithm for deep learning) is performed on the first topic information and the shooting element and the matching texts of each semantic type, so as to obtain the matching texts meeting the matching conditions as target texts, and data labeled with the target texts is taken from the material data as material content.

In some optional embodiments, the matching text may include a style tag, where the style tag is used to determine an expression style of the video content, and the style tag may be general, story, expert, humor, or educational, and may also be determined according to a current user of the terminal, so as to implement personalized classification of the material content according to a video production style of the current user, without specific limitation. And matching the first topic information with the shooting element to ensure that the expression style of the video content is attached to the actual topic and the shooting scene.

Optionally, the terminal may determine the current user according to the account logged in on the terminal, and may also identify the current user from the shooting elements, which is not limited.

Step S140: and detecting the interaction information, and acquiring the content to be combined from the material content corresponding to each target semantic type according to the interaction information.

In the embodiment of the present application, the interactive information may include, but is not limited to, input information of a user in an application interface of the terminal, voice information of the user, image information captured by the user, and the like. Correspondingly, the terminal can collect voice information through the audio recording device, and collect image information through the shooting device.

In an optional implementation manner, the terminal may perform display processing on the material content corresponding to each target semantic type, such as displaying a text, playing a video, playing music, and the like in an application interface. And the terminal can acquire the selected content as the content to be combined when the user selects a certain content in the material contents through the application interface.

Step S150: and combining the contents to be combined corresponding to all the target semantic types to obtain the video contents.

It can be understood that, the combination processing of all contents to be combined may specifically be: and importing all the contents to be combined into the same video time axis, and adjusting the sequence of the content materials to be combined through the video time axis to finally obtain the video contents.

In some optional embodiments, the terminal may obtain the sorting information corresponding to each target semantic type, and obtain a video timeline, where the video timeline includes a plurality of playing time segments connected in series according to a time sequence, and at this time, a duration of each playing time segment may be preset. Based on the above, the terminal acquires the playing time periods corresponding to the target semantic types from the video time axis according to the sequencing information, and the playing time periods are used as the target time periods of the target semantic types. And importing the contents to be combined corresponding to each target semantic type into the target time period of the target semantic type to obtain the video contents. In practical application, the terminal can respond to the adjustment instruction, and respectively perform duration adjustment or sequencing adjustment on each time segment in the video time axis to obtain the adjusted video time axis. Optionally, the adjustment instruction may be an instruction generated by the terminal according to a manual adjustment operation detected from the application interface, and is not particularly limited.

In addition, in an optional implementation manner, the terminal may perform display processing on all the contents to be combined, and by acquiring the confirmation instruction for each content to be combined, all the contents to be combined are sorted according to the acquisition order of the confirmation instruction to obtain a sorting result, and then each content to be combined is sequentially combined according to the sorting result. For example, in practical application, when a user selects any content to be combined through an application interface, the terminal detects a confirmation instruction of the content to be combined.

In another optional implementation manner, the terminal may also obtain the sorting instructions for all the contents to be combined, and identify the sorting result from the sorting instructions, so as to sequentially combine the contents to be combined according to the sorting result. The generation mode of the sequencing instruction includes but is not limited to: and dragging the operation controls corresponding to the contents to be combined through the application interface by the user to adjust the arrangement sequence of the operation controls, and generating a sequencing instruction by the terminal according to the adjusted arrangement sequence of the operation controls.

In yet another alternative implementation, the terminal may also sequentially perform combination processing on each content to be combined according to the semantic order set for each target semantic type. For example, assuming preset video semantic types including title, opening, feature, summary and dialect, the corresponding semantic order may be: title, open field, feature, dialect, summary. That is to say, for the content a to be combined corresponding to the title, the content B to be combined corresponding to the opening, the content C to be combined corresponding to the feature film, the content D to be combined corresponding to the summary, and the content E to be combined corresponding to the conversation, the video content generated by the terminal is ABCED.

Therefore, according to the video content generation method provided by the embodiment of the application, the appropriate target semantic type can be screened from the preset video semantic types by acquiring the first topic information, so that a semantic frame of the video content is formed. Based on the method, the shooting elements are identified to determine the shooting scene, so that the shooting scene is combined with the first topic information and the target semantic type, the material content which accords with the shooting scene and the topic is matched for the target semantic type in the video material library, and a more accurate material reference effect is achieved. Furthermore, the contents to be combined meeting the requirements of the user are obtained from the material contents according to the detected interactive information so as to be combined into the final video contents, thereby improving the personalized degree of the generated video contents and the video quality.

In some optional embodiments, please refer to fig. 2, and fig. 2 is a schematic flow chart illustrating a process of constructing a corresponding relationship in an embodiment of the present application. As shown in fig. 2, before step S110, the following steps S200 to S230 may be further included.

Step S200: a video sample is collected.

In the embodiment of the application, the video sample may be a popular video, a highly collected video, or the like collected through channels such as a social network site, an information network site, a video network site, and the like, without limitation.

Step S210: and analyzing the content of the video sample to obtain first video content, and extracting second topic information from the first video content.

Illustratively, the content analysis of the video sample may include, but is not limited to, version analysis, material extraction, caption extraction, error correction, punctuation recognition, and the like.

In some optional implementations, step S210 may also be: key frames are extracted from the video sample as first video content. And obtaining frame contents by carrying out image description processing or image object detection on the key frames, thereby taking the frame contents of all the key frames as second topic information. The key frame is a video frame where a key action of a character or an object in the video sample is located, and the algorithm for extracting the key frame includes, but is not limited to: a key frame extraction algorithm based on motion analysis, for example, analyzing the optical flow of object motion in video frames contained in a video sample, and taking a video frame with the minimum optical flow moving times as a key frame; the key frame extraction algorithm based on video clustering divides video frames contained in a video sample into a plurality of clusters through clustering, and selects an image frame closest to a clustering center in each cluster as a key frame.

Step S220: and acquiring a semantic type related to the first video content from preset video semantic types to serve as the first semantic type.

In the embodiment of the application, the terminal may obtain the semantic type artificially labeled for the first video content, and may also identify the semantic type related to the first video content from the video semantic types in a data analysis manner, without limitation.

Further, in some alternative embodiments, step S220 may include, but is not limited to, the following steps S221 to S223.

Step S221: and acquiring a material classification standard according to a preset video semantic type, wherein the material classification standard is used for determining different semantic types in the video semantic type.

In an embodiment of the present application, the material classification criteria may include classification criteria set for different semantic types, respectively, and the classification criteria includes, but is not limited to, at least one of a video time period, a keyword of video content, and a video frame category.

Step S222: and performing content grouping processing on the first video content according to the material classification standard to obtain multiple groups of material data.

In some optional embodiments, after step S210, the terminal may further perform video capture processing (such as web crawler or video search) according to the second topic information to obtain a target video, so as to perform content analysis on the target video to obtain second video content. For content analysis of the target video, the description of content analysis of the video sample in step S210 may be referred to, and details are not repeated. Based on this, in step S222, the terminal may perform content grouping processing on the first video content and the second video content according to the material classification criteria to obtain multiple sets of material data. Therefore, the method can further retrieve a plurality of video resources of the Internet according to the topic information, and can improve the sample reliability of content analysis and material grouping.

Step S223: and marking the semantic type corresponding to each group of material data according to the video semantic type to add the first semantic type.

In some optional implementation manners, the terminal may acquire, by collecting the reference video, the packet contents divided for the reference video based on the material classification standard and semantic types labeled for the respective packet contents as training data, and train an identification model using the training data. Specifically, in steps S222 and S223, the first video content is input into the identification module for identification, so as to obtain multiple sets of material data and semantic types corresponding to each set of material data. The recognition module may use a model based on classification algorithms such as logistic regression, naive bayes, decision trees, support vector machines, random forests or gradient boosting trees, which is not specifically limited.

Accordingly, after step S223, the terminal may add a plurality of sets of material data to the video material library. Therefore, by grouping, marking and storing the material data, the material content matched with the semantic type can be obtained from the video material library only by searching different semantic types.

Therefore, through the steps from S221 to S223, the video content is segmented and the semantic type is labeled according to the specified classification standard, so as to realize more intelligent semantic classification.

Step S230: and establishing a corresponding relation between the second topic information and the first semantic type.

Correspondingly, step S110 may specifically be: and screening a target semantic type from the video semantic types according to the first topic information and the corresponding relation.

It can be seen that, through the above steps S200 to S230, the correspondence between the topic information and the semantic types is established, which is convenient for quickly screening the matched semantic types according to the topic information in practical application.

In some optional embodiments, please refer to fig. 3, and fig. 3 is a schematic flowchart illustrating a specific process of step S130 in fig. 1. As shown in fig. 3, step S130 includes, but is not limited to, the following steps S131 to S136.

Step S131: and acquiring the text material corresponding to each target semantic type from a video material library.

Step S132: and performing label identification according to the first topic information and the shooting element to obtain a text label and a multimedia label.

In the embodiment of the present application, the text tag is used for labeling text data, the multimedia tag is used for labeling multimedia data, and the multimedia data is other media data besides text, which includes but is not limited to pictures, music, voice, video, animation, and the like.

Step S133: and according to the text label, performing content matching in the text material to obtain text content.

Step S134: and acquiring the multimedia material corresponding to the text content from the video material library.

In the embodiment of the present application, the video material library may adopt a hierarchical storage structure, a tree-like storage structure or a chain storage structure to store different text contents and corresponding multimedia materials, which is not specifically limited. That is to say, a parent node may be constructed for the text content, and then a child node of the parent node may be constructed by using the multimedia material corresponding to the text content. Specifically, when the multimedia material includes at least two materials of different media types, the associated nodes may also be constructed for the materials of different media types according to the hierarchy. For example, the parent node corresponding to the text content is connected to different image-mode child nodes, and each image-mode child node is connected to a corresponding animation-mode child node and a sound-mode child node, respectively, so that a more comprehensive material chain can be obtained by traversing a node path.

Step S135: and according to the multimedia tag, performing content matching in the multimedia material to obtain the multimedia content.

The multimedia material comprises multimedia data marked with a plurality of labels, and the multimedia content is rapidly extracted through label matching.

Step S136: and taking the text content and the multimedia content as material content corresponding to the target semantic type.

It can be seen that, through steps S131 to S136, at least two interactive propagation media with respect to the theme information and the shooting scene elements can be screened in combination with the tag identification and matching manner, so that the diversity of the material content can be enriched, and the improvement of the intuitive vividness of the video content display is facilitated.

In some alternative embodiments, please refer to fig. 4, wherein fig. 4 is a schematic flowchart of step S140 in fig. 1. As shown in fig. 4, step S140 includes, but is not limited to, the following steps S141 to S143.

Step S141: and detecting the voice information, and performing information matching in the material content according to the voice information to obtain reference content.

In embodiments of the present application, the reference content may include, but is not limited to, text, pictures, music, voice, emoticons, charts, video, animation, subtitles, picture filters, post effects, video engineering files, and the like. In step S141, specifically, the terminal may collect voice data through the recording device, and recognize voice information from the voice data through an Automatic Speech Recognition (ASR) technology.

Step S142: and performing display processing on the reference content.

In step S142, specifically, the terminal may display text, pictures, subtitles, picture filters, post-effects, video engineering files, and the like, and play videos, animations, music, voices, and the like in the application interface for the user to refer to.

Step S143: and detecting editing information of the reference content, and updating the reference content according to the editing information to obtain the content to be combined.

In the embodiment of the present application, the editing information is information generated according to an editing operation on the reference content, where the editing operation may represent that a user edits the display content in the application interface through the application interface, and the editing operation includes, but is not limited to, content replacement, deletion, addition, or modification. For example, a user may edit the specific content of a certain text in the application interface, delete a certain picture, add a newly uploaded video, or modify a certain subtitle, etc.

Therefore, by adopting the steps S141 to S143, the reference content can be quickly screened out from the material content through the voice recognition, and the user can update the reference content according to the needs of the user, so that the personalization degree and diversity of the video content can be further improved.

The following tables 1 and 2 are given as examples. Table 1 is a material content schematic table, and table 2 is a content schematic table to be combined, which should be understood as an example, and does not constitute a specific limitation to the material content and the content to be combined. As shown in table 1 and table 2, the user can use the genre label "general" and pick out the title h3, the opening o3, the feature c1, the feature c3, the summary s3, and the dialect a3 from the material contents so as to combine them into a completely new video content, which is convenient to operate.

TABLE 1 material content schematic table

Table 2 schematic table of contents to be combined

Referring to fig. 5, an embodiment of the present application further provides a video content generating apparatus, which can implement the video content generating method described above, and the video content generating apparatus includes an obtaining module 510, a screening module 520, an identifying module 530, a matching module 540, an interacting module 550, and a combining module 560, where:

an obtaining module 510, configured to obtain first topic information.

The screening module 520 is configured to screen at least two target semantic types from preset video semantic types according to the first topic information, where each target semantic type is used to determine different video semantic features.

And an identifying module 530 for acquiring the shooting information and identifying the shooting element from the shooting information.

And the matching module 540 is configured to perform content matching in the video material library according to each target semantic type, the first topic information, and the shooting element, so as to obtain material content corresponding to the target semantic type.

And the interaction module 550 is configured to detect interaction information, and obtain content to be combined from the material content corresponding to each target semantic type according to the interaction information.

And the combination module 560 is configured to combine the contents to be combined corresponding to all the target semantic types to obtain video contents.

The specific implementation of the video content generating apparatus is substantially the same as the specific implementation of the video content generating method, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement the video content generation method. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 6, fig. 6 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 601 may be implemented by a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute related programs to implement the technical solution provided in the embodiment of the present application;

the memory 602 may be implemented in a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 602 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 602 and called by the processor 601 to execute the video content generating method according to the embodiments of the present disclosure;

an input/output interface 603 for inputting and outputting information;

the communication interface 604 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.) or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 605 that transfers information between the various components of the device (e.g., the processor 601, memory 602, input/output interfaces 603, and communication interfaces 604);

wherein the processor 601, the memory 602, the input/output interface 603 and the communication interface 604 are communicatively connected to each other within the device via a bus 605.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the video content generation method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the embodiments shown in the figures are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a U disk, a removable hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method for generating video content, the method comprising:

acquiring first topic information;

according to each target semantic type, the first topic information and the shooting element, performing content matching in a video material library to obtain material content corresponding to the target semantic type;

and combining the contents to be combined corresponding to all the target semantic types to obtain video contents.

2. The method according to claim 1, wherein before the filtering out the target semantic type from the preset video semantic types according to the first topic information, the method further comprises:

collecting a video sample;

3. The method according to claim 2, wherein the obtaining the semantic type related to the first video content from the preset video semantic types as the first semantic type comprises:

performing content grouping processing on the first video content according to the material classification standard to obtain a plurality of groups of material data;

according to the video semantic types, marking the semantic types corresponding to each group of the material data to add a first semantic type;

the method further comprises the following steps:

4. The method of claim 3, wherein after extracting the second topic information from the first video content, the method further comprises:

performing content analysis on the target video to obtain second video content;

5. The method according to any one of claims 1 to 4, wherein the performing content matching in a video material library according to each of the target semantic types, the first topic information, and the shooting element to obtain material content corresponding to the target semantic types comprises:

6. The method according to any one of claims 1 to 4, wherein the detecting interaction information and obtaining the content to be combined from the material content corresponding to each target semantic type according to the interaction information includes:

performing display processing on the reference content;

7. The method according to any one of claims 1 to 4, wherein the combining the contents to be combined corresponding to all the target semantic types to obtain the video content comprises:

obtaining ordering information corresponding to each target semantic type;

8. A video content generating apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring first topic information;

9. An electronic device, characterized in that the electronic device comprises a memory storing a computer program and a processor implementing the video content generation method of any one of claims 1 to 7 when the processor executes the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the video content generation method according to any one of claims 1 to 7.