CN114390217A

CN114390217A - Video synthesis method and device, computer equipment and storage medium

Info

Publication number: CN114390217A
Application number: CN202210047948.7A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-22

Abstract

The application relates to a video synthesis method, a video synthesis device, computer equipment and a storage medium. The method relates to the field of network media and the technical field of artificial intelligence, and comprises the following steps: acquiring content description text information of a target object; the content description text information is text information describing the content expressed by the target object; semantic feature extraction is carried out on the content description text information to obtain text semantic features; acquiring candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips; determining a video clip matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video clip; and synthesizing the object video based on the content description text information and the target video segment. The method can improve the efficiency of processing the multimedia data.

Description

Video synthesis method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video synthesis method and apparatus, a computer device, and a storage medium.

Background

With the development of computers and internet technologies, multimedia technologies have emerged, which refer to technologies that perform comprehensive processing and management on various media information such as text, data, graphics, images, animations, sounds, etc. through computers, so that users can perform real-time information interaction with computers through various senses. In more and more scenarios, multimedia data is processed using multimedia technology, for example, a video including a plurality of media data may be synthesized using multimedia technology.

At present, multimedia data in the internet is increasing, and before the multimedia data is processed, required media data needs to be manually screened from various media data, and then the manually screened various media data is processed by using a multimedia technology.

However, the manual filtering of the media data consumes more time, resulting in a lower processing efficiency of the multimedia data.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video composition method, apparatus, computer device, storage medium, and computer program product capable of improving efficiency of processing multimedia data in response to the above technical problems.

In one aspect, the present application provides a video composition method. The method comprises the following steps: acquiring content description text information of a target object; the content description text information is text information describing the content expressed by the target object; semantic feature extraction is carried out on the content description text information to obtain text semantic features; acquiring candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips; determining a video segment matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video segment; synthesizing an object video corresponding to the target object based on the content description text information and the target video segment; the object video comprises object picture content in the object video clip, and the content description text information is correspondingly displayed when the object picture content in the object video is played.

On the other hand, the application also provides a video synthesis device. The device comprises: the information acquisition module is used for acquiring content description text information of the target object; the content description text information is text information describing the content expressed by the target object; the feature extraction module is used for extracting semantic features of the content description text information to obtain text semantic features; the characteristic acquisition module is used for acquiring candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips; the video acquisition module is used for determining a video clip matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video clip; the video synthesis module is used for synthesizing an object video corresponding to the target object based on the content description text information and the target video clip; the object video comprises object picture content in the object video clip, and the content description text information is correspondingly displayed when the object picture content in the object video is played.

In some embodiments, the candidate video content features are a plurality; the content description text information comprises a plurality of text information segments; the feature extraction module is further configured to perform semantic feature extraction on each text information segment to obtain segment semantic features of the text information segment; and determining each fragment semantic feature as the text semantic feature.

In some embodiments, the candidate video content features are a plurality; the video acquisition module is further configured to determine a matching degree between each of the segment semantic features and each of the candidate video content features respectively; and determining video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature to obtain target video clips.

In some embodiments, the target video clips are multiple, and each target video clip is matched with one text information clip; the video acquisition module is further configured to determine, for each text information segment, a matching degree of a segment semantic feature of the text information segment and each candidate video content feature; based on the matching degree of the segment semantic features of the text information segments and the candidate video content features, screening the candidate video content features to obtain the video content features matched with the segment semantic features of the text information segments; and acquiring the video clip corresponding to the matched video content characteristic to obtain a target video clip matched with the text information clip.

In some embodiments, the video obtaining module is further configured to obtain a text information segment adjacent to the text information segment, so as to obtain an adjacent text information segment of the text information segment; calculating the difference between the segment semantic features of the adjacent text information segments and the segment semantic features of the text information segments to obtain feature difference information; performing feature fusion on the feature difference information and the segment semantic features to obtain fused semantic features; and determining the matching degree of the segment semantic features of the text information segments and each candidate video content feature based on the fusion semantic features.

In some embodiments, the feature extraction module is further configured to perform word segmentation processing on the text information segment to obtain a plurality of word segments; for each word segment, extracting semantic features of the word segment to obtain the word semantic features of the word segment; and performing feature fusion on the word meaning features of the word segments to obtain the segment semantic features of the text information segments.

In some embodiments, the apparatus further comprises a feature generation module, where the candidate video content features are generated by the feature generation module; the feature generation module is used for extracting semantic features of each video frame in the candidate video clips to obtain frame semantic features; and performing feature fusion on each frame semantic feature to obtain the candidate video content feature.

In some embodiments, the target video segment is a plurality of video segments, the content description text information comprises a plurality of text information segments, and each target video segment matches one text information segment; the video synthesis module is further configured to sequentially splice the target video segments matched with the text information segments according to the sequence of the text information segments in the content description text information, and determine the display time of the matched text information segments according to the playing time of each target video segment to synthesize the object video corresponding to the target object.

In some embodiments, the target object is a target text object; the content description text information is abstract description information of the content described by the target text object; the object video is a video for introducing the target text object.

In some embodiments, the video composition module is further configured to convert the content description text information into audio data; synthesizing an object video corresponding to the target object based on the content description text information, the target video segment and the audio data; and correspondingly displaying the content description text information and correspondingly playing the audio data when the target picture content of the object video is played.

In some embodiments, the video synthesis module is further configured to obtain candidate audio, and determine an audio style of the candidate audio; determining the content style of the content description text information, and determining candidate audio with the audio style matched with the content style of the content description text information as target audio; synthesizing an object video corresponding to the target object based on the content description text information, the target video segment and the target audio; and when the object video is played, correspondingly displaying the content description text information and correspondingly playing the target audio.

On the other hand, the application also provides computer equipment. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the video synthesis method when executing the computer program.

In another aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned video composition method.

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program, wherein the computer program realizes the steps of the video synthesis method when executed by a processor.

The video synthesis method, the video synthesis device, the computer equipment, the storage medium and the computer program product are used for obtaining content description text information of a target object, extracting semantic features from the content description text information to obtain text semantic features, obtaining candidate video content features, determining video content features matched with the text semantic features, obtaining video segments corresponding to the matched video content features to obtain target video segments, synthesizing object videos corresponding to the target object based on the content description text information and the target video segments, wherein the object videos comprise target picture contents in the target video segments, and the content description text information is correspondingly displayed when the target picture contents in the object videos are played. Because the content description text information is the text information describing the content expressed by the target object, and the candidate video content features are obtained by extracting the semantic features of the content expressed in the candidate video segments, when the candidate video content features are matched with the text semantic features, the content description text information is matched with the content expressed by the video segments, namely the similarity is higher, so that the text information and the video which are matched with each other are automatically determined, the efficiency of screening the multimedia data is improved, and the efficiency of processing the multimedia data is improved.

Drawings

FIG. 1 is a diagram of an environment in which a video compositing method may be applied in some embodiments;

FIG. 2 is a flow diagram illustrating a video compositing method in some embodiments;

FIG. 3 is a schematic diagram of a video composition interface in some embodiments;

FIG. 4 is a block diagram of a text feature extraction network in some embodiments;

FIG. 5 is a block diagram of a text feature extraction network in some embodiments;

FIG. 6 is a block diagram of an encoder in some embodiments;

FIG. 7 is a block diagram of a residual network in some embodiments;

FIG. 8 is a block diagram of a feature fusion network in some embodiments;

FIG. 9 is a graph of the effects of object video in some embodiments;

FIG. 10 is a schematic flow chart of a video compositing method in some embodiments;

FIG. 11 is a schematic diagram of a composite video in some embodiments;

FIG. 12 is a schematic diagram of the calculation of the degree of match in some embodiments;

FIG. 13 is a block diagram of the structure of a video compositing device in some embodiments;

FIG. 14 is a diagram of the internal structure of a computer device in some embodiments;

FIG. 15 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The video synthesis method provided by the application can be applied to the application environment shown in fig. 1. The application environment includes a terminal 102 and a server 104. Wherein the terminal 102 communicates with the server 104 via a network.

Specifically, the terminal 102 may send a video synthesis request to the server 104, where the video synthesis request is used to request generation of an object video corresponding to a target object, the object video includes target picture content in a target video segment, the target picture content in the object video correspondingly displays content description text information when being played, the server 104 may obtain content description text information of the target object in response to the video synthesis request, the content description text information is text information describing content expressed by the target object, perform semantic feature extraction on the content description text information to obtain text semantic features, and obtain candidate video content features, the candidate video content features are obtained by performing semantic feature extraction on picture content of the candidate video segment, based on a matching degree between the text semantic features and the candidate video content features, determining a video clip matched with the content description text information to obtain a target video clip, synthesizing an object video corresponding to a target object based on the content description text information and the target video clip, wherein the object video comprises target picture content in the target video clip, and the content description text information is correspondingly displayed when the target picture content in the object video is played. The server 104 may return the object video corresponding to the synthesized target object to the terminal 102. The terminal 102 may play an object video corresponding to the target object.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, smart televisions, vehicle-mounted terminals, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster or a cloud server formed by a plurality of servers.

It is to be understood that the above application scenario is only an example, and does not constitute a limitation to the video composition method provided in the embodiment of the present application, and the method provided in the embodiment of the present application may also be applied to other application scenarios, for example, the video composition method provided in the present application may be executed by the terminal 102, the terminal 102 may upload the object video corresponding to the synthesized target object to the server 104, the server 104 may store the object video corresponding to the target object, and may also forward the object video corresponding to the target object to other devices.

The video synthesis method provided by the application can be applied to the field of network media, for example, the video synthesis method provided by the application can be used for processing videos or texts in the field of network media.

The video composition provided by the present application may be based on artificial intelligence, for example, in the present application, the matching degree of the text semantic features and the candidate video content features may be determined by using a matching degree detection model, so as to determine the video content features matching the text semantic features. The matching degree detection model is an artificial intelligence-based model, such as a trained neural network model, and is used for determining the matching degree between the semantic features of the text and the candidate video content features.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.

The scheme provided by the embodiment of the application relates to the technologies such as artificial neural network of artificial intelligence, and is specifically explained by the following embodiment:

in some embodiments, as shown in fig. 2, a video composition method is provided, where the method may be executed by a server or a terminal, or may be executed by both the terminal and the server, and in this embodiment, the method is described as applied to the server 104 in fig. 1, and includes the following steps:

step 202, acquiring content description text information of a target object; the content description text information is text information describing the content expressed by the target object.

Wherein the target object may comprise at least one of an image, a video, and a text, wherein the text includes, but is not limited to, a novel, a verse, a word, lyrics, an article, a word in a web page, a word in a picture, a word in a video, or a word in a textbook. The content description text information may be the content described in the target object or may be text information of a simple introduction of the target object. For example, taking the target object as a novel as an example, the content description text information may include at least one of a novel name, a paragraph in the novel, an abstract of the novel, a brief introduction of the novel, and the like. The content description text information of the target object may be pre-stored in the server, or may be acquired by the server from other devices.

Specifically, the target object may be an object of an object video to be synthesized, where the object video includes target picture content in the target video segment, and the target picture content refers to content presented by a video frame in the video. When the target picture content in the target video is played, the corresponding display content description text information is displayed, the target video can be used as a video for introducing the content expressed by the target object, the terminal can display a video synthesis interface, the video synthesis interface comprises a content description text information filling area, the content description text information filling area is used for acquiring the content description text information of the target object, when the terminal receives the video synthesis operation, the content description text information of the target object filled in the content description text information filling area is acquired in response to the video synthesis operation, a video synthesis request is sent to the server, the content description text information is carried in the video synthesis request, the server can query a video segment matched with the content expressed by the content description text information in response to the video synthesis request, and video synthesis is carried out based on the queried video segment, the method includes the steps of obtaining an object video corresponding to a target object, returning the object video corresponding to the target object to a terminal, enabling the terminal to display the received object video corresponding to the target object in a video composition interface, for example, the video composition interface may include a video display area, and enabling the terminal to display the received object video corresponding to the target object in the video display area. The video composition interface may further display a video composition control, and the video composition operation may be a trigger operation on the video composition control, where the trigger operation includes, but is not limited to, a click operation or a touch operation of a mouse.

In some embodiments, the video composition interface further includes an object identifier obtaining area, where the object identifier obtaining area is used to obtain an object identifier, and the object identifier is used to uniquely identify the object. The terminal may obtain content description text information of the target object filled in the content description text information filling area and an object identifier of the target object filled in the object identifier obtaining area in response to the video composition operation, and send a video composition request to the server, where the video composition request may include the content description text information and the object identifier, the server may obtain, in response to the video composition request, a video segment matching the content expressed by the content description text information by querying, perform video composition based on the queried video segment, and obtain an object video corresponding to the target object, and the synthesized object video may include at least one of the content description text information or the object identifier.

For example, as shown in (a) in fig. 3, an object identification acquisition area 304, a content description text information filling area 306, and a video composition control 308 are shown in the video composition interface 302, the object identification acquisition area 304 is filled with the name of ancient poem, "pity nong", and the content description text information filling area 306 is filled with the content of ancient poem, "pity nong", namely, "one chestnut in spring, ten thousand in autumn. In the four seas, farmers are starved to death. ". When the terminal receives a click operation on the video composition control 308, the terminal generates a video composition control including "pity nong" and "one millet in spring and ten thousand particles in autumn in response to the click operation. In the four seas, farmers are starved to death. "sending the video composition request to the server. The server responds to the video synthesis request, generates an object video corresponding to ancient poem 'pity nong', returns the object video to the terminal, and the terminal displays the object video corresponding to ancient poem 'pity nong' to introduce ancient poem 'pity nong'. As shown in fig. 3 (b), a video display area 310 is displayed in the video composition interface 302, and a video introducing ancient poem "pity nong" is displayed in the video display area 310, and the video introducing ancient poem "pity nong" is the target video synthesized by the server for ancient poem "pity nong".

And 204, extracting semantic features of the content description text information to obtain text semantic features.

The text semantic features are features obtained by extracting the semantics of the content description text information, can be features obtained by directly extracting the semantics of the content description text information, and can also be features obtained by segmenting the content description text information to obtain a plurality of text information segments and then extracting the semantics of each text information segment. The semantics of the textual information refer to the meaning that the textual information expresses.

Specifically, the server may directly extract semantic features from the content description text information to obtain text semantic features. Or, the server may divide the content description text information into a plurality of text information segments, and for each text information segment, the server may extract semantic features of the text information segment to obtain segment semantic features of the text information segment. The text semantic features are determined based on the segment semantic features of each text information segment, for example, the server may determine each segment semantic feature as a text semantic feature, that is, the text semantic features may include each segment semantic feature, or the server may perform feature fusion on each segment semantic feature, and determine the feature obtained by the fusion as a text semantic feature.

In some embodiments, the server may segment the content description text information to obtain a plurality of text information segments, for example, the server may determine a segmented character, compare the character in the content description text information with the segmented character, determine a position of the character with the consistent comparison in the content description text information as a segmentation position, and segment the content description text information at the segmentation position in the content description text information to obtain a plurality of text information segments. The cut characters include, but are not limited to, at least one of commas, periods, or semicolons. The server can also obtain a trained semantic segmentation model, and segment the content description text information by using the semantic segmentation model to obtain a plurality of text information segments. The semantic segmentation model is used for segmenting information according to semantics, and each segmented text information segment has certain semantics.

In some embodiments, the server may extract semantic features of the textual information using the trained textual feature extraction network. The text feature extraction network may be an artificial neural network including, but not limited to, any one of a Word2Vec network or a BERT network. Wherein BERT is an abbreviation of Bidirective Encoder reactivations from transformations, and Chinese stands for: transform-based bi-directional Encoder representation, BERT may also be referred to as bi-directional transform's Encoder. BERT is a language representation model (language representation model). As shown in fig. 4, a network structure diagram of BERT is shown, where the classification identifier may be [ CLS ], where [ CLS ] indicates that the extracted features may be used for classification tasks, and CLS is an abbreviation for classification. In fig. 4, 512 is the length of input (input) data, "help prince mayuko" is partial data in the input data, and 12 is the number of encoder layers. BERT is a deep two-way pre-training Transformer based on semantic understanding. BERT is designed as a deep bi-directional model so that the neural network captures left and right context information from the target word more efficiently from the first layer itself through to the last layer. The method is a language representation model, can realize language representation target training, and achieves the purpose of semantic understanding through a deep bidirectional Transformer model.

Among them, BERT is applied to the NLP (Natural Language Processing) field, and improves the accuracy in a plurality of directions in the NLP field. BERT essentially learns a good feature representation for words by running an auto-supervised learning approach on a corpus basis. Self-supervised learning refers to supervised learning that runs on data without manual labeling. In NLP tasks, word embedding features (embedding) as tasks can be represented using features of BERT. BERT provides a model for migratory learning by other tasks. The BERT may be fine-tuned or fixed according to the task and then used as a feature extractor. The network architecture of BERT uses a multi-layer Transformer structure, and converts the distance between two words at any position into 1 through an Attention mechanism. As shown in fig. 5, a network architecture diagram of the BERT is shown. Wherein, the encoder may be a transform Block (transform module), input feature 1 — input feature N is a feature sequence of a sentence, input feature 1 may be represented by E1, input feature 2 may be represented by E2, input feature N may be represented by EN, and E is an abbreviation of embedding. Output signature 1-output signature N is the result of the hidden layer output, output signature 1 may be denoted by T1, output signature 2 may be denoted by T2, and input signature N may be denoted by TN. The Transformer is an encoder-decoder structure formed by stacking a plurality of encoders and decoders. As shown in fig. 6, a network structure of the Transformer is shown. The structure of the encoder and decoder is shown in fig. 6. The encoder includes a Multi-Head Attention module (Multi-Head Attention), and further includes a summation and normalization module (Add & norm), a Feed-Forward neural network (Feed Forward) for converting the input corpus into feature vectors. The decoder, whose inputs are the output of the encoder and the predicted result, includes a Masked Multi-Head Attention module (Masked Multi-Head Attention), a Multi-Head Attention module (Multi-Head Attention) and a full concatenation composition for outputting the conditional probability of the final result. The encoder and the decoder also comprise a summation and normalization module and a feedforward neural network. N in "N x" indicates that N identical modules are included in the encoder or decoder.

In some embodiments, the trained text feature extraction network may be a network in a trained match detection model. The matching degree detection model is used for determining the matching degree between the text semantic features and the video content features. The video content features are features obtained by extracting semantic features of the picture contents of the video clips. The matching degree detection model can also comprise a matching degree calculation network, the matching degree calculation network is a network for calculating the matching degree, the server can input the content description text information into the matching degree detection model, the semantic features of the content description text information are extracted by utilizing a text feature extraction network in the matching degree detection model to obtain text semantic features, the text semantic features and the video content features are input into the matching degree calculation network, and the matching degree between the text semantic features and the video content features is calculated.

Step 206, obtaining candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips.

The candidate video content features may be pre-stored in the server, or may be obtained by the server from another device, or may be obtained by the server through semantic feature extraction on the picture content of the candidate video clip. The candidate video content features may be one or more. Plural means at least two. The candidate video clips can be pre-stored in the server, or can be obtained by the server from other devices. The candidate video segments are one or more. Each candidate video segment corresponds to a candidate video content feature. The candidate video content features corresponding to the candidate video clips are features obtained by extracting semantic features of the picture contents of the video clips. The candidate video content features are used to reflect the semantics, i.e. the expressed meanings, expressed by the candidate video segments. The candidate video content features may include features of a thing included in the candidate video segment, the thing may be animate or inanimate, including but not limited to at least one of a human, an animal, a plant, or a building. The candidate video content features may thus characterize the type of event included in the candidate video segment.

Specifically, the server may extract semantic features of the picture content of the candidate video segment by using the trained video feature extraction network to obtain the video content features of the candidate video segment, and determine the obtained video content features as the candidate video content features. The video feature extraction network may be a network in the matching degree detection model, or may be a network independent of the matching degree detection model. The video feature extraction network may be an artificial neural network including, but not limited to, any one of ResNet (residual network) or Node2Vec, and may be any one of ResNet50 or ResNet101, for example. Compared with the ResNet50 model, Node2Vec occupies less hardware resources and computing resources. The residual error network is a thought of adding residual error learning (residual error) into the traditional convolutional neural network, so that the problems of gradient dispersion and accuracy reduction (training set) in a deep network are solved, the network can be deeper and deeper, and the speed is controlled while the accuracy is ensured.

The residual network is applied to the fields of object classification and the like and is a part of a classical neural network of a computer vision task backbone, and typical networks are ResNet50 or ResNet101, 50 and 101 refer to the number of layers. The ResNet50 is divided into 5 stages (stages), wherein the result of stage0 is relatively simple, and can be regarded as preprocessing of input (input), and the last 4 stages are all composed of bottle neck layers, and the structure is relatively similar. stage1 includes 3 Bottleneecks, and stages 2-4 include 4, 6, 3 Bottleneecks, respectively. As shown in fig. 7, an architectural diagram of ResNet50 is shown. In the figure, (3,244,244) indicates the number of channels (channel), height (height) and width (width) of the input (input), i.e., (C, H, W), respectively. When the height and width of the input are equal, it is represented by (C, W). C indicates the number of input channels, H indicates the input height, and W indicates the input width. "input" herein refers to input data. In stage0, "7 × 7" in "convolutional layer: 7 × 7,64,/2 BN, activation function" is the size of the convolution kernel, 64 is the number of convolution kernels, and "/2" means that the step size of the convolution kernel is 2. BN is an abbreviation for Batch Normalization. The activation function may be, for example, a ReLU. "3X 3" in "max pooling layer: 3X 3,/2" refers to the size of the max pooled kernel, "/2" refers to the step size of the kernel being 2. (64,56,56) are the number of lanes, height and width of the stage0 output. The input and output channels of the bottleneck layer 1 are different in number. The bottleneck layer 2 has the same number of input and output channels. "1 × 1" in "convolutional layer 1 × 1, C1,/SBN, Activate function" is the size of the convolution kernel, C1 is the number of convolutional layer channels, and S represents the step size. Here, the Convolutional layer refers to a Convolutional Neural Network (CNN), which can be represented by Conv. Convolutional neural networks are feed-forward neural networks whose artificial neurons can respond to a portion of the coverage of surrounding cells that can be used for image processing. The convolutional neural network consists of one or more convolutional layers and a top fully-connected layer, and may also include an associated weight and pooling layer (pooling layer).

In some embodiments, the candidate video clips include a plurality of video frames, and the server may extract semantic features of each video frame in the video clips to obtain frame semantic features of each video frame, and determine video content features of the candidate video clips based on the frame semantic features of each video frame, so as to obtain the candidate video content features. For example, the server may perform feature fusion on each frame semantic feature to obtain candidate video content features. Wherein the feature fusion comprises at least one of feature concatenation or feature addition.

In some embodiments, the server may extract semantic features from the video frames using the trained frame feature extraction network to obtain frame semantic features. The frame feature extraction network may be an artificial neural network including, but not limited to, any of a residual network or Node2 Vec. The frame feature extraction network may be a network in the matching degree detection model, or may be a network independent of the matching degree detection model.

In some embodiments, the server may perform feature fusion on the frame semantic features of each video frame of the video segment by using the trained feature fusion network to obtain the video content features of the candidate video segment, i.e., obtain the candidate video content features. The feature fusion network may be an artificial neural network including, but not limited to, any one of a Long short-term memory (LSTM) network or a bidirectional LSTM network. Among them, LSTM is an RNN (Recurrent Neural Network). The bi-directional LSTM network may be denoted as BLSTM.

In some embodiments, the server may arrange the frame semantic features according to the order of the video frames in the video segment to obtain a frame semantic feature sequence, and the earlier the order of the video frames in the video segment is, the earlier the order of the frame semantic features of the video frames in the frame semantic feature sequence is. The server may perform feature fusion on each frame semantic feature based on the order (i.e., position) of the frame semantic feature in the frame semantic feature sequence to obtain a candidate video content feature. Taking the feature fusion network as a bidirectional LSTM network as an example, the architecture diagram of the feature fusion network is shown in fig. 8, the frame semantic feature sequence is input data of the feature fusion network, the video content feature is output data of the feature fusion network, and the frame semantic feature sequence is input into the feature fusion network for feature fusion, and the video content feature is obtained by fusion.

And step 208, determining the video segment matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video segment.

The matching degree between the text semantic features and the candidate video content features is used for reflecting the similarity degree between the text semantic features and the candidate video content features, and the greater the matching degree is, the more similar the text semantic features and the candidate video content features are. The target video clip refers to a video clip matched with the content description text information. For example, if the content description text information is "kindergarten with respect to the other party, the video clip matching the content description text information may be a video embodying" red jacket "," antique ", and" baryor ". The target video segment may be one or more.

Specifically, the candidate video content features may be multiple, for each candidate video content feature, the server may determine a matching degree between the text semantic feature and the candidate video content feature, so as to obtain a matching degree between each candidate video content feature and the text semantic feature, determine, from among the candidate video content features, a video content feature whose matching degree satisfies a condition that the matching degree is greater, obtain an object video content feature, and determine, based on a video segment corresponding to the object video content feature, a video segment that matches with the content description text information, so as to obtain an object video segment. Wherein the condition that the matching degree is greater may include at least one of the matching degree being maximum or the matching degree being greater than a threshold value of the matching degree. The threshold value of the matching degree can be preset or set according to requirements. For example, the server may determine the candidate video content feature with the largest matching degree as the target video content feature.

In some embodiments, the server may calculate a similarity between the text semantic features and the candidate video content features, determine a matching degree between the text semantic features and the candidate video content features based on the calculated similarity, the matching degree having a positive correlation with the similarity. For example, the server may determine the similarity as a degree of match. Wherein the similarity may be a cosine similarity, for example. The positive correlation refers to: under the condition that other conditions are not changed, the changing directions of the two variables are the same, and when one variable changes from large to small, the other variable also changes from large to small. It is understood that a positive correlation herein means that the direction of change is consistent, but does not require that when one variable changes at all, another variable must also change. For example, it may be set that the variable b is 100 when the variable a is 10 to 20, and the variable b is 120 when the variable a is 20 to 30. Thus, the change directions of a and b are both such that when a is larger, b is also larger. But b may be unchanged in the range of 10 to 20 a.

In some embodiments, the server may divide the content description text information into a plurality of text information segments, where the text semantic features include segment semantic features of each text information segment, and for the segment semantic features of each text information segment, the server may determine a matching degree between each candidate video content feature and the segment semantic feature, and, based on the matching degree between the candidate video content feature and the segment semantic feature, filter a video segment matching the text information segment of the segment semantic feature from among the candidate video segments, for example, a video segment corresponding to a candidate video content feature with a highest matching degree, or a video segment corresponding to a candidate video content feature with a matching degree greater than a matching degree threshold may be determined as a video segment matching the text information segment of the segment semantic feature, and determining the target video clips based on the video clips respectively matched with the text information clips. The server may determine one, more, or all of the video clips respectively matching the respective text information clips as the target video clip. For example, the content description text information is divided into 2 text information segments, which are a text information segment 1 and a text information segment 2, respectively, a video segment matched with the text information segment 1 is a video segment 1, and a video segment matched with the text information segment 2 is a video segment 2, then at least one of the video segment 1 or the video segment 2 may be determined as a target video segment, for example, both the video segment 1 and the video segment 2 may be determined as target video segments, that is, 2 target video segments, which are the video segment 1 and the video segment 2, respectively.

In some embodiments, the server may perform feature fusion on the segment semantic features of each text information segment, use the fused features as text semantic features, and determine a video segment matched with the content description text information based on a matching degree between the text semantic features and candidate video content features to obtain a target video segment.

Step 210, synthesizing an object video corresponding to the target object based on the content description text information and the target video segment; the target video comprises target picture content in the target video clip, and when the target picture content in the target video is played, content description text information is correspondingly displayed.

Wherein the object video is a video synthesized based on the content description text information and the target video segment. The target picture content refers to picture content in the target video clip.

Specifically, the content description text information may be divided into a plurality of text information segments, a plurality of target video segments may be provided, and each target video segment is matched with one text information segment, that is, the target video segments are matched with the text information segments one by one. In the process of synthesizing the object video, the server may splice the target video segments, and determine the text information segment matched with the target video segment as the text displayed simultaneously by the target video segment, for example, the text information segment may be displayed in the form of a subtitle or a bullet screen.

In some embodiments, after the server synthesizes the object video, the object video may be sent to the terminal, and the terminal may play the object video, and in the process of playing the object video, while playing the picture in the target video clip, the text information clip matched with the target video clip is displayed.

In some embodiments, the server may convert the content description text information into audio data, and play the audio data of the content description text information while playing the picture in the target video. Or, the server may convert each text information segment into an audio segment, and play the audio segment of the text information segment during the process of displaying the text information segment.

For example, the target object is described as a novel, the content description text information is, for example, brief introduction information of the novel, and as shown in fig. 9, an effect diagram of the synthesized object video is shown, fig. 9 shows the object video generated for the novel "wedding bar", and brief introduction information of the novel includes "being a happy family with the opponent bystanders". The picture shown in fig. 9 includes a man and a woman, and the woman wears a coat, and the content of the picture matches the content expressed by "becoming happy family by worship with the other party". If the brief introduction information also includes "spring flower bloom", the picture is switched to a picture matching "spring flower bloom" as the video is played, for example, the picture may include flowers, and the "spring flower bloom" is displayed.

The video synthesis method includes the steps of obtaining content description text information of a target object, conducting semantic feature extraction on the content description text information to obtain text semantic features, obtaining candidate video content features, determining video content features matched with the text semantic features, obtaining video segments corresponding to the matched video content features to obtain target video segments, synthesizing object videos corresponding to the target object based on the content description text information and the target video segments, wherein the object videos comprise target picture contents in the target video segments, and when the target picture contents in the object videos are played, the content description text information is correspondingly displayed. Because the content description text information is the text information describing the content expressed by the target object, and the candidate video content features are obtained by extracting the semantic features of the content expressed in the candidate video clips, when the candidate video content features are matched with the text semantic features, the content description text information is matched with the content expressed by the video clips, namely the similarity is higher, so that the text information and the video which are matched with each other are automatically determined, the efficiency of screening the multimedia data is improved, and the efficiency of processing the multimedia data is improved. In addition, the video synthesis method realizes the automatic generation of the object video of the target object based on the content description text information of the target object and the video clip, thereby realizing the scheme of automatically generating the video for the object and improving the efficiency of generating the video.

In some embodiments, the content description textual information comprises a plurality of pieces of textual information; the semantic feature extraction of the content description text information to obtain text semantic features comprises the following steps: extracting semantic features of each text information fragment to obtain fragment semantic features of each text information fragment; and determining each segment semantic feature as a text semantic feature.

Specifically, the server may segment the content description text information to obtain a plurality of text information segments, and for each text information segment, the server may extract semantic features of the text information segment to obtain segment semantic features of the text information segment, and obtain text semantic features according to the segment semantic features respectively corresponding to each text information segment. For example, the server may determine, as the text semantic features, segment semantic features respectively corresponding to the text information segments, that is, the text semantic features include segment semantic features of the text information segments.

In the embodiment, the text semantic features are obtained according to the segment semantic features respectively corresponding to the text information segments, so that the obtained text semantic features comprise the segment semantic features of the text information segments, and the expressive force and the accuracy of the text semantic features are improved.

In some embodiments, the candidate video content features are plural; determining a video clip matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video clip, wherein the step of obtaining the target video clip comprises the following steps: respectively determining the matching degree between each fragment semantic feature and each candidate video content feature; and determining the video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature to obtain the target video clip.

The text semantic features comprise fragment semantic features of each text information fragment. The video clips matching the text information clip may be one or more.

Specifically, the server may determine a matching degree between the segment semantic features and the candidate video content features to obtain a feature matching degree, where the feature matching degree is a matching degree between the segment semantic features and the candidate video content features. And for each segment semantic feature, screening out the video content features matched with the segment semantic features from each candidate video segment based on the feature matching degree between the segment semantic features and each candidate video content feature. For example, the server may rank the video content features of each candidate in an order from a large feature matching degree to a small feature matching degree to obtain a video content feature sequence, where the larger the feature matching degree is, the earlier the candidate video content features are ranked in the video content feature sequence. The server may determine, from the sequence of video content features, a video content feature that is ranked before a ranking threshold as the video content feature that matches the segment semantic feature. The sorting threshold may be preset or set as desired, and may be, for example, second or third. The server may further determine a maximum feature matching degree, and determine the video content feature corresponding to the maximum feature matching degree as the video content feature matched with the segment semantic feature.

In some embodiments, the server may obtain matching content features corresponding to the respective segment semantic features, where the matching content features corresponding to the segment semantic features refer to video content features matching the segment semantic features, and the server may obtain target video segments matching the text information segments based on the video segments corresponding to the respective matching content features. For example, the server may respectively determine video segments corresponding to each matching content feature as target video segments, or screen one or more video segments from the video segments corresponding to each matching content feature as target video segments, where the screening may be random screening, or may be screening based on a feature matching degree, for example, for each matching content feature, the server may perform weighted calculation on a matching degree between the matching content feature and each segment semantic feature, determine a result of the calculation as a weighted matching degree of the matching content feature, thereby obtaining a weighted matching degree of each matching content feature, screen the target video segments from each matching content feature based on the weighted matching degree, for example, determine a video segment of the matching content feature corresponding to a maximum weighted matching degree as the target video segment, or determining the video segment with the weighted matching degree larger than the weighted matching degree threshold value and matching the content characteristics as the target video segment. The weighted matching degree threshold value can be preset or set according to requirements.

For example, there are 2 text information segments, which are respectively a text information segment 1 and a text information segment 2, the segment semantic feature of the text information segment 1 is a1, the matching content feature corresponding to a1 is B1, the segment semantic feature of the text information segment 2 is a2, and the matching content feature corresponding to a2 is B2, then for B1, the matching degree between B1 and a1 and the matching degree between B1 and a2 are weighted to obtain a weighted matching degree C1 of B1, for B2, the matching degree between B2 and a1 and the matching degree between B2 and a2 are weighted to obtain a weighted matching degree C2 of B2, and the video segment corresponding to the larger one of C1 and C2 and matching content feature is determined as the target video segment.

In the embodiment, the video segments respectively matched with the text information segments are determined based on the matching degree between the semantic features of each segment and the candidate video content features to obtain the target video segments, so that the semantic features of the text information segments are fully considered in the process of determining the target video segments, and the accuracy of the target video segments is improved.

In some embodiments, the target video clips are multiple, and each target video clip is matched with one text information clip; determining video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature, and obtaining the target video clip comprises the following steps: for each text information segment, determining the matching degree of the segment semantic features of the text information segment and each candidate video content feature; based on the matching degree of the segment semantic features of the text information segments and each video content feature, screening the candidate video content features to obtain the video content features matched with the segment semantic features of the text information segments; and acquiring a video segment corresponding to the matched video content characteristics to obtain a target video segment matched with the text information segment.

Specifically, each piece of textual information matches a target video segment. For each text information segment, the server may calculate a matching degree between the segment semantic features of the text information segment and each candidate video content feature, for example, there are 2 candidate video content features, which are respectively B1 and B2, and then, for the segment semantic feature a1, calculate a matching degree between a1 and B1 and calculate a matching degree between a1 and B2 to obtain each feature matching degree of a1, and filter the video content features matching with the segment semantic features according to the feature matching degrees to obtain the matching video content features of each segment semantic feature, and use the matching video content features of each segment semantic feature as the object video features.

In this embodiment, the video segments corresponding to the matched video content features are obtained, and the target video segments matched with the text information segments are obtained, so that the video segments matched with the text information segments are determined as the target video segments, and the matching degree of the semantic features of the target video segments and the text is improved.

In some embodiments, determining a degree of matching of the segment semantic features of the text information segment with each candidate video content feature comprises: acquiring a text information segment adjacent to the text information segment to obtain an adjacent text information segment of the text information segment; calculating the difference between the segment semantic features of the adjacent text information segments and the segment semantic features of the text information segments to obtain feature difference information; performing feature fusion on the feature difference information and the segment semantic features to obtain fused semantic features; and determining the matching degree of the segment semantic features of the text information segments and each candidate video content feature based on the fusion semantic features.

Wherein adjacent means that positions in the content description text information are adjacent, and the adjacent text information piece may include at least one of a preceding text information piece or a succeeding text information piece. The preceding text information segment refers to a text information segment that is located before and adjacent to the text information segment. The following text information segment refers to a text information segment that is located after and adjacent to the text information segment. The feature difference information refers to a difference between segment semantic features.

Specifically, the server may arrange the text information segments according to positions of the text information segments in the content description text information to obtain a text information segment sequence, and the earlier the positions of the text information segments in the content description text information are, the earlier the order of the text information segments in the text information segment sequence is. For each text information segment, the server may obtain, from the sequence of text information segments, a text information segment arranged before and adjacent to the text information segment as a preceding text information segment of the text information segment. The server may obtain, from the sequence of text information segments, a text information segment arranged after and adjacent to the text information segment as a succeeding text information segment of the text information segment, and determine at least one of a preceding text information segment or a succeeding text information segment as an adjacent text information segment of the text information segment.

In some embodiments, the feature fusion may include at least one of feature concatenation or feature addition, e.g., the server may feature concatenate the feature difference information with the segment semantic features, taking the result of the concatenation as the fused semantic features of the segment semantic features.

In some embodiments, the server may calculate a matching degree between the fused semantic features and the candidate video content features, and determine the matching degree between the fused semantic features and the candidate video content features as a matching degree between the segment semantic features corresponding to the fused semantic features and the candidate video content features.

In the embodiment, the matching degree of the segment semantic features of the text information segments and each candidate video content feature is determined based on the fusion semantic features, the fusion semantic features are determined based on the segment semantic features and the segment semantic features of the adjacent text information segments, so that the fusion semantic features cover the information of the adjacent text information segments, the obtained matching degree not only considers the features of the text information segments but also refers to the features of the adjacent text information segments, and the calculated matching degree is more reasonable.

In some embodiments, extracting semantic features of the text information segment to obtain segment semantic features of the text information segment includes: performing word segmentation processing on the text information fragments to obtain a plurality of word fragments; extracting semantic features of the word segments to obtain the word semantic features of the word segments; and performing feature fusion on the word semantic features of each word segment to obtain the segment semantic features of the text information segments.

The word segment refers to a segment formed by words, each word can comprise one or more Chinese characters, and each word can also comprise one or more English words. Each word segment may include a word or words. The term meaning feature is a feature obtained by extracting semantic features of the term fragments.

Specifically, the server may obtain a word bank, perform word segmentation processing on the text information segment based on the word bank to obtain a plurality of word segments, where the word bank includes a plurality of words, and the server may divide a part of the text information segment, which is the same as a word in the word bank, into one word segment, so as to obtain a plurality of word segments.

In some embodiments, the server may input the word segments into the text feature extraction network to extract semantic features, so as to obtain word semantic features of the word segments.

In some embodiments, the server obtains the word meaning characteristics of each word segment, performs statistical operation on each word meaning characteristic, and takes the result of the statistical operation as the segment semantic characteristics of the text information segment. The statistical operation includes, but is not limited to, any one of a mean operation, a most significant operation, or a weighted calculation. For example, weighting calculation may be performed on each word semantic feature, and the result of the weighting calculation may be used as the segment semantic feature of the text information segment.

In some embodiments, the server may obtain the trained feature fusion network, and input the term semantic features of each term segment into the feature fusion network for feature fusion to obtain the segment semantic features of the text information segment. For example, the server may arrange the word sense features of the word segments according to the positions of the word segments in the text information segment to obtain a word sense feature sequence, and the earlier the positions of the word segments in the text information segment are, the earlier the word sense features of the word segments are ordered in the word sense feature sequence. And inputting the word meaning characteristic sequence into a characteristic fusion network for characteristic fusion to obtain the segment semantic characteristics of the text information segment.

In the embodiment, word segmentation processing is performed on the text information segment to obtain a plurality of word segments, semantic features of the word segments are extracted for each word segment to obtain word semantic features of the word segments, feature fusion is performed on the word semantic features of each word segment to obtain segment semantic features of the text information segment, each word in the text information segment is fully considered in the process of determining the segment semantic features, and accuracy of the segment semantic features is improved.

In some embodiments, the candidate video content features are obtained by: for each video frame in the candidate video clips, semantic feature extraction is carried out on the video frame to obtain frame semantic features; and performing feature fusion on each frame semantic feature to obtain candidate video content features.

Specifically, the feature fusion includes any one of feature concatenation or feature addition. For example, the server may perform a weighted calculation on each frame semantic feature to obtain candidate video content features.

In some embodiments, the server may arrange the frame semantic features according to the order of the video frames in the video segment to obtain a frame semantic feature sequence, and the earlier the order of the video frames in the video segment is, the earlier the order of the frame semantic features of the video frames in the frame semantic feature sequence is. The server may perform feature fusion on each frame semantic feature based on the order (i.e., position) of the frame semantic feature in the frame semantic feature sequence to obtain a candidate video content feature. For example, the server may determine a weight of each frame semantic feature based on the sequence of the frame semantic features in the frame semantic feature sequence, and perform weighted calculation on each frame semantic feature by using the weight to obtain candidate video content features.

In the embodiment, semantic feature extraction is performed on each video frame in the candidate video clips to obtain frame semantic features, feature fusion is performed on each frame semantic feature to obtain candidate video content features, each video frame is fully considered in the process of determining the candidate video content features, and accuracy of the candidate video content features is improved.

In some embodiments, the target video segment is multiple, the content description text information comprises multiple text information segments, and each target video segment is matched with one text information segment; synthesizing an object video corresponding to the target object based on the content description text information and the target video clip comprises: and sequentially splicing the target video clips matched with the text information clips according to the sequence of the text information clips in the content description text information, and determining the display time of the matched text information clips according to the playing time of each target video clip so as to synthesize the object video corresponding to the target object.

The playing time refers to the time for playing the target video clip, and the displaying time refers to the time for displaying the text information clip. The more forward the position of the text information segment in the content description text information is, the more forward the display time of the text information segment is, and the more forward the playing time of the target video segment matched with the text information segment is. The display time of the text information segment is the same as the playing time of the target video segment matched with the text information segment.

Specifically, the server may sequentially splice target video segments matched with the text information segments according to the sequence of the text information segments in the content description text information, and determine the presentation time of the matched text information segments according to the playing time of each target video segment to synthesize the object video corresponding to the target object. For example, the server may synthesize the target video with the text information piece as a subtitle of the target video piece matched therewith. When the object video is played, the text information segment is displayed in the form of subtitles. The server can also take the text information segment as a bullet screen of the target video segment matched with the text information segment to synthesize the object video. And when the object video is played, displaying the text information fragment in a bullet screen mode.

In some embodiments, the server may compose the object video using a video composition tool. The video composition tool is used to compose a video. For example, ffmpeg. Wherein ffmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams. It provides a complete solution for recording, converting and streaming audio and video. It contains a very advanced audio/video codec library libavcodec, and many of the codes in libavcodec are developed from the beginning in order to ensure high portability and codec quality.

In the embodiment, the target video segments matched with the text information segments are sequentially spliced according to the sequence of the text information segments in the content description text information, the display time of the matched text information segments is determined according to the playing time of each target video segment, and the target video corresponding to the target object is synthesized, so that the matched text and the video segments are automatically synthesized, and the efficiency of synthesizing the multimedia data is improved.

In some embodiments, the target object is a target text object; the content description text information is abstract description information of the content described by the target text object; the object video is a video for introducing a target text object.

The target text object refers to an object in a text form, and includes but is not limited to at least one of a novel, a poem, a technical article and the like. For example, the target text object is a novel. The content described by the target text object includes, but is not limited to, at least one of scene-related content, story-related content, persona-related content, location-related content, and the like. The abstract description information is information for summarizing the content described by the target text object, and may include at least one of characters, places, story lines, time or climate of occurrence of a story, and the like in the target text object. The target video may be a video introducing the target text object, for example, when the target text object is a novel, the target video may be a video promoting or introducing the novel, and may also be referred to as a novel promotion video.

In some embodiments, the server may obtain a candidate audio set, where the candidate audio set includes a plurality of candidate audios, and the server may select one or more candidate audios from the candidate audio set, and synthesize an object video corresponding to the target object based on the selected candidate audios, the content description text information, and the target video segment. And when the object video corresponding to the target object is played, playing the selected candidate audio. The server may select the candidate audio according to the popularity of the candidate audio, for example, may select the candidate audio with the largest popularity.

In this embodiment, the target object is a target text object, the content description text information is abstract description information of content described by the target text object, and the object video is a video for introducing the target text object, so that the video for introducing the target text object is automatically generated, and the processing efficiency of the multimedia data is improved.

In some embodiments, synthesizing an object video corresponding to the target object based on the content description text information and the target video segment; wherein the object video includes the object picture content in the object video segment, and when the object picture content in the object video is played, correspondingly displaying the content description text information includes: converting the content description text information into audio data; synthesizing an object video corresponding to the target object based on the content description text information, the target video clip and the audio data; when the target picture content of the object video is played, the content description text information is correspondingly displayed and the audio data is correspondingly played.

Specifically, the server may perform voice conversion on the content description text information, and convert the content description text information into audio data. The server may process the content description text information, the target video clip, and the audio data using a video synthesis tool, and synthesize an object video corresponding to the target object.

In some embodiments, each target video segment is matched with one text information segment, the server may perform voice conversion on each text information segment to obtain an audio segment corresponding to each text information segment, and process each text information segment, each target video segment, and each audio segment by using a video synthesis tool to synthesize an object video corresponding to a target object. For example, in the process of synthesizing the video, for each text information segment, the server may set the presentation time of the text information segment, the playing time of the audio segment of the text information segment, and the playing time of the target video segment matched with the text information segment to be the same time.

In some embodiments, the server may send the synthesized object video to the terminal, and the terminal may play the object video, and when playing the object video, when playing a picture in the target video clip, display a text information clip matching the target video clip, and play an audio clip of the matching text information clip.

In this embodiment, the content description text information is converted into audio data, and an object video corresponding to the object is synthesized based on the content description text information, the target video segment, and the audio data, so that the object is introduced in a text, video, and audio manner, and the expression capability of the object video is improved.

In some embodiments, synthesizing an object video corresponding to the target object based on the content description text information and the target video segment; wherein the object video includes the object picture content in the object video segment, and when the object picture content in the object video is played, correspondingly displaying the content description text information includes: acquiring candidate audio and determining the audio style of the candidate audio; determining the content style of the content description text information, and determining candidate audio with the audio style matched with the content style of the content description text information as target audio; synthesizing an object video corresponding to the target object based on the content description text information, the target video clip and the target audio; when the object video is played, the content description text information and the corresponding playing target audio are correspondingly displayed.

The candidate audio may be pre-stored in the server, or may be acquired by the server from another device. The audio style is used to characterize the type of emotion expressed by the audio. The content style is used for representing the emotion type expressed by the content description text information. The emotion type includes but is not limited to at least one of calmness, excitement, cheerfulness and the like, and the target audio refers to candidate audio with an audio style consistent with the content style. The candidate audio may be plural.

Specifically, the server may compare the content style with the audio style of the candidate audio, and when the comparison is consistent, determine the candidate audio as the target audio. The server may take the target audio as background music in the target video.

In this embodiment, the candidate audio whose audio style matches with the content style of the content description text information is determined as the target audio, and the object video corresponding to the target object is synthesized based on the content description text information, the target video segment, and the target audio, so that the emotion to be expressed by the content description text information is enhanced by the audio, and the expression capability of the object video is improved.

The application also provides an application scene, and the application scene applies the video synthesis method. The application scenario is a scenario for generating a novel publicity video, and specifically, as shown in fig. 10, the application of the video composition method in the application scenario is as follows:

step 1002, the terminal sends a video synthesis request to the server, wherein the video synthesis request carries summary description information of the novel.

And 1004, the server segments the abstract description information to obtain a plurality of text information segments, and performs semantic feature extraction on each text information segment to obtain segment semantic features corresponding to each text information segment.

As shown in fig. 11, a schematic diagram of a composite video is shown. In the figure, the text feature generation network may also be referred to as an embedding network of texts, and the video feature generation network may also be referred to as an embedding network of videos. The video generation module may also be referred to as a novel video generation module. In fig. 11, the abstract description information of the novel is segmented into n text information segments, and the server may input the text information segments into the text feature generation network to generate segment semantic features respectively corresponding to the text information segments, that is, n segment semantic features. The segment semantic features i are the segment semantic features of the text information segments i, and i is more than or equal to 1 and less than or equal to n.

The text feature generation network may include a text feature extraction network and a text feature fusion network. The text feature extraction network may be, for example, a bert network, and the text feature fusion network may be, for example, a bi-directional LSTM network. The server can perform word segmentation processing on the text information fragments to obtain a plurality of word fragments, and the word fragments are input into the text feature extraction network to extract word meaning features of the word fragments. And arranging the word meaning characteristics of each word segment into a word meaning characteristic sequence, and inputting the word meaning characteristic sequence into a text characteristic fusion network for fusion to obtain the segment semantic characteristics of the text information segment. For example, the embedding sequence output by the bert network can be input into the bidirectional LSTM, and then the hidden state of the forward output and the hidden state of the backward output of the embedded sequence are extracted and spliced into the final text embedding vector, i.e. the segment semantic features.

In step 1006, the server determines video content features respectively matching the semantic features of each segment from the candidate video content feature library.

Wherein the candidate video content features may be generated using a video feature generation network, which may be the Resnet50 network model. For example, through a Resnet network structure, the embedding vectors can be generated from the pictures in the video clips in the video library respectively, and then the video is mapped into one embedding vector sequence. And then inputting the imbedding vector sequence into a bidirectional LSTM network to perform long-section time information fusion on the imbedding sequence to generate an imbedding vector, wherein the generated imbedding vector is a video content feature, the generated imbedding vector is stored in a video content feature library, and video segments in the video library correspond to the video content feature in the video content feature library one by one. Use embedding to make the matching, through the embedding of the text and video of drawing, then constitute the embedding storehouse of video clip for every video clip can both be more discretization, has own embedding id (sign), and the video embedding storehouse can be expanded in automatic production. Wherein the video library may be stored in a database of the server.

A CNN structure is used in a Resnet50 network to build an integral model, meanwhile, in order to solve the phenomenon that the depth of the network is deepened but the gradient disappears or the weight update is attenuated along with the number of layers, a residual connection structure is used in the network, the mechanism can enable the gradient to carry out jump connection, meanwhile, input (input) information and output (output) information of each small module can be directly connected, and the output of the network module can directly and intuitively feel the original characteristic property in the input.

The server can calculate the matching degree between the segment semantic features and the candidate video content features by using the matching module, and determine the video content features matched with the segment semantic features according to the matching degree. The matching module refers to a network for matching the text with the video and is used for calculating the matching degree between the semantic features of the text and the video content features. As shown in fig. 12, a matching module is shown, which includes a fully connected layers (FC). The segment semantic features and the candidate video content features are simultaneously input into a full-link layer for dimension conversion, and the segment semantic features after dimension conversion and the video content features after dimension conversion are generated, wherein the segment semantic features and the candidate video content features can respectively correspond to a full-link layer, and the full-link layer is used for converting the dimension of the segment semantic features and the dimension of the video content features into the same dimension, so that the segment semantic features after dimension conversion and the video content features after dimension conversion can be subjected to point multiplication calculation (namely inner product operation), namely cosine similarity (also called space cosine distance) calculation is carried out, cosine similarity is obtained, and the cosine similarity is used as a matching degree. The cosine similarity may also be referred to as cosine distance. The degree of match can be directly quantified to a probability of 0 to 1.

The text feature generation network, the video feature generation network and the matching module can be obtained through joint training, and the text feature extraction network, the video feature extraction network and the matching module can be networks or modules in a matching degree detection model.

Step 1008, for each text information segment, the server obtains a video segment corresponding to the video content feature matched with the segment semantic feature of the text information segment from the video library as a video segment matched with the text information segment.

The video content feature library comprises a plurality of candidate video content features. The video library includes video segments corresponding to the respective candidate video content features, that is, the video library includes a plurality of candidate video segments. As shown in fig. 11, the server selects n video clips from the video library, where the video clip i is a video clip matching the text information clip i.

Step 1010, the server synthesizes the novel publicity video based on each text information segment and the video segment matched with each text information segment, and takes the text information segment as the subtitle of the video segment matched with the novel publicity video.

As shown in fig. 11, each text information segment and the video segment matching each text information segment are input to the video generation module, and a novel promotion video is generated. For example, the server may splice video segments and correspond each video segment to a text information segment by way of subtitling, and then resize (resize) the video segments into a portrait video, and finally generate a novel announcement video or a novel promotion video. For example, according to the text information segment and the video segment matched with each text information segment, splicing, mixing and shearing are carried out, and then the video segment is sleeved into the video template to generate the final novel propaganda video.

The server returns the novel promotional video to the terminal, step 1012.

And 1014, the terminal plays the novel propaganda video, and displays the text information fragment matched with the video fragment while playing the picture in the video fragment.

In the video synthesis method, multi-mode network matching is performed, automatic text matching video is realized, matching speed is increased, production of novel videos is automated, and efficiency of producing the novel videos is improved. The full-automatic text and video matching scheme is provided, the cost and the manufacturing time are saved, and the whole link can be industrially produced. The matching of the text and the video is carried out by using the multi-mode fusion model, so that the matching accuracy is improved. By establishing an embedding library of the video clips and simultaneously establishing mapping between the embedding and the video clips, matching search can be performed on the text and the full amount of videos in the library in each matching process, and large-range search matching can greatly improve the matching degree of the text and the finally determined videos and improve the video generation effect.

Where the multiple modalities can be multimedia data that describes the same object. For example, in the internet environment, information such as video, pictures, voice, text, etc. describing a particular object. Multimodal can also refer to the same type of media data from different sensors. For example, image data generated by different examination apparatuses in medical imaging, including but not limited to B-Scan ultrasonography (B-Scan), Computed Tomography (CT), magnetic resonance, etc. The multimodal may also be the same object data detected by different sensors in the context of the internet of things, etc. The multi-modal information has different data structure characteristics, ideographic symbols and information in the form of representation.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a video synthesis apparatus for implementing the video synthesis method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the video composition apparatus provided below can be referred to the limitations of the video composition method in the foregoing, and details are not described here.

In some embodiments, as shown in fig. 13, there is provided a video synthesizing apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: information acquisition module 1302, feature extraction module 1304, feature acquisition module 1306, video acquisition module 1308, and video composition module 1310, where:

an information obtaining module 1302, configured to obtain content description text information of a target object; the content description text information is text information describing the content expressed by the target object;

the feature extraction module 1304 is configured to perform semantic feature extraction on the content description text information to obtain text semantic features;

a feature obtaining module 1306, configured to obtain candidate video content features; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips;

the video obtaining module 1308, configured to determine, based on a matching degree between the text semantic features and the candidate video content features, a video segment that matches the content description text information for the target video segment, so as to obtain the target video segment;

a video synthesizing module 1310 configured to synthesize an object video corresponding to the target object based on the content description text information and the target video segment; the target video comprises target picture content in the target video clip, and when the target picture content in the target video is played, content description text information is correspondingly displayed.

The video synthesis device obtains content description text information of a target object, performs semantic feature extraction on the content description text information to obtain text semantic features, obtains candidate video content features, determines video content features matched with the text semantic features, obtains video segments corresponding to the matched video content features to obtain target video segments, synthesizes target videos corresponding to the target object based on the content description text information and the target video segments, wherein the target videos comprise target picture contents in the target video segments, and correspondingly display the content description text information when the target picture contents in the target videos are played. Because the content description text information is the text information describing the content expressed by the target object, and the candidate video content features are obtained by extracting the semantic features of the content expressed in the candidate video clips, when the candidate video content features are matched with the text semantic features, the content description text information is matched with the content expressed by the video clips, namely the similarity is higher, so that the text information and the video which are matched with each other are automatically determined, the efficiency of screening the multimedia data is improved, and the efficiency of processing the multimedia data is improved.

In some embodiments, the content description textual information comprises a plurality of pieces of textual information; the feature extraction module is also used for extracting semantic features of the text information fragments to obtain the fragment semantic features of the text information fragments; and determining each segment semantic feature as a text semantic feature.

In some embodiments, the candidate video content features are plural; the video acquisition module is also used for respectively determining the matching degree between each fragment semantic feature and each candidate video content feature; and determining the video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature to obtain the target video clip.

In some embodiments, the target video clips are multiple, and each target video clip is matched with one text information clip; the video acquisition module is also used for determining the matching degree of the segment semantic features of the text information segments and the candidate video content features for each text information segment; based on the matching degree of the segment semantic features of the text information segments and the video content features of each candidate, screening the video content features matched with the segment semantic features of the text information segments from the video content features of each candidate; and acquiring a video segment corresponding to the matched video content characteristics to obtain a target video segment matched with the text information segment.

In some embodiments, the video obtaining module is further configured to obtain a text information segment adjacent to the text information segment, to obtain an adjacent text information segment of the text information segment; calculating the difference between the segment semantic features of the adjacent text information segments and the segment semantic features of the text information segments to obtain feature difference information; performing feature fusion on the feature difference information and the segment semantic features to obtain fused semantic features; and determining the matching degree of the segment semantic features of the text information segments and each candidate video content feature based on the fusion semantic features.

In some embodiments, the feature extraction module is further configured to perform word segmentation processing on the text information segment to obtain a plurality of word segments; extracting semantic features of the word segments to obtain the word semantic features of the word segments; and performing feature fusion on the word semantic features of each word segment to obtain the segment semantic features of the text information segments.

In some embodiments, the apparatus further includes a feature generation module, where the candidate video content features are generated by the feature generation module; the feature generation module is used for extracting semantic features of each video frame in the candidate video clips to obtain frame semantic features; and performing feature fusion on each frame semantic feature to obtain candidate video content features.

In some embodiments, the target video segment is multiple, the content description text information comprises multiple text information segments, and each target video segment is matched with one text information segment; and the video synthesis module is also used for sequentially splicing the target video segments matched with the text information segments according to the sequence of the text information segments in the content description text information, and determining the display time of the matched text information segments according to the playing time of each target video segment so as to synthesize the object video corresponding to the target object.

In some embodiments, the video composition module is further configured to convert the content description text information into audio data; synthesizing an object video corresponding to the target object based on the content description text information, the target video clip and the audio data; when the target picture content of the object video is played, the content description text information is correspondingly displayed and the audio data is correspondingly played.

In some embodiments, the video synthesis module is further configured to obtain candidate audio, and determine an audio style of the candidate audio; determining the content style of the content description text information, and determining candidate audio with the audio style matched with the content style of the content description text information as target audio; synthesizing an object video corresponding to the target object based on the content description text information, the target video clip and the target audio; when the object video is played, the content description text information and the corresponding playing target audio are correspondingly displayed.

For specific limitations of the video composition apparatus, reference may be made to the above limitations of the video composition method, which are not described herein again. The various modules in the video compositing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 14. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected by a system bus, and the communication interface, the display unit and the input device are connected by the input/output interface to the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video compositing method. The display unit of the computer equipment is used for forming a visual and visible picture, and can be a display screen, a projection device or a virtual reality imaging device, the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 15. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as content description text information, text semantic features, candidate video content features, candidate video segments, segment semantic features and the like. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a video compositing method.

It will be appreciated by those skilled in the art that the configurations shown in fig. 14 and 15 are block diagrams of only some of the configurations relevant to the present application, and do not constitute a limitation on the computing devices to which the present application may be applied, and a particular computing device may include more or less components than those shown, or some of the components may be combined, or have a different arrangement of components.

In some embodiments, there is further provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is also provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. For example, the target object, the content description text information, the video clip and other data referred to in the present application are all obtained under sufficient authorization.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for video compositing, the method comprising:

acquiring content description text information of a target object; the content description text information is text information describing the content expressed by the target object;

semantic feature extraction is carried out on the content description text information to obtain text semantic features;

acquiring candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips;

determining a video segment matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video segment;

synthesizing an object video corresponding to the target object based on the content description text information and the target video segment; the object video comprises object picture content in the object video clip, and the content description text information is correspondingly displayed when the object picture content in the object video is played.

2. The method of claim 1, wherein the content description text information comprises a plurality of text information segments; the semantic feature extraction of the content description text information to obtain text semantic features comprises the following steps:

for each text information segment, extracting semantic features of the text information segment to obtain segment semantic features of the text information segment;

and determining each fragment semantic feature as the text semantic feature.

3. The method of claim 2, wherein the candidate video content features are plural; determining a video clip matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video clip, wherein the step of determining the video clip comprises the following steps:

respectively determining the matching degree between each fragment semantic feature and each candidate video content feature;

and determining video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature to obtain target video clips.

4. The method according to claim 3, wherein the target video segments are plural, and each target video segment is matched with one text information segment; determining video clips respectively matched with the text information clips based on the matching degree between each clip semantic feature and each candidate video content feature to obtain a target video clip, wherein the step of determining the video clips comprises the following steps:

for each text information segment, determining the matching degree of the segment semantic features of the text information segment and each candidate video content feature;

based on the matching degree of the segment semantic features of the text information segments and the candidate video content features, screening the candidate video content features to obtain the video content features matched with the segment semantic features of the text information segments;

and acquiring the video clip corresponding to the matched video content characteristic to obtain a target video clip matched with the text information clip.

5. The method of claim 4, wherein determining a degree of matching of a segment semantic feature of the text information segment to each of the candidate video content features comprises:

acquiring a text information segment adjacent to the text information segment to obtain an adjacent text information segment of the text information segment;

calculating the difference between the segment semantic features of the adjacent text information segments and the segment semantic features of the text information segments to obtain feature difference information;

performing feature fusion on the feature difference information and the segment semantic features to obtain fused semantic features;

and determining the matching degree of the segment semantic features of the text information segments and each candidate video content feature based on the fusion semantic features.

6. The method according to claim 2, wherein the semantic feature extraction of the text information segment to obtain segment semantic features of the text information segment comprises:

performing word segmentation processing on the text information fragments to obtain a plurality of word fragments;

for each word segment, extracting semantic features of the word segment to obtain the word semantic features of the word segment;

and performing feature fusion on the word meaning features of the word segments to obtain the segment semantic features of the text information segments.

7. The method of claim 1, wherein the candidate video content features are obtained by:

for each video frame in the candidate video clips, performing semantic feature extraction on the video frame to obtain frame semantic features;

and performing feature fusion on each frame semantic feature to obtain the candidate video content feature.

8. The method according to claim 1, wherein the target video segment is plural, the content description text information comprises plural text information segments, and each target video segment matches one text information segment;

the synthesizing of the object video corresponding to the target object based on the content description text information and the target video segment includes:

and sequentially splicing the target video clips matched with the text information clips according to the sequence of the text information clips in the content description text information, and determining the display time of the matched text information clips according to the playing time of each target video clip so as to synthesize the target video corresponding to the target object.

9. The method of claim 1, wherein the target object is a target text object; the content description text information is abstract description information of the content described by the target text object; the object video is a video for introducing the target text object.

10. The method according to claim 1, wherein the synthesizing of the object video corresponding to the target object is based on the content description text information and the target video segment; wherein the object video includes the object picture content in the object video segment, and when the object picture content in the object video is played, correspondingly displaying the content description text information includes:

converting the content description text information into audio data;

synthesizing an object video corresponding to the target object based on the content description text information, the target video segment and the audio data;

and correspondingly displaying the content description text information and correspondingly playing the audio data when the target picture content of the object video is played.

11. The method according to claim 1, wherein the synthesizing of the object video corresponding to the target object is based on the content description text information and the target video segment; wherein the object video includes the object picture content in the object video segment, and when the object picture content in the object video is played, correspondingly displaying the content description text information includes:

acquiring candidate audio and determining the audio style of the candidate audio;

determining the content style of the content description text information, and determining candidate audio with the audio style matched with the content style of the content description text information as target audio;

synthesizing an object video corresponding to the target object based on the content description text information, the target video segment and the target audio;

and when the object video is played, correspondingly displaying the content description text information and correspondingly playing the target audio.

12. A video compositing apparatus, characterized in that the apparatus comprises:

the information acquisition module is used for acquiring content description text information of the target object; the content description text information is text information describing the content expressed by the target object;

the feature extraction module is used for extracting semantic features of the content description text information to obtain text semantic features;

the characteristic acquisition module is used for acquiring candidate video content characteristics; the candidate video content features are obtained by extracting semantic features of the picture contents of the candidate video clips;

the video acquisition module is used for determining a video clip matched with the content description text information based on the matching degree between the text semantic features and the candidate video content features to obtain a target video clip;

the video synthesis module is used for synthesizing an object video corresponding to the target object based on the content description text information and the target video clip; the object video comprises object picture content in the object video clip, and the content description text information is correspondingly displayed when the object picture content in the object video is played.

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 11 when executed by a processor.