CN113377971B

CN113377971B - Multimedia resource generation method and device, electronic equipment and storage medium

Info

Publication number: CN113377971B
Application number: CN202110598129.7A
Authority: CN
Inventors: 王厚志; 梅晓茸; 张梦馨; 刘旭东; 叶小瑜; 金梦; 张德兵; 郭晓锋; 周伟浩; 张辰怡
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-02-27
Anticipated expiration: 2041-05-31
Also published as: CN113377971A

Abstract

The disclosure relates to a method, a device, an electronic device and a storage medium for generating multimedia resources, which belong to the technical field of multimedia, and the method comprises the following steps: based on a target text input by a user, candidate resources related to the target text are obtained from a large number of material resources, keywords of the target text are extracted, the candidate resources are further matched with the target text through a multi-mode matching model, recommended resources are further obtained, the target multimedia resources are rapidly and intelligently generated for the user based on the recommended resources, and meanwhile the correlation between the generated target multimedia resources and the target text is guaranteed.

Description

Multimedia resource generation method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of multimedia, and in particular relates to a method and a device for generating multimedia resources, electronic equipment and a storage medium.

Background

With the increasing variety of applications, advertisers need to make advertisements based on their own applications to realize the promotion of the applications. The multimedia resources comprise resources such as pictures, videos and the like, and can be used as an advertisement form to realize popularization of application programs. For the novel application program, an advertiser often uses a picture related to the novel content as an advertisement according to the content of a novel, and sets a skip interface of the novel application program in the picture, so that the purpose of attracting a user to download the novel application program is achieved.

In the above technology, the advertiser needs to manually select the pictures, and the pictures matched with the content of the novel are screened from the pictures, so that a great deal of manpower resources can be spent, the cost is high, and the efficiency is low.

Disclosure of Invention

The invention provides a method, a device, electronic equipment and a storage medium for generating multimedia resources, which can quickly and intelligently generate target multimedia resources for users and ensure the correlation between the generated target multimedia resources and target texts. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a multimedia resource generating method, including:

responding to a received multimedia resource generation request, and acquiring a target text carried by the multimedia resource generation request;

based on the label of the target text, acquiring a plurality of candidate resources, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text accords with a first judgment condition;

determining at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text, wherein the keyword information is used for representing keywords in the target text, and the recommended resource is a candidate resource which meets a matching condition with a matching parameter between the target text;

And generating the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text includes:

processing the characteristics of the candidate resources and the keyword information of the target text through a multi-mode matching model to obtain matching parameters of the candidate resources and the target text, wherein the multi-mode matching model is obtained by training based on sample materials, sample texts and the matching parameters of the sample materials and the sample texts;

at least one recommended resource is determined from the plurality of candidate resources based on a matching parameter of the candidate resource and the target text.

In some embodiments, the processing, by the multimodal matching model, the feature of the candidate resource and the keyword information of the target text, to obtain the matching parameters of the candidate resource and the target text includes:

the multimodal matching model includes: the system comprises a first full-connection neural network, a second full-connection neural network, a bidirectional attention module, a third full-connection neural network, a feature fusion layer and a multi-layer perceptron;

the first fully-connected neural network and the second fully-connected neural network are used for acquiring semantic mapping information and keyword mapping information based on the characteristics of the candidate resources and the keyword information;

The bidirectional attention module is used for acquiring a first feature based on the semantic mapping information and the keyword mapping information;

the third fully-connected neural network is used for acquiring a second feature and a third feature based on the feature of the candidate resource and the keyword information;

the feature fusion layer is used for acquiring a fourth feature based on the first feature, the second feature and the third feature;

the multi-layer perceptron is used for acquiring matching parameters of the candidate resource and the target text based on the fourth characteristic.

mapping each semantic information in the characteristics of the candidate resource and each keyword information of the target text to a first semantic space through the first fully-connected neural network and the second fully-connected neural network respectively, and outputting semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information respectively, wherein the semantic mapping information is used for representing single semantics of the candidate resource, and the keyword mapping information is used for representing one keyword of the target text;

Mapping the semantic mapping information and the keyword mapping information through a bidirectional attention module to obtain the first feature, wherein the first feature is the matching information of the candidate resource and the target text;

through a third fully-connected neural network, the characteristics of the candidate resources and the characteristic information of all keywords are respectively and integrally mapped to a second semantic space, the second characteristics and the third characteristics are output, the second characteristics are used for representing all semantics of the candidate resources, and the third characteristics are used for representing all keywords of the target text;

fusing the first feature, the second feature and the third feature through a feature fusion layer to obtain the fourth feature;

and processing the fourth characteristic through a multi-layer perceptron, and outputting the matching parameters of the candidate resource and the target text.

In some embodiments, the mapping the semantic mapping information and the keyword mapping information to obtain the first feature includes:

acquiring first matching information of the single semantic mapping information and the whole keyword mapping information;

acquiring second matching information of the single keyword mapping information and the whole semantic mapping information;

And splicing the first matching information and the second matching information to obtain the first characteristic.

In some embodiments, the method further comprises:

the audio data in the candidate resource is determined to be a recommended resource.

In some embodiments, the determining at least one recommended resource from the plurality of candidate resources comprises:

the lyric feature in the audio data is obtained, the lyric feature is matched with the keyword information of the target text through a deep learning model, the matching degree between the audio data and the target text is obtained, and the audio data with the matching degree meeting the audio matching condition is determined as the recommended resource.

In some embodiments, the obtaining the keyword information of the target text includes:

calculating the importance degree of all words contained in the target text, sorting the words in the target text based on the importance degree, taking the words ranked at the first K positions as the keywords of the target text, wherein K is an integer greater than 0 and less than the number of words contained in the target text;

and processing the extracted keywords to obtain the keyword information.

In some embodiments, the method further comprises:

And extracting titles from the target text according to the text components indicated by the text format template, and taking the extracted titles as keywords of the target text.

In some embodiments, the generating the target multimedia resource corresponding to the target text based on the at least one recommended resource includes any one of:

splicing the plurality of recommended resources to obtain the target multimedia resource;

the target multimedia asset is generated based on any multimedia asset template, which is multimedia asset data with editable fragments.

In some embodiments, the stitching the plurality of recommended resources to obtain the target multimedia resource includes:

determining at least one data group from the plurality of recommended resources, wherein each data group comprises at least one type of data in image data and video data with the content similarity meeting the condition;

and splicing the data in the data group together, and splicing the spliced data groups sequentially according to the data groups.

In some embodiments, the method further comprises:

if the recommended resource comprises audio data, the audio data is used as the background audio of the generated multimedia resource.

In some embodiments, the generating the target multimedia asset based on any multimedia asset template comprises:

inserting the recommended resource in the editable segment of the multimedia asset template to generate the target multimedia asset.

In some embodiments, the inserting the recommended asset in the editable segment of the multimedia asset template to generate the target multimedia asset comprises:

and inserting the recommended resources of the corresponding type into the corresponding positions of the multimedia resource template according to the data types corresponding to the editable fragments in the multimedia resource template to obtain the target multimedia resources.

In some embodiments, the method further comprises:

acquiring the user click times of a plurality of videos of any object to be promoted, wherein the videos are generated based on different material resources;

and taking the material resource corresponding to the video, the clicking times of which meet the second judging condition, as the candidate resource.

According to a second aspect of the embodiments of the present disclosure, there is provided a multimedia resource generating apparatus, the apparatus including:

an obtaining unit configured to obtain a target text carried by a multimedia resource generation request in response to receiving the multimedia resource generation request;

The acquiring unit is configured to execute the label based on the target text to acquire a plurality of candidate resources, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text meets a first judging condition;

a determining unit configured to perform determination of at least one recommended resource from the plurality of candidate resources based on keyword information of the target text, the keyword information being used to represent keywords in the target text, the recommended resource being a candidate resource for which a matching parameter with the target text meets a matching condition;

and the generating unit is configured to execute the generation of the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining unit is configured to perform processing on the feature of the candidate resource and the keyword information of the target text through a multi-modal matching model to obtain a matching parameter of the candidate resource and the target text, where the multi-modal matching model is obtained based on sample material, sample text, and matching parameter training of the sample material and the sample text, and at least one recommended resource is determined from the plurality of candidate resources based on the matching parameter of the candidate resource and the target text.

In some embodiments, the multimodal matching model includes: the system comprises a first full-connection neural network, a second full-connection neural network, a bidirectional attention module, a third full-connection neural network, a feature fusion layer and a multi-layer perceptron;

the multi-layer perceptron is used for acquiring the matching parameters of the candidate resources and the target text based on the fourth characteristic.

In some embodiments, the determining unit further comprises:

a mapping subunit configured to perform mapping each semantic information in the feature of the candidate resource and each keyword information of the target text to a first semantic space through the first fully-connected neural network and the second fully-connected neural network, respectively, and output semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information, where the semantic mapping information is used to represent a single semantic of the candidate resource, and the keyword mapping information is used to represent a keyword of the target text;

The mapping subunit is configured to map the semantic mapping information and the keyword mapping information through a bidirectional attention module to obtain the first feature, wherein the first feature is the matching information of the candidate resource and the target text;

the mapping subunit is configured to perform integral mapping of the feature of the candidate resource and the feature information of all keywords to a second semantic space through a third fully connected neural network, and output the second feature and the third feature, wherein the second feature is used for representing all semantics of the candidate resource, and the third feature is used for representing all keywords of the target text;

a fusion subunit configured to perform fusion of the first feature, the second feature and the third feature through a feature fusion layer to obtain the fourth feature;

and the processing subunit is configured to perform processing on the fourth feature through a multi-layer perceptron and output matching parameters of the candidate resource and the target text.

In some embodiments, the mapping subunit is configured to perform obtaining first matching information of the single semantic mapping information and the all keyword mapping information, obtaining second matching information of the single keyword mapping information and the all semantic mapping information, and splicing the first matching information and the second matching information to obtain the first feature.

In some embodiments, the determining unit is configured to perform determining the audio data in the candidate resource as a recommended resource.

In some embodiments, the determining unit is configured to perform obtaining lyric features in the audio data, match the lyric features with keyword information of the target text through a deep learning model, obtain a matching degree between the audio data and the target text, and determine the audio data with the matching degree meeting an audio matching condition as the recommended resource.

In some embodiments, the obtaining unit is configured to perform calculating importance degrees of all words included in the target text, rank the words in the target text based on the importance degrees, use the words ranked in the first K positions as keywords of the target text, where K is an integer greater than 0 and less than the number of words included in the target text, and process the extracted keywords to obtain the keyword information.

In some embodiments, the obtaining unit is configured to execute each component of the text indicated by the text format template, extract a title from the target text, and use the extracted title as a keyword of the target text.

In some embodiments, the generating unit is configured to perform any one of:

In some embodiments, the generating unit is configured to perform stitching the plurality of recommended resources, determine at least one data set, where each data set includes at least one type of data in the image data and the video data that are subject to the content similarity condition, stitch the data in the one data set together, and stitch the stitched data sets sequentially according to the data sets.

In some embodiments, the generating unit is configured to perform the audio data as background audio of the generated multimedia asset if the audio data is included in the recommended asset.

In some embodiments, the apparatus further comprises:

an audio processing unit configured to perform inserting the recommended asset in the editable segment of the multimedia asset template to generate the target multimedia asset.

In some embodiments, the generating unit is configured to execute inserting the recommended resources of the corresponding type into the corresponding positions of the multimedia resource template according to the data types corresponding to the editable clips in the multimedia resource template, so as to obtain the target multimedia resource.

In some embodiments, the obtaining unit is configured to perform obtaining a number of user clicks of a plurality of videos of any object to be promoted, where the plurality of videos are generated based on different material resources, and the material resource corresponding to the video whose number of user clicks meets the second judgment condition is used as the candidate resource.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising:

one or more processors;

a memory for storing the processor-executable program code;

wherein the processor is configured to execute the program code to implement the above-described multimedia asset generation method.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium comprising: the program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the above-described multimedia asset generation method.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described method of generating a multimedia resource.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of an implementation environment of a method for generating multimedia assets, according to an exemplary embodiment;

fig. 2 is a basic block diagram illustrating a method of generating a multimedia asset according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a method of generating a multimedia asset, according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating a method of generating a multimedia asset, according to an exemplary embodiment;

FIG. 5 is a flowchart illustrating a method of extracting features of a material resource according to an exemplary embodiment;

FIG. 6 is a flowchart illustrating a method of extracting target text keyword information, according to an exemplary embodiment;

FIG. 7 is a diagram illustrating a multimodal matching model configuration in accordance with an illustrative embodiment;

FIG. 8 is a block diagram of a multimedia asset generation device according to an exemplary embodiment;

Fig. 9 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The data referred to in this disclosure may be data authorized by the user or sufficiently authorized by the parties.

Fig. 1 is a schematic diagram of an implementation environment of a method for generating a multimedia resource according to an embodiment of the present disclosure, referring to fig. 1, where the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 may be at least one of a smart phone, a smart watch, a desktop computer, a portable computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop portable computer, etc., the terminal 101 has a communication function, may access the internet, and the terminal 101 may refer to one of a plurality of terminals, which is only exemplified by the terminal 101 in this embodiment. Those skilled in the art will recognize that the number of terminals may be greater or lesser. The terminal 101 installs and runs an application supporting the generation of multimedia resources, which may be a live application, a video application, or other multimedia application, etc.

The server 102 may be an independent physical server, a server cluster or a distributed file system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform. The server 102 may be associated with a database for storing material resources including, but not limited to, image data, video data, audio data, and the like. The server 102 and the terminal 101 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application. Alternatively, the number of servers 102 may be greater or lesser, which is not limited in the embodiments of the present application. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Based on the above-described implementation environment, fig. 2 is a basic schematic block diagram of a multimedia asset generation method according to an exemplary embodiment, which is used in a server. As shown in fig. 2, the system architecture on which the multimedia resource generation method depends includes the following 5 parts:

(1) Data layer

The data layer is used to store material resources including, but not limited to: image data, video data, audio data, etc.

(2) Content understanding layer

The content understanding layer is used for extracting characteristics of material resources in the data layer, the obtained characteristics are used for matching with a target text in a subsequent matching layer, and the characteristic extraction comprises the following three steps:

manual labeling: based on the specific content of the material resources, labels are manually marked for the material resources in the data layer, for example, for an image data containing city streets, labels of city life are manually marked for the data image.

Feature extraction: and extracting the characteristics of the material resources by adopting a machine learning or deep learning method based on the marked material resources. In some embodiments, the characteristics of the material resource include: structural information of image data and video data, audio data characteristics, and the like. The structural information of the image data and the video data refers to semantic information of the image data and the video data, for example, for one image data, if the content of obvious semantics in the image data includes sky, building and person, the structural information of the image data is a part of the image data including the sky, the building and the person. Audio data features include, but are not limited to: pitch characteristics, tempo characteristics, lyrics characteristics, and the like, which are not limited in this embodiment.

Feature vector representation: and carrying out vectorization representation on the extracted characteristic information based on a vector representation method to obtain characteristic vectors corresponding to all the material resources.

(3) Recall layer

The recall layer is used for acquiring candidate resources based on the target text input by the user, wherein the candidate resources refer to material resources related to the target text. In some embodiments, the server obtains a plurality of candidate resources through a tag recall method, the tag recall refers to obtaining a material resource related to the target text based on the tag of the material resource, and the obtained material resource is used as the candidate resource. In some embodiments, the server obtains the candidate resource through a recall method of posterior data, where the recall method of posterior data refers to taking a material resource with higher user preference as the candidate resource, for example, based on a first material resource and a second material resource, a first video and a second video are respectively generated, the server puts the first video and the second video on the same video platform, and the first video is clicked by the user more times, so that the user preference of the first material resource is higher. Of course, the server may also obtain the candidate resource through other recall methods, which is not limited in this embodiment.

(4) Matching layer

The matching layer is used for matching the candidate resources with the target text and providing one or more video generation strategies. In some embodiments, the server first obtains the keyword information of the target text, then matches the keyword information with the features of each candidate resource by using a multi-mode matching model based on the features obtained by the content understanding layer, and obtains the matching parameters of each candidate resource and the target text, wherein the higher the value of the matching parameters is, the higher the matching degree is, and optionally, the matching parameters are matching probability values. This embodiment is described in detail in the following further explanation of the present solution, and see the corresponding embodiment of fig. 4.

In some embodiments, the server provides one or more generation policies for the generation of subsequent target multimedia resources. The video generation strategy refers to a method for generating multimedia resources based on recommended resources acquired by a subsequent generation layer, for example, a plurality of recommended resources are spliced to obtain target multimedia resources, or the target multimedia resources are generated based on a template file and the recommended resources.

(5) Generating layer

The generation layer is used for providing a multimedia resource generation service. In some embodiments, the server obtains a plurality of recommended resources based on the matching parameters, where the recommended resources refer to candidate resources with larger matching parameters with the target text. And generating the target multimedia resources by the server by adopting the generation strategy based on the acquired recommended resources.

Fig. 3 is a flowchart illustrating a method of generating a multimedia resource, as shown in fig. 3, for use in a server, according to an exemplary embodiment, comprising the steps of:

in step 301, in response to receiving the multimedia asset generation request, the server obtains a target text carried by the multimedia asset generation request.

In step 302, the server obtains a plurality of candidate resources based on the label of the target text, where the candidate resources are material resources whose similarity between the label and the label of the target text meets a first judgment condition.

In step 303, the server determines at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text, where the keyword information is used to represent a keyword in the target text, and the recommended resource is a candidate resource whose matching parameter matches a matching condition with the target text.

In step 304, the server generates the target multimedia resource corresponding to the target text based on the at least one recommended resource.

According to the technical scheme provided by the embodiment of the disclosure, based on the target text input by the user, candidate resources related to the target text are obtained from a large number of material resources, keywords of the target text are extracted, the candidate resources and the target text are further matched through the multi-mode matching model, recommended resources are further obtained, the target multimedia resources are rapidly and intelligently generated for the user based on the recommended resources, and meanwhile the correlation between the generated target multimedia resources and the target text is guaranteed.

Fig. 4 is a flowchart illustrating a multimedia asset generation method for use in a server, which is described below using video generation as an example, according to an exemplary embodiment. As shown in fig. 4, the method comprises the following steps:

in step 401, the server acquires a material resource.

In some embodiments, the server obtains the material resources from a database.

In step 402, the server performs feature extraction on the material resource to obtain features corresponding to the material resource.

In some embodiments, in response to the material resource being image data, the server invokes a Region-based high-speed convolutional neural network (Faster Region-based Convolutional Neural Network, faster-RCNN) to take the image data as input data for the Faster-RCNN network, processes the image data through the Faster-RCNN network, and outputs a feature corresponding to the image data from the Faster-RCNN network. For example, as shown in fig. 5, for the target image data in fig. 5, the content having obvious semantics in the target image data includes: the wine glass, the characters, the house and the sky, the Faster-RCNN network can extract the 4 semantic information contained in the target image data, and the 4 semantic information is the corresponding feature of the target image data. Optionally, the server performs vectorization representation on the 4 semantic information, and uses the obtained semantic vector as the corresponding feature of the target image data.

In some embodiments, in response to the material resource being video data, the server first decomposes the video data into multiple frames of images, and then performs feature extraction on the multiple frames of images based on the above method for performing feature extraction on image data, so as to obtain features corresponding to the video data.

By extracting the characteristics of the material resources, a better guiding effect is achieved on matching the follow-up process with the target text.

Before feature extraction is performed on a material resource, the material resource needs to be manually marked, the feature extraction and the manual marking are completed before a server receives a video generation request for the first time, and the feature extraction are performed only once, and do not need to be performed in real time each time the server receives the video generation request.

In step 403, the server obtains the target text entered by the user.

In some embodiments, the server obtains a target text input by the user, taking an advertisement video as an example, where the target text is a representative text of an object to be promoted, and if the advertisement video needs to be generated for an electronic book application program, the target text may be a profile of any electronic book, and the electronic book may be an electronic book provided by the application program. The electronic book application program is used for providing a text electronic book, a voice book and the like, and the embodiment of the disclosure is not limited.

The server is provided with a video generation service, and in some embodiments, a user can access the video generation service through the terminal, input a target text based on the video generation service, trigger a video generation request, and respond to the received video generation request of the terminal, the server acquires the target text carried by the video generation request. And the server performs the steps of subsequent data recall and data matching based on the content of the target text.

In step 404, the server obtains a plurality of candidate resources based on the tag of the target text, where the candidate resources are material resources whose similarity between the tag and the tag of the target text meets the first judgment condition.

In some embodiments, the server obtains a plurality of candidate resources through a tag recall method. And the server calculates the similarity between each material resource label and the label of the target text, and the material resource corresponding to the label with the similarity meeting the first judgment condition is used as the candidate resource.

In some embodiments, when calculating the similarity, the server adopts a semantic similarity calculation model, and inputs the label information of the material resource and the label information of the target text into the semantic similarity calculation model so as to obtain the similarity between the label of the material resource and the label of the target text. The semantic similarity calculation model may be a multi-layer convolutional neural network, a recurrent neural network or other deep learning model, which is not limited in this embodiment. Optionally, the judging condition is that the similarity between the material resource label and the label of the target text is greater than a first threshold, or the similarity is in the first N positions according to the order from big to small, where N is an integer greater than 0 and less than the number of the material resource labels.

Alternatively, the server may obtain the candidate resources through a posterior data recall method. The server acquires the user click times of a plurality of videos of any object to be promoted, the videos are generated based on different material resources, and the material resources corresponding to the videos of which the user click times meet the second judgment conditions are used as candidate resources. Optionally, the second judgment condition is that the number of user clicks of the video is greater than a second threshold, or the number of user clicks is in the first S positions in order from large to small, and S is an integer greater than 0 and less than the number of videos.

It should be noted that, the server may obtain the candidate resources through one or more recall methods in the above methods, and in response to the server adopting multiple recall methods, the server takes the resources obtained by all recall methods as the candidate resources, for example, the server obtains the first resource through the tag recall method and obtains the second resource through the posterior data recall method, and the candidate resources include the first resource and the second resource. Of course, the server may also obtain the candidate resource through other recall methods, which is not limited in this embodiment.

In step 405, the server performs keyword extraction on the target text, and obtains keyword information, where the keyword information is used to represent keywords in the target text.

In some embodiments, the server obtains the keyword information of the target text by a Term Frequency-inverse text Frequency statistics method (Term Frequency-Inverse Document Frequency, TFIDF) and a converter-based bi-directional coded representation method (Bidirectional Encoder Representation from Transformers, BERT). As shown in fig. 6, the server calculates the importance degrees of all the words contained in the target text by adopting a TFIDF method, ranks the words in the target text based on the importance degrees, uses the words ranked in the first K positions as the keywords of the target text, and K is an integer greater than 0 and less than the number of words contained in the target text. The server adopts a BERT model to vectorize and express K keywords, and the obtained K keyword vectors are used as K keyword information.

In some embodiments, the server extracts a title from the target text based on the text components indicated by the text format template, and uses the extracted title as a keyword for the target text.

Through the keyword extraction process, the characteristics of the target text are subjected to vectorization representation, and a good guiding effect is achieved on the subsequent matching process.

In step 406, the server inputs the features of the candidate resource and the keyword information of the target text into a multi-mode matching model, so as to obtain the matching parameters of the candidate resource and the target text, where the multi-mode matching model is obtained based on sample materials, sample text, and the matching parameters of the sample materials and the sample text.

In some embodiments, in response to the candidate resource being image data, the server obtains a feature of the image data and keyword information of the target text, inputs the feature and the keyword information into a multi-modal matching model, and obtains matching parameters of the image data and the target text, as shown in fig. 7, the multi-modal matching model includes: the system comprises a first fully-connected neural network, a second fully-connected neural network, a third fully-connected neural network, a bidirectional attention module, a feature fusion layer and a multi-layer perceptron. The multi-layer perceptron is used for acquiring the matching parameters of each candidate resource and the target text based on the fourth feature.

In some embodiments, the matching process includes the following steps 406A to 406E:

in step 406A, the server maps each semantic information in the feature of the image data and each keyword information of the target text to a first semantic space through a first fully-connected neural network and a second fully-connected neural network, and outputs semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information, where the semantic mapping information is used to represent a single semantic of the image data, and the keyword mapping information is used to represent a keyword of the target text. Optionally, the first fully-connected neural network and the second fully-connected neural network comprise a hidden layer, and the activation function of the first fully-connected neural network and the second fully-connected neural network is a sigmoid function.

Through the step 406A, each semantic information and each keyword information of the image data are mapped to the same semantic space independently, so as to play a guiding role in subsequently acquiring the matching information of the image data and the keyword data.

In step 406B, the server maps the semantic mapping information and the keyword mapping information through the bi-directional attention module to obtain a first feature, where the first feature is matching information of the image data and the target text. Wherein the mapping process comprises: acquiring first matching information of single semantic mapping information and all keyword mapping information, acquiring second matching information of the single keyword mapping information and all the semantic mapping information, and splicing the first matching information and the second matching information to obtain a first feature, wherein the first matching information is optionally the matching degree of the single semantic mapping information and all the keyword mapping information, and the second matching information is the matching degree of the single keyword mapping information and all the semantic mapping information. The first characteristic obtained based on the bidirectional attention mechanism contains the matching information of the image data and the target text, so that the correlation between the target video and the target text generated by the subsequent server is ensured.

In step 406C, the server maps the feature of the image data and the whole keyword information to the second semantic space through the third fully connected neural network, and outputs a second feature and a third feature, where the second feature is used to represent the whole semantic information of the image data, and the third feature is used to represent the whole keyword of the target text. Optionally, the third fully-connected neural network includes a hidden layer, and the activation function of the third fully-connected neural network is a sigmoid function.

Through the step 406C, the features of the image data and all the keyword information are integrally mapped to the same semantic space, so as to lay a good foundation for subsequent feature fusion.

In step 406D, the server fuses the first feature, the second feature, and the third feature through the feature fusion layer to obtain a fourth feature.

In some embodiments, the server fuses the first feature, the second feature, and the third feature using three methods of computing a kronecker product (Kronecker Product), a vector connection (Vector Concatenation), and a Self Attention mechanism (Self Attention), resulting in a fourth feature.

In step 406E, the server processes the fourth feature through a Multi-Layer process (MLP) and outputs matching parameters for the image data and the target text.

In some embodiments, in response to the candidate resource being video data, the server obtains the characteristics of the multi-frame image corresponding to the video data, obtains the matching parameters of the image and the target text based on the steps 406A to 406G for any frame of image, and calculates an average value of the matching parameters of all the images and the target text, where the average value is the matching parameters of the video data and the target text.

In the process of training the multi-modal matching model, the server acquires training data, wherein the training data comprises sample materials, sample texts and sample matching parameters. The training is achieved through multiple iterations, in any iteration process, feature extraction is conducted on sample materials to obtain corresponding sample features, keyword extraction is conducted on sample texts to obtain sample keyword information, the sample features and the sample keyword information are input into a model to be trained, whether a training ending condition is achieved is determined based on the output prediction matching parameters and the sample matching parameters, if the training ending condition is achieved, the model corresponding to the iteration is determined to be the multi-mode matching model, if the multi-mode matching model is not achieved, model parameters are adjusted, and the next iteration process is conducted based on the adjusted model until the training ending condition is achieved. Optionally, the training ending condition is: and if the barrel division precision of the predicted matching parameters is greater than 0.95 or the iteration number reaches a second threshold, ending training.

It should be noted that, the above multi-mode matching model can obtain the matching parameters of the candidate resource and the target text, and through experimental verification, the number of hidden layer nodes of three fully connected neural networks in the multi-mode matching model is properly increased, so that the accuracy of the multi-mode matching model can be significantly improved.

In step 407, the server determines at least one recommended resource from the plurality of candidate resources based on the matching parameters of the candidate resources and the target text.

In some embodiments, the server uses the candidate resource meeting the matching condition as the recommended resource based on the matching parameter of the candidate resource and the target text, and optionally, the matching condition is: the matching parameter of the candidate resource and the target text is larger than a third threshold value, or the matching parameter is arranged at the previous M positions in order from large to small, and M is an integer larger than 0 and smaller than the number of the candidate resources.

For steps 406 to 407 described above, the present embodiment only relates to the matching process of the image data, the video data and the target text in the candidate resources. In some embodiments, for the audio data in the candidate resource, the server directly determines the audio data in the candidate resource as a recommended resource, or the server acquires the lyric feature in the audio data, matches the lyric feature with the keyword information of the target text through the deep learning model to obtain the matching degree between the audio data and the target text, and determines the audio data with the matching degree meeting the audio matching condition as the recommended resource.

In step 408, the server generates a target video related to the target text based on the at least one recommended resource.

In some embodiments, the generation process described above includes any one of the following implementations:

one implementation: and the server splices the plurality of recommended resources to obtain the target video. For example, the plurality of recommended resources are determined to at least one data group, each data group comprises at least one type of data in image data and video data with the content similarity meeting conditions, the data in one data group are spliced together, and the spliced data groups are spliced sequentially according to the data groups. If the recommended resource includes audio data, the audio data is used as background audio of the generated video.

In some embodiments, the server generates the target video based on any video template that is video data having an editable clip that can be used to insert a recommendation resource to generate the target video. In the process of generating the target video based on the video template, according to the data types corresponding to all the editable fragments in the video template, the recommended resources of the corresponding types are inserted into the corresponding positions of the video template, so that the target video is obtained. Of course, the server may also generate the target video by using other methods, which is not limited in this embodiment.

According to the technical scheme provided by the embodiment of the disclosure, firstly, candidate resources related to the target text are obtained from a large number of material resources based on the target text input by the user, then, keyword extraction is carried out on the target text, the candidate resources and the target text are further matched through a multi-mode matching model, further, recommended resources are obtained, a target video is generated for the user rapidly and intelligently based on the recommended resources, and meanwhile, the correlation between the generated target video and the target text is guaranteed.

Fig. 8 is a block diagram illustrating a multimedia asset generation device according to an exemplary embodiment. Referring to fig. 8, the apparatus includes an acquisition unit 801, a determination unit 802, and a generation unit 803.

An obtaining unit 801 configured to perform obtaining, in response to receiving a multimedia resource generation request, a target text carried by the multimedia resource generation request;

the obtaining unit 801 is configured to perform label based on the target text, and obtain a plurality of candidate resources, where the candidate resources are material resources whose similarity between the label and the label of the target text meets a first judgment condition;

a determining unit 802 configured to perform determining at least one recommended resource from the plurality of candidate resources based on keyword information of the target text, the keyword information being used to represent each keyword in the target text, the recommended resource being a candidate resource for which a matching parameter with the target text meets a matching condition;

And a generating unit 803 configured to generate the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining unit 802 is configured to perform processing on the feature of the candidate resource and the keyword information of the target text by using a multi-modal matching model to obtain a matching parameter of the candidate resource and the target text, where the multi-modal matching model is obtained by training based on the sample material, the sample text, and the matching parameter of the sample material and the sample text, and determine at least one recommended resource from the plurality of candidate resources based on the matching parameter of the candidate resource and the target text.

In some embodiments, the determining unit 802 further comprises:

The mapping subunit is configured to perform integral mapping of the feature of the candidate resource and the whole keyword information to a second semantic space through a third fully connected neural network, and output the second feature and the third feature, wherein the second feature is used for representing the whole semantic of the candidate resource, and the third feature is used for representing the whole keyword of the target text;

In some embodiments, the determining unit 802 is configured to perform determining the audio data in the candidate resource as a recommended resource.

In some embodiments, the determining unit 802 is configured to perform obtaining lyrics features in the audio data, match the lyrics features with keyword information of the target text through a deep learning model, obtain a matching degree between the audio data and the target text, and determine the audio data with the matching degree meeting an audio matching condition as the recommended resource.

In some embodiments, the obtaining unit 801 is configured to calculate importance degrees of all words included in the target text, rank the words in the target text based on the importance degrees, use the words ranked in the first K positions as keywords of the target text, where K is an integer greater than 0 and less than the number of words included in the target text, and process the extracted keywords to obtain the keyword information.

In some embodiments, the obtaining unit 801 is configured to perform extracting a title from the target text according to the text components indicated by the text format template, and using the extracted title as a keyword of the target text.

In some embodiments, the generating unit 803 is configured to perform any one of:

In some embodiments, the generating unit 803 is configured to perform the determining the plurality of recommended resources to at least one data group, where each data group includes at least one type of data of the image data and the video data that are subject to the content similarity condition, splice the data in the one data group together, and splice the spliced data groups sequentially according to the data groups.

In some embodiments, the generating unit 803 is configured to perform the audio data as the background audio of the generated multimedia asset if the audio data is included in the recommended asset.

In some embodiments, the apparatus further comprises:

In some embodiments, the generating unit 803 is configured to perform inserting the recommended resources of the corresponding type into the corresponding positions of the multimedia resource template according to the data types corresponding to the editable clips in the multimedia resource template, so as to obtain the target multimedia resource.

In some embodiments, the obtaining unit 801 is configured to perform obtaining the number of user clicks of a plurality of videos of any object to be promoted, where the plurality of videos are generated based on different material resources, and the material resource corresponding to the video whose number of user clicks meets the second judgment condition is used as the candidate resource.

The above-described embodiment is described taking an electronic device as an example of a server, and a structure of the electronic device will be described below. Fig. 9 is a block diagram of an electronic device according to an exemplary embodiment, where the electronic device 900 may have a relatively large difference due to configuration or performance, and may include one or more processors (Central Processing Units, CP U) 901 and one or more memories 902, where the one or more memories 902 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 901 to implement a procedure performed by the electronic device in the multimedia resource generating method provided in the above respective method embodiments. Of course, the electronic device 900 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g. a memory 902 comprising program code, which is executable by the processor 901 of the electronic device 900 to perform the above-described multimedia asset generation method. Alternatively, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Compact-Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the above-described method of generating multimedia resources.

In some embodiments, the computer program related to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a multimedia asset, the method comprising:

acquiring a plurality of candidate resources based on the label of the target text, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text accords with a first judgment condition;

Processing the characteristics of the candidate resources through a first full-connection neural network in the multi-mode matching model to obtain semantic mapping information, and processing the keyword information of the target text through a second full-connection neural network in the multi-mode matching model to obtain keyword mapping information;

processing the semantic mapping information and the keyword mapping information through a bidirectional attention module in the multi-mode matching model to obtain a first feature;

processing the characteristics of the candidate resources and the keyword information through a third fully-connected neural network in the multi-mode matching model to obtain second characteristics and third characteristics;

processing the first feature, the second feature and the third feature through a feature fusion layer in the multi-mode matching model to obtain a fourth feature;

processing the fourth feature through a multi-layer perceptron in the multi-mode matching model to obtain matching parameters of the candidate resource and the target text;

determining at least one recommended resource from the plurality of candidate resources based on the matching parameters;

and generating a target multimedia resource corresponding to the target text based on the at least one recommended resource.

2. The method for generating a multimedia resource according to claim 1, wherein the processing the feature of the candidate resource through a first fully connected neural network in a multimodal matching model to obtain semantic mapping information, processing the keyword information of the target text through a second fully connected neural network in a multimodal matching model to obtain keyword mapping information, processing the semantic mapping information and the keyword mapping information through a bidirectional attention module in the multimodal matching model to obtain a first feature, processing the feature of the candidate resource and the keyword information through a third fully connected neural network in the multimodal matching model to obtain a second feature and a third feature, processing the first feature, the second feature and the third feature through a feature fusion layer in the multimodal matching model to obtain a fourth feature, processing the fourth feature through a multi-layer perceptron in the multimodal matching model to obtain a matching parameter of the candidate text and the target, and the matching parameter comprises:

mapping each semantic information in the characteristics of the candidate resources and each keyword information of the target text to a first semantic space through the first fully-connected neural network and the second fully-connected neural network respectively, and outputting semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information respectively, wherein the semantic mapping information is used for representing a single semantic of the candidate resources, and the keyword mapping information is used for representing one keyword of the target text;

Mapping the semantic mapping information and the keyword mapping information through the bidirectional attention module to obtain the first feature, wherein the first feature is the matching information of the candidate resource and the target text;

through the third fully-connected neural network, the characteristics of the candidate resources and all the keyword information are respectively and integrally mapped to a second semantic space, the second characteristics and the third characteristics are output, the second characteristics are used for representing all the semantics of the candidate resources, and the third characteristics are used for representing all the keywords of the target text;

fusing the first feature, the second feature and the third feature through the feature fusion layer to obtain the fourth feature;

and processing the fourth characteristic through the multi-layer perceptron, and outputting the matching parameters of the candidate resource and the target text.

3. The method of claim 2, wherein mapping the semantic mapping information and the keyword mapping information to obtain the first feature comprises:

acquiring first matching information of single semantic mapping information and all keyword mapping information;

Acquiring second matching information of single keyword mapping information and all semantic mapping information;

4. The method of claim 1, further comprising:

and determining the audio data in the candidate resources as recommended resources.

5. The method of claim 1, wherein said determining at least one recommended resource from the plurality of candidate resources comprises:

and obtaining lyric features in the audio data, matching the lyric features with the keyword information of the target text through a deep learning model to obtain the matching degree between the audio data and the target text, and determining the audio data with the matching degree meeting the audio matching condition as the recommended resource.

6. The method of claim 1, wherein the process of obtaining the keyword information of the target text comprises:

calculating importance degrees of all words contained in the target text, sorting the words in the target text based on the importance degrees, taking the words ranked at the first K positions as keywords of the target text, wherein K is an integer greater than 0 and less than the number of words contained in the target text;

And processing the extracted keywords to obtain the keyword information.

7. The method of multimedia asset generation according to claim 6, further comprising:

8. The method for generating a multimedia resource according to claim 1, wherein the generating the target multimedia resource corresponding to the target text based on the at least one recommended resource comprises any one of:

splicing the plurality of recommended resources to obtain the target multimedia resources;

9. The method of claim 8, wherein the splicing the plurality of recommended resources to obtain the target multimedia resource comprises:

determining at least one data set by using the plurality of recommended resources, wherein each data set comprises at least one type of data in image data and video data with the content similarity meeting the condition;

And splicing the data in one data group together, and splicing the spliced data groups sequentially according to the data groups.

10. The method of multimedia asset generation according to claim 8, further comprising:

and if the recommended resource comprises audio data, taking the audio data as the background audio of the generated multimedia resource.

11. The method of claim 8, wherein generating the target multimedia asset based on any multimedia asset template comprises:

inserting the recommended resources in the editable segment of the multimedia asset template to generate the target multimedia asset.

12. The method of claim 11, wherein the inserting the recommended resource in the editable segment of the multimedia resource template to generate the target multimedia resource comprises:

and inserting the recommended resources of the corresponding types into the corresponding positions of the multimedia resource templates according to the data types corresponding to the editable fragments in the multimedia resource templates to obtain the target multimedia resources.

13. The method of claim 1, further comprising:

and taking the material resource corresponding to the video with the click times meeting the second judgment condition of the user as the candidate resource.

14. A multimedia asset generation device, the device comprising:

the obtaining unit is configured to execute the label based on the target text to obtain a plurality of candidate resources, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text meets a first judging condition;

a determination unit configured to perform:

15. The apparatus according to claim 14, wherein the determining unit includes:

a mapping subunit configured to perform mapping each semantic information in the feature of the candidate resource and each keyword information of the target text to a first semantic space through the first fully-connected neural network and the second fully-connected neural network, and output semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information, where the semantic mapping information is used to represent a single semantic of the candidate resource, and the keyword mapping information is used to represent one keyword of the target text;

The mapping subunit is configured to map the semantic mapping information and the keyword mapping information through the bidirectional attention module to obtain the first feature, wherein the first feature is the matching information of the candidate resource and the target text;

the mapping subunit is configured to perform integral mapping of the features of the candidate resources and all the keyword information to a second semantic space through the third fully-connected neural network, and output the second features and the third features, wherein the second features are used for representing all the semantics of the candidate resources, and the third features are used for representing all the keywords of the target text;

a fusion subunit configured to perform fusion of the first feature, the second feature and the third feature through the feature fusion layer, so as to obtain the fourth feature;

and the processing subunit is configured to perform processing on the fourth feature through the multi-layer perceptron and output matching parameters of the candidate resource and the target text.

16. The apparatus according to claim 15, wherein the mapping subunit is configured to perform obtaining first matching information of single semantic mapping information and all keyword mapping information, obtaining second matching information of single keyword mapping information and all semantic mapping information, and concatenating the first matching information and the second matching information to obtain the first feature.

17. The apparatus according to claim 14, wherein the determining unit is configured to perform determining audio data in the candidate resources as recommended resources.

18. The apparatus according to claim 14, wherein the determining unit is configured to perform obtaining a lyric feature in audio data, match the lyric feature with keyword information of the target text by a deep learning model, obtain a matching degree between the audio data and the target text, and determine, as the recommended resource, audio data whose matching degree meets an audio matching condition.

19. The apparatus according to claim 14, wherein the obtaining unit is configured to perform calculation of importance degrees of all words contained in the target text, order words in the target text based on the importance degrees, regard the words ranked in the top K positions as keywords of the target text, K is an integer greater than 0 and less than the number of words contained in the target text, and process the extracted keywords to obtain the keyword information.

20. The apparatus according to claim 19, wherein the obtaining unit is configured to execute each component of the text indicated by the text format template, extract a title from the target text, and use the extracted title as a keyword of the target text.

21. The multimedia asset generating apparatus according to claim 14, wherein the generating unit is configured to perform any one of:

22. The apparatus according to claim 21, wherein the generating unit is configured to perform the step of determining at least one data group from the plurality of recommended resources, each data group including at least one type of data of image data and video data having a content similarity conforming to a condition, splice data in one data group together, and splice the spliced data groups sequentially according to the data groups.

23. The multimedia asset generating apparatus according to claim 21, characterized in that the apparatus further comprises:

the generation unit is configured to execute the audio data as the background audio of the generated multimedia resource if the recommended resource comprises the audio data.

24. The multimedia asset generating apparatus according to claim 21, characterized in that the apparatus further comprises:

an audio processing unit configured to perform insertion of the recommended resources in the editable segment of the multimedia asset template to generate the target multimedia asset.

25. The apparatus according to claim 24, wherein the generating unit is configured to execute the insertion of the recommended resources of the respective types into the corresponding positions of the multimedia resource templates according to the data types corresponding to the respective editable fragments in the multimedia resource templates, to obtain the target multimedia resources.

26. The apparatus according to claim 14, wherein the obtaining unit is configured to perform obtaining a number of user clicks of a plurality of videos of any object to be promoted, the plurality of videos being generated based on different material resources, and the material resource corresponding to the video whose number of user clicks meets a second criterion being the candidate resource.

27. An electronic device, the electronic device comprising:

one or more processors;

a memory for storing the processor-executable program code;

wherein the processor is configured to execute the program code to implement the multimedia asset generation method of any of claims 1 to 13.

28. A computer readable storage medium, characterized in that program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the multimedia asset generation method of any of claims 1 to 13.