CN113377971A

CN113377971A - Multimedia resource generation method and device, electronic equipment and storage medium

Info

Publication number: CN113377971A
Application number: CN202110598129.7A
Authority: CN
Inventors: 王厚志; 梅晓茸; 张梦馨; 刘旭东; 叶小瑜; 金梦; 张德兵; 郭晓锋; 周伟浩; 张辰怡
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-10
Anticipated expiration: 2041-05-31
Also published as: CN113377971B

Abstract

The disclosure relates to a multimedia resource generation method, a multimedia resource generation device, an electronic device and a storage medium, and belongs to the technical field of multimedia, wherein the method comprises the following steps: the method comprises the steps of obtaining candidate resources related to a target text from a large number of material resources based on the target text input by a user, extracting keywords of the target text, further matching the candidate resources with the target text through a multi-mode matching model, obtaining recommended resources, rapidly and intelligently generating the target multimedia resources for the user based on the recommended resources, and meanwhile guaranteeing the relevance of the generated target multimedia resources and the target text.

Description

Multimedia resource generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of multimedia technologies, and in particular, to a multimedia resource generation method and apparatus, an electronic device, and a storage medium.

Background

With the increasing variety of various applications, advertisers need to produce advertisements based on their own applications to realize the popularization of applications. The multimedia resources comprise resources such as pictures, videos and the like, and can be used as an advertisement form to realize popularization of the application program. For a novel application program, an advertiser often uses a picture related to the novel content as an advertisement according to the content of a novel, and sets a jump interface of the novel application program in the picture, so that the aim of attracting a user to download the novel application program is fulfilled.

In the technology, the advertiser needs to manually select the pictures, and selects the pictures matched with the novel content from the pictures, so that a large amount of human resources are consumed, the cost is high, and the efficiency is low.

Disclosure of Invention

The present disclosure provides a multimedia resource generation method, apparatus, electronic device, and storage medium, which can quickly and intelligently generate a target multimedia resource for a user, and simultaneously ensure the correlation between the generated target multimedia resource and a target text. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a multimedia resource generation method, including:

in response to receiving a multimedia resource generation request, acquiring a target text carried by the multimedia resource generation request;

acquiring a plurality of candidate resources based on the label of the target text, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text meets a first judgment condition;

determining at least one recommended resource from the candidate resources based on the keyword information of the target text, wherein the keyword information is used for representing a keyword in the target text, and the recommended resource is a candidate resource which meets a matching condition with a matching parameter between the recommended resource and the target text;

and generating the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text includes:

processing the characteristics of the candidate resources and the keyword information of the target text through a multi-modal matching model to obtain matching parameters of the candidate resources and the target text, wherein the multi-modal matching model is obtained by training based on sample materials, the sample text and the matching parameters of the sample materials and the sample text;

and determining at least one recommended resource from the plurality of candidate resources based on the matching parameters of the candidate resources and the target text.

In some embodiments, the processing, by the multi-modal matching model, the features of the candidate resource and the keyword information of the target text to obtain the matching parameters of the candidate resource and the target text includes:

the multi-modal matching model includes: the system comprises a first full-connection neural network, a second full-connection neural network, a bidirectional attention module, a third full-connection neural network, a feature fusion layer and a multilayer perceptron;

the first full-link neural network and the second full-link neural network are used for acquiring semantic mapping information and keyword mapping information based on the characteristics of the candidate resources and the keyword information;

the bidirectional attention module is used for acquiring a first characteristic based on the semantic mapping information and the keyword mapping information;

the third fully-connected neural network is used for acquiring a second feature and a third feature based on the feature of the candidate resource and the keyword information;

the feature fusion layer is used for acquiring a fourth feature based on the first feature, the second feature and the third feature;

the multilayer perceptron is used for acquiring matching parameters of the candidate resources and the target text based on the fourth feature.

mapping each semantic information in the characteristics of the candidate resources and each keyword information of the target text to a first semantic space through the first fully-connected neural network and the second fully-connected neural network respectively, and outputting semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information respectively, wherein the semantic mapping information is used for representing single semantics of the candidate resources, and the keyword mapping information is used for representing a keyword of the target text;

mapping the semantic mapping information and the keyword mapping information through a bidirectional attention module to obtain the first characteristic, wherein the first characteristic is matching information of the candidate resource and the target text;

respectively and integrally mapping the characteristics of the candidate resources and the characteristic information of all the keywords to a second semantic space through a third fully-connected neural network, and outputting the second characteristics and the third characteristics, wherein the second characteristics are used for expressing all the semantics of the candidate resources, and the third characteristics are used for expressing all the keywords of the target text;

fusing the first feature, the second feature and the third feature through a feature fusion layer to obtain a fourth feature;

and processing the fourth feature through a multilayer perceptron, and outputting a matching parameter of the candidate resource and the target text.

In some embodiments, the mapping the semantic mapping information and the keyword mapping information to obtain the first feature includes:

acquiring first matching information of the single semantic mapping information and the all keyword mapping information;

acquiring second matching information of the single keyword mapping information and all semantic mapping information;

and splicing the first matching information and the second matching information to obtain the first characteristic.

In some embodiments, the method further comprises:

and determining the audio data in the candidate resource as the recommended resource.

In some embodiments, the determining at least one recommended resource from the plurality of candidate resources comprises:

and acquiring lyric characteristics in the audio data, matching the lyric characteristics with keyword information of the target text through a deep learning model to obtain the matching degree between the audio data and the target text, and determining the audio data with the matching degree meeting audio matching conditions as the recommended resource.

In some embodiments, the obtaining of the keyword information of the target text includes:

calculating the importance degrees of all words contained in the target text, sequencing the words in the target text based on the importance degrees, and taking the words at the first K positions as the keywords of the target text, wherein K is an integer which is greater than 0 and less than the number of the words contained in the target text;

and processing the extracted keywords to obtain the keyword information.

In some embodiments, the method further comprises:

and extracting a title from the target text according to each component of the text indicated by the text format template, and taking the extracted title as a keyword of the target text.

In some embodiments, the generating the target multimedia resource corresponding to the target text based on the at least one recommended resource includes any one of:

splicing the plurality of recommended resources to obtain the target multimedia resource;

the target multimedia asset is generated based on any multimedia asset template, which is multimedia asset data with editable segments.

In some embodiments, the splicing the recommended resources to obtain the target multimedia resource includes:

determining at least one data group from the plurality of recommended resources, wherein each data group comprises at least one type of data in image data and video data with content similarity meeting conditions;

and splicing the data in the data group together, and sequentially splicing the spliced data groups according to the data groups.

In some embodiments, the method further comprises:

and if the recommended resource comprises audio data, using the audio data as the generated background audio of the multimedia resource.

In some embodiments, the generating the target multimedia asset based on any multimedia asset template comprises:

inserting the recommended resource into the editable segment of the multimedia resource template to generate the target multimedia resource.

In some embodiments, the inserting the recommended resource in the editable segment of the multimedia resource template to generate the target multimedia resource comprises:

and inserting the recommended resources of the corresponding type into the corresponding positions of the multimedia resource template according to the data types corresponding to the editable segments in the multimedia resource template to obtain the target multimedia resources.

In some embodiments, the method further comprises:

acquiring the user click times of a plurality of videos of any object to be promoted, wherein the plurality of videos are generated based on different material resources;

and taking the material resource corresponding to the video with the user click times meeting the second judgment condition as the candidate resource.

According to a second aspect of the embodiments of the present disclosure, there is provided a multimedia resource generating apparatus, the apparatus including:

the acquiring unit is configured to execute the steps of responding to the received multimedia resource generation request, and acquiring a target text carried by the multimedia resource generation request;

the acquisition unit is configured to execute the label based on the target text and acquire a plurality of candidate resources, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text meets a first judgment condition;

a determining unit configured to perform determining at least one recommended resource from the plurality of candidate resources based on keyword information of the target text, the keyword information being used for representing a keyword in the target text, the recommended resource being a candidate resource for which a matching parameter with the target text meets a matching condition;

and the generating unit is configured to generate the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining unit is configured to perform processing on the features of the candidate resources and the keyword information of the target text through a multi-modal matching model to obtain matching parameters of the candidate resources and the target text, the multi-modal matching model is trained based on matching parameters of sample materials, sample texts, and sample materials and sample texts, and at least one recommended resource is determined from the candidate resources based on the matching parameters of the candidate resources and the target text.

In some embodiments, the multi-modal matching model comprises: the system comprises a first full-connection neural network, a second full-connection neural network, a bidirectional attention module, a third full-connection neural network, a feature fusion layer and a multilayer perceptron;

the multilayer perceptron is used for acquiring matching parameters of each candidate resource and the target text based on the fourth feature.

In some embodiments, the determining unit further comprises:

a mapping subunit, configured to perform mapping each semantic information in the features of the candidate resource and each keyword information of the target text to a first semantic space through the first fully-connected neural network and a second fully-connected neural network, and output semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information, respectively, where the semantic mapping information is used to represent a single semantic meaning of the candidate resource, and the keyword mapping information is used to represent a keyword of the target text;

the mapping subunit is configured to perform mapping on the semantic mapping information and the keyword mapping information through a bidirectional attention module to obtain the first feature, where the first feature is matching information of the candidate resource and the target text;

the mapping subunit is configured to perform overall mapping of the feature of the candidate resource and the feature information of all the keywords to a second semantic space through a third fully-connected neural network, and output the second feature and the third feature, where the second feature is used to represent all the semantics of the candidate resource, and the third feature is used to represent all the keywords of the target text;

a fusion subunit configured to perform fusion of the first feature, the second feature and the third feature through a feature fusion layer to obtain the fourth feature;

and the processing subunit is configured to perform processing on the fourth feature through a multilayer perceptron, and output a matching parameter of the candidate resource and the target text.

In some embodiments, the mapping subunit is configured to perform obtaining first matching information of the single semantic mapping information and the all-keyword mapping information, obtaining second matching information of the single keyword mapping information and the all-keyword mapping information, and concatenating the first matching information and the second matching information to obtain the first feature.

In some embodiments, the determining unit is configured to perform the determining of the audio data in the candidate resource as the recommended resource.

In some embodiments, the determining unit is configured to perform obtaining a lyric feature in the audio data, matching the lyric feature with keyword information of the target text through a deep learning model to obtain a matching degree between the audio data and the target text, and determining the audio data with the matching degree meeting an audio matching condition as the recommended resource.

In some embodiments, the obtaining unit is configured to perform calculating importance degrees of all words contained in the target text, rank the words in the target text based on the importance degrees, use the words ranked at the top K positions as keywords of the target text, where K is an integer greater than 0 and less than the number of words contained in the target text, and process the extracted keywords to obtain the keyword information.

In some embodiments, the obtaining unit is configured to perform extracting a title from the target text according to each component of the text indicated by the text format template, and taking the extracted title as a keyword of the target text.

In some embodiments, the generating unit is configured to perform any of:

In some embodiments, the generating unit is configured to perform the steps of determining at least one data group including at least one type of data of image data and video data with content similarity meeting the condition, splicing the data in the data group, and sequentially splicing the spliced data groups according to the data groups.

In some embodiments, the generating unit is configured to perform, if the recommended resource includes audio data, using the audio data as the generated background audio of the multimedia resource.

In some embodiments, the apparatus further comprises:

an audio processing unit configured to perform inserting the recommended resource in the editable segment of the multimedia resource template to generate the target multimedia resource.

In some embodiments, the generating unit is configured to insert the recommended resource of the corresponding type into the corresponding position of the multimedia resource template according to the data type corresponding to each editable segment in the multimedia resource template, so as to obtain the target multimedia resource.

In some embodiments, the obtaining unit is configured to execute user click times for obtaining multiple videos of any one object to be promoted, the multiple videos are generated based on different material resources, and the material resource corresponding to the video with the user click times meeting the second judgment condition is used as the candidate resource.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the multimedia resource generating method described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium including: the program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the multimedia asset generation method described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the multimedia asset generation method described above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram illustrating an implementation environment for a method for generating a multimedia asset according to an exemplary embodiment;

FIG. 2 is a basic functional block diagram illustrating a method of multimedia asset generation according to an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of multimedia asset generation according to an exemplary embodiment;

FIG. 4 is a flow chart illustrating a method of multimedia asset generation according to an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method for story resource feature extraction in accordance with an exemplary embodiment;

FIG. 6 is a flow diagram illustrating a method for target text keyword information extraction in accordance with an exemplary embodiment;

FIG. 7 is a diagram illustrating a multi-modal matching model architecture in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating a multimedia asset generation apparatus in accordance with an exemplary embodiment;

FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The data to which the present disclosure relates may be data that is authorized by a user or sufficiently authorized by parties.

Fig. 1 is a schematic diagram of an implementation environment of a multimedia resource generation method provided in an embodiment of the present disclosure, and referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 may be at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop computer, and the like, the terminal 101 has a communication function and can access the internet, and the terminal 101 may be generally referred to as one of a plurality of terminals, which is only exemplified by the terminal 101 in this embodiment. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. The terminal 101 is installed and operated with an application program supporting multimedia resource generation, which may be a live application program, a video application program, or other multimedia application program, etc.

The server 102 may be an independent physical server, a server cluster or a distributed file system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. Server 102 may have associated therewith a database for storing material assets including, but not limited to, image data, video data, audio data, and the like. The server 102 and the terminal 101 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the application. Alternatively, the number of the servers 102 may be more or less, and the embodiment of the present application is not limited thereto. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Based on the above implementation environment, fig. 2 is a basic schematic block diagram illustrating a multimedia resource generation method according to an exemplary embodiment, and the method is used in a server. As shown in fig. 2, the system architecture on which the multimedia resource generation method depends includes the following 5 parts:

(1) data layer

The data layer is used for storing material resources including but not limited to: image data, video data, audio data, and the like.

(2) Content understanding layer

The content understanding layer is used for extracting the characteristics of the material resources in the data layer, the obtained characteristics are used for matching with the target text in the subsequent matching layer, and the characteristic extraction comprises the following three steps:

manual labeling: based on the specific content of the material resources, labels are manually labeled for the material resources in the data layer, for example, for an image data containing city streets, labels for city life are manually labeled for the data image.

Characteristic extraction: and based on the labeled material resources, extracting the characteristics of the material resources by adopting a machine learning or deep learning method. In some embodiments, the characteristics of the material resources include: structural information of image data and video data, audio data characteristics, and the like. The structure information of the image data and the video data refers to semantic information of the image data and the video data, and for example, for a piece of image data, if contents with obvious semantics in the image data include sky, buildings and people, the structure information of the image data is a part of the image data including the sky, the buildings and the people. Audio data features include, but are not limited to: pitch characteristics, rhythm characteristics, lyric characteristics, etc., which are not limited in this embodiment.

Feature vector representation: and performing vectorization representation on the extracted feature information based on a vector representation method to obtain feature vectors corresponding to the material resources.

(3) Recall layer

The recall layer is used for acquiring candidate resources based on the target text input by the user, wherein the candidate resources refer to material resources related to the target text. In some embodiments, the server obtains a plurality of candidate resources through a tag recall method, where the tag recall is to obtain a material resource related to the target text based on a tag of the material resource, and use the obtained material resource as a candidate resource. In some embodiments, the server obtains the candidate resource by a posterior data recall method, where the posterior data recall method is to take a material resource with a higher user preference degree as the candidate resource, for example, a first video and a second video are respectively generated based on the first material resource and the second material resource, and the server puts the first video and the second video on the same video platform, and if the first video is clicked more times by the user, the user preference degree of the first material resource is higher. Of course, the server may also obtain the candidate resource by other recall methods, which is not limited in this embodiment.

(4) Matching layer

The matching layer is used for matching the candidate resources with the target text and providing one or more video generation strategies. In some embodiments, the server first obtains the keyword information of the target text, and then matches the keyword information with the features of each candidate resource by using a multi-modal matching model based on the features obtained by the content understanding layer to obtain a matching parameter between each candidate resource and the target text, where the higher the value of the matching parameter is, the higher the matching degree is, and optionally, the matching parameter is a matching probability value. The embodiments of the present application will be described in detail in the following further explanation of the present solution, and refer to the corresponding embodiment content in fig. 4.

In some embodiments, the server provides one or more generation policies for subsequent generation of the target multimedia asset. The video generation strategy refers to a method for generating multimedia resources based on recommended resources acquired by a subsequent generation layer, for example, a target multimedia resource is obtained by splicing a plurality of recommended resources, or the target multimedia resource is generated based on a template file and the recommended resources.

(5) Forming layers

The generation layer is used for providing a multimedia resource generation service. In some embodiments, the server obtains a plurality of recommended resources based on the matching parameters, where the recommended resources refer to candidate resources with larger matching parameters with the target text. And based on the acquired recommended resources, the server generates the target multimedia resources by adopting the generation strategy.

Fig. 3 is a flowchart illustrating a multimedia asset generation method according to an exemplary embodiment, which is used in a server, as shown in fig. 3, and includes the following steps:

in step 301, in response to receiving a multimedia resource generation request, a server obtains a target text carried by the multimedia resource generation request.

In step 302, the server obtains a plurality of candidate resources based on the tag of the target text, where the candidate resources are material resources whose similarity between the tag and the tag of the target text meets the first determination condition.

In step 303, the server determines at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text, where the keyword information is used to represent the keyword in the target text, and the recommended resource is a candidate resource whose matching parameter with the target text meets the matching condition.

In step 304, the server generates the target multimedia resource corresponding to the target text based on the at least one recommended resource.

According to the technical scheme provided by the embodiment of the disclosure, based on the target text input by the user, the candidate resources related to the target text are obtained from a large number of material resources, the keyword extraction is carried out on the target text, the candidate resources are further matched with the target text through the multi-mode matching model, the recommended resources are further obtained, the target multimedia resources are rapidly and intelligently generated for the user based on the recommended resources, and meanwhile, the correlation between the generated target multimedia resources and the target text is ensured.

Fig. 4 is a flowchart illustrating a multimedia asset generating method used in a server according to an exemplary embodiment, and the following description takes video generation as an example. As shown in fig. 4, the method comprises the following steps:

in step 401, the server acquires a material resource.

In some embodiments, the server retrieves the material resources from a database.

In step 402, the server performs feature extraction on the material resources to obtain features corresponding to the material resources.

In some embodiments, in response to the material resource being image data, the server calls a Region-based high-speed Convolutional Neural Network (fast-RCNN) to use the image data as input data of the fast-RCNN Network, processes the image data through the fast-RCNN Network, and outputs a feature corresponding to the image data through the fast-RCNN Network. For example, as shown in fig. 5, for the target image data in fig. 5, the content of the target image data with obvious semantics includes: the fast-RCNN network can extract the 4 semantic information contained in the target image data, and the 4 semantic information is the characteristics corresponding to the target image data. Optionally, the server performs vectorization representation on the 4 semantic information, and takes the obtained semantic vector as a feature corresponding to the target image data.

In some embodiments, in response to that the material resource is video data, the server first decomposes the video data into multiple frames of images, and then performs feature extraction on the multiple frames of images respectively based on the above-described method for performing feature extraction on image data to obtain features corresponding to the video data.

By extracting the characteristics of the material resources, the subsequent matching with the target text is well guided.

It should be noted that, before feature extraction is performed on the material resources, manual annotation needs to be performed on the material resources, and the processes of feature extraction and manual annotation are both completed before the server receives the video generation request for the first time, and only need to be performed once, and do not need to perform real-time annotation and feature extraction each time the server receives the video generation request.

In step 403, the server obtains the target text input by the user.

In some embodiments, the server obtains a target text input by the user, taking the production of an advertisement video as an example, where the target text is a representative text of an object to be promoted, and if an advertisement video needs to be generated for an electronic book application, the target text may be a profile of any electronic book, and the electronic book may be an electronic book provided by the application. The electronic book application is used for providing a text electronic book, an audio book, and the like, and the embodiment of the disclosure is not limited.

The server is provided with a video generation service, in some embodiments, a user can access the video generation service through the terminal, and input a target text based on the video generation service, trigger a video generation request, and in response to receiving the video generation request of the terminal, the server obtains the target text carried by the video generation request. And the server performs subsequent data recall and data matching based on the content of the target text.

In step 404, the server obtains a plurality of candidate resources based on the tag of the target text, where the candidate resources are material resources whose similarity between the tag and the tag of the target text meets the first determination condition.

In some embodiments, the server obtains the plurality of candidate resources through a tag recall method. And the server calculates the similarity between each material resource label and the label of the target text, and takes the material resource corresponding to the label with the similarity meeting the first judgment condition as the candidate resource.

In some embodiments, in calculating the similarity, the server uses a semantic similarity calculation model to input the tag information of the material resource and the tag information of the target text into the semantic similarity calculation model, so as to obtain the similarity between the material resource tag and the tag of the target text. The semantic similarity calculation model may be a multilayer convolutional neural network, a recurrent neural network, or another deep learning model, which is not limited in this embodiment. Optionally, the above determination condition is that the similarity between the material resource tag and the tag of the target text is greater than a first threshold, or the similarity is located at the top N positions in descending order, where N is an integer greater than 0 and less than the number of the material resource tags.

Alternatively, the server may obtain the candidate resource through an a posteriori data recall method. The server obtains the user click times of a plurality of videos of any object to be promoted, the videos are generated based on different material resources, and the material resources corresponding to the videos of which the user click times meet the second judgment condition are used as candidate resources. Optionally, the second determination condition is that the user click times of the videos are greater than a second threshold, or the user click times are located at the first S positions in descending order, where S is an integer greater than 0 and less than the number of the videos.

It should be noted that the server may obtain the candidate resources through one or more of the above-mentioned recall methods, and in response to the server adopting multiple recall methods, take the resources obtained by all the recall methods as the candidate resources, for example, the server obtains the first resource through the tag recall method, and obtains the second resource through the posterior data recall method, and then the candidate resources include the first resource and the second resource. Of course, the server may also obtain the candidate resource by other recall methods, which is not limited in this embodiment.

In step 405, the server performs keyword extraction on the target text to obtain keyword information, where the keyword information is used to represent a keyword in the target text.

In some embodiments, the server obtains the keyword information of the target text through a Term Frequency-Inverse text Frequency statistical method (TFIDF) and a converter-based Bidirectional encoding Representation method (BERT). As shown in fig. 6, the server calculates the importance degrees of all words contained in the target text by using a TFIDF method, ranks the words in the target text based on the importance degrees, and takes the words ranked at the top K positions as the keywords of the target text, where K is an integer greater than 0 and less than the number of words contained in the target text. And the server adopts a BERT model to carry out vectorization representation on the K keywords, and the obtained K keyword vectors are used as the K keyword information.

In some embodiments, the server extracts the title from the target text according to the components of the text indicated by the text format template, and takes the extracted title as the keyword of the target text.

Through the keyword extraction process, the characteristics of the target text are vectorized and expressed, and a better guiding effect is achieved on the subsequent matching process.

In step 406, the server inputs the features of the candidate resources and the keyword information of the target text into a multi-modal matching model to obtain matching parameters of the candidate resources and the target text, wherein the multi-modal matching model is obtained by training based on the sample materials, the sample text, and the matching parameters of the sample materials and the sample text.

In some embodiments, in response to the candidate resource being image data, the server obtains a feature of the image data and keyword information of the target text, inputs the feature and the keyword information into a multi-modal matching model, and obtains matching parameters of the image data and the target text, as shown in fig. 7, the multi-modal matching model includes: the system comprises a first fully-connected neural network, a second fully-connected neural network, a third fully-connected neural network, a bidirectional attention module, a feature fusion layer and a multilayer perceptron. The first fully-connected neural network and the second fully-connected neural network are used for acquiring semantic mapping information and keyword mapping information based on the characteristics and keyword information of candidate resources, the bidirectional attention module is used for acquiring first characteristics based on the semantic mapping information and the keyword mapping information, the third fully-connected neural network is used for acquiring second characteristics and third characteristics based on the characteristics and keyword information of the candidate resources, the characteristic fusion layer is used for acquiring fourth characteristics based on the first characteristics, the second characteristics and the third characteristics, and the multilayer perceptron is used for acquiring matching parameters of each candidate resource and a target text based on the fourth characteristics.

In some embodiments, the matching process includes the following steps 406A to 406E:

in step 406A, the server maps each semantic information in the features of the image data and each keyword information of the target text to a first semantic space through a first fully-connected neural network and a second fully-connected neural network, and outputs semantic mapping information corresponding to each semantic information and keyword mapping information corresponding to each keyword information, respectively, where the semantic mapping information is used to represent a single semantic meaning of the image data, and the keyword mapping information is used to represent a keyword of the target text. Optionally, the first fully-connected neural network and the second fully-connected neural network comprise a hidden layer, and the activation function of the first fully-connected neural network and the second fully-connected neural network is a sigmoid function.

Through the step 406A, each semantic information and each keyword information of the image data are independently mapped to the same semantic space, so that a guidance function is provided for subsequently acquiring matching information of the image data and the keyword data.

In step 406B, the server maps the semantic mapping information and the keyword mapping information through the bidirectional attention module to obtain a first feature, where the first feature is matching information between the image data and the target text. Wherein the mapping process comprises: the method comprises the steps of obtaining first matching information of single semantic mapping information and all the keyword mapping information, obtaining second matching information of the single keyword mapping information and all the semantic mapping information, and splicing the first matching information and the second matching information to obtain a first characteristic, wherein optionally, the first matching information is the matching degree of the single semantic mapping information and all the keyword mapping information, and the second matching information is the matching degree of the single keyword mapping information and all the semantic mapping information. The first feature obtained based on the bidirectional attention mechanism contains matching information of the image data and the target text, and the relevance between the target video generated by the subsequent server and the target text is ensured.

In step 406C, the server maps the features of the image data and the all-keyword information to a second semantic space through a third fully-connected neural network, and outputs a second feature and a third feature, wherein the second feature is used for representing all-semantic information of the image data, and the third feature is used for representing all keywords of the target text. Optionally, the third fully-connected neural network includes a hidden layer, and the activation function of the third fully-connected neural network is a sigmoid function.

Through the step 406C, the features of the image data and all the keyword information are integrally mapped to the same semantic space, so that a better foundation is laid for subsequent feature fusion.

In step 406D, the server fuses the first feature, the second feature and the third feature through the feature fusion layer to obtain a fourth feature.

In some embodiments, the server fuses the first feature, the second feature, and the third feature by using three methods, i.e., Kronecker Product (Kronecker Product) calculation, Vector connection (Vector connection), and Self Attention mechanism (Self Attention), to obtain the fourth feature.

In step 406E, the server processes the fourth feature through a Multi-Layer perceptron (MLP), and outputs matching parameters of the image data and the target text.

In some embodiments, in response to that the candidate resource is video data, the server obtains features of multiple frames of images corresponding to the video data, obtains, for any frame of image, matching parameters between the image and the target text based on the steps 406A to 406G, and calculates an average value of the matching parameters between all the images and the target text, where the average value is the matching parameter between the video data and the target text.

In the process of training the multi-modal matching model, the server acquires training data, wherein the training data comprises sample materials, sample texts and sample matching parameters. The training is realized through multiple iterations, in any iteration process, the characteristics of a sample material are extracted to obtain corresponding sample characteristics, a sample text is subjected to keyword extraction to obtain sample keyword information, the sample characteristics and the sample keyword information are input into a model to be trained, whether a training end condition is reached or not is determined based on output prediction matching parameters and sample matching parameters, if yes, the model corresponding to the iteration is determined to be the multi-modal matching model, if not, the model parameters are adjusted, and the next iteration process is executed based on the adjusted model until the training end condition is reached. Optionally, the training end condition is: and if the bucket dividing precision of the predicted matching parameters is more than 0.95 or the iteration times reach a second threshold value, ending the training.

It should be noted that the multi-modal matching model can obtain matching parameters of the candidate resources and the target text, and through experimental verification, the number of hidden layer nodes of three fully-connected neural networks in the multi-modal matching model is increased appropriately, so that the accuracy of the multi-modal matching model can be improved remarkably.

In step 407, the server determines at least one recommended resource from the plurality of candidate resources based on the matching parameter of the candidate resource and the target text.

In some embodiments, the server takes a candidate resource meeting a matching condition as a recommended resource based on a matching parameter of the candidate resource and the target text, and optionally, the matching condition is: the matching parameter of the candidate resource and the target text is larger than a third threshold, or the matching parameters are sorted from big to small at the top M positions, wherein M is an integer larger than 0 and smaller than the number of the candidate resources.

For the above steps 406 to 407, the present embodiment only relates to the matching process of the image data, the video data and the target text in the candidate resources. In some embodiments, for audio data in the candidate resource, the server directly determines the audio data in the candidate resource as a recommended resource, or the server obtains a lyric feature in the audio data, matches the lyric feature with keyword information of a target text through a deep learning model, obtains a matching degree between the audio data and the target text, and determines the audio data whose matching degree meets an audio matching condition as the recommended resource, which is not limited in this embodiment.

In step 408, the server generates a target video related to the target text based on the at least one recommended resource.

In some embodiments, the generating process includes any one of the following implementations:

one implementation is as follows: and the server splices the plurality of recommended resources to obtain the target video. For example, the plurality of recommended resources are used for determining at least one data group, each data group comprises at least one type of data in image data and video data with content similarity meeting conditions, the data in one data group are spliced together, and the spliced data groups are sequentially spliced according to the data groups. And if the recommended resources comprise audio data, taking the audio data as the background audio of the generated video.

In some embodiments, the server generates the target video based on any video template, which is video data with editable segments that can be used to insert recommended resources to generate the target video. In the process of generating the target video based on the video template, according to the data types corresponding to all editable segments in the video template, inserting the recommended resources of the corresponding types into the corresponding positions of the video template, thereby obtaining the target video. Of course, the server may also generate the target video by using other methods, which is not limited in this embodiment.

According to the technical scheme provided by the embodiment of the disclosure, firstly, candidate resources related to a target text are obtained from a large number of material resources based on the target text input by a user, then keywords are extracted from the target text, the candidate resources are further matched with the target text through a multi-mode matching model, recommended resources are further obtained, a target video is rapidly and intelligently generated for the user based on the recommended resources, and meanwhile, the correlation between the generated target video and the target text is guaranteed.

Fig. 8 is a block diagram illustrating a multimedia asset generation apparatus according to an example embodiment. Referring to fig. 8, the apparatus includes an acquisition unit 801, a determination unit 802, and a generation unit 803.

An obtaining unit 801 configured to perform, in response to receiving a multimedia resource generation request, obtaining a target text carried by the multimedia resource generation request;

the obtaining unit 801 is configured to execute tag based on the target text, and obtain a plurality of candidate resources, where the candidate resources are material resources whose similarity between the tag and the tag of the target text meets a first judgment condition;

a determining unit 802 configured to perform determining at least one recommended resource from the plurality of candidate resources based on keyword information of the target text, the keyword information being used for representing each keyword in the target text, the recommended resource being a candidate resource meeting a matching condition with a matching parameter between the target text and the recommended resource;

a generating unit 803 configured to generate the target multimedia resource corresponding to the target text based on the at least one recommended resource.

In some embodiments, the determining unit 802 is configured to perform processing on the features of the candidate resources and the keyword information of the target text through a multi-modal matching model to obtain matching parameters of the candidate resources and the target text, the multi-modal matching model is trained based on matching parameters of sample materials, sample texts, and sample materials and sample texts, and at least one recommended resource is determined from the candidate resources based on the matching parameters of the candidate resources and the target text.

In some embodiments, the determining unit 802 further comprises:

the mapping subunit is configured to perform overall mapping of the feature of the candidate resource and the all-keyword information to a second semantic space through a third fully-connected neural network, and output the second feature and the third feature, wherein the second feature is used for representing all the semantics of the candidate resource, and the third feature is used for representing all the keywords of the target text;

In some embodiments, the determining unit 802 is configured to perform the determining of the audio data in the candidate resource as the recommended resource.

In some embodiments, the determining unit 802 is configured to perform obtaining a lyric feature in the audio data, matching the lyric feature with keyword information of the target text through a deep learning model to obtain a matching degree between the audio data and the target text, and determining the audio data with the matching degree meeting an audio matching condition as the recommended resource.

In some embodiments, the obtaining unit 801 is configured to perform calculating importance degrees of all words contained in the target text, rank the words in the target text based on the importance degrees, use the words in the top K positions as keywords of the target text, where K is an integer greater than 0 and less than the number of words contained in the target text, and process the extracted keywords to obtain the keyword information.

In some embodiments, the obtaining unit 801 is configured to perform extracting a title from the target text according to each component of the text indicated by the text format template, and taking the extracted title as a keyword of the target text.

In some embodiments, the generation unit 803 is configured to perform any of the following:

In some embodiments, the generating unit 803 is configured to perform the steps of determining at least one data group including at least one type of data of image data and video data with content similarity satisfying conditions, splicing the data in the data group, and sequentially splicing the spliced data groups according to the data groups.

In some embodiments, the generating unit 803 is configured to perform the step of taking the audio data as the generated background audio of the multimedia resource if the recommended resource includes the audio data.

In some embodiments, the apparatus further comprises:

In some embodiments, the generating unit 803 is configured to insert the recommended resource of the corresponding type into the corresponding position of the multimedia resource template according to the data type corresponding to each editable segment in the multimedia resource template, so as to obtain the target multimedia resource.

In some embodiments, the obtaining unit 801 is configured to execute user click times for obtaining multiple videos of any one object to be promoted, where the multiple videos are generated based on different material resources, and the material resource corresponding to the video whose user click times meet the second determination condition is taken as the candidate resource.

The above embodiment is described taking an electronic device as an example, and the configuration of the electronic device will be described below. Fig. 9 is a block diagram of an electronic device according to an exemplary embodiment, where the electronic device 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CP U)901 and one or more memories 902, where the one or more memories 902 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 901 to implement the processes executed by the electronic device in the multimedia resource generating method provided by the foregoing method embodiments. Certainly, the electronic device 900 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 900 may further include other components for implementing device functions, which are not described herein again.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory 902, comprising program code executable by a processor 901 of an electronic device 900 to perform the above-described multimedia asset generation method. Alternatively, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact-Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the multimedia asset generation method described above.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a multimedia resource, the method comprising:

determining at least one recommended resource from the candidate resources based on keyword information of the target text, wherein the keyword information is used for representing keywords in the target text, and the recommended resource is a candidate resource of which the matching parameters with the target text meet matching conditions;

2. The method of claim 1, wherein the determining at least one recommended resource from the plurality of candidate resources based on the keyword information of the target text comprises:

determining at least one recommended resource from the plurality of candidate resources based on the matching parameters of the candidate resources and the target text.

3. The method according to claim 2, wherein the processing the features of the candidate resources and the keyword information of the target text through a multi-modal matching model to obtain the matching parameters of the candidate resources and the target text comprises:

the multi-modal matching model comprises: the system comprises a first full-connection neural network, a second full-connection neural network, a bidirectional attention module, a third full-connection neural network, a feature fusion layer and a multilayer perceptron;

the first full-connection neural network and the second full-connection neural network are used for acquiring semantic mapping information and keyword mapping information based on the characteristics of the candidate resources and the keyword information;

the third fully-connected neural network is used for acquiring a second feature and a third feature based on the features of the candidate resources and the keyword information;

4. The method according to claim 3, wherein the processing the features of the candidate resources and the keyword information of the target text through a multi-modal matching model to obtain the matching parameters of the candidate resources and the target text comprises:

respectively mapping the characteristics of the candidate resources and all the keyword information to a second semantic space through a third fully-connected neural network in a whole manner, and outputting the second characteristics and the third characteristics, wherein the second characteristics are used for representing all the semantics of the candidate resources, and the third characteristics are used for representing all the keywords of the target text;

fusing the first feature, the second feature and the third feature through a feature fusion layer to obtain the fourth feature;

5. The method of claim 4, wherein the mapping the semantic mapping information and the keyword mapping information to obtain the first feature comprises:

acquiring second matching information of the single keyword mapping information and all the semantic mapping information;

6. The method of claim 1, wherein the determining at least one recommended resource from the plurality of candidate resources comprises:

7. The method according to claim 1, wherein the obtaining of the keyword information of the target text comprises:

and processing the extracted keywords to obtain the keyword information.

8. An apparatus for generating a multimedia asset, the apparatus comprising:

the acquisition unit is configured to execute the steps of responding to the received multimedia resource generation request, and acquiring a target text carried by the multimedia resource generation request;

the acquisition unit is configured to execute label based on the target text and acquire a plurality of candidate resources, wherein the candidate resources are material resources of which the similarity between the label and the label of the target text meets a first judgment condition;

a determining unit configured to determine at least one recommended resource from the plurality of candidate resources based on keyword information of the target text, wherein the keyword information is used for representing a keyword in the target text, and the recommended resource is a candidate resource which meets a matching condition with a matching parameter of the target text;

the generating unit is configured to generate the target multimedia resource corresponding to the target text based on the at least one recommended resource.

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the multimedia asset generation method of any of claims 1 to 7.

10. A computer-readable storage medium, wherein program code in the computer-readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the multimedia asset generation method of any of claims 1 to 7.