CN116561350B

CN116561350B - Resource generation method and related device

Info

Publication number: CN116561350B
Application number: CN202310831428.XA
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2024-01-09
Anticipated expiration: 2043-07-07
Also published as: CN116561350A

Abstract

The embodiment of the application discloses a resource generation method and a related device in the field of artificial intelligence, wherein the method comprises the following steps: determining basic emotion labels corresponding to basic resources to be matched and basic finals corresponding to basic texts in the basic resources; searching a reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and the basic final; the candidate text library stores a plurality of candidate texts and emotion labels and vowels corresponding to the candidate texts respectively; the emotion labels corresponding to the reference candidate texts and the basic emotion labels meet preset emotion matching conditions, and the vowels corresponding to the reference candidate texts and the basic vowels meet preset rhyme conditions; and generating the target media resource according to the basic resource and the reference candidate text. The method can automatically generate new media resources based on the basic resources and texts with different sources, improves the generation efficiency of the media resources and improves the quality of the generated media resources.

Description

Resource generation method and related device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a resource generating method and a related device.

Background

Two texts with different sources are spliced together to form a text with a section of rhyme, which is a popular media resource creation form nowadays. For example, in some short video platforms, a video producer may select a keyword from a movie and select an ancient poetry that is convoyed with the keyword, and then clip the keyword and a movie fragment to which the keyword belongs to obtain a corresponding short video resource.

In the related art, two texts are typically manually selected by a resource producer, and corresponding media resources are produced accordingly. This approach relies heavily on the scope of the artificial knowledge surface, and in many cases, the matching of manually selected text may not be ideal due to the limitation of the scope of the artificial knowledge surface, and thus, the quality of the manufactured media resources is poor; moreover, creating media assets in this manner tends to take a long time and is inefficient.

Disclosure of Invention

The embodiment of the application provides a resource generation method and a related device, which can automatically generate new media resources based on basic resources and texts with different sources, improve the generation efficiency of the media resources and improve the quality of the generated media resources.

In view of this, a first aspect of the present application provides a resource generating method, the method including:

aiming at basic resources to be matched, determining basic emotion labels corresponding to the basic resources and basic finals corresponding to basic texts in the basic resources;

searching a reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and the basic final; the candidate text library stores a plurality of candidate texts, emotion labels corresponding to the candidate texts and vowels corresponding to the candidate texts; a preset emotion matching condition is met between the emotion label corresponding to the reference candidate text and the basic emotion label, and a preset rhyme condition is met between the vowel corresponding to the reference candidate text and the basic vowel;

and generating a target media resource according to the basic resource and the reference candidate text.

A second aspect of the present application provides a resource generating apparatus, the apparatus comprising:

the basic information determining module is used for determining basic emotion labels corresponding to basic resources and basic finals corresponding to basic texts in the basic resources aiming at the basic resources to be matched;

The matching text searching module is used for searching a reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and the basic final; the candidate text library stores a plurality of candidate texts, emotion labels corresponding to the candidate texts and vowels corresponding to the candidate texts; a preset emotion matching condition is met between the emotion label corresponding to the reference candidate text and the basic emotion label, and a preset rhyme condition is met between the vowel corresponding to the reference candidate text and the basic vowel;

and the resource generation module is used for generating a target media resource according to the basic resource and the reference candidate text.

A third aspect of the present application provides a computer device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to execute the steps of the resource generating method according to the first aspect described above according to the computer program.

A fourth aspect of the present application provides a computer readable storage medium for storing a computer program for executing the steps of the resource generating method according to the first aspect described above.

A fifth aspect of the present application provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the steps of the resource generating method described in the first aspect.

From the above technical solutions, the embodiments of the present application have the following advantages:

the embodiment of the application provides a resource generation method, in the method, aiming at a basic resource to be matched, a basic emotion label corresponding to the basic resource and basic finals corresponding to basic texts in the basic resource are determined first; then, searching a reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and basic finals; the candidate text library stores a plurality of candidate texts, emotion labels corresponding to the candidate texts and final feet corresponding to the candidate texts, preset emotion matching conditions are met between emotion labels corresponding to the reference candidate texts and basic emotion labels, and preset rhyme conditions are met between final feet corresponding to the reference candidate texts and the basic final feet; and finally, generating a new target media resource according to the basic resource and the reference candidate text. Compared with the mode that a resource producer manually selects two resources and synthesizes new media resources according to the two resources in the related art, the scheme provided by the embodiment of the application can search the reference candidate texts which are matched with the basic emotion labels of the basic resources and are in rhyme with the basic texts in the basic resources from two angles of emotion matching and rhyme in a candidate text library comprising a large number of candidate texts; on one hand, the method is not limited by the range of the manual knowledge surface any more, the reference candidate text matched with the basic resource can be selected in a larger text selection range, and on the other hand, the reference candidate text matched with the basic resource can be searched based on information of two dimensions, namely emotion labels and vowels, so that the searched reference candidate text can be better matched with the basic resource, and further, the target media resource generated based on the basic resource and the reference candidate text is ensured to have higher quality. In addition, the generation flow of the target media resource provided by the embodiment of the application is an automatic media resource generation flow, and compared with manual production of the media resource, the generation time of the media resource can be greatly shortened, and the generation efficiency of the media resource is improved.

Drawings

Fig. 1 is an application scenario schematic diagram of a resource generating method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a resource generating method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a processing architecture of a basic video resource according to an embodiment of the present application;

fig. 4 is a schematic diagram of the working principle of a swin-transducer structure according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a working principle of a Patch Partition module provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of Swin Transformer Block provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of initial features of the Bert structure input provided in an embodiment of the present application;

fig. 8 is a schematic diagram of the attention mechanism in the Bert structure provided in the embodiment of the present application;

fig. 9 is a schematic diagram of a construction flow of a candidate text library according to an embodiment of the present application;

fig. 10 is a schematic diagram of a construction architecture of a candidate text library according to an embodiment of the present application;

fig. 11 is a schematic diagram of a mapping relationship between vowels and vowels according to an embodiment of the present application;

fig. 12 is a schematic diagram of a mapping relationship between a final and a candidate text according to an embodiment of the present application;

fig. 13 is a schematic diagram of a search dictionary corresponding to a candidate text library according to an embodiment of the present application;

Fig. 14 is a schematic diagram of a search flow of a reference candidate text according to an embodiment of the present application;

fig. 15 is a schematic diagram of an implementation architecture for searching for a reference candidate text according to an embodiment of the present application;

fig. 16 is a schematic diagram of an overall implementation architecture of a resource generating method according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a resource generating device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 19 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.

Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The scheme provided by the embodiment of the application relates to the technologies of computer vision, natural language processing and the like of artificial intelligence, and is specifically described by the following embodiments:

the resource generating method provided by the embodiment of the application can be executed by computer equipment, and the computer equipment can be terminal equipment or a server. The terminal equipment comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals, aircrafts and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server.

It should be noted that, the information (including but not limited to related information of the basic resource and the candidate text, etc.), the data (including but not limited to related information of the basic resource and the candidate text, etc.), and the signals related to the embodiment of the present application are authorized by the related object or are fully authorized by each party, and the collection, the use and the processing of the related data all comply with the related laws and regulations and standards of the related country and region.

In order to facilitate understanding of the resource generating method provided in the embodiments of the present application, an application scenario of the resource generating method is described below by taking an execution body of the resource generating method as an example of a server.

Referring to fig. 1, fig. 1 is an application scenario schematic diagram of a resource generating method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a server 110, a terminal device 120, and a database 130; the server 110 and the terminal device 120 may communicate directly or indirectly through wired or wireless communication; the server 110 may access the database 130 through a network, and the database 130 may be provided separately or may be integrated into the server 110 or other devices. The server 110 is configured to execute the resource generating method provided in the embodiment of the present application, the terminal device 120 is configured to provide a basic resource to the server 110, and the database 130 is a candidate text library in the embodiment of the present application.

In practical applications, the terminal device 120 may transmit a basic resource to the server 110, where the basic resource may be a basic video resource (e.g., a movie episode, an original video episode, etc.) or a basic text resource (e.g., an original text episode, etc.).

After receiving the basic resource, the server 110 may determine a basic emotion tag corresponding to the basic resource and a basic final corresponding to a basic text in the basic resource; the basic emotion label here is used to characterize the emotion category to which the content of the basic resource expression belongs, where the basic vowel may be the last word or the last two words in the basic text. Then, the server 110 may access the database 130 as a candidate text library, and search the candidate text library for a reference candidate text matching the basic resource based on the basic emotion tag and the basic vowel; the candidate text library stores a large number of candidate texts, emotion labels and finals corresponding to the candidate texts, preset emotion matching conditions are required to be met between emotion labels corresponding to the reference candidate texts to be searched and the basic emotion labels, and preset rhyme conditions are required to be met between finals corresponding to the reference candidate texts and the basic finals. Finally, the server 110 may generate a new target media resource based on the basic resource provided by the terminal device 120 and the reference candidate text acquired from the database 130, that is, perform a synthesis process on the basic resource and the reference candidate text to obtain the target media resource; alternatively, the server 110 may transmit the generated target media asset to the terminal device 120 to display the target media asset through the terminal device 120.

Compared with the mode that a resource producer manually selects two resources and synthesizes new media resources according to the two resources in the related art, the embodiment of the application can search the reference candidate texts which are matched with the basic emotion labels of the basic resources and are in rhyme with the basic texts in the basic resources in a candidate text library comprising a large number of candidate texts; on one hand, the method is not limited by the range of the manual knowledge surface any more, the reference candidate text matched with the basic resource can be selected in a larger text selection range, and on the other hand, the reference candidate text which is matched with the emotion of the basic resource and has rhyme can be searched based on the information of the emotion label and the vowel, so that the searched reference candidate text can be better matched with the basic resource, and further, the target media resource generated based on the basic resource and the reference candidate text is ensured to have higher quality. In addition, the generation flow of the target media resource provided by the embodiment of the application is an automatic media resource generation flow, and compared with manual production of the media resource, the generation time of the media resource can be greatly shortened, and the generation efficiency of the media resource is improved.

It should be understood that the application scenario shown in fig. 1 is merely an example, and in practical application, the resource generating method provided in the embodiment of the present application may also be applied to other scenarios, for example, the server 110 may also obtain the base resource from other databases, etc., and the application scenario of the resource generating method provided in the embodiment of the present application is not limited in any way.

The resource generating method provided by the application is described in detail below through a method embodiment.

Referring to fig. 2, fig. 2 is a flow chart of a resource generating method according to an embodiment of the present application. For convenience of description, the following embodiments will be described by taking an execution subject of the resource generating method as a server as an example. As shown in fig. 2, the resource generating method includes the steps of:

step 201: and aiming at the basic resources to be matched, determining basic emotion labels corresponding to the basic resources and basic finals corresponding to basic texts in the basic resources.

In the embodiment of the application, the server may acquire the basic resource first, then determine the emotion tag corresponding to the basic resource as the basic emotion tag, and determine the final corresponding to the basic text in the basic resource as the basic final.

It should be noted that, the basic resource in the embodiment of the present application may be any media resource including text, and the basic resource may specifically be a basic video resource, such as a video clip in a movie and television play, a video clip created by a user, etc., and the basic resource may specifically also be a basic text resource, such as a speech in a movie and television play, a text clip in a book, lyrics, a web hottext clip, a user created text clip, etc., where the embodiment of the present application does not limit the form and source of the basic resource.

In practical application, the server may acquire the above basic resource from the terminal device; illustratively, the user of the terminal device may upload the base resource to the server through a target application (e.g., a short video application, a resource creation application, etc.) running on the terminal device, for example, uploading a locally stored video resource of a movie and play episode, a user-created video resource, a user-created text episode, etc. to the server. Alternatively, the server may obtain the above-mentioned basic resources from the relevant database; illustratively, the server may obtain the base asset from a database for storing movie and television video assets. The embodiment of the present application does not limit any way to obtain the basic resource.

It should be noted that, the basic emotion label in the embodiment of the present application is an emotion label corresponding to a basic resource, which is used to represent an emotion category to which a content expressed by the basic resource belongs; the base emotion tags may be one or more of the emotion tags of the candidate emotion tags, where the candidate emotion tags include, but are not limited to, feeling wounded, thinking, happy, fun, anxiety, passion, anecdotal, no emotion, etc. In practical application, the basic emotion labels corresponding to the basic resources can be determined according to the basic resources through a pre-trained neural network model with an emotion classification function.

It should be noted that, in the embodiment of the present application, the base vowel is a vowel corresponding to the base text in the base resource. The base text here is text related to the base resource, for example, when the base resource is a base video resource, the base text may be text in the base video resource (such as a line of a movie and a line of a video, etc.), and when the base resource is a base text resource, the base text is the base text resource itself. The base vowels may include at least one of a base single-press vowels, which are the last words in the base text, and a base double-press vowels, which are the last two words in the base text.

Step 202: searching a reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and the basic final; the candidate text library stores a plurality of candidate texts, emotion labels corresponding to the candidate texts and vowels corresponding to the candidate texts; and a preset emotion matching condition is met between the emotion label corresponding to the reference candidate text and the basic emotion label, and a preset rhyme condition is met between the vowel corresponding to the reference candidate text and the basic vowel.

After the server determines the basic emotion label and the basic final, the server can search the reference candidate text which is matched with the obtained basic resource emotion and has the same convoy from the candidate text library based on the basic emotion label and the basic final.

It should be noted that, the candidate text library in the embodiment of the present application is a pre-constructed database storing a large number of candidate texts, where the candidate texts stored in the candidate text library may include, but are not limited to, poetry, web popular texts, and the like. In addition, the candidate text library also stores emotion labels and finals corresponding to each candidate text, the emotion labels corresponding to the candidate texts are used for representing emotion categories to which the content expressed by the candidate texts belongs, and the finals corresponding to the candidate texts can be determined according to at least one of the last word and the last two words in the candidate texts. Optionally, the candidate text library may further store respective degrees of heat corresponding to each candidate text, where the degrees of heat corresponding to the candidate text are used to characterize the search degrees of heat of the candidate text on the search platform, and the higher the degrees of heat corresponding to the candidate text, the higher the current degree of focus of the candidate text is. Optionally, the candidate text library may further store text lengths corresponding to the candidate texts, where the text lengths corresponding to the candidate texts may be determined according to the number of characters included in the candidate texts.

It should be understood that in practical application, the candidate text library may be expanded and updated according to practical requirements. For example, a new candidate text may be crawled over the network and added to the candidate text library with its corresponding emotion tags and finals to achieve expansion of the candidate text library. For example, for each candidate text corresponding to the heat degree stored in the candidate text library, the heat degree corresponding to the candidate text may be updated periodically according to the search amount of the candidate text on the search platform, so as to implement updating of the candidate text library.

It should be noted that, in the embodiment of the present application, the reference candidate text is a candidate text that is found in the candidate text library and matches with the underlying resource. The reference candidate text is matched with the basic resource, which can be understood that the emotion label corresponding to the reference candidate text and the basic emotion label corresponding to the basic resource meet the preset emotion matching condition, and the vowel corresponding to the reference candidate text and the basic vowel corresponding to the basic resource meet the preset vowel condition. The predicted emotion matching condition is a condition for measuring whether the emotion tags are matched, and if at least one emotion tag corresponding to the reference candidate text is the same as or similar to the basic emotion tag, the emotion tag corresponding to the reference candidate text and the basic emotion tag can be considered to meet the preset emotion matching condition. The preset rhyme condition herein is a condition for measuring whether the vowels are rhyme or not, for example, the vowels in the two vowels may be required to be the same or similar, and for example, if the vowels in the vowels corresponding to the reference candidate text are the same or similar to the vowels in the basic vowels, the vowels corresponding to the reference candidate text may be considered to satisfy the preset rhyme condition.

Step 203: and generating a target media resource according to the basic resource and the reference candidate text.

After the server finds the reference candidate text matching the base resource in the candidate text library, the base resource and the reference candidate text may be synthesized to generate a new target media resource.

It should be noted that, in the embodiment of the present application, the target media resource is a media resource synthesized by the basic resource and the reference candidate text, and the target media resource may be any one of a video resource, an audio resource, an image resource, and a text resource, and the embodiment of the present application does not limit the form of the target media resource at all.

For example, when the base resource is a base video resource, the server may generate an audio resource based on the reference candidate text that is found, and then clip the audio resource to the base video resource to obtain the target media resource in the form of a video. For example, assuming that the base video asset is a movie fragment, the server may convert the reference candidate text into an audio asset having the audio feature according to the audio feature of the movie fragment, and then clip the audio asset into the movie fragment, thereby obtaining the target media asset.

For example, when the base resource is a base text resource, the server may splice the found reference candidate text with the base text resource to obtain the target media resource in text form. Or the server can splice the searched reference candidate text and the basic text resource, and further convert the spliced text resource into a corresponding audio resource based on preset audio template characteristics, so as to obtain the target media resource in the audio form. Or the server can splice the searched reference candidate text and the basic text resource, then, based on the preset audio template characteristics, the spliced text resource is converted into the corresponding audio resource, and then, the audio resource is clipped into the preset video template, so that the target media resource in the video form is obtained.

It should be understood that the manner of generating the target media resource described above is merely an example, and in practical applications, the server may also generate the target media resource automatically based on the basic resource and the reference candidate text in other manners, and the embodiment of the present application does not limit the generation manner of the target media resource at all.

Compared with the method that two resources are manually selected by a resource producer in the related art and new media resources are synthesized according to the two resources, the resource generating method provided by the embodiment of the application can search reference candidate texts which are matched with basic emotion labels of basic resources and are conquered with basic texts in the basic resources from two angles of emotion matching and conquering in a candidate text library comprising a large number of candidate texts; on one hand, the method is not limited by the range of the manual knowledge surface any more, the reference candidate text matched with the basic resource can be selected in a larger text selection range, and on the other hand, the reference candidate text matched with the basic resource can be searched based on information of two dimensions, namely emotion labels and vowels, so that the searched reference candidate text can be better matched with the basic resource, and further, the target media resource generated based on the basic resource and the reference candidate text is ensured to have higher quality. In addition, the generation flow of the target media resource provided by the embodiment of the application is an automatic media resource generation flow, and compared with manual production of the media resource, the generation time of the media resource can be greatly shortened, and the generation efficiency of the media resource is improved.

In one possible implementation, the server may determine, in addition to the basic emotion tag corresponding to the basic resource, a resource authoring tag corresponding to the basic resource, where the resource authoring tag is used to characterize whether the basic resource is suitable for being combined with other text to generate a new media resource. If the resource creation tag characterizes that the basic resource is suitable for being combined with other texts to generate a new media resource, executing the step 202, namely searching a candidate text candidate library for a reference candidate text matched with the basic resource based on the basic emotion tag and the basic final corresponding to the basic resource; if the resource creation tag characterizes that the basic resource is not suitable for being combined with other texts to generate a new media resource, step 202 and subsequent steps are not executed, i.e. the candidate text library is not searched for a reference candidate text matched with the basic resource, and the target media resource is not generated based on the basic resource.

Before searching the reference candidate text matched with the basic resource, the server can perform preliminary judgment on the basic resource, namely judging whether the basic resource is suitable for generating a new media resource by combining with other texts for secondary creation. In specific implementation, the server can determine the resource creation tag corresponding to the basic resource according to the basic resource through a pre-trained neural network model, and the resource creation tag correspondingly characterizes whether the basic resource is suitable for being combined with other texts to generate a new media resource. When the resource creation tag characterizes that the underlying resource is suitable for being combined with other text to generate a new media resource, the server may continue to perform step 202 described above to find a reference candidate text suitable for being combined with the underlying resource; when the resource authoring tag characterizes that the underlying resource is unsuitable for generating a new media resource in combination with other text, the server may not perform steps 202 and 203 described above, i.e., forgo generating the target media resource based on the underlying resource for secondary authoring.

Thus, before searching the reference candidate text matched with the basic resource, judging whether the basic resource is suitable for being combined with other texts to generate a new media resource, and deciding whether to continue to execute the subsequent operation according to the judging result. On the one hand, basic resources which are not suitable for secondary creation, such as basic resources with insubstantial meaning of expressed content, can be screened out, and the secondary creation of the basic resources with higher quality is ensured to generate target media resources, so that the quality of the target media resources is correspondingly improved; on the other hand, the related reference candidate text searching operation and the target media resource generating operation which are executed based on the basic resources with poor quality can be avoided, and the waste of related processing resources is reduced.

In practical application, in order to improve the generation efficiency of the target media resource, the server may use the same neural network model, and determine the basic emotion tag and the resource creation tag corresponding to the basic resource.

In one possible implementation, when the base resource is a base video resource, the server may determine a base emotion tag and a resource creation tag corresponding to the base video resource by:

And determining a first emotion label and a first creation label corresponding to the basic video resource according to the basic text in the basic video resource through the text classification model. And determining a second emotion label and a second creation label corresponding to the basic video resource according to the video frames in the basic video resource through the image classification model. Determining a basic emotion label corresponding to the basic video resource according to the first emotion label and the second emotion label; and determining a resource creation tag corresponding to the basic video resource according to the first creation tag and the second creation tag.

For the basic video resource, the server can adopt a corresponding neural network model to realize the creation type classification and emotion classification of the basic video resource based on multi-mode data (namely text data and image data related to the basic video resource), and determine whether the basic video resource is suitable for being combined with other texts to generate a new media resource and the emotion classification of the content expressed by the basic video resource. Therefore, the basic emotion labels and the resource creation labels corresponding to the basic video resources are determined by fully referencing the data of a plurality of different modes, and the accuracy of the determined basic emotion labels and resource creation labels is improved.

As an example, the server may determine the first emotional tag and the first authored tag corresponding to the base video asset by:

determining the integral text characteristics corresponding to the basic text and the unit text characteristics corresponding to each text unit in the basic text according to the basic text through a text coding structure in the text classification model; determining comprehensive text features corresponding to the basic text according to the overall text features and the text features of each unit; and determining a first emotion label and a first creation label according to the comprehensive text characteristics through a classification structure in the text classification model.

The server may determine a second emotional tag and a second authored tag corresponding to the base video asset by:

determining image characteristics corresponding to the basic video resources according to video frames in the basic video resources through an image coding structure in the image classification model; splicing the image features with the comprehensive text features to obtain comprehensive image features; and determining a second emotion label and a second creation label according to the comprehensive image characteristics through a classification structure in the image classification model.

Fig. 3 is a schematic diagram of a processing architecture of a basic video resource according to an embodiment of the present application. As shown in fig. 3, the text classification model in the processing architecture is a neural network model based on a transform bi-directional coding (Bidirectional Encoder Representations from Transformer, bert) structure, and the image classification model is a neural network model based on a swin-transform structure.

When the method is specifically applied, initial features corresponding to basic texts in basic video resources can be input into a Bert structure, and the Bert structure obtains feature vectors corresponding to nodes in the initial features through analysis and processing of the initial features; the feature vector corresponding to the last node is used for representing the integral text feature of the basic text, and the feature vectors corresponding to other nodes are used for representing the unit text features corresponding to each text unit in the basic text. In order to obtain richer text features capable of reflecting information expressed by the basic text more accurately, in the embodiment of the application, the whole text features and unit text features corresponding to each text unit can be fused to obtain comprehensive text features; specifically, the unit text features corresponding to each text unit can be subjected to average processing to obtain average text features, and then the average text features and the whole text features are spliced to obtain comprehensive text features. Further, a first emotion tag corresponding to the base video resource may be determined from the integrated text feature by a first classification structure (softmax 1) in the text classification model; and determining a first authoring label corresponding to the basic video resource according to the comprehensive text characteristic through a second classification structure (softmax 2) in the text classification model.

Meanwhile, video frames in the basic video resource can be sequentially input into a swin-transform structure, and the swin-transform structure obtains image characteristics of the basic video resource by analyzing and processing the input video frames; based on the image features, it is helpful to screen out video assets where the video frames are empty mirrors, or where there are no people in the video frames, such video assets are generally unsuitable for secondary authoring to generate new media assets. Considering that it is difficult to accurately determine the resource creation tag and the basic emotion tag of the basic video resource based on the image features only, the comprehensive text features determined by the text classification model are introduced into the processing procedure of the image classification model, that is, the image features output by the swin-transform structure are spliced with the comprehensive text features to obtain the comprehensive image features. Further, determining a second emotion tag corresponding to the base video resource according to the comprehensive image features through a first classification structure (softmax 1) in the image classification model; and determining a second authoring label corresponding to the basic video resource according to the comprehensive image characteristics through a second classification structure (softmax 2) in the image classification model.

After the first emotion tag and the second emotion tag are obtained, the server can determine a basic emotion tag according to the first emotion tag and the second emotion tag. Specifically, the first emotion labels are essentially probabilities that the basic video resources determined by the text classification model belong to various candidate emotion labels, namely the first emotion labels comprise probability values corresponding to the various candidate emotion labels; similarly, the second emotion labels are essentially probabilities that the basic video resources determined by the image classification model belong to various candidate emotion labels, that is, the second emotion labels include probability values corresponding to the various candidate emotion labels. Based on this, for each type of candidate emotion tag, the server may calculate, as the target probability value for that type of candidate emotion tag, an average of probability values for that type of candidate emotion tag in the first emotion tag and probability values for that type of candidate emotion tag in the second emotion tag. Finally, the server can select the candidate emotion tags with the corresponding target probability values exceeding the preset probability threshold as the basic emotion tags corresponding to the basic video resources; or, the server may sort the candidate emotion tags according to the order of the corresponding target probability values from large to small, and finally select k (k is an integer greater than or equal to 1) candidate emotion tags with top sorting as the basic emotion tags corresponding to the basic video resource.

After the first authoring label and the second authoring label are obtained, the server can determine the resource authoring label corresponding to the basic resource according to the first authoring label and the second authoring label. For example, if at least one of the first authoring label and the second authoring label characterizes that the basic resource is suitable for being combined with other text to generate a new media resource, the server may determine that the resource authoring label corresponding to the basic resource characterizes that the basic resource is suitable for being combined with other text to generate a new media resource; otherwise, if the first authoring label and the second authoring label both characterize that the basic resource is not suitable for being combined with other texts to generate a new media resource, the server can determine that the resource authoring label corresponding to the basic resource characterizes that the basic resource is not suitable for being combined with other texts to generate the new media resource.

In order to further understand the basic emotion tags and the determination manner of the resource authoring tags, the swin-transformer structure and the Bert structure are described in detail below, respectively.

For a swin-transformer structure, fig. 4 is a schematic diagram illustrating the working principle of the swin-transformer structure according to the embodiment of the present application. As shown in fig. 4, the input data of the swin-transformer structure is a video frame in the base video resource, the size of the input data is typically h×wx3, H and W respectively represent the height and width of the video frame, and 3 represents three channel values of Red Green Blue (RGB) of pixels in the video frame. After inputting a video frame into a switch-transformer structure, firstly performing block processing on the video frame through a Patch Partition (Patch Partition) module, namely taking every 4×4 adjacent pixels as a Patch (Patch), and then performing flattening (flat) processing in a channel (channel) direction; in the case of RGB three-channel pictures, each Patch includes 16 pixels, and since each pixel has values of R, G, B channels, 16×3=48 values will be obtained after flattening processing, and accordingly, the size of the feature map will be changed from h×w×3 to H/4×w/4×48 after processing by the Patch Partition module. Then, the Linear embedding (Linear embedding) layer is used to perform Linear transformation processing on the channel (channel) data of each pixel, and the size of the feature map is changed from H/4×W/4×48 to H/4×W/4×C after the Linear transformation processing. In general, the Patch Partition module and the Linear embedding layer may be implemented by a convolution layer, and fig. 5 is a schematic diagram of the working principle of the Patch Partition module provided in the embodiment of the present application.

Next, feature maps of different sizes are constructed through multiple stages (stages), where three classes of Stage are downsampled through the Patch merge (Patch merge) layer except Stage1 is first passed through a Linear addition layer. Each Stage includes a Swin transform block (Swin Transformer Block), each Swin Transformer Block having a structure as shown in fig. 6, and a window Multi-Head Self-Attention (W-MSA) structure and a shift window Multi-Head Self-Attention (SW-MSA) structure are used in pairs in each Swin Transformer Block; in addition, each Swin Transformer Block includes a Layer normalization (Layer normalization, LN) structure and a Multi-Layer Perceptron (MLP) structure that assist in the operation of the W-MSA structure, and an LN structure and an MLP structure that assist in the operation of the SW-MSA structure. The Patch metering layer in the last three stages is specifically used for realizing downsampling, reducing resolution and adjusting the number of channels, so that a layering design is formed, and meanwhile, a certain amount of operation is saved; in convolutional neural networks (Convolutional Neural Networks, CNN), a convolutional/pooling layer of step stride=2 is typically used to reduce resolution before each Stage starts; the operation executed by the Patch Merging layer is similar to the pooling operation, but is more complex than the pooling operation, the pooling operation can cause information loss, but the Patch Merging does not cause two times of downsampling each time, namely, elements are selected according to the position interval 2 in the row direction and the column direction, new patches are spliced, all patches are spliced to obtain a whole tensor, and finally the tensor is unfolded, and the width and the height of a feature map are respectively reduced by 2 times, so that the channel dimension can become 4 times of the original, and the channel dimension is adjusted to be 2 times of the original through a full-connection layer at the moment, so that the final image feature can be obtained.

In the embodiment of the application, video frames in a basic video resource are input into the swin-transform structure according to the arrangement sequence of the video frames in the basic video resource, so that an encoded image feature sequence is obtained, and then, an image feature capable of representing the semantic feature of the basic video resource is obtained by calculating the average value of each image feature in the image feature sequence. And then, the image features and the comprehensive text features are spliced together to obtain the comprehensive image features. And further, respectively inputting the comprehensive image features into two softmax structures to obtain a second creation tag and a second emotion tag corresponding to the basic video resource.

For the Bert structure, the Bert structure is a bidirectional coding representation based on a transformer, which is a pre-training model structure, and in practical application, the Bert structure is usually trained based on two training tasks, one training task is to predict a word covered in a sentence, and the other training task is to judge whether the two input sentences are context. After the pretrained Bert structure, a corresponding network structure can be added according to related business requirements so as to complete downstream tasks of natural language processing, such as text classification, machine translation and the like.

In the Bert structure, the input initial feature is summed from three different features, namely the initial feature of the word itself (wordpiece token embedding), the tag feature (segment embedding) and the location feature (position embedding); the initial feature of the word itself may be obtained by searching a word feature dictionary, and the tag feature is used to represent whether the word belongs to a known word above or a known word below in the following prediction task. That is, initial characteristics of the Bert structure input are wordpiece token embedding + segment embedding + position embedding, fig. 7 is a schematic diagram of initial characteristics of an exemplary Bert structure input provided in the embodiment of the present application, as shown in fig. 7, an input text is [ CLS ] my dog is cut [ SEP ] he keys play [ SEP ], where [ CLS ] is a text initiator and [ SEP ] is a text separator; the corresponding initial feature of the input text is obtained by adding up the wordpiece token embedding, segment embedding and position embedding parts.

The Bert structure uses only the code (encoder) portion of the transducer. The overall framework of the Bert architecture is formed by stacking multiple layers of transformers' encodings, each consisting of a superposition of multiple-head-attention mechanisms (muti-head-attention), linear normalization layers (Layer Normalization), feed-forward layers (feed-forward) and Layer Normalization. Typically, the large-scale Bert structure includes 24 encodings, each layer including 16 attention mechanisms, and the dimension of the word vector is 1024; the small-scale Bert structure includes 12 layers of encodings, each layer including 12 attention mechanisms, and the dimension of the word vector is 768. Furthermore, the size of the feed-forward layer is typically set to 4H, where H is the dimension of the word vector.

The main function of each attention mechanism is to recode the target word by calculating the relativity between the target word and all words in the sentence; the calculation of each attention mechanism includes three steps: calculating the relevance between words, carrying out normalization processing on the relevance, and carrying out weighting processing on the initial characteristics of all the words based on the relevance to obtain the coding characteristics of the target words. When calculating the relativity between words through an attention mechanism, linear transformation can be carried out on the input sequence vectors through three weight matrixes, three new sequence vectors of query, key and value are respectively generated, and the query vector of each word is respectively multiplied by the key vectors of all words in the sequence to obtain the relativity between words; then, normalizing the correlation degree through softmax; the normalized correlation is used as a weight, and is weighted and summed with a value vector to obtain a new coding feature of each word, and various parameters generated in the processing process of the attention mechanism are all commonly called as intermediate features. Fig. 8 is a schematic diagram of the attention mechanism in the Bert structure in the embodiment of the present application.

In the embodiment of the application, the initial characteristic of the Bert structure can be determined and input according to the basic text in the basic video resource; and then, processing the initial characteristic through the Bert structure to obtain the coding characteristic representation corresponding to each node in the initial characteristic. In general, the obtained coding feature corresponding to the last node can globally represent the overall text feature of the basic text, and the coding features corresponding to other nodes are used for representing the unit text features corresponding to each text unit in the basic text; in consideration of the fact that the content of information carried by the coding feature corresponding to the last node may be less, and the content information of the whole basic text cannot be fully expressed, in the embodiment of the application, the unit text features corresponding to each text unit are subjected to average processing, and then the text features obtained by the average processing are fused with the coding feature corresponding to the last node, so that the comprehensive text feature capable of fully expressing the content information of the basic text is obtained. And further, respectively inputting the comprehensive text features into two softmax structures to obtain a first creation tag and a first emotion tag corresponding to the basic video resource.

In this way, by adopting the text classification model based on the Bert structure, the first emotion label and the first creation label are determined according to the basic text in the basic video resource, and the second emotion label and the second creation label are determined according to the video frame in the basic video resource by adopting the image classification model based on the swin-transform structure. The text information and the image information in the basic video resources can be fully considered in the process of determining the emotion tags and the creation tags, and further, the determined emotion tags and creation tags are guaranteed to have higher accuracy.

It should be understood that in practical application, the text classification model may be a text classification model with other structures besides the Bert structure-based model, and the embodiment of the present application does not limit the structure of the text classification model in any way. Similarly, the image classification model may be an image classification model of other structures besides a model based on a swin-transducer structure, and the embodiment of the present application does not limit the structure of the image classification model in any way.

In one possible implementation, when the basic resource is a basic text resource, the server may determine a basic emotion tag and a resource creation tag corresponding to the basic text resource by:

And determining a basic emotion label and a resource creation label corresponding to the basic video resource according to the basic text in the basic video resource through the text classification model.

Aiming at the basic video resource, the server can adopt a corresponding neural network model, based on single-mode data (namely text data of the basic text resource), realize the creation type classification and emotion classification of the basic text resource, and determine whether the basic text resource is suitable for being combined with other texts to generate a new media resource and the emotion classification of the content expressed by the basic text resource. Thus, the basic emotion labels and the resource creation labels corresponding to the basic text resources are accurately determined through the neural network model.

As an example, the server may determine the base emotion tags and resource authoring tags corresponding to the base text resources by:

determining the integral text characteristics corresponding to the basic text and the unit text characteristics corresponding to each text unit in the basic text according to the basic text through a text coding structure in the text classification model; determining comprehensive text features corresponding to the basic text according to the overall text features and the text features of each unit; and determining a basic emotion label and a resource creation label according to the comprehensive text characteristics through a classification structure in the text classification model.

The manner of determining the corresponding basic emotion tags and resource creation tags for the basic text resources is similar to the manner of determining the first emotion tags and the first creation tags for the basic video resources according to the basic texts thereof described above, that is, the text classification model based on the Bert structure can be used to determine the corresponding basic emotion tags and resource creation tags according to the basic text resources. Details are referred to above for related descriptions, and are not repeated here.

In a possible implementation manner, the candidate text library in the embodiment of the present application may be constructed through a process shown in fig. 9, where the method for constructing the candidate text library may be performed by a computer device, and the computer device may be a terminal device or a server, and in this embodiment, the server is taken as an execution body to describe, as shown in fig. 9, the construction process of the candidate text library includes the following steps:

step 901: acquiring an original candidate text library; the original candidate text library includes a plurality of original candidate texts.

In the embodiment of the application, the original candidate text library consists of a large number of original candidate texts, and the original candidate texts are text screening bases when the candidate text library is built.

Fig. 10 is a schematic diagram of a construction architecture of a candidate text library provided in an embodiment of the present application, as shown in fig. 10, in order to ensure that a reference candidate text used in a final generated target media resource has a certain audience, which is learned or known by most resource viewers, so that ancient poems in a relevant teaching text can be used as an original candidate text and incorporated into the original candidate text library in the embodiment of the present application; in addition, the network trending poem can be used as an original candidate text and is incorporated into an original candidate text library in the embodiment of the application; on one hand, the ancient poems in the teaching textbook are likely to be learned by resource viewers, so that the resource viewers are interested and like the ancient poems, and on the other hand, the network popular poems have a certain degree of awareness among a plurality of resource viewers.

Of course, in practical application, the server may acquire other types of original candidate texts in other manners to form an original candidate text library, and the embodiments of the present application do not limit the source and types of the original candidate texts in any way.

Step 902: determining a text creation tag and an emotion tag corresponding to each original candidate text through a text classification model; the text authoring tag is used to characterize whether the original candidate text is suitable for being combined with other resources to generate a new media resource.

For each original candidate text in the original candidate text library, the server can input the original candidate text into a text classification model, and the text classification model can output a text creation tag and an emotion tag corresponding to the original candidate text through analyzing and processing the original candidate text. The text authoring tags corresponding to the original candidate text herein are used to characterize whether the original candidate text is suitable for use in connection with other assets to generate new media assets, similar to the asset authoring tags above. The emotion label corresponding to the original candidate text is used for representing the emotion type of the content expressed by the original candidate text.

It should be noted that, in order to ensure that the configuration standards of the text creation tag and the emotion tag corresponding to the original candidate text are consistent with the configuration standards of the resource creation tag and the basic emotion tag corresponding to the basic resource, so as to ensure that whether the basic resource and the candidate text are matched or not can be more accurately determined when the reference candidate text matched with the basic resource is searched later, herein, a text classification model used when the text creation tag and the emotion tag corresponding to the original candidate text are determined can be the same model as the text classification model used for determining the resource creation tag and the emotion tag for the basic resource and described above, for example, the text classification model can be based on the Bert structure, and the specific working principle of the text classification model can be detailed with reference to the related content and is not repeated herein.

Step 903: and aiming at each original candidate text, if the text creation label corresponding to the original candidate text characterizes that the original candidate text is suitable for being combined with other resources to generate a new media resource, taking the original candidate text as the candidate text, and adding the candidate text, the emotion label corresponding to the candidate text and the finals into the candidate text library.

As shown in fig. 10, for each original candidate text, the server may determine, according to the text creation tag corresponding to the original candidate text, whether the original candidate text is suitable for being combined with other resources to generate a new media resource. If the text creation label corresponding to the original candidate text characterizes that the original candidate text is suitable for being combined with other resources to generate a new media resource, the original candidate text is used as a candidate text and added into a candidate text library, and the emotion label and the finals corresponding to the candidate text are also added into the candidate text library; the final of the candidate text may be obtained by using a related final disassembly algorithm, where the final may include at least one of a single-press final and a double-press final, where the single-press final is the last word in the candidate text, and the double-press final is the last two words in the candidate text. If the text creation tag corresponding to the original candidate text characterizes that the original candidate text is not suitable for being combined with other resources to generate a new media resource, the original candidate text is not added into a candidate text library.

Optionally, for the candidate text in the candidate text library, the server may further determine a heat corresponding to the candidate text based on a search amount corresponding to the candidate text on the search platform, and determine a text length corresponding to the candidate text; further, the heat and the text length corresponding to the candidate text are also added to the candidate text library.

Specifically, as shown in fig. 10, for each candidate text in the candidate text library, the server may search the candidate text on the search platform to obtain a search amount of the candidate text on the search platform, and further determine a heat corresponding to the candidate text based on the search amount; it should be appreciated that the higher the search volume of a candidate text, the higher the corresponding heat of that candidate text. In addition, the server may determine, according to the number of characters included in the candidate text, a text length corresponding to the candidate text. Further, the corresponding hotness and text length of the candidate text are added to the candidate text library accordingly.

Therefore, the corresponding heat of the candidate texts is stored in the candidate text library, so that the selection of the reference candidate texts with higher heat in the subsequent determination of the reference candidate texts is facilitated, the target media resource is generated based on the reference candidate texts with higher heat, and the generated target media resource is guaranteed to have higher audience degree, namely, the target media resource is concerned and loved by more resource viewers. In addition, the text length corresponding to the candidate text is stored in the candidate text library, so that the selection of the reference candidate text matched with the text length of the basic text in the basic resource in the subsequent determination of the reference candidate text is facilitated, the target media resource is generated based on the reference candidate text matched with the text length and the basic resource, and the quality of the target media resource can be further improved.

In one possible implementation manner, a mapping relation between vowels and vowels, a mapping relation between vowels and candidate texts, and tag content corresponding to the candidate texts are stored in a candidate text library in the embodiment of the application; the tag content here includes emotion tags, heat and text length corresponding to the candidate text.

Specifically, the vowels are divided into single vowels and double vowels, wherein the single vowels refer to vowels in one word, and the double vowels refer to vowels in two words; and the vowels stored in the candidate text library are obtained by sorting vowels included in the vowels of each candidate text in the candidate text library. The vowels are divided into single-press vowels and double-press vowels, wherein the single-press vowels are the last word in the candidate text, and the double-press vowels are the last two words in the candidate text; the final in the candidate text library is obtained by arranging the final of each candidate text in the candidate text library. The mapping relation between the vowels and the vowels is determined according to the belonging relation between the vowels and the vowels, namely if a certain vowel belongs to a certain vowel, the mapping relation exists between the vowels and the vowels; it should be understood that there is a mapping relationship between the single vowel and the single vowel, and there is a mapping relationship between the double vowel and the double vowel. Fig. 11 is a schematic diagram of a mapping relationship between vowels and vowels according to an embodiment of the present application, where one vowel may correspond to at least one vowel.

The mapping relation between the vowels and the candidate texts is determined according to the belonging relation between the vowels and the candidate texts, namely if a certain vowels belongs to a certain candidate text, the mapping relation exists between the vowels and the candidate texts; fig. 12 is a schematic diagram of a mapping relationship between an exemplary final and a candidate text according to an embodiment of the present application, where one final may correspond to at least one candidate text.

The tag content corresponding to the candidate text may include the emotion tag corresponding to the candidate text determined by the text classification model, the heat corresponding to the candidate text determined based on the search amount on the search platform, and the text length corresponding to the candidate text determined based on the number of characters included in the candidate text. Of course, in practical applications, other types of content may be added as tag content corresponding to the candidate text, which is not limited in the embodiments of the present application.

Fig. 13 is a schematic diagram of a search dictionary corresponding to a candidate text library according to an embodiment of the present application. As shown in fig. 13, the primary index of the candidate text library is vowels, which include single vowels and double vowels; the primary index of the candidate text library points to the secondary index, the secondary index is a vowel with a mapping relation with a vowel, a single vowel points to a single vowel, and a double vowel points to a double vowel; the secondary index of the candidate text library points to the candidate text, namely points to the candidate text with a mapping relation with the vowel; each candidate text is associated with its corresponding tag content, such as emotional tags, text length, heat, etc.

Thus, the candidate text library with the structure is constructed, so that the reference candidate text matched with the basic resource can be quickly searched in the candidate text library, namely, the searching efficiency of the reference candidate text is improved, and the generating efficiency of the target media resource is improved.

According to the construction method of the candidate text library, the candidate text suitable for being combined with other resources to generate new media resources can be selected from a large number of original candidate texts, reliability of the candidate texts stored in the candidate text library is guaranteed, meanwhile, reference candidate texts unsuitable for secondary creation are prevented from being searched when the reference candidate texts are searched later, and quality of target media resources is guaranteed on the side face. In addition, the candidate text, the label content corresponding to the candidate text, the vowels and the vowels corresponding to the candidate text are stored in a candidate text library according to a specific mapping relation, so that the data storage structure of the candidate text library is standardized, further, the rapid search of the reference candidate text in the candidate text library is facilitated, and the generation efficiency of target media resources is improved.

In one possible implementation manner, the server may search the candidate text library for the reference candidate text matching the basic resource through the reference candidate text searching process shown in fig. 14, where the reference candidate text searching process includes the following steps:

Step 1401: determining an emotion label which meets the preset emotion matching condition between the emotion label and the basic emotion label as a reference emotion label; and searching candidate texts corresponding to the reference emotion labels in the candidate text library to form a primary candidate text set.

In the embodiment of the application, the server can determine the emotion tags meeting the preset emotion matching conditions among the basic emotion tags corresponding to the basic resources, and the emotion tags are used as reference emotion tags; for example, the server may directly determine the basic emotion tag itself as a reference emotion tag, or the server may determine an emotion tag matched with the basic emotion tag according to an emotion tag matching rule table, as a reference emotion tag, where the emotion tag matching rule table records matching relationships between various emotion tags. Further, the server may search the candidate text corresponding to the reference emotion tag in the candidate text library, and use the candidate text corresponding to the reference emotion tag to form a first-level candidate text set.

Fig. 15 is a schematic diagram of an implementation architecture for searching for reference candidate text according to an embodiment of the present application. As shown in fig. 15, for the basic resource "my name is called" Xiaoming ", a reference emotion tag satisfying a preset emotion matching condition with the basic emotion tag may be determined according to the basic emotion tag" no obvious emotion "corresponding to the basic resource, and then candidate texts corresponding to the reference emotion tag are searched in a candidate text library to form a first-level candidate text set.

Step 1402: determining the vowels meeting the preset rhyming conditions between the base vowels and the base vowels as reference vowels; and searching candidate texts corresponding to the reference vowels in the primary candidate text set to form a secondary candidate text set.

The server can determine the vowels meeting the preset rhyming conditions among the basic vowels corresponding to the basic resources as reference vowels; specifically, the server may determine, as a reference vowel, a vowel having the vowel or other vowels similar to the vowel according to the vowels in the base vowels. The server may then continue to find candidate texts corresponding to the reference final in the primary candidate text set, and form a secondary candidate text set using the candidate texts corresponding to the reference final.

As an example, the basic vowels may include basic double-pressed vowels and basic single-pressed vowels, where the basic double-pressed vowels are the last two words in the basic text and the basic single-pressed vowels are the last words in the basic text. In this case, the server may determine the above-described secondary candidate text set by:

Determining reference double-pressing vowels based on double-pressing vowels corresponding to the basic double-pressing vowels; searching a candidate text corresponding to the reference double-pressing vowel in the first-level candidate text set; if the candidate text corresponding to the reference double-pressing final is found, a second-level candidate text set is formed by utilizing the candidate text corresponding to the reference double-pressing final; if the candidate text corresponding to the reference double-pressing vowel is not found, determining the reference single-pressing vowel based on the single-pressing vowel corresponding to the basic single-pressing vowel; and searching candidate texts corresponding to the reference single-press vowels in the primary candidate text set to form a secondary candidate text set.

As shown in fig. 15, for the basic text "my name" min ", the corresponding basic double-pressing final is" min ", and the basic single-pressing final is" min ". When the secondary candidate text set is searched for the basic text, the basic double-rhyme foot can be firstly used for searching; specifically, a double-pressing vowel 'ao' corresponding to a basic double-pressing vowel 'Xiaoming' is firstly determined, then, a vowel with the double-pressing vowel is determined to be used as a reference double-pressing vowel, and whether a candidate text corresponding to the reference double-pressing vowel exists in a first-level candidate text set is searched; if the candidate text corresponding to the reference double-pressing vowel is found in the first-level candidate text set, the found candidate text is directly utilized to form a second-level candidate text set; and if the candidate text corresponding to the reference double-pressing vowel is not found in the primary candidate text set, searching the secondary candidate text set based on the basic single-pressing vowel. When the secondary candidate text set is searched based on the basic single-pressing vowel, the single-pressing vowel "ing" corresponding to the basic single-pressing vowel "bright" can be determined first, then the vowel with the single-pressing vowel is determined as the reference single-pressing vowel, and in the primary candidate text set, the candidate text corresponding to the reference single-pressing vowel is searched to form the secondary candidate text set.

Therefore, according to the mode, according to the rules of the prior double-pressing and the secondary single-pressing, the candidate texts which are in the same rhyme with the basic texts are preferentially searched in the primary candidate text set, and the candidate texts which are in the same rhyme with the basic texts can be searched, so that the higher rhyme matching performance between the reference candidate texts and the basic texts which are determined later can be ensured, the pairing is more neat, and the generation of higher-quality target media resources is facilitated. If the candidate texts which are double-pressed with the basic texts are not found in the primary candidate text set, the candidate texts which are single-pressed with the basic texts are found in the primary candidate text set, so that basic guarantee is provided for the follow-up determination of the reference candidate texts, and the yield of target media resources is guaranteed.

Step 1403: the reference candidate text is determined based on candidate text included in the secondary candidate text set.

After determining the secondary candidate text set, the server may select any candidate text from the secondary candidate text set as a reference candidate text for use in generating the target media asset in conjunction with the underlying asset.

As an example, in the case that the candidate text library further includes respective corresponding hotness of each candidate text, the server may select, as the reference candidate text, a candidate text whose corresponding hotness satisfies a preset hotness condition based on the respective corresponding hotness of each candidate text in the secondary candidate text set.

Specifically, the respective corresponding heat degree of each candidate text stored in the candidate text library is determined according to the respective search amount of each candidate text on the search platform, and the higher the corresponding heat degree of the candidate text, the more the current search amount of the candidate text on the search platform is, the higher the current attention degree is. Based on the above, when the server selects the reference candidate text from the secondary candidate text set, the server may select the first n (n is an integer greater than or equal to 1) candidate texts with higher corresponding heat, thereby determining the reference candidate text; alternatively, candidate texts in which the corresponding heat is higher than a preset heat threshold may be selected, and the reference candidate text may be determined accordingly.

Thus, based on the corresponding heat degree of the candidate texts, the candidate texts with higher corresponding heat degrees are selected to serve as reference candidate texts for generating the target media resources by combining with the basic resources, so that the quality of the generated target media resources can be improved, and the generated target media resources have higher audience degree.

As an example, in the case that the candidate text library further includes text lengths corresponding to the respective candidate texts, the server may select, as the reference candidate text, a candidate text satisfying a preset length matching condition between the corresponding text length and the text length of the base text based on the text length corresponding to the respective candidate texts in the secondary candidate text set.

Specifically, the lengths corresponding to the candidate texts stored in the candidate text library are determined according to the number of characters included in the candidate texts. When the server selects the reference candidate text from the secondary candidate text set, the server can select candidate texts with the same text length as the basic text or with the difference value between the text lengths of the basic text within the preset length difference value, and accordingly the reference candidate text is determined.

Therefore, based on the text length corresponding to the candidate text, the candidate text with the same or similar text length as the basic text is selected to be used as the reference candidate text for generating the target media resource by combining with the basic resource, so that better matching performance between the reference candidate text and the basic text can be ensured, and the matching is more neat, thereby improving the quality of the target media resource generated subsequently.

It should be understood that, in practical application, the server may refer to only any information of the heat and the text length to select the reference candidate text in the secondary candidate text set, or may refer to both information of the heat and the text length to select the reference candidate text in the secondary candidate text set, which is not limited in any way in the embodiment of the present application.

According to the reference candidate text searching method, a primary candidate text set is determined based on basic emotion labels corresponding to basic resources, a secondary candidate text set is determined based on basic finals corresponding to the basic resources, and finally the reference candidate text is selected from the secondary candidate text set based on at least one of heat and text length information. Therefore, according to the sequence of emotion matching, rhyming and matching quality improving, the reference candidate texts are determined step by step, so that the reference candidate texts can be efficiently found in the candidate text library, and the found reference candidate texts can be well matched with the basic resources, thereby ensuring the generation efficiency and the generation quality of the target media resources.

In order to facilitate further understanding of the resource generating method provided in the embodiments of the present application, the resource generating method provided in the embodiments of the present application is described in the following by way of example in conjunction with fig. 16, and fig. 16 is a schematic diagram of an overall implementation architecture of the resource generating method provided in the embodiments of the present application.

As shown in fig. 16, the overall implementation of the embodiment of the present application is divided into three parts, namely, label configuration of the basic resource, construction of the candidate text library, and search and resource integration of the reference candidate text.

The label configuration of the basic resource is used for determining a basic emotion label and a resource creation label (which are used for representing whether the basic resource is suitable for being combined with other texts to generate a new media resource) corresponding to the basic resource through a neural network model; when the basic video resource is the basic video resource, the embodiments of the present application innovatively use the bimodal data to determine the basic emotion tag and the resource creation tag corresponding to the basic video resource, that is, determine the basic emotion tag and the resource creation tag corresponding to the basic video resource based on the image data and the text data in the basic video resource.

The construction of the candidate text library is performed on-line for screening candidate texts and constructing the candidate text library based on the collected large number of original candidate texts. Specifically, a large number of ancient poetry sentences can be collected as original candidate texts; then, a neural network model used when the labels are configured for the basic resources is multiplexed, and emotion labels and text creation labels (representing whether the original candidate texts are suitable for being combined with other resources to generate new media resources) corresponding to the original candidate texts are determined; furthermore, the original candidate text which is suitable for combining with other resources to generate new media resources is taken as a candidate text, and the candidate text, emotion labels, finals (including double finals and single finals) corresponding to the candidate text, heat (determined according to the searching amount of the candidate text on a searching platform) and text length are correspondingly stored in a candidate text library.

The searching and resource integration of the reference candidate text is used for searching the reference candidate text matched with the basic resource in the candidate text library and synthesizing the basic resource and the reference candidate text into the target media resource. Specifically, candidate texts matched with basic emotion labels corresponding to basic resources can be selected from a candidate text library; then, selecting candidate texts which are rhymed with the basic texts in the basic resources from the candidate texts according to the rules of the prior double-press and the next single-press; further, based on the heat and the text length corresponding to the selected candidate text, determining a candidate text with higher heat and the text length equal to or similar to the text length of the basic text as a reference candidate text; and finally, integrating the determined reference candidate text with the basic resource, for example, clipping the reference candidate text into the basic video resource to obtain the target media resource.

The resource generation mode provided by the embodiment of the application can be applied to the following scenes in an exemplary manner:

1) In the later stage of movie and television play, the method provided by the embodiment of the application can be adopted to judge each segment in the currently-produced movie and television play, judge whether the segment is suitable for being combined with other texts to generate new media resources, and determine emotion tags corresponding to the segments; then selecting a plurality of fragments suitable for being combined with other texts to generate new media resources from the movie and television play, and searching a reference candidate text (such as a paleopoem) matched with the selected fragments in a candidate text library according to the emotion labels and the vowels corresponding to the fragments; and generating a new video clip based on the clip and the reference candidate text through secondary creation, and taking the new video clip as a popularization resource of the movie and television play.

2) On the short video platform, movie fragments with higher interest (the movie fragments are suitable for being combined with other texts to generate new media resources) are provided for users, and reference candidate texts (searched in a candidate text library by adopting the method provided by the embodiment of the application) matched with the movie fragments are provided for users, so that the user is supported to trigger selection operation on the movie fragments and the reference candidate texts, corresponding new target media resources are generated accordingly, and the resource creation efficiency of the users on the short video platform is improved.

In addition, the user can upload the self-created video segment or text segment by himself, and the method provided by the embodiment of the application is adopted to search the reference candidate text matched with the video segment or text segment in the candidate text library, and automatically generate new target media resources based on the video segment or text segment and the reference candidate text, thereby completing automatic resource generation.

The application further provides a corresponding resource generating device aiming at the resource generating method, so that the resource generating method is practically applied and realized.

Referring to fig. 17, fig. 17 is a schematic diagram of a configuration of a resource generating apparatus 1700 corresponding to the resource generating method shown in fig. 2 above. As shown in fig. 17, the resource generating apparatus 1700 includes:

A basic information determining module 1701, configured to determine, for a basic resource to be matched, a basic emotion tag corresponding to the basic resource and a basic final corresponding to a basic text in the basic resource;

a matching text searching module 1702 configured to search, based on the basic emotion tag and the basic final, a candidate text library for a reference candidate text that matches the basic resource; the candidate text library stores a plurality of candidate texts, emotion labels corresponding to the candidate texts and vowels corresponding to the candidate texts; a preset emotion matching condition is met between the emotion label corresponding to the reference candidate text and the basic emotion label, and a preset rhyme condition is met between the vowel corresponding to the reference candidate text and the basic vowel;

a resource generating module 1703, configured to generate a target media resource according to the base resource and the reference candidate text.

Optionally, when the base resource is a base video resource, the base information determining module 1701 is specifically configured to:

determining a first emotion label corresponding to the basic video resource according to the basic text in the basic video resource through a text classification model;

Determining a second emotion label corresponding to the basic video resource according to the video frames in the basic video resource through an image classification model;

and determining the basic emotion label according to the first emotion label and the second emotion label.

Optionally, the basic information determining module 1701 is specifically configured to:

determining integral text characteristics corresponding to the basic text and unit text characteristics corresponding to each text unit in the basic text according to the basic text through a text coding structure in the text classification model; determining comprehensive text features corresponding to the basic text according to the integral text features and the unit text features; determining the first emotion label according to the comprehensive text characteristics through a classification structure in the text classification model;

determining image features corresponding to the basic video resources according to the video frames through an image coding structure in the image classification model; splicing the image features with the comprehensive text features to obtain comprehensive image features; and determining the second emotion label according to the comprehensive image characteristics through a classification structure in the image classification model.

Optionally, when the basic resource is a basic text resource, the basic information determining module 1701 is specifically configured to:

and determining the basic emotion label corresponding to the basic text resource according to the basic text corresponding to the basic text resource through a text classification model.

Optionally, the basic information determining module 1701 is further configured to:

determining a resource creation tag corresponding to the basic resource; the resource creation tag is used for representing whether the basic resource is suitable for being combined with other texts to generate a new media resource;

if the resource creation tag characterizes that the basic resource is suitable for being combined with other texts to generate a new media resource, executing the basic emotion tag and the basic final, and searching a candidate text candidate library for a reference candidate text matched with the basic resource;

and if the resource creation tag characterizes that the basic resource is not suitable for being combined with other texts to generate a new media resource, stopping executing the basic emotion tag and the basic final, and searching a candidate text candidate library for a reference candidate text matched with the basic resource.

when the basic resource is a basic video resource, determining a first creation tag corresponding to the basic video resource according to the basic text in the basic video resource through a text classification model; determining a second creation tag corresponding to the basic video resource according to the video frames in the basic video resource through an image classification model; determining the resource creation tag according to the first creation tag and the second creation tag;

and when the basic resource is a basic text resource, determining the resource creation tag according to the basic text corresponding to the basic text resource through the text classification model.

Optionally, the matching text searching module 1702 includes:

the primary searching unit is used for determining an emotion label which meets the preset emotion matching condition between the primary searching unit and the basic emotion label and is used as a reference emotion label; searching candidate texts corresponding to the reference emotion labels in the candidate text library to form a first-level candidate text set;

the second-level searching unit is used for determining the vowels meeting the preset rhyming conditions between the second-level searching unit and the basic vowels and taking the vowels as reference vowels; searching candidate texts corresponding to the reference vowels in the primary candidate text set to form a secondary candidate text set;

And the tertiary searching unit is used for determining the reference candidate text based on the candidate texts included in the secondary candidate text set.

Optionally, the basic finals include basic double-pressing finals and basic single-pressing finals, the basic double-pressing finals are the last two words in the basic text, and the basic single-pressing finals are the last words in the basic text; the secondary searching unit is specifically configured to:

determining a reference double-pressing final based on the double-pressing final corresponding to the basic double-pressing final; searching candidate texts corresponding to the reference double-pressing vowels in the primary candidate text set;

if the candidate text corresponding to the reference double-pressing final is found, the candidate text corresponding to the reference double-pressing final is utilized to form the secondary candidate text set;

if the candidate text corresponding to the reference double-pressing vowel is not found, determining the reference single-pressing vowel based on the single-pressing vowel corresponding to the basic single-pressing vowel; and searching candidate texts corresponding to the reference list pressing vowels in the primary candidate text set to form the secondary candidate text set.

Optionally, the candidate text library further includes a plurality of hotness corresponding to each of the candidate texts; the three-stage searching unit is specifically configured to:

And selecting the candidate texts with the corresponding heat degrees meeting the preset heat degree condition as the reference candidate text based on the heat degrees corresponding to the candidate texts in the secondary candidate text set.

Optionally, the candidate text library further includes text lengths corresponding to a plurality of candidate texts; the three-stage searching unit is specifically configured to:

and selecting a candidate text which meets a preset length matching condition between the corresponding text length and the text length of the basic text as the reference candidate text based on the text length corresponding to each candidate text in the secondary candidate text set.

Optionally, the apparatus further comprises a text library construction module 1704; the text library construction module 1704 is configured to:

acquiring an original candidate text library; the original candidate text library comprises a plurality of original candidate texts;

determining a text creation tag and an emotion tag corresponding to each original candidate text through a text classification model; the text creation tag is used for representing whether the original candidate text is suitable for being combined with other resources to generate new media resources;

And aiming at each original candidate text, if the text creation label corresponding to the original candidate text characterizes that the original candidate text is suitable for being combined with other resources to generate a new media resource, taking the original candidate text as the candidate text, and adding the candidate text, the emotion label corresponding to the candidate text and the finals into the candidate text library.

Optionally, the text library construction module 1704 is further configured to:

determining the corresponding heat degree of the candidate texts based on the search quantity corresponding to the candidate texts on a search platform aiming at each candidate text; determining the text length corresponding to the candidate text;

and adding the corresponding heat and text length of the candidate text into the candidate text library.

Optionally, the candidate text library stores a mapping relation between vowels and vowels, a mapping relation between the vowels and the candidate text, and label content corresponding to the candidate text; and the label content comprises emotion labels, heat and text length corresponding to the candidate texts.

Compared with the mode that a resource producer manually selects two resources and synthesizes new media resources according to the two resources, the resource generating device provided by the embodiment of the application can search reference candidate texts which are matched with basic emotion labels of basic resources and are conquered with basic texts in the basic resources from two angles of emotion matching and conquering in a candidate text library comprising a large number of candidate texts; on one hand, the method is not limited by the range of the manual knowledge surface any more, the reference candidate text matched with the basic resource can be selected in a larger text selection range, and on the other hand, the reference candidate text matched with the basic resource can be searched based on information of two dimensions, namely emotion labels and vowels, so that the searched reference candidate text can be better matched with the basic resource, and further, the target media resource generated based on the basic resource and the reference candidate text is ensured to have higher quality. In addition, the generation flow of the target media resource provided by the embodiment of the application is an automatic media resource generation flow, and compared with manual production of the media resource, the generation time of the media resource can be greatly shortened, and the generation efficiency of the media resource is improved.

The embodiment of the application also provides a computer device for generating the media resource, which can be a terminal device or a server, and the terminal device and the server provided by the embodiment of the application are described below from the aspect of hardware materialization.

Referring to fig. 18, fig. 18 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 18, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. Taking a terminal device as a computer as an example:

fig. 18 is a block diagram showing a part of the structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 18, a computer includes: radio Frequency (RF) circuitry 1810, memory 1820, input units 1830 including touch panel 1831 and other input devices 1832, display unit 1840 including display panel 1841, sensor 1850, audio circuitry 1860 (which may connect speaker 1861 and microphone 1862), wireless fidelity (wireless fidelity, wiFi) module 1870, processor 1880, and power supply 1890. Those skilled in the art will appreciate that the computer architecture shown in fig. 18 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.

The memory 1820 may be used to store software programs and modules, and the processor 1880 may execute the various functional applications and data processing of the computer by executing the software programs and modules stored in the memory 1820. The memory 1820 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1820 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 1880 is the control center of the computer, connects the various parts of the overall computer using various interfaces and lines, and performs various functions of the computer and processes data by running or executing software programs and/or modules stored in the memory 1820, and invoking data stored in the memory 1820. In the alternative, processor 1880 may include one or more processing units; preferably, the processor 1880 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., and a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1880.

In the embodiment of the present application, the processor 1880 included in the terminal is further configured to perform the steps of any implementation manner of the resource generating method provided in the embodiment of the present application.

Referring to fig. 19, fig. 19 is a schematic structural diagram of a server 1900 according to an embodiment of the present application. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The Server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, and/or one or more operating systems, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM , Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 19.

The CPU 1922 may be further configured to perform the steps of any implementation of the resource generating method provided in the embodiments of the present application.

The embodiments of the present application also provide a computer readable storage medium storing a computer program for executing any one of the foregoing implementation manners of the resource generating method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute any one of the resource generating methods described in the foregoing respective embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media in which a computer program can be stored.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of resource generation, the method comprising:

aiming at basic resources to be matched, determining basic emotion labels corresponding to the basic resources and basic finals corresponding to basic texts in the basic resources; the basic finals comprise basic double-pressing finals and basic single-pressing finals, the basic double-pressing finals are the last two words in the basic text, and the basic single-pressing finals are the last words in the basic text;

Generating a target media resource according to the basic resource and the reference candidate text;

the searching for the reference candidate text matched with the basic resource in a candidate text library based on the basic emotion label and the basic final, including:

determining an emotion label which meets the preset emotion matching condition between the emotion label and the basic emotion label as a reference emotion label; searching candidate texts corresponding to the reference emotion labels in the candidate text library to form a first-level candidate text set;

determining the vowels meeting the preset rhyming conditions between the base vowels and the base vowels as reference vowels; searching candidate texts corresponding to the reference vowels in the primary candidate text set to form a secondary candidate text set;

determining the reference candidate text based on candidate texts included in the secondary candidate text set;

the vowels meeting the preset rhyming conditions between the determination and the basic vowels are used as reference vowels; searching candidate texts corresponding to the reference vowels in the primary candidate text set to form a secondary candidate text set, wherein the method comprises the following steps of:

2. The method of claim 1, wherein when the base resource is a base video resource, the determining a base emotion tag corresponding to the base resource comprises:

3. The method according to claim 2, wherein the determining, by a text classification model, a first emotion tag corresponding to the base video asset from the base text in the base video asset comprises:

the determining, by the image classification model, a second emotion tag corresponding to the base video asset according to the video frames in the base video asset, including:

4. The method of claim 1, wherein when the base resource is a base text resource, the determining a base emotion tag corresponding to the base resource comprises:

5. The method according to any one of claims 1 to 4, further comprising:

6. The method of claim 5, wherein the determining a resource authoring label corresponding to the base resource comprises:

7. The method according to claim 1, wherein the candidate text library further comprises a plurality of hotspots corresponding to the candidate texts; the determining the reference candidate text based on the candidate texts included in the secondary candidate text set includes:

8. The method of claim 1, wherein the candidate text library further comprises text lengths corresponding to each of a plurality of candidate texts; the determining the reference candidate text based on the candidate texts included in the secondary candidate text set includes:

9. The method of claim 1, wherein the candidate text library is constructed by:

10. The method according to claim 9, wherein the method further comprises:

11. The method according to any one of claims 1, 9 and 10, wherein the candidate text library stores mapping relations between vowels and vowels, mapping relations between vowels and candidate texts, and tag contents corresponding to the candidate texts; and the label content comprises emotion labels, heat and text length corresponding to the candidate texts.

12. A resource generating apparatus, the apparatus comprising:

the basic information determining module is used for determining basic emotion labels corresponding to basic resources and basic finals corresponding to basic texts in the basic resources aiming at the basic resources to be matched; the basic finals comprise basic double-pressing finals and basic single-pressing finals, the basic double-pressing finals are the last two words in the basic text, and the basic single-pressing finals are the last words in the basic text;

the resource generation module is used for generating a target media resource according to the basic resource and the reference candidate text;

the matching text searching module comprises:

A third-level searching unit, configured to determine the reference candidate text based on candidate texts included in the second-level candidate text set;

the secondary searching unit is specifically configured to:

13. The apparatus of claim 12, wherein when the base resource is a base video resource, the base information determination module is specifically configured to:

14. The apparatus of claim 13, wherein the base information determination module is specifically configured to:

15. The apparatus of claim 12, wherein when the underlying resource is an underlying text resource, the underlying information determination module is specifically configured to:

16. The apparatus of any one of claims 12 to 14, wherein the base information determination module is further configured to:

17. The apparatus of claim 16, wherein the base information determination module is specifically configured to:

18. The apparatus of claim 12, wherein the candidate text library further comprises a plurality of hotspots corresponding to the candidate texts; the three-stage searching unit is specifically configured to:

19. The apparatus of claim 12, wherein the candidate text library further comprises text lengths corresponding to each of a plurality of the candidate texts; the three-stage searching unit is specifically configured to:

20. The apparatus of claim 12, further comprising a text library construction module; the text library construction module is used for:

21. The apparatus of claim 20, wherein the text library construction module is further configured to:

22. The apparatus according to any one of claims 12, 20 and 21, wherein the candidate text library stores mapping relations between vowels and vowels, mapping relations between vowels and candidate texts, and tag contents corresponding to the candidate texts; and the label content comprises emotion labels, heat and text length corresponding to the candidate texts.

23. A computer device, the device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to perform the resource generating method of any one of claims 1 to 11 according to the computer program.

24. A computer-readable storage medium storing a computer program for executing the resource generating method according to any one of claims 1 to 11.