CN110909182A

CN110909182A - Multimedia resource searching method and device, computer equipment and storage medium

Info

Publication number: CN110909182A
Application number: CN201911204886.0A
Authority: CN
Inventors: 张志伟; 林靖; 刘鹏
Original assignee: Reach Best Technology Co Ltd
Current assignee: Reach Best Technology Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-24
Anticipated expiration: 2039-11-29
Also published as: CN110909182B

Abstract

The disclosure relates to a multimedia resource searching method, a multimedia resource searching device, computer equipment and a storage medium, and belongs to the technical field of computers. According to the method, the search request is received, the text features of the keywords carried by the search request are respectively fused with the multimedia features of the multimedia resources to obtain a plurality of fusion features with stronger feature expression capacity, the fusion features are input into the click rate estimation model, the fusion features are subjected to convolution processing through the click rate estimation model to output the estimated click rates of the multimedia resources, the search result is generated based on the estimated click rates of the multimedia resources, the estimated click rates of the multimedia resources in the search result accord with the target condition, the accuracy of a server in the process of searching the multimedia resources is improved, and the search experience of a user is improved.

Description

Multimedia resource searching method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a multimedia resource searching method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, a user can browse multimedia resources based on a terminal, for example, the user triggers the terminal to send a search request to a server, the server returns the multimedia resources obtained by searching based on the search request to the terminal, and the terminal receives and displays the multimedia resources.

In the related art, when searching for some multimedia resources such as audio and video, a server generally needs to label respective text information (such as a title, an introduction, etc.) for each multimedia resource, after parsing a search request of a terminal to obtain a keyword, the server retrieves the text information of each multimedia resource in a keyword retrieval manner, and if the keyword sent by the terminal can hit the text information of a certain multimedia resource, the server returns the multimedia resource to the terminal.

In the above process, since the multimedia resources need to be converted into respective text information for keyword search, which may be referred to as a "text search" search method, during the conversion process of the multimedia resources and the text information, a large amount of detailed information is inevitably lost, that is, although the text information of the multimedia resources returned to the terminal by the server matches the keywords, the multimedia resources are not the multimedia resources that the user likes to browse, so that the server has poor accuracy when searching the multimedia resources, resulting in poor user search experience.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a computer device and a storage medium for searching multimedia resources, so as to at least solve the problem of poor accuracy in searching multimedia resources in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a multimedia resource searching method, including:

receiving a search request, wherein the search request carries a keyword to be searched;

respectively fusing the text features of the keywords with the multimedia features of a plurality of multimedia resources to obtain a plurality of fusion features;

inputting the fusion characteristics into a click rate estimation model, carrying out convolution processing on the fusion characteristics through the click rate estimation model, and outputting estimated click rates of the multimedia resources;

and generating a search result based on the estimated click rates of the plurality of multimedia resources, wherein the estimated click rates of the multimedia resources in the search result accord with a target condition.

In a possible implementation manner, the fusing the text features of the keyword with the multimedia features of the plurality of multimedia resources, respectively, to obtain a plurality of fused features includes:

and splicing the multimedia characteristics of any multimedia resource with the text characteristics to obtain a fusion characteristic.

In a possible implementation manner, before the fusing the text features of the keyword with the multimedia features of the plurality of multimedia resources respectively to obtain a plurality of fused features, the method further includes:

inputting the keywords into a word vector model, and embedding the keywords through the word vector model to obtain text characteristics of the keywords;

and inputting the multimedia resources into a convolutional neural network, and performing convolutional processing on the multimedia resources through the convolutional neural network to obtain multimedia characteristics of the multimedia resources.

In one possible embodiment, the generating search results based on the estimated click through rates of the plurality of multimedia resources comprises:

and sequencing the plurality of multimedia resources according to the sequence of the estimated click rate from large to small, and packaging the multimedia resources sequenced at the front target position as the search result.

and sequencing the plurality of multimedia resources according to the sequence of the estimated click rate from large to small, determining the multimedia resources sequenced at the front first target proportion, and packaging at least one multimedia resource in the multimedia resources sequenced at the front first target proportion as the search result.

In one possible implementation, before inputting the plurality of fusion features into a click-through rate prediction model, performing convolution processing on the plurality of fusion features through the click-through rate prediction model, and outputting the predicted click-through rates of the plurality of multimedia resources, the method further includes:

obtaining a plurality of sample training groups, wherein each sample training group comprises a sample keyword, a sample multimedia resource and an average click rate;

and performing iterative training on the initial click rate estimation model according to the plurality of sample training sets, and stopping training until a convergence condition is met to obtain the click rate estimation model.

In one possible embodiment, the obtaining a plurality of sample training sets comprises:

acquiring a plurality of average click rates according to click logs of a plurality of users, wherein one average click rate is used for expressing the average probability of clicking one sample multimedia resource by the plurality of users under the condition of searching one sample keyword;

constructing a plurality of tuples based on a plurality of sample keywords, a plurality of sample multimedia resources and the average click rates, wherein each tuple comprises a sample keyword, a sample multimedia resource and the average click rate of the sample keyword relative to the sample multimedia resource;

and screening the multiple tuples according to the average click rate in the multiple tuples to obtain the multiple sample training groups.

In a possible implementation manner, the screening the tuples according to the average click rate in the tuples to obtain the sample training sets includes:

and sequencing all tuples corresponding to any sample keyword according to the sequence from large average click rate to small average click rate, and determining tuples sequenced in the front second target proportion and tuples sequenced in the rear third target proportion as sample training groups.

In a possible implementation manner, any click log of each user comprises a user identification, a sample keyword, a sample multimedia resource and click behavior, wherein the click behavior is used for indicating whether the user clicks one sample multimedia resource under the condition of searching one sample keyword;

the obtaining a plurality of average click rates according to the click logs of the plurality of users comprises:

determining a first target number and a second target number for at least one click log with the same sample keyword and sample multimedia resources, wherein the first target number is the number of clicked click logs in a click behavior, and the second target number is the number of the at least one click logs;

and dividing the first target quantity by the second target quantity to obtain a numerical value, and determining the numerical value as the average click rate of the sample keyword relative to the sample multimedia resource.

In one possible embodiment, the convergence condition is that relative entropy values between estimated click rates and average click rates of the plurality of sample training sets are less than or equal to a target threshold.

According to a second aspect of the embodiments of the present disclosure, there is provided a multimedia resource searching apparatus, including:

the device comprises a receiving unit, a searching unit and a searching unit, wherein the receiving unit is configured to execute receiving of a searching request, and the searching request carries a keyword to be searched;

the fusion unit is configured to perform fusion of the text features of the keywords and the multimedia features of the multimedia resources respectively to obtain a plurality of fusion features;

the convolution unit is configured to input the plurality of fusion characteristics into a click rate estimation model, perform convolution processing on the plurality of fusion characteristics through the click rate estimation model, and output estimated click rates of the plurality of multimedia resources;

the generating unit is configured to execute the estimated click rate based on the plurality of multimedia resources and generate a search result, wherein the estimated click rate of the multimedia resources in the search result meets the target condition.

In one possible embodiment, the fusion unit is configured to perform:

In one possible embodiment, the apparatus is further configured to:

In a possible implementation, the generating unit is configured to perform:

In one possible embodiment, the apparatus further comprises:

the acquisition unit is configured to acquire a plurality of sample training sets, wherein each sample training set comprises a sample keyword, a sample multimedia resource and an average click rate;

and the training unit is configured to execute iterative training on the initial click rate estimation model according to the plurality of sample training sets, and stop training until a convergence condition is met to obtain the click rate estimation model.

In one possible implementation, the obtaining unit includes:

the obtaining subunit is configured to perform obtaining, according to the click logs of the multiple users, multiple average click rates, where one average click rate is used to represent an average probability that the multiple users click on a sample multimedia resource in a case of searching for a sample keyword;

a construction subunit configured to perform construction of a plurality of tuples based on a plurality of sample keywords, a plurality of sample multimedia resources, and the plurality of average click rates, each tuple including one sample keyword, one sample multimedia resource, and an average click rate of the sample keyword with respect to the sample multimedia resource;

and the screening subunit is configured to perform screening on the multiple tuples according to the average click rate in the multiple tuples to obtain the multiple sample training sets.

In one possible embodiment, the screening subunit is configured to perform:

the acquisition subunit is configured to perform:

According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising:

one or more processors;

one or more memories for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform:

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having at least one instruction which, when executed by one or more processors of a computer device, enables the computer device to perform a multimedia resource search method, the method comprising:

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising one or more instructions which, when executed by one or more processors of a computer device, enable the computer device to perform a multimedia resource search method, the method comprising:

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of receiving a search request, fusing text features of keywords carried by the search request with multimedia features of a plurality of multimedia resources respectively to obtain a plurality of fusion features with stronger feature expression capability, inputting the fusion features into a click rate estimation model, carrying out convolution processing on the fusion features through the click rate estimation model, outputting estimated click rates of the multimedia resources, generating a search result based on the estimated click rates of the multimedia resources, wherein the estimated click rates of the multimedia resources in the search result accord with target conditions, and because the text features and the multimedia features are fused, a large amount of loss of detailed information of the multimedia features can be avoided, and the estimated click rates of the multimedia resources are obtained through the click rate estimation model, the multimedia resources with the estimated click rates according with the target conditions can be packaged into the search result, the accuracy of the server in searching the multimedia resources is improved, and the searching experience of the user is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a diagram illustrating an implementation environment of a multimedia asset searching method according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of multimedia resource searching in accordance with an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of multimedia resource searching in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating an embodiment of the present disclosure for obtaining an estimated click rate;

FIG. 5 is a flowchart illustrating a training process of a click rate estimation model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating a logical structure of a multimedia asset searching apparatus according to an exemplary embodiment;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The user information to which the present disclosure relates may be information authorized by the user or sufficiently authorized by each party.

Fig. 1 is a schematic diagram of an implementation environment of a multimedia resource searching method according to an exemplary embodiment, referring to fig. 1, in which at least one terminal 101 and a server 102 may be included, and the following details are described below:

the at least one terminal 101 is installed and operated with an application program supporting a search function, where the application program may be at least one of a browser, a social application, a live application, a shopping application, and a payment application, and the category of the application program is not specifically limited in the embodiment of the present disclosure.

The server 102 may include at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center, and the server 102 is configured to provide a background service for an application supporting a search function. Alternatively, the server 102 may undertake primary computational tasks and at least one terminal 101 undertakes secondary computational tasks; alternatively, the server 102 may undertake secondary computing tasks, with at least one terminal 101 undertaking primary computing tasks; alternatively, at least one terminal 101 and the server 102 perform cooperative computing by using a distributed computing architecture.

The at least one terminal 101 and the server 102 may be connected to each other through a wired network or a wireless network.

In an exemplary scenario, a user may start an application on any one of the at least one terminal 101, the application may display a user interface carrying a search box, the user inputs a keyword to be searched in the search box, when the terminal detects a trigger operation of the user on a search option, the terminal generates a search request carrying the keyword, and sends the search request to the server 102. The server 102 receives a search request from a terminal, generates a search result based on the multimedia resource search method provided by the embodiment of the present disclosure, and sends the search result to the terminal, which will be described in detail in the following embodiment.

The applications installed on each terminal in the at least one terminal 101 may be the same, or may be the same type of application on different operating system platforms, and the device types of each terminal may be the same or different, and the device types may include: at least one of a smart phone, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III), an MP4(Moving Picture Experts Group Audio Layer IV), a laptop or a desktop computer. The following embodiments are illustrated with the terminal comprising a smartphone.

Those skilled in the art will appreciate that the number of each terminal may be only one, and may also be several tens or hundreds, or more, and the number and the device type of at least one terminal 101 are not specifically limited in the embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a multimedia resource searching method according to an exemplary embodiment, and referring to fig. 2, the multimedia resource searching method may be applied to a computer device, which may be the server 102 in the above implementation environment, and the following description will take the computer device as an example.

In step 201, the server receives a search request, where the search request carries a keyword to be searched.

In step 202, the server fuses the text features of the keyword with the multimedia features of the multimedia resources, respectively, to obtain a plurality of fused features.

In step 203, the server inputs the fusion features into a click-through rate estimation model, performs convolution processing on the fusion features through the click-through rate estimation model, and outputs estimated click-through rates of the multimedia resources.

In step 204, the server generates a search result based on the estimated click rates of the plurality of multimedia resources, wherein the estimated click rates of the multimedia resources in the search result meet the target condition.

The method provided by the embodiment of the disclosure comprises the steps of receiving a search request, fusing text features of keywords carried by the search request with multimedia features of a plurality of multimedia resources respectively to obtain a plurality of fusion features with stronger feature expression capability, inputting the plurality of fusion features into a click rate estimation model, carrying out convolution processing on the plurality of fusion features through the click rate estimation model to output estimated click rates of the plurality of multimedia resources, generating a search result based on the estimated click rates of the plurality of multimedia resources, wherein the estimated click rates of the multimedia resources in the search result accord with a target condition, and because the text features and the multimedia features are fused, a large amount of loss of detailed information of the multimedia features can be avoided, and the estimated click rates of the multimedia resources are obtained through the click rate estimation model, the multimedia resources with the estimated click rates which accord with the target condition can be packaged into the search result, the accuracy of the server in searching the multimedia resources is improved, and the searching experience of the user is improved.

In one possible embodiment, fusing the text features of the keyword with the multimedia features of the plurality of multimedia resources, respectively, to obtain a plurality of fused features includes:

In a possible implementation manner, before the text features of the keyword are respectively fused with the multimedia features of the plurality of multimedia resources to obtain a plurality of fused features, the method further includes:

inputting the keyword into a word vector model, and embedding the keyword through the word vector model to obtain the text characteristics of the keyword;

and inputting the multimedia resources into a convolutional neural network, and performing convolutional processing on the multimedia resources through the convolutional neural network to obtain the multimedia characteristics of the multimedia resources.

In one possible embodiment, generating the search result based on the estimated click through rates of the plurality of multimedia resources comprises:

In one possible embodiment, before inputting the fusion features into a click-through rate estimation model, performing convolution processing on the fusion features through the click-through rate estimation model, and outputting the estimated click-through rates of the multimedia resources, the method further includes:

and performing iterative training on the initial click rate estimation model according to the plurality of sample training sets until the initial click rate estimation model meets the convergence condition, and obtaining the click rate estimation model.

In one possible embodiment, obtaining a plurality of sample training sets comprises:

and screening the multiple tuples according to the average click rate in the multiple tuples to obtain the multiple sample training sets.

In one possible embodiment, the screening the tuples according to the average click rate of the tuples to obtain the sample training sets includes:

In one possible implementation, any click log of each user comprises a user identifier, a sample keyword, a sample multimedia resource and click behavior, wherein the click behavior is used for indicating whether the user clicks one sample multimedia resource under the condition of searching one sample keyword;

according to the click logs of a plurality of users, acquiring a plurality of average click rates comprises the following steps:

determining a first target number and a second target number for at least one click log with the same sample keyword and sample multimedia resources, wherein the first target number is the number of the clicked click logs in the click behavior, and the second target number is the number of the at least one click logs;

In one possible embodiment, the convergence condition is that relative entropy values between the estimated click rates and the average click rates of the plurality of sample training sets are less than or equal to a target threshold.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 3 is a flowchart illustrating a multimedia resource searching method according to an exemplary embodiment, and referring to fig. 3, the multimedia resource searching method may be applied to a computer device, which may be the server 102 in the above implementation environment, and the following description will take the computer device as an example.

In step 301, the server receives a search request of the terminal, where the search request carries a keyword to be searched.

In the above process, when the server receives any request, the server may analyze a header field of the request, determine the request as a search request when the header field carries a search identifier, and analyze the search request to obtain a keyword to be searched. The keywords may include texts, numbers, letters, special symbols, and the like, and the content of the keywords is not specifically limited in the embodiment of the present disclosure.

In some embodiments, the terminal may send the search request to the server by: the user can start an application program on the terminal, a user interface can be displayed in the application program, the user interface can be a search interface used for providing a search function and can also be a home page of the application program, the user interface can comprise a search box and search options, the search options can be located near the search box, the user can input a keyword to be searched in the search box by clicking the search box, specifically, the user can directly input the keyword in a text form through a screen keyboard, the user can also input the keyword in a voice form, then the terminal intelligently converts the voice form into the text form, when the terminal detects that the user triggers the search options, the terminal generates a search request carrying the keyword, and sends the search request to the server.

In step 302, the server inputs the keyword into a word vector model, and embeds the keyword through the word vector model to obtain the text feature of the keyword.

After the server analyzes the search request to obtain the keyword, the keyword can be input into a pre-stored word vector model, the keyword is embedded through the word vector model to obtain a word vector of the keyword, the word vector of the keyword is determined by the server to be the text feature of the keyword, in other words, the word vector model can obtain the word vector of the keyword through a word embedding (word embedding) mode, so that the keyword can be converted into a vector form which can be processed by a computer from a text form, and the processability and expression capacity of the text feature of the keyword are improved.

The word Vector model can be a Chinese word Vector model ChinesWord 2Vector or a foreign word Vector model, word Vector models of different languages can be adopted according to different languages of the keywords, the type of the word Vector model is not specifically limited in the embodiment of the disclosure, optionally, the word Vector model can be obtained by synchronous training with the click rate estimation model, and the training process will be detailed in the next embodiment.

In some embodiments, the server may not obtain the text features of the keyword through the word vector model in step 302, but obtain the feature vector of the keyword by performing one-hot (one-hot) encoding on the keyword, and determine the feature vector of the keyword as the text features of the keyword.

In step 303, the server inputs the multimedia resources into a convolutional neural network, and performs convolutional processing on the multimedia resources through the convolutional neural network to obtain multimedia features of the multimedia resources.

The plurality of multimedia resources may be part or all of multimedia resources pre-stored in each database, and the multimedia resources may be at least one of text resources, video resources, audio resources, picture resources, or web resources.

Alternatively, if the search scope is specified in the search request, the server may screen the multimedia resources that meet the search scope from the respective databases. For example, the search request specifies a search for a video resource, the server may determine all video resources pre-stored in the video database as the plurality of multimedia resources, for example, the search request specifies a search for a picture resource, the server may determine all picture resources pre-stored in the picture database as the plurality of multimedia resources, for example, the search request specifies a search for a multimedia resource before a certain historical time point, and the server may determine a multimedia resource stored in each database at a time point before the historical time point as the plurality of multimedia resources.

In the above process, the server obtains the multiple multimedia resources from the database, for any multimedia resource in the multiple multimedia resources, the server may input the multimedia resource into a Convolutional Neural Network (CNN), where the CNN may include at least one convolutional layer, and adjacent convolutional layers in the at least one convolutional layer are connected in series, that is, an output graph of any convolutional layer is used as an input graph of a next convolutional layer of the convolutional layer, perform convolutional processing on the multimedia resource through at least one convolutional layer in the CNN, obtain an output graph of a last convolutional layer as a multimedia feature of the multimedia resource, and repeatedly perform the above steps until obtaining the multimedia feature of each multimedia resource.

In some embodiments, the server may not obtain the multimedia features of each multimedia resource through the CNN in step 303, but may obtain the respective multimedia features by using different models for different types of multimedia resources, for example, obtaining the multimedia features of a picture resource by using the CNN, obtaining the multimedia features of a video resource by using LSTM (Long Short-term memory network), obtaining the multimedia features of an audio resource by using VGG (Visual Geometry Group), and the like, so that the server can perform targeted feature extraction on the different types of multimedia resources, and can improve the expression capability of the multimedia features.

In step 304, the server fuses the text features of the keyword with the multimedia features of the multimedia resources, respectively, to obtain a plurality of fused features.

Optionally, for the multimedia features of any multimedia resource, the server may splice the multimedia features and the text features to obtain a fusion feature, and repeatedly execute the above process, so that the text features may be respectively fused with the multimedia features to obtain the fusion features. The feature information of the text feature and the multimedia feature can be kept as much as possible in a splicing mode, and a part of feature information is prevented from being lost in a feature fusion process.

When the concatenation is performed, for example, a 16-dimensional text feature is obtained through the word vector model, and a 32-dimensional multimedia feature is obtained through the CNN, then a 48-dimensional fusion feature can be obtained through the concatenation process.

In some embodiments, for a multimedia feature of any multimedia resource, the server may further merge the text feature with the multimedia feature by: the server conducts dimension transformation on the multimedia features to obtain the multimedia features after dimension transformation, the dimensions of the multimedia features after dimension transformation are the same as those of the text features, and then all elements in the text features are added with elements in corresponding positions in the multimedia features after dimension transformation respectively to obtain a fusion feature, so that the text features and the multimedia features can be subjected to more compact feature fusion.

In step 305, the server inputs the fusion features into a click-through rate estimation model, performs convolution processing on the fusion features through the click-through rate estimation model, and outputs estimated click-through rates of the multimedia resources.

The Click Through Rate (CTR) estimation model is used for predicting the probability that a user clicks a certain multimedia resource under the condition of searching a certain keyword, the Click Through Rate estimation model may be pre-stored locally by a server, and the Click Through Rate estimation model may be obtained Through training in the following steps in the following embodiments, which are not described herein.

Optionally, the click rate prediction model may be a Deep Neural Network (DNN), and the DNN may include at least one hidden layer and a normalization layer, where adjacent hidden layers in the at least one hidden layer are connected in series, that is, an output graph of any hidden layer serves as an input graph of a hidden layer next to the hidden layer. For any fusion feature, the server may input the fusion feature into at least one hidden layer in the DNN, perform convolution processing on the fusion feature through the at least one hidden layer, input an output graph of a last hidden layer into a normalization layer, perform exponential normalization (softmax) processing on the output graph of the last hidden layer through the normalization layer to obtain an estimated click rate of the multimedia resource corresponding to the fusion feature, and repeatedly execute the above steps until the estimated click rate of each multimedia resource is obtained.

Fig. 4 is a schematic diagram of obtaining an estimated click rate according to an embodiment of the disclosure, referring to fig. 4, showing an end-to-end model, taking multimedia resources as video resources as an example, the keywords carried in the search request can be called "Query words", on one hand, the server inputs the Query words into a Word2Vec Word vector model, embedding (embedding) processing is carried out on the Query words through a word vector model, text characteristics of the Query words are output, on the other hand, a server inputs Video resources (namely 'Video videos' in the figure) into a CNN, performing convolution processing on the video resource through CNN, outputting multimedia characteristics of the video resource, fusing text characteristics of Query words and the multimedia characteristics of the video resource to obtain fused characteristics, inputting the fused characteristics into DNN, and performing convolution processing on the fusion characteristics through DNN, and outputting the estimated click rate of the video resources.

In some embodiments, in addition to the DNN, the click rate prediction model may also be a Wide & Deep network (a combined width and depth network), a GBDT (Gradient Boosting Decision Tree), an XGBoost (eXtreme Gradient Boosting), and the like, and the type of the click rate prediction model is not specifically limited in the embodiments of the present application.

In step 306, the server generates a search result based on the estimated click rates of the plurality of multimedia resources, wherein the estimated click rates of the multimedia resources in the search result meet the target condition.

The target condition may be that the estimated click rate of the multimedia resource is located at a previous target position, or the target condition may also be that the estimated click rate of the multimedia resource is located at a previous first target proportion. Wherein the first target ratio is any value greater than or equal to 0 and less than or equal to 1.

In some embodiments, the server may sort the plurality of multimedia resources in order of decreasing estimated click rate, and encapsulate the multimedia resources sorted at the front target position as the search result, so that the estimated click rate of the multimedia resources in the search result is as high as possible. Optionally, since some multimedia resources (e.g., video resources) usually occupy a large space, the titles, thumbnails and jump links of the multimedia resources ordered at the top target position can be packaged as search results, so that the consumed time of resource transmission can be saved.

In some embodiments, the server may rank the plurality of multimedia resources in order of decreasing estimated click-through rates, determine multimedia resources ranked at a first previous target proportion, and encapsulate at least one of the multimedia resources ranked at the first previous target proportion as a search result. In the above process, for the multimedia resources ranked at the first target ratio, the server may optionally select at least one multimedia resource from the multimedia resources ranked at the first target ratio, and encapsulate the at least one multimedia resource as the search result, so that an over-fitting phenomenon of the search result can be avoided, and the generalization degree and randomness of the search result are increased. In some embodiments, since some multimedia resources (e.g., video resources) usually occupy a large space, the title, the thumbnail and the jump link of the at least one multimedia resource can be packaged as a search result, so that the consumed time of resource transmission can be saved.

In step 307, the server transmits the search result to the terminal.

In the above process, after the server generates the search result, the server sends the search result to the terminal, and when the terminal receives the search result, each multimedia resource in the search result may be displayed in a user interface provided by the application program, for example, a display area of the search result may be located below the search box.

In some embodiments, if the title, the thumbnail and the skip link of each multimedia resource are carried in the search result, the terminal may display the title, the thumbnail and the skip link of each multimedia resource in the user interface, when a click operation of a user on the skip link of any multimedia resource is detected, a resource request is sent to the server, the resource request is used for requesting to access the multimedia resource, the server responds to the resource request, the multimedia resource is sent to the terminal, and when the terminal receives the multimedia resource, the terminal may display the multimedia resource in the application program.

In the above scenario, the server can issue each multimedia resource with higher accuracy to the terminal during the search, that is, the matching degree between each multimedia resource displayed by the terminal and the liking of the user is higher, so that the user is more willing to click the searched multimedia resource, the click rate and the conversion rate of the multimedia resource are improved, and the search experience of the user is improved.

In the above embodiment, a process of how the server performs multimedia resource search based on the click-through rate prediction model is shown, and in the embodiment of the present disclosure, a process of the server training the click-through rate prediction model will be described in detail.

Fig. 5 is a training flowchart of a click-through rate estimation model provided in an embodiment of the present disclosure, and referring to fig. 5, the embodiment may be applied to a model training device, and the following description takes the model training device as a server as an example.

In step 501, the server obtains a plurality of average click rates according to the click logs of a plurality of users, where one average click rate is used to represent an average probability that the plurality of users click on a sample multimedia resource in the case of searching for a sample keyword.

In the above process, the server may extract the click logs of the multiple users from the background system of the application program, and obtain each average click rate according to the click logs of the users, where the multiple users may be some or all users registered on the application program, and the selection range of the multiple users is not specifically limited in the embodiment of the present disclosure.

In some embodiments, any of the click logs of each user includes a user identification, a sample keyword, a sample multimedia asset, and a click behavior indicating whether the user clicked on a sample multimedia asset in case of searching for a sample keyword.

Taking multimedia resources as video resources as an example, a click log of a certain user may be stored as a { user, query, video, click } quadruplet, where "user" is also a user identifier, the user identifier may be at least one of a user name, a user identification code, a user phone number, or a user mailbox and can uniquely identify information of a certain user, "query" is also a sample keyword, and refers to a search keyword that has been input by the user during a search process, "video" is also a sample multimedia resource, and refers to any corresponding multimedia resource that the terminal displays when the user searches for the sample keyword in the quadruplet, and "click" is also a click behavior, and refers to whether the user clicks the sample multimedia resource in the quadruplet under the condition of searching for the sample keyword in the quadruplet, optionally, "click" may be a binarized one, when the value of "click" is 1, the representative user clicks the sample multimedia resources in the quadruplet when searching for the sample keyword in the quadruplet, and when the value of "click" is 0, the representative user does not click the sample multimedia resources in the quadruplet when searching for the sample keyword in the quadruplet. For example, click log { user A, Game X, VideoID, 1} indicates: user a has searched for game X and clicked on the video asset with the aforementioned VideoID in a search interface (also referred to as a search system) provided by the application.

In some embodiments, the server may cluster the click logs of the multiple users according to whether the sample keywords and the sample multimedia resources are the same, may cluster the click logs having the same sample keywords and the same sample multimedia resources into the same set, and then calculate the average click rate by using the set as a unit, and may obtain each average click rate of each sample keyword relative to each sample multimedia resource by repeating the above process.

Optionally, for at least one click log (i.e. each click log in a set) with the same sample keyword and sample multimedia resource, the server may obtain the average click rate by: the server determines a first target number and a second target number, wherein the first target number is the number of clicked logs of which the click behavior is clicked, and the second target number is the number of logs of at least one clicked log; and the server determines the average click rate of the sample keyword relative to the sample multimedia resource according to the value obtained by dividing the first target quantity by the second target quantity. The server may repeat the above steps for each set, thereby obtaining the average click rate of each set.

Based on the above example, after obtaining the click logs in the form of { user, query, video, click } quadruplets, the server may perform set division on all click logs through query and video, and divide the click logs with the same query and video into the same set, where each click log in the set is used to represent a click behavior generated by a user when the user shows the video to the user after searching the query by the recommendation system, and therefore, an average click rate of the query with respect to the video may be obtained in the set, for example, the number of click logs with "click" value of 1 in the set is determined as a first target number, the number of all click logs in the set is determined as a second target number, and a value obtained by dividing the first target number by the second target number is determined as an average click rate of the query with respect to the video. For example, the collection includes 10 click logs corresponding to 10 users, and in the case of searching the same query, the search system presents the same video to the 10 users, where 4 people click on the video, 6 people do not click on the video, then the first number of objects can be calculated to be 4, the second number of objects is calculated to be 10, and the average click rate of the query relative to the video is calculated to be 0.4.

In the process, the average click rate of at least one click log with the same sample keywords and sample multimedia resources is calculated through the first target number and the second target number, massive click logs can be quickly sorted and clustered, so that each average click rate is quickly calculated, and the processing logic of the click logs by the server is optimized.

In step 502, the server constructs a plurality of tuples based on a plurality of sample keywords, a plurality of sample multimedia resources, and the average click rates, wherein each tuple includes a sample keyword, a sample multimedia resource, and the average click rate of the sample keyword relative to the sample multimedia resource.

After the server obtains a plurality of average click rates through step 501, each average click rate uniquely corresponds to one sample keyword and one sample multimedia resource, so that a tuple can be constructed for each average click rate, the tuple at least includes three elements, namely the average click rate, the sample keyword and the sample multimedia resource, and each element in each tuple has a corresponding relationship.

Based on the above example, the server may construct a plurality of { query, video, click _ ratio } triples based on a plurality of sample keywords, a plurality of sample multimedia resources, and a plurality of average click rates, where "click _ ratio" is also an average click rate, and may also be referred to as a soft voting result of "query" relative to "video".

In step 503, the server screens the tuples according to the average click rate of the tuples to obtain a plurality of sample training sets.

In some embodiments, for each tuple corresponding to any sample keyword, the server may sort the tuples according to the descending order of the average click rate, and determine the tuple sorted at the second front target proportion and the tuple sorted at the third rear target proportion as sample training groups. In the above process, when searching the sample keyword, the user has a high probability of clicking the sample multimedia resources in the tuples with the ranking in the front second target proportion and a low probability of clicking the sample multimedia resources in the tuples with the ranking in the back third target proportion, that is, whether the tuples with the ranking in the front second target proportion or the tuples with the ranking in the back third target proportion have well-defined expression ability, so the server can take the tuples with the ranking in the front second target proportion as the positive samples during training, take the tuples with the ranking in the back third target proportion as the negative samples during training, discard the rest tuples, thereby completing data cleaning for massive tuples and selecting more valuable training data. Wherein the second target ratio or the third target ratio is any one of values greater than or equal to 0 and less than or equal to 1.

Based on the above example, after the server obtains multiple { query, video, click _ ratio } triples, each triplet with the same query is regarded as the same set, so that in any set with the same query, several { video, click _ ratio } triplets are included, the server may sort the triplets according to the size of the click _ ratio, for each set with the same query, taking the second target proportion and the third target proportion both as 30%, for each set with the same query, the server only selects the tuples with the click _ ratio located in the first 30% and the last 30% as sample training sets, discards the tuples with the click _ ratio located in the middle 40%, and repeatedly executes the above process, so as to obtain multiple { query, video, click _ ratio } sample training sets after data cleaning.

In the above step 501-503, the server obtains a plurality of sample training sets, each sample training set includes a sample keyword, a sample multimedia resource and an average click rate, so that model training can be performed based on the plurality of sample training sets, which will be described in the following steps.

In step 504, the server inputs a plurality of sample keywords in a plurality of sample training sets into an initial word vector model, and performs embedding processing on the plurality of sample keywords through the initial word vector model to obtain text features of the plurality of sample keywords.

Step 504 is similar to step 302 and will not be described herein.

In step 505, the server inputs a plurality of sample multimedia resources in a plurality of sample training sets into an initial CNN, and performs convolution processing on the plurality of sample multimedia resources through the initial CNN to obtain multimedia features of the plurality of sample multimedia resources.

Step 505 is similar to step 303 and will not be described herein.

In step 506, the server fuses the text features of the sample keywords in each sample training set with the multimedia features of the sample multimedia resources to obtain a plurality of sample fusion features.

Step 506 is similar to step 304 and will not be described herein.

In step 507, the server inputs the sample fusion features into an initial click rate estimation model, performs convolution processing on the sample fusion features through the initial click rate estimation model, and outputs estimated click rates of the sample multimedia resources.

Step 507 is similar to step 305, and is not described herein again.

In step 508, the server obtains the loss function value of the training process according to the estimated click rate and the average click rate of the plurality of sample training sets.

In some embodiments, the server may use the relative entropy between the estimated click-through rates and the average click-through rates of the training set of samples as a loss function, where the relative entropy is also referred to as KL divergence (Kullback-Leibler divergence), which is a measure of asymmetry in the difference between two probability distributions.

Assuming that P (X) and Q (X) distributions are two probability distributions over a random variable X, KL divergence is a method for representing the difference between probability distributions P (X) and Q (X), KL divergence can be represented by D (P | | | Q), which refers to the information loss that occurs when a probability distribution Q (X) is used to fit the true distribution P (X), where P (X) is the true distribution and Q (X) is the fit distribution of P (X). Then, in the case of discrete random variables, the KL divergence is defined as shown in the following equation:

further, on the basis of the KL divergence definitional expression, when the KL divergence is adopted as the loss function, the loss function may have the following expression:

in the above formula, loss is a loss function, X is a sample training group, X is a sample set formed by a plurality of sample training groups, k is a category label, and N is a label set formed by all category labels, in this embodiment of the present disclosure, N is 2, that is, two category labels including "click" and "no click" are included in a label set, which refers to a predicted probability that the initial click rate prediction model belongs to the category label k to the sample training group X, when the category label k is "click", that is, the predicted click rate of the initial click rate prediction model to the sample training group X is referred to as a true probability that the sample training group X belongs to the category label k, and when the category label is "click", that is, the average click rate (commonly referred to as "soft" voting probability) included in the sample training group X.

In the above process, the KL divergence is taken as the loss function of the training process, so that the difference between the estimated click rate and the average click rate of each sample training set can be better weighed. In some embodiments, cross entropy or mean square error between the estimated click rate and the average click rate of each sample training set may also be used as a loss function, and the form of the loss function is not specifically limited in the embodiments of the present disclosure.

In step 509, if the loss function value does not satisfy the convergence condition, at least one model parameter of the initial word vector model, the initial CNN, or the initial click rate estimation model is adjusted, and the

above step

504 and 509 are iteratively executed until the loss function value satisfies the convergence condition, and then the following step 510 is executed.

Optionally, when relative entropy is adopted as the loss function, the convergence condition may be that relative entropy between the estimated click rate and the average click rate of the plurality of sample training sets is less than or equal to a target threshold. The target threshold may be any value greater than or equal to 0. In this case, when the relative entropy values between the estimated click rates and the average click rates of the plurality of sample training sets are greater than the target threshold, it is determined that the loss function value does not meet the convergence condition, the server may adjust a model parameter of at least one of the initial word vector model, the initial CNN, or the initial click rate estimation model based on a back propagation algorithm, iteratively perform

step

504 and 509 above to obtain another loss function value, determine whether the another loss function value meets the convergence condition again, and so on until a certain time, the relative entropy values between the estimated click rates and the average click rates of the plurality of sample training sets are less than or equal to the target threshold, determine that the loss function value meets the convergence condition, and perform step 510 below.

It should be noted that when the loss functions are different, the convergence condition is also not only different, for example, when the loss function is the cross entropy between the estimated click rate and the average click rate of the plurality of sample training sets, then the convergence condition may be that the cross entropy between the estimated click rate and the average click rate of the plurality of sample training sets is less than or equal to a cross entropy threshold, optionally, the convergence condition may further include the number of iterations reaching a target number, where the target number is any number greater than or equal to 1, and the content of the convergence condition is not specifically limited in the embodiments of the present disclosure.

In step 510, if the loss function value meets the convergence condition, the server stops training to obtain a word vector model, a CNN, and a click rate estimation model.

In step 504-.

In the process, if the loss function value meets the convergence condition, the server is considered to finish training the initial word vector model to obtain a word vector model, finish training the initial CNN to obtain CNN, and finish training the initial click rate estimation model to obtain a click rate estimation model.

The method provided by the embodiment of the disclosure constructs a plurality of tuples by processing data of click logs of a plurality of users, and performs data cleaning on the tuples based on average click rate to obtain a plurality of sample training sets, so as to train an initial word vector model, an initial CNN (CNN) and an initial click rate estimation model based on the plurality of sample training sets, and respectively obtain a word vector model, a CNN and a click rate estimation model, which can improve the accuracy and intelligence of the word vector model, the CNN and the click rate estimation model, so that the click rate estimation model obtained by training can intelligently predict which multimedia resources the click rate of the users is higher in a search scene, thereby improving the accuracy of a multimedia resource search process and improving the click rate of the users on the searched multimedia resources when the click rate estimation model is put into the multimedia resource search process, the conversion rate of multimedia resources is improved, and therefore the search experience of a user is optimized.

Furthermore, through historical search data of users (namely click logs of a plurality of users), a 'soft' voting data cleaning method is provided, sample training groups with stronger expression ability are cleaned from massive tuples according to the average click rate, the screening ability aiming at the massive tuple data is improved, and therefore the sample training groups with higher training value can be screened efficiently. In addition, differences between the estimated click rate and the average click rate of each sample training set can be better weighed by training each model with KL divergence as a loss function.

Fig. 6 is a block diagram illustrating a logical structure of a multimedia asset searching apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes a receiving unit 601, a fusing unit 602, a convolution unit 603, and a generating unit 604.

A receiving unit 601 configured to perform receiving a search request, where the search request carries a keyword to be searched;

a fusion unit 602 configured to perform fusion of the text features of the keyword with the multimedia features of the multiple multimedia resources, respectively, to obtain multiple fusion features;

a convolution unit 603 configured to perform input of the fusion features into a click rate estimation model, perform convolution processing on the fusion features through the click rate estimation model, and output estimated click rates of the multimedia resources;

the generating unit 604 is configured to perform generating a search result based on the estimated click rates of the plurality of multimedia resources, wherein the estimated click rates of the multimedia resources in the search result meet the target condition.

The device provided by the embodiment of the disclosure receives a search request, respectively fuses text features of keywords carried by the search request with multimedia features of a plurality of multimedia resources to obtain a plurality of fusion features with stronger feature expression capability, inputs the plurality of fusion features into a click rate estimation model, performs convolution processing on the plurality of fusion features through the click rate estimation model to output estimated click rates of the plurality of multimedia resources, generates a search result based on the estimated click rates of the plurality of multimedia resources, the estimated click rates of the multimedia resources in the search result conform to a target condition, can avoid causing a large loss of detailed information of the multimedia features due to the fusion of the text features and the multimedia features, obtains the estimated click rates of the multimedia resources through the click rate estimation model, can package the multimedia resources of which the estimated click rates conform to the target condition into the search result, the accuracy of the server in searching the multimedia resources is improved, and the searching experience of the user is improved.

In one possible embodiment, the fusion unit 602 is configured to perform:

In one possible embodiment, the apparatus is further configured to:

In one possible implementation, the generating unit 604 is configured to perform:

In a possible embodiment, based on the apparatus composition of fig. 6, the apparatus further comprises:

In a possible implementation, based on the apparatus composition of fig. 6, the obtaining unit includes:

the obtaining subunit is configured to perform obtaining, according to the click logs of the plurality of users, a plurality of average click rates, one average click rate being used for representing an average probability that the plurality of users click a sample multimedia resource in a case of searching for a sample keyword;

a construction subunit, configured to perform construction of a plurality of tuples based on a plurality of sample keywords, a plurality of sample multimedia resources, and the plurality of average click rates, each tuple including one sample keyword, one sample multimedia resource, and an average click rate of the sample keyword with respect to the sample multimedia resource;

and the screening subunit is configured to perform screening on the tuples according to the average click rate in the tuples to obtain the sample training sets.

In one possible embodiment, the screening subunit is configured to perform:

the acquisition subunit is configured to perform:

With regard to the apparatus in the above-mentioned embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the multimedia resource searching method, and will not be elaborated herein.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure, where the computer device 700 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 701 to implement the multimedia resource searching method according to the above-described method embodiments. Certainly, the computer device may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the computer device may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of a computer device to perform the multimedia resource searching method provided by the various method embodiments described above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes one or more instructions that can be executed by a processor of a terminal to implement the multimedia resource searching method provided by the above-mentioned method embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A multimedia resource searching method is characterized by comprising the following steps:

2. The method of claim 1, wherein the fusing the text features of the keywords with the multimedia features of the multimedia resources to obtain a plurality of fused features comprises:

3. The method as claimed in claim 1, wherein before the step of fusing the text features of the keywords with the multimedia features of the multimedia resources to obtain the fused features, the method further comprises:

4. The method as claimed in claim 1, wherein before the inputting the plurality of fusion features into a click-through rate estimation model, performing convolution processing on the plurality of fusion features through the click-through rate estimation model, and outputting the estimated click-through rates of the plurality of multimedia resources, the method further comprises:

5. The method of claim 4, wherein the obtaining the training set of samples comprises:

6. The method of claim 5, wherein the filtering the tuples according to the average click-through rate of the tuples to obtain the sample training sets comprises:

7. The multimedia resource searching method as claimed in claim 5, wherein any one of the click logs of each user includes a user identifier, a sample keyword, a sample multimedia resource, and a click behavior for indicating whether the user clicks a sample multimedia resource in case of searching for a sample keyword;

8. A multimedia resource search apparatus, comprising:

9. A computer device, comprising:

one or more processors;

wherein the one or more processors are configured to execute the instructions to implement the multimedia resource searching method of any one of claims 1 to 7.

10. A storage medium, wherein at least one instruction of the storage medium, when executed by one or more processors of a computer device, enables the computer device to perform the multimedia resource searching method of any one of claims 1 to 7.