CN113919446A

CN113919446A - Method and device for model training and similarity determination of multimedia resources

Info

Publication number: CN113919446A
Application number: CN202111339230.7A
Authority: CN
Inventors: 张水发
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-01-11

Abstract

The application relates to the technical field of artificial intelligence, and provides a method and a device for model training and similarity determination of multimedia resources, which are used for solving the problem that the training work of a convolutional neural network model in the related technology is complex. In the present application, based on that similar query requests correspond to similar multimedia resources, in order to better construct a positive sample pair, multimedia resources corresponding to the similar query requests are obtained by screening through a time window in the embodiment of the present application. The time window can be used because two access operations with similar access times are usually accesses to very similar multimedia resources under the same or similar query requests. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

Description

Method and device for model training and similarity determination of multimedia resources

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for model training and similarity determination of multimedia resources.

Background

The video resources in the network are rich, and the number of videos is huge along with the accumulation of time and the increase of users. Some video-based applications typically use a representation of the characteristics of the video. One application is video recommendation by calculating the similarity between videos.

Similarity between videos generally requires converting the videos into a feature expression, such as an embedding feature. The distance between the feature expressions of the two videos is then calculated as the similarity of the two videos.

In the related technology, a Convolutional Neural Network (CNN) is mostly adopted to extract the feature expression of the video. However, training a convolutional neural network requires pairs of positive and negative samples. The marking work of the positive and negative sample pairs is complex, the marking is difficult, and the yield is low. Therefore, the training of the convolutional neural network model in the related art is complicated.

Disclosure of Invention

The embodiment of the application provides a method and a device for model training and similarity determination of multimedia resources, which are used for solving the problem that the training work of a convolutional neural network model in the related technology is complex.

In a first aspect, the present application provides a method for training a feature extraction model of a multimedia resource, the method including:

constructing a positive sample pair by adopting multimedia resources accessed by the same account object in a multimedia resource set, wherein the access time of two samples in the positive sample pair is taken from the same time window; and the number of the first and second groups,

constructing a negative sample pair by adopting the un-accessed multimedia resources and the accessed multimedia resources in the multimedia resource set; the multimedia resources matched with the similar query requests form the multimedia resource set, and the query requests with the similarity higher than the similarity threshold form the similar query requests;

and training a feature extraction model by adopting the positive sample pair and the negative sample pair, wherein the feature extraction model is used for extracting the similarity between multimedia resources.

Optionally, the feature extraction model is a double-tower network structure, where each tower includes a convolutional neural network and a first full connection layer;

wherein the convolutional neural network of a first tower network structure of the two tower network structures is used to extract a first feature of a first multimedia resource; the convolutional neural network of a second tower network structure in the double-tower network structure is used for extracting a first feature of a second multimedia resource;

the first full connection layer of the first tower network structure is used for extracting the first characteristic of the first multimedia resource and the second characteristic of the first multimedia resource to obtain the characteristic expression of the first multimedia resource;

the first full connection layer of the second tower network structure is used for extracting the first characteristic of the second multimedia resource and the second characteristic of the second multimedia resource to obtain the characteristic expression of the second multimedia resource;

the feature expression of the first multimedia asset and the feature expression of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.

Optionally, the second feature is a parameter of attention degree of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the parameter of attention degree of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

the convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and the attention degree parameter of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.

Optionally, the training of the feature extraction model by using the positive sample pair and the negative sample pair includes:

acquiring respective feature expressions of a first multimedia resource and a second multimedia resource in any sample pair;

performing feature extraction on the feature expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification features;

classifying the classification features to obtain a prediction class of any sample pair, wherein the classification classes for classification comprise a positive sample pair and a negative sample pair;

and determining a loss value by adopting the prediction type and the labeling type, and adjusting the model parameters of the feature extraction model based on the loss value.

Optionally, the method further includes:

acquiring the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

screening out multimedia resources with specified operation records from the candidate resource sets to obtain a positive sample resource set;

the method for constructing the positive sample pair by adopting the multimedia resources accessed by the same account object in the multimedia resource set specifically comprises the following steps:

and for the positive sample resource set, adopting the multimedia resources accessed by the same account object to construct a positive sample pair.

Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of two samples in the negative sample pair, and an expression of the feature similarity of two samples in the positive sample pair is the same as an expression of the difference degree of two samples in the negative sample pair, and the expressions are different in sign.

Optionally, the loss function is used to minimize the distance between the two samples in the positive sample pair and maximize the difference between the two samples in the negative sample pair.

Optionally, the loss function is used to minimize the value of the following equation:

wherein Dp represents that the sample l and the sample c are a positive sample pair, Dn represents that the sample l and the sample c are a negative sample pair, Vc is the feature expression of the sample c, and Vl is the feature expression of the sample l.

Optionally, the method further includes:

constructing the set of multimedia resources based on:

determining the similarity of each query request in a plurality of query requests;

acquiring at least two query requests with similarity higher than a corresponding similarity threshold value to obtain a query request set;

and constructing the multimedia resource set by adopting the multimedia resources corresponding to the query requests in the query request set.

Optionally, the method further includes:

and for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, marking any multimedia resource accessed by the same account object in the multimedia resource set and the multimedia resource set as a positive sample pair.

Optionally, the plurality of query requests satisfy at least one of the following conditions:

the query time interval is less than a first duration;

query requests for the same account object.

Optionally, if the query requests meet the query request with the query time interval smaller than the first time length or the same account object, the similarity threshold adopted by the query requests with the query time interval smaller than the first time length is smaller than the similarity threshold adopted by the query requests of the same account object.

Optionally, the specified operation record includes at least one of the following:

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

Optionally, the attention degree parameter includes at least one of the following parameters:

click rate index parameters, attention rate index parameters, like rate index parameters, comment rate index parameters and forwarding rate index parameters.

In a second aspect, the present application further provides a method for determining similarity of multimedia resources, where the method includes:

acquiring a first multimedia resource and a second multimedia resource;

respectively extracting feature expressions of the first multimedia resource and the second multimedia resource by adopting a feature extraction model;

determining similarity of the first multimedia resource and the second multimedia resource based on respective feature expressions of the first multimedia resource and the second multimedia resource;

the feature extraction model is obtained by training based on a positive sample pair and a negative sample in advance, wherein:

the query requests with the similarity higher than the similarity threshold form the same type of query requests; multimedia resources matched with the similar query requests form the multimedia resource set;

the positive sample pair is constructed by adopting the multimedia resources accessed by the same account object in the multimedia resource set, and the access time of two samples in the positive sample pair is taken from the same time window;

the negative sample pair is constructed by using the un-accessed multimedia resources and the accessed multimedia resources in the multimedia resource set.

Optionally, training a feature extraction model by using the positive sample pair and the negative sample pair, including:

Optionally, the method further includes:

constructing the set of multimedia resources based on:

Optionally, the method further includes:

the query time interval is less than a first duration;

query requests for the same account object.

Optionally, the method further includes:

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

In a third aspect, the present application further provides a device for training a feature extraction model of a multimedia resource, where the device includes:

the system comprises a sample pair construction module, a sample pair construction module and a data processing module, wherein the sample pair construction module is configured to adopt multimedia resources accessed by the same account object in a multimedia resource set to construct a positive sample pair, and the access time of two samples in the positive sample pair is taken from the same time window; constructing a negative sample pair by adopting the un-accessed multimedia resources and the accessed multimedia resources in the multimedia resource set; the multimedia resources matched with the similar query requests form the multimedia resource set, and the query requests with the similarity higher than the similarity threshold form the similar query requests;

a training module configured to train a feature extraction model using the positive sample pair and the negative sample pair, wherein the feature extraction model is used for extracting similarity between multimedia resources.

Optionally, the training of the feature extraction model using the positive sample pair and the negative sample pair is performed, and the training module is specifically configured to:

Optionally, the apparatus further comprises:

the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

the positive sample resource set determining module is configured to screen out multimedia resources with specified operation records from the candidate resource sets to obtain a positive sample resource set;

executing the multimedia resource accessed by the same account object in the multimedia resource set to construct a positive sample pair, wherein the sample pair construction module is specifically configured to:

Optionally, the apparatus further comprises:

a multimedia asset set construction module configured to construct the multimedia asset set based on:

Optionally, the sample pair constructing module is further configured to: and for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, marking any multimedia resource accessed by the same account object in the multimedia resource set and the multimedia resource set as a positive sample pair.

the query time interval is less than a first duration;

query requests for the same account object.

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

In a fourth aspect, the present application further provides an apparatus for determining similarity of multimedia resources, where the apparatus includes:

an acquisition module configured to acquire a first multimedia resource and a second multimedia resource;

a feature expression extraction module configured to extract feature expressions of the first multimedia resource and the second multimedia resource respectively using a feature extraction model;

a similarity determination module configured to determine a similarity of the first multimedia resource and the second multimedia resource based on the respective feature expressions of the first multimedia resource and the second multimedia resource;

Optionally, the apparatus further comprises: a training module configured to train a feature extraction model with the pair of positive samples and the pair of negative samples based on:

Optionally, the apparatus further comprises:

and the sample pair construction module is configured to mark any one of the multimedia resource and the multimedia resource set accessed by the same account object as a positive sample pair if the last accessed multimedia resource associated with the query request has a specified operation record for each query request in the query request set.

the query time interval is less than a first duration;

query requests for the same account object.

Optionally, the apparatus further comprises:

the sample pair construction module adopts multimedia resources accessed by the same account object in a multimedia resource set to construct a positive sample pair based on the following method:

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

In a fifth aspect, the present application further provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement any of the methods as provided in the first and/or second aspects of the present application.

In a sixth aspect, an embodiment of the present application further provides a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform any one of the methods as provided in the first and/or second aspects of the present application.

In a seventh aspect, an embodiment of the present application provides a computer program product comprising a computer program that, when executed by a processor, implements any of the methods as provided in the first and/or second aspects of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects: the same or similar multimedia assets can be searched by different query requests. Therefore, the query requests of the same or similar multimedia resources have certain similarity. In other words, similar query requests correspond to similar multimedia assets. In order to better construct the positive sample pair, the multimedia resources corresponding to the similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times are usually accesses to very similar multimedia resources under the same or similar query requests. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a sample construction and training method of a feature extraction model of a multimedia resource according to an embodiment of the present application;

fig. 3 is a second flowchart illustrating a method for training a feature extraction model of a multimedia resource according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a feature extraction model of a multimedia resource according to an embodiment of the present application;

fig. 5 is a third schematic flowchart illustrating a method for training a feature extraction model of a multimedia resource according to an embodiment of the present application;

fig. 6 is a flowchart illustrating a method for determining similarity of multimedia resources according to an embodiment of the present disclosure;

fig. 7 is a block diagram illustrating a structure of a training apparatus for a feature extraction model of a multimedia resource according to an embodiment of the present disclosure;

fig. 8 is a block diagram illustrating a similarity determination apparatus for multimedia resources according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present application better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Hereinafter, some terms in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

(1) In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.

(2) "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

(3) A server serving the terminal, the contents of the service such as providing resources to the terminal, storing terminal data; the server is corresponding to the application program installed on the terminal and is matched with the application program on the terminal to run.

(4) The terminal device may refer to an APP (Application) of a software class, or may refer to a client. The system is provided with a visual display interface and can interact with a user; is corresponding to the server, and provides local service for the client. For software applications, except some applications that are only run locally, the software applications are generally installed on a common client terminal and need to be run in cooperation with a server terminal. After the development of the internet, more common application programs include short video applications, email clients for receiving and sending emails, and clients for instant messaging, for example. For such applications, a corresponding server and a corresponding service program are required in the network to provide corresponding services, such as database services, configuration parameter services, and the like, so that a specific communication connection needs to be established between the client terminal and the server terminal to ensure the normal operation of the application program.

(5) The multimedia resource refers to various resources that can be accessed in a network, such as an audio resource, a video resource, a web page resource, and the like.

(6) And the feature expression refers to information for describing features of the multimedia resource, and can extract high-level features from the multimedia resource for subsequent application.

In view of the problems that labeling work of positive and negative sample pairs is complex, labeling is difficult, and training work of a trained convolutional neural network model is complex due to low yield in the related art, the embodiment of the application provides a solution.

The method provided by the embodiment of the application is not only suitable for video, but also suitable for feature extraction of any multimedia resource such as audio resources and webpage resources.

The inventive concept of the embodiment of the application can be summarized as that an access request set is constructed by obtaining the same or similar access requests based on the access records of the users to the multimedia resources, then the multimedia resources accessed in the same time window are selected from the multimedia resource set corresponding to the access request set to construct a positive sample pair, and the multimedia resources which are not accessed and the accessed multimedia resources are selected to construct a negative sample pair. Therefore, the same or similar multimedia resources can be screened out based on the same or similar access requests, and then the accuracy of the construction of the positive sample pair can be ensured by screening the multimedia resources in the same time window from the same or similar multimedia resources based on the access behavior of the user. In addition, based on the un-accessed multimedia resources and the accessed multimedia resources, automatic mining and labeling of negative sample pairs are realized. Thus, the present application may simplify labeling of positive and negative sample pairs.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs. It should be noted that the user information referred to in the present application is obtained based on user authorization.

Fig. 1 is a schematic view of an application scenario of a training method of a post-processing model of a speech recognition result and a processing method according to an embodiment of the present application. The application scenario includes a plurality of terminal devices 101 (including terminal device 101-1, terminal device 101-2, … … terminal device 101-n), and further includes server 102. The terminal device 101 and the server 102 are connected via a wireless or wired network, and the terminal device 101 includes but is not limited to a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, a smart television, and other electronic devices. The server 102 may be a server, a server cluster composed of several servers, or a cloud computing center. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

Of course, the method provided in the embodiment of the present application is not limited to the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

The user may send a query request to the server 102 based on the terminal device 101, and the server 102 may search for a relevant multimedia resource for the user based on the query request and then push the multimedia resource to the terminal device 101 for presentation. The user can screen out the multimedia resources which accord with the search intention from the displayed results to access. Therefore, the server 102 may dig out a sample corresponding to the user search intention corresponding to the user query request based on the access operation of the user to the own query request, and further, the unaccessed multimedia resource is a sample not conforming to the user search intention. Therefore, the same or similar search intents can be analyzed, samples which are mined out of the same or similar search intents can construct positive sample pairs, and then samples which do not belong to the search intents and samples which belong to the search intents can construct negative sample pairs, so that automatic mining and labeling of the positive and negative sample pairs are realized. The constructed positive and negative sample pairs accord with the search intention of the user and can be suitable for different query requests without being limited to the cognitive condition of the annotating personnel, so that the categories of the constructed positive and negative sample pairs can be richer, the training of the feature extraction model is more complete, more features can be learned by the feature extraction model, and the features capable of expressing deep semantics can be extracted.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.

For convenience of understanding, the present application will be described below with respect to a training method of a feature extraction model of a multimedia resource and a method for a user to determine similarity between multimedia resources, respectively.

Training of feature extraction model

(1) Labeling of positive and negative sample pairs

The same or similar multimedia assets can be searched by different query requests. Therefore, the query requests of the same or similar multimedia resources have certain similarity. In other words, similar query requests correspond to similar multimedia assets. In order to better construct the positive sample pair, the multimedia resources corresponding to the similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times are usually accesses to very similar multimedia resources under the same or similar query requests. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

In view of this, in the embodiment of the present application, query requests with similarity higher than the similarity threshold form homogeneous query requests, and multimedia resources matched with the homogeneous query requests form a multimedia resource set. In practice, the multimedia resource set may be constructed based on the following method, as shown in fig. 2, including the following steps:

in step 201, a similarity of each query request in the plurality of query requests is determined.

In the embodiment of the application, the query requests can be subjected to feature extraction to obtain semantic features of the query requests, and then the similarity between the query requests is determined based on the distance between the semantic features.

The query request may be triggered based on the search keyword, and the query request for extracting semantic features may include the search keyword and may further include a query category. The query category is user-specified. For example, a user may query a major category of television shows for specific categories, such as ancient shows, suspense shows, year of production, actors, etc.

In step 202, at least two query requests with similarity higher than a corresponding similarity threshold are obtained to obtain a query request set;

in implementation, in order to better mine the positive sample pairs, the query requests used for constructing the query request set need to satisfy at least one of the following conditions:

condition 1: the query time interval is less than the first duration.

The query request set constructed by the query requests meeting the condition and with the similarity higher than the similarity threshold can realize short-term interest modeling of the user.

And if the embedding feature similarity of the two query requests is high and the query time interval of the two query requests is less than 30min (minutes), combining the query requests into one query request set session.

Condition 2: query requests for the same account object. This condition is used to model the long-term interest of the user. The condition can realize modeling on the long-term interest of the user to obtain a positive sample pair.

In implementation, the similarity threshold used for the medium-short term interest modeling in condition 1 and the similarity threshold used for the long term interest modeling in condition 2 may be different. For example, the conditions for short-term interest modeling may be relaxed with a similarity threshold lower than that used for long-term interest modeling.

In specific implementation, for example, in condition 1, short-term interest modeling requires that the similarity of each query request in the same query request set is high, and the similarity threshold used in long-term interest modeling is high, which can reflect that the same query request of the same user is used.

In step 203, the multimedia resource set is constructed by using the multimedia resources corresponding to each query request in the query request set.

For example, the multimedia resources corresponding to the query request 1 include a and b, the multimedia resources corresponding to the query request 2 include b and c, and the multimedia resources corresponding to the query request 3 include d and e. The set of multimedia resources corresponding to the query request 1-3 includes a, b, c, d, e.

On the basis of obtaining the multimedia resource set, a positive and negative sample pair complete model training can be constructed, and as shown in fig. 2, the method comprises the following steps:

in step 204, a positive sample pair is constructed using the multimedia resources in the set of multimedia resources accessed by the same account object, wherein the access times of two samples in the positive sample pair are taken from the same time window.

For example, both long-term and short-term interest modeling can result in a corresponding set of query requests. Each query request in the query request set obtains a corresponding query result, namely a multimedia resource corresponding to the query request, each query structure in the same query request set forms a multimedia resource set, and which multimedia resources are accessed in the query result corresponding to each query request are information which can be determined. Therefore, for the multimedia resources accessed in the multimedia resource set, namely similar multimedia resources, in order to ensure higher similarity, the multimedia resources accessed in the same time window are obtained to construct a positive sample pair.

And if the sequence of the multimedia resources is obtained according to the access time sequence, acquiring any two accessed multimedia resources in the same time window according to the size of the time window to construct a positive sample pair.

For example, the query time interval between the query request a and the query request B is less than the first duration, and the similarity is higher than the similarity threshold. The multimedia resources searched and presented based on the query request a and the multimedia resources searched and presented based on the query request B are shown in table 1:

then the sequences (a1, a2, b1, a3, b2) are obtained by sequencing according to the access time sequence. Then the time window size is 2 and for a1, its constituent positive sample pairs include a1, a2 and a1, b 1. For a2, its constituent positive sample pairs include [ a1, a2], [ a2, b1], and [ a2, a3], and so on.

In some embodiments, in order to further ensure the accuracy of the constructed positive sample pair, in the embodiments of the present application, a multimedia resource set corresponding to the query request set may be screened, and a multimedia resource that is effectively accessed is screened out to construct the positive sample pair. May be implemented to include the following steps as shown in fig. 3:

in step 301, the accessed multimedia resources in the multimedia resource set are obtained to obtain a candidate resource set.

If the multimedia resources corresponding to the query request A are a1, a2, a3 and a4, wherein the first three are accessed, a1, a2 and a3 are used for constructing a candidate resource set, and a4 is used as the non-accessed multimedia resource.

In step 302, multimedia resources with specified operation records are screened from the candidate resource set, resulting in a positive sample resource set.

Wherein, the operation record is appointed for screening out the effective access multimedia resources which can also be called as effective click. The specified operation record may include at least one of: the playing time length is longer than the specified time length, and the playing time length has at least one of the following operation attributes: like, forward, collect, follow, comment, etc.

For example, a short video a1 is accessed and the playback time length is greater than the specified time length, then a1 has the specified operation record. Or a1 is accessed and then is any one of praise, conversion, collection, attention or comment, the a1 also has the specified operation record, and the a1 is the effectively accessed multimedia resource.

The multimedia resources effectively accessed in the candidate resource set are screened out to obtain a positive sample resource set, and then in step 303, the multimedia resources accessed by the same account object are adopted to construct a positive sample pair for the positive sample resource set.

Therefore, by screening the positive samples with the specified operation records, the noise data of invalid accesses can be screened out, and the accuracy of the constructed positive sample pairs is improved.

In another embodiment, in order to increase the diversity and number of positive sample pairs, multimedia assets that are satisfactorily accessed are defined in the embodiments of the present application. The multimedia resources that are satisfactorily accessed are: and inquiring the multimedia resource accessed last in the request once, wherein if the multimedia resource has the specified operation record, the multimedia resource is the multimedia resource which is satisfied to be accessed. Then the multimedia resource that is satisfactorily accessed and the multimedia resource that is accessed/effectively accessed by the same account object both construct a positive sample pair.

In step 205, a negative example pair is constructed with the non-accessed multimedia resources and the accessed multimedia resources in the set of multimedia resources.

It should be noted that, the execution sequence of the operation of constructing the positive sample pair in step 204 and the operation of constructing the negative sample pair in step 205 is not limited, that is, step 204 may be executed first and then step 205 is executed, step 205 may be executed first and then step 204 is executed, or step 204 and step 205 may be executed simultaneously, and both are applicable to the embodiment of the present application.

(2) Parameter training process for feature extraction model

After the positive and negative sample pairs are obtained, a feature extraction model may be trained using the positive and negative sample pairs in step 206.

The structure of the feature extraction model to be trained is shown in fig. 4: the feature extraction model to be trained is of a multi-tower network structure, each tower comprises a convolutional neural network and a first full connection layer, a second full connection layer is connected behind the first full connection layer of each tower, and a logistic regression layer is connected behind the second full connection layer to carry out classification prediction on whether the sample pairs are positive sample pairs or negative sample pairs. Wherein:

the convolutional neural network of a first tower network structure of the two tower network structures is used for extracting a first feature of a first multimedia resource; the convolutional neural network of a second tower network structure in the double-tower network structure is used for extracting a first feature of a second multimedia resource;

The convolutional neural network is used for carrying out feature extraction on the content of the multimedia resource and the attention degree parameter Ctr of the multimedia resource to obtain a first feature, or carrying out feature extraction on the content of the multimedia resource to obtain the first feature. Wherein the attention parameter, hereinafter also referred to as a second feature, may comprise at least one of the following parameters: click rate index parameters such as click rate or click rate, attention rate index parameters such as attention rate and attention rate, like amount of likes and dislikes index parameters such as amount of likes and dislikes, comment amount index parameters such as total amount of comments or comment rate, and comment rate can be understood as a ratio between a comment-issuing user and an accessing user; a forwarding amount indicator parameter such as forwarding amount or forwarding rate, which may be understood as the ratio of the number of forwarded users to the number of visited users, etc.

The first full connection layer is used for carrying out feature extraction on the first feature and the attention degree parameter to obtain feature expression of the multimedia resource.

The feature extraction model to be trained for the two tower structure is shown in fig. 4. In fig. 4, the first fully-connected layer and the second fully-connected layer may each include one or more fully-connected layers. In connection with fig. 4, the training process comprises the steps as shown in fig. 5:

in step 501, the feature expressions of the first multimedia resource and the second multimedia resource in any sample pair, i.e. the feature expression of the first full-link layer output, are obtained.

In step 502, a second full-connected layer is adopted to perform feature extraction on the feature expressions of the first multimedia resource and the second multimedia resource, so as to obtain a classification feature, that is, a feature input to the logistic regression layer softmax.

In step 503, the classification features are classified to obtain a prediction class of any sample pair, and the classified classes for classification include a positive sample pair and a negative sample pair.

And performing classification processing on Softmax based on the classification characteristics to obtain whether the two input videos are positive sample pairs or negative sample pairs. Based on the prediction class and the labeled class, a comparison can be made, and in step 504, a loss value is determined using the prediction class and the labeled class, and model parameters of the feature extraction model are adjusted based on the loss value.

During implementation, the loss function adopted by the training feature extraction model comprises the feature similarity of two samples in the positive sample pair and the difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as that of the difference degree of the two samples in the negative sample pair, and the expression of the feature similarity is different from that of the difference degree of the two samples in the negative sample pair in sign. Thus, a simple formula can be employed to determine the loss value. So that the loss of the positive and negative sample pairs can be simultaneously included when calculating the loss. The loss function is used to minimize the distance between the two samples in the positive sample pair and maximize the degree of difference between the two samples in the negative sample pair. Therefore, similar multimedia resources can be classified, different multimedia resources can be divided as much as possible to carry out model training, and the accuracy of extracting the characteristics of the multimedia resources by the model is improved.

In practice, the loss function is used as shown in equation (1), and the training objective is to make the result of equation 1 as smaller as better.

For example, a video may be sampled, and an image sequence obtained by sampling may be input to the feature extraction model, or a single frame image of the video may be input to the feature extraction model. When an image sequence is input, the final feature representation is the feature representation of the entire image sequence. When a single frame image is input, the feature expression of each frame image of the video can be obtained, then the feature expression of an image sequence is obtained through splicing, or each frame of sampling image of the first video and each frame of sampling image of the other video are respectively subjected to classified prediction, and finally the prediction type of each frame of sampling image is obtained and compared with the corresponding sample type to obtain a loss value.

Secondly, determining similarity based on feature extraction model

After the feature extraction model is trained, the similarity of multimedia resources can be determined based on the model, it needs to be stated that the feature extraction model is trained once, and the model can be reused subsequently to determine the similarity. As shown in fig. 6, a schematic flow chart of a method for determining similarity by using the aforementioned feature extraction model in the embodiment of the present application includes the following steps:

in step 601, a first multimedia resource and a second multimedia resource are obtained;

in step 602, extracting feature expressions of the first multimedia resource and the second multimedia resource respectively by using a feature extraction model;

in step 603, the similarity between the first multimedia resource and the second multimedia resource is determined based on the feature expression of each of the first multimedia resource and the second multimedia resource.

During implementation, the cosine distance of the two embedding characteristics is obtained, so that the similarity of the two multimedia resources is obtained.

Based on the same inventive concept, the present application provides a training apparatus for a feature extraction model of a multimedia resource, as shown in fig. 7, the apparatus 700 includes:

a sample pair construction module 701 configured to construct a positive sample pair by using multimedia resources accessed by the same account object in the multimedia resource set, wherein access times of two samples in the positive sample pair are taken from the same time window; constructing a negative sample pair by adopting the un-accessed multimedia resources and the accessed multimedia resources in the multimedia resource set; the multimedia resources matched with the similar query requests form the multimedia resource set, and the query requests with the similarity higher than the similarity threshold form the similar query requests;

a training module 702 configured to train a feature extraction model using the positive sample pair and the negative sample pair, wherein the feature extraction model is used for extracting similarity between multimedia resources.

Optionally, the apparatus further comprises:

Optionally, the apparatus further comprises:

the query time interval is less than a first duration;

query requests for the same account object.

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

Based on the same inventive concept, an embodiment of the present application further provides an apparatus for determining similarity of multimedia resources, as shown in fig. 8, where the apparatus 800 includes:

an obtaining module 801 configured to obtain a first multimedia resource and a second multimedia resource;

a feature expression extraction module 802 configured to extract feature expressions of the first multimedia resource and the second multimedia resource respectively by using a feature extraction model;

a similarity determination module 803 configured to determine similarity of the first multimedia resource and the second multimedia resource based on the feature expression of each of the first multimedia resource and the second multimedia resource;

Optionally, the apparatus further comprises:

the query time interval is less than a first duration;

query requests for the same account object.

Optionally, the apparatus further comprises:

the playing time is longer than the specified time;

like, forward, collect, pay attention to, comment.

Having described the model training method and processing method for multimedia assets and related apparatus according to exemplary embodiments of the present application, an electronic device according to another exemplary embodiment of the present application is next described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible implementations, an electronic device according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the multimedia information editing method according to various exemplary embodiments of the present application described above in this specification. For example, the processor may perform steps such as a model training method and a similarity determination method of the multimedia resource.

The electronic device 130 according to this embodiment of the present application is described below with reference to fig. 9. The electronic device 130 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, the electronic device 130 is represented in the form of a general electronic device. The components of the electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131).

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 130, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the electronic device 130 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 132 comprising instructions, executable by the processor 131 of the apparatus 700 or the processor 131 of the apparatus 800, to perform the above-described training method and post-processing method of the post-processing model of the speech recognition result is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by the processor 131, implements any of the method for model training and the method for similarity determination of multimedia assets as provided herein.

In an exemplary embodiment, the aspects of the model training method and the similarity determination method for a multimedia resource provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the model training method and the similarity determination method for a multimedia resource according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program products of the model training method and the similarity determination method for multimedia resources according to the embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be executed on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for training a feature extraction model of a multimedia resource, the method comprising:

2. The method of claim 1, wherein the feature extraction model is a two-tower network structure, wherein each tower comprises a convolutional neural network, a first fully-connected layer;

3. The method according to claim 2, wherein the second feature is a parameter of attention degree of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the parameter of attention degree of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

4. The method of claim 2 or 3, wherein the training of the feature extraction model using the pair of positive samples and the pair of negative samples comprises:

5. A method for determining similarity of multimedia resources is characterized by comprising the following steps:

acquiring a first multimedia resource and a second multimedia resource;

6. An apparatus for training a feature extraction model of a multimedia resource, the apparatus comprising:

7. An apparatus for determining similarity of multimedia resources, the apparatus comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-5.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1-5 when executed by a processor.