CN116030375A

CN116030375A - Video feature extraction and model training method, device, equipment and storage medium

Info

Publication number: CN116030375A
Application number: CN202211215322.9A
Authority: CN
Inventors: 沈栋; 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-04-28

Abstract

The disclosure relates to a method, a device, equipment and a storage medium for video feature extraction and model training, and relates to the technical field of computers. The model training method comprises the following steps: acquiring video characteristics and tag information of a first video resource; determining a classification result of the video features of the first video resource based on the classification model, and determining a classification loss value according to difference information of the classification result and the tag information; acquiring search word characteristics of a second video resource; based on the comparison learning model, comparing and learning the video characteristics of the first video resource and the search word characteristics of the second video resource to obtain a comparison loss value; based on the classification loss value and the comparison loss value, training the video feature extraction model to be trained to obtain the video feature extraction model.

Description

Video feature extraction and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a method, a device, equipment and a storage medium for video feature extraction and model training.

Background

Video feature extraction techniques refer to mapping a sequence of video images into a high-dimensional feature vector and expressing video picture content via the high-dimensional feature vector. The video feature extraction technique may be applied to more scenes, such as video recommendation scenes or video search scenes.

Currently, video feature extraction techniques are typically model trained based on an unsupervised or supervised approach, thereby obtaining a video feature extraction model that is capable of extracting video features. However, the accuracy of the video feature extraction model trained based on an unsupervised approach is low. And the video feature extraction model trained based on the supervised mode needs to depend on the supervision information. The supervision information usually needs manual annotation, the annotation quantity is large, and the time consumption is high, so that the training efficiency of the video feature extraction model is low.

Therefore, how to improve the training efficiency of the video feature extraction model under the condition of ensuring the accuracy of the video feature extraction model is a technical problem to be solved in the prior art.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for video feature extraction and model training, which not only can ensure accuracy of a video feature extraction model, but also can improve training efficiency of the video feature extraction model.

The technical scheme of the embodiment of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, a video feature extraction model training method is provided, which can be applied to an electronic device. The method may include: acquiring video characteristics and tag information of a first video resource;

determining a classification result of the video features of the first video resource based on the classification model, and determining a classification loss value according to difference information of the classification result and the tag information;

acquiring search word characteristics of a second video resource;

based on the comparison learning model, comparing and learning the video characteristics of the first video resource and the search word characteristics of the second video resource to obtain a comparison loss value;

based on the classification loss value and the comparison loss value, training the video feature extraction model to be trained to obtain the video feature extraction model.

Optionally, when the number of the first video resources is a plurality of the second video resources is a plurality of the first video resources, the same video resources and different video resources are included in the plurality of first video resources and the plurality of second video resources; based on the contrast learning model, performing contrast learning on the video features of the first video resource and the search word features of the second video resource to obtain a contrast loss value, including:

Determining the dot product of the video features of the first type of video resources and the search word features as a first classification target of the comparison learning model; the first type of video assets are used for representing the same video assets in the plurality of first video assets and the plurality of second video assets;

determining the dot product of the video features of the second-class video resources and the search word features as a second classification target of the comparison learning model; the second type of video assets are used for representing different video assets in the plurality of first video assets and the plurality of second video assets;

and based on the first classification target and the second classification target, comparing and learning the video characteristics of the first video resource and the search word characteristics of the second video resource to obtain a comparison loss value.

Optionally, based on the first classification target and the second classification target, performing contrast learning on the video feature of the first video resource and the search word feature of the second video resource to obtain a contrast loss value, including:

determining a first set of class features based on the first classification target; the first type feature set is used for representing video features and search word features of the first type video resources;

determining a second class of feature sets based on the second classification target; the second type feature set is used for representing video features and search word features of the second type video resources;

And determining a contrast loss value according to the difference information of the first type of feature set and the second type of feature set.

Optionally, acquiring the video feature of the first video resource includes:

acquiring text features and image features of a first video resource;

and carrying out feature fusion on the text features and the image features based on a multi-modal algorithm to obtain video features.

Optionally, acquiring the text feature and the image feature of the first video resource includes:

based on an image feature extraction algorithm, extracting features of video images of the first video resource to obtain image features;

text detection is carried out on the first video resource based on a voice recognition algorithm and a text detection algorithm so as to obtain text information;

and extracting the characteristics of the text information based on a text characteristic extraction algorithm to obtain text characteristics.

Optionally, the video feature extraction model training method further includes:

acquiring initial video features of a first video resource and initial search word features of a second video resource;

regularizing the initial video features and the initial search word features to obtain processed video features and processed search word features;

the processed video feature is determined to be a video feature of the first video asset and the processed search term feature is determined to be a search term feature of the second video asset.

According to a second aspect of embodiments of the present disclosure, a video feature extraction method is provided, which may be applied to an electronic device. The method may include:

acquiring video resources to be processed;

inputting the video resources to be processed into a video feature extraction model to obtain video features of the video resources to be processed; the video feature extraction model is trained according to the video feature extraction model training method of any one of the first aspects.

According to a third aspect of embodiments of the present disclosure, a video feature extraction model training apparatus is provided, which may be applied to an electronic device. The apparatus may include: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring the video characteristics and the tag information of the first video resource;

the processing unit is used for determining a classification result of the video features of the first video resource based on the classification model and determining a classification loss value according to difference information of the classification result and the tag information;

the acquisition unit is also used for acquiring the search word characteristics of the second video resource;

the processing unit is further used for carrying out contrast learning on the video characteristics of the first video resource and the search word characteristics of the second video resource based on the contrast learning model so as to obtain a contrast loss value;

The processing unit is further used for training the video feature extraction model to be trained based on the classification loss value and the comparison loss value so as to obtain the video feature extraction model.

Optionally, when the number of the first video resources is a plurality of the second video resources is a plurality of the first video resources, the same video resources and different video resources are included in the plurality of first video resources and the plurality of second video resources; the processing unit is specifically used for:

Optionally, the processing unit is specifically configured to:

Optionally, the acquiring unit is specifically configured to:

acquiring text features and image features of a first video resource;

Optionally, the acquiring unit is specifically configured to:

Optionally, the acquiring unit is further configured to acquire an initial video feature of the first video resource and an initial search term feature of the second video resource;

The processing unit is also used for regularizing the initial video features and the initial search word features to obtain processed video features and processed search word features;

the processing unit is further configured to determine the processed video feature as a video feature of the first video resource and determine the processed search term feature as a search term feature of the second video resource.

According to a fourth aspect of embodiments of the present disclosure, there is provided a video feature extraction apparatus, which may be applied to an electronic device. The apparatus may include: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring video resources to be processed;

the processing unit is used for inputting the video resources to be processed into the video feature extraction model so as to obtain the video features of the video resources to be processed; the video feature extraction model is trained according to the video feature extraction model training method of any one of the first aspects.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement any of the optional video feature extraction model training methods of the first aspect above, or the video feature extraction method of the second aspect above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having instructions stored thereon, which when executed by a processor of an electronic device, enable the electronic device to perform any one of the above-described optional video feature extraction model training methods of the first aspect, or the video feature extraction methods of the above-described second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the video feature extraction model training method according to any of the optional implementations of the first aspect, or the video feature extraction method of the second aspect described above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

based on any one of the above aspects, an embodiment of the present disclosure provides a training method for a video feature extraction model, where after obtaining video features and tag information of a first video resource, an electronic device may determine a classification result of the video features of the first video resource based on a classification model, and determine a classification loss value according to difference information between the classification result and the tag information. Then, after obtaining the search term feature of the second video resource, the electronic device may perform contrast learning on the video feature of the first video resource and the search term feature of the second video resource based on the contrast learning model, so as to obtain a contrast loss value. Subsequently, the electronic device may train the video feature extraction model to be trained based on the classification loss value and the comparison loss value to obtain the video feature extraction model.

Since the classification loss value is obtained according to the difference information of the classification result and the tag information, and the comparison loss value is obtained according to the video feature and the search word feature, the video feature extraction model obtained based on the training of the classification loss value and the comparison loss value can be regarded as a video feature extraction model obtained based on the supervised information (without manually labeling the supervision information) including the tag information and the search word feature. Therefore, when the video feature extraction model extracts the video feature of the video to be processed, the video feature of the video to be processed can be fully mined, the accuracy of the target feature extraction model is improved, and meanwhile, the training efficiency of the video feature extraction model is also improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 shows a schematic diagram of a video feature extraction model training system provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a video feature extraction model training method according to an embodiment of the disclosure;

FIG. 3 illustrates a flow diagram of yet another video feature extraction model training method provided by embodiments of the present disclosure;

FIG. 4 is a flow chart of yet another video feature extraction model training method provided by an embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram of yet another video feature extraction model training method provided by embodiments of the present disclosure;

FIG. 6 illustrates a flow diagram of yet another video feature extraction model training method provided by an embodiment of the present disclosure;

FIG. 7 is a flow chart of yet another video feature extraction model training method provided by an embodiment of the present disclosure;

FIG. 8 illustrates a flow diagram of yet another video feature extraction model training method provided by an embodiment of the present disclosure;

fig. 9 is a schematic flow chart of a video feature extraction method according to an embodiment of the disclosure;

fig. 10 is a schematic structural diagram of a video feature extraction model training device according to an embodiment of the disclosure;

fig. 11 is a schematic structural diagram of a video feature extraction apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure;

fig. 13 shows a schematic structural diagram of a server according to an embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components.

The data referred to in this disclosure may be data authorized by the user or sufficiently authorized by the parties.

In general technology, a video feature extraction technology is usually based on an unsupervised mode or a supervised mode for model training, so as to obtain a video feature extraction model capable of extracting video features. However, the accuracy of the video feature extraction model trained based on an unsupervised approach is low. And the video feature extraction model trained based on the supervised mode needs to depend on the supervision information. The supervision information usually needs manual annotation, the annotation quantity is large, and the time consumption is high, so that the training efficiency of the video feature extraction model is low.

Based on this, the embodiment of the disclosure provides a training method for a video feature extraction model, and the electronic device may determine a classification result of the video feature of the first video resource based on the classification model after obtaining the video feature and the tag information of the first video resource, and determine a classification loss value according to difference information of the classification result and the tag information. Then, after obtaining the search term feature of the second video resource, the electronic device may perform contrast learning on the video feature of the first video resource and the search term feature of the second video resource based on the contrast learning model, so as to obtain a contrast loss value. Subsequently, the electronic device may train the video feature extraction model to be trained based on the classification loss value and the comparison loss value to obtain the video feature extraction model.

Fig. 1 is a schematic diagram of a video feature extraction model training system according to an embodiment of the disclosure, where, as shown in fig. 1, the video feature extraction model training system may include: server 110 and electronic device 120, server 110 may establish a connection with electronic device 120 over a wired or wireless network.

The server 110 may be a data server of some multimedia resource service platform, and may be used to store and process multimedia resources. For example, the multimedia asset service platform may be a short video application service platform, a news service platform, a live broadcast service platform, a shopping service platform, a take-away service platform, a sharing service platform, a functional website, and the like. The multimedia resources provided by the short video application service platform can be some short video works, the multimedia resources provided by the news service platform can be some news information, the multimedia resources provided by the live broadcast service platform can be live broadcast works and the like, and the rest is not described in detail. The present disclosure is not limited to a particular type of multimedia asset service platform.

In this disclosure, the server 110 is mainly used to store data required for training the video feature extraction model, for example: a first video asset, video features and tag information for the first video asset, a second video asset, search term features for the second video asset, and so forth. The server 110 may transmit corresponding data to the electronic device 120 upon receiving a data acquisition request transmitted by the electronic device 120.

In some embodiments, the server 110 may further include or be connected to a database, and the multimedia resources of the multimedia resource service platform may be stored in the database. The electronic device 120 may implement access operations to multimedia resources in the database through the server 110.

The electronic device 120 may be a server, a terminal, or other electronic devices for training a video feature extraction model, which is not limited in this disclosure.

When the electronic device 120 is a server, the electronic device 120 and the server 110 may be two independent servers or may be integrated in the same server, which is not limited in this application.

It is easy to understand that when the electronic device 120 and the server 110 are integrated in the same server, the communication between the electronic device 120 and the server 110 is a communication between modules inside the server. In this case, the communication flow therebetween is the same as "in the case where the electronic device 120 and the server 110 are independent of each other".

For ease of understanding, the present application will be described primarily with the electronic device 120 and the server 110 being provided separately.

The server may be a single server or may be a server cluster including a plurality of servers. In some implementations, the server cluster may also be a distributed cluster. The present disclosure is not limited to a specific implementation of the server.

When the electronic device 120 is a terminal, the electronic device 120 may be a mobile phone, a tablet computer, a desktop, a laptop, a handheld computer, a notebook, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a cellular phone, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR) \virtual reality (VR) device, or the like, which may install and use a content community application (e.g., a fast hand), and the specific form of the terminal is not particularly limited in the present disclosure. The system can perform man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment and the like.

Alternatively, in the video feature extraction model training system shown in fig. 1, the electronic device 120 may be connected to at least one server 110. The present disclosure is not limited in the number and type of servers 110.

The video feature extraction model training method provided by the embodiment of the present disclosure may be applied to the electronic device 120 in the application scenario shown in fig. 1.

The video feature extraction model training method provided by the embodiment of the disclosure is described in detail below with reference to the accompanying drawings.

As shown in fig. 2, when the video feature extraction model training method is applied to an electronic device, the video feature extraction model training method may include:

s201, the electronic equipment acquires video characteristics and tag information of the first video resource.

Wherein the tag information may be a topic tag of the first video asset.

Specifically, when the video feature extraction model is obtained by training, in order to improve the accuracy of the video feature extraction model, the electronic device may perform video feature extraction model training based on a supervised manner (the accuracy of the model obtained by training based on a supervised manner is higher than the accuracy of the model obtained by training based on an unsupervised manner). And when the video feature extraction model is trained based on a supervised mode, the method is required to depend on supervision information. In this case, the electronic device may obtain the video characteristics and tag information of the first video asset.

Since the tag information is tag information matched with the first video asset, the electronic device may use the tag information as the supervision information. Therefore, the electronic equipment can train to obtain the video feature extraction model with higher accuracy based on the supervision information with higher video resource association degree.

For example, when the first video asset is basketball video, the topic tag of the first video asset may be "# sports".

Still further exemplary, when the first video asset is a music video, the topic tag of the first video asset may be "# music".

In one implementation, the present disclosure is not limited to the number of first video assets, as the video feature extraction model training requires a large amount of training data as a basis. In practical applications, the number of the first video resources may be 1000 or 10000. Correspondingly, the number of the video features and the number of the tag information are also multiple, that is, the electronic device can acquire the video features and the tag information corresponding to each first video resource.

In one implementation, the electronic device may obtain the video feature and tag information of the first video asset from a server (e.g., server 110 in fig. 1) that stores data required for the video feature extraction model training, or may obtain the first video asset and tag information from a database that stores data required for the video feature extraction model training, and then determine the video feature of the first video asset based on a feature extraction algorithm, which is not limited by the present disclosure.

The database may be a database of the electronic device, or may be a database in another storage device or a storage system (e.g., a distributed storage system), which is not limited in this disclosure.

S202, the electronic equipment determines a classification result of the video features of the first video resource based on the classification model, and determines a classification loss value according to difference information of the classification result and the tag information.

Specifically, after obtaining the video feature and the tag information of the first video resource, the electronic device may determine difference information between the video feature and the tag information in order to obtain a classification loss value for training the video feature extraction model. However, because the video features are feature vectors, the electronic device can determine a classification result of the video features of the first video asset based on the classification model.

The electronic equipment can learn and train the classification model according to the classified classification data, so that a classification result of the video features can be obtained.

In practical applications, common classification models include: logistic regression models, decision tree models, support vector machine models, naive bayes models, and the like.

Alternatively, the classification model may be a model trained in advance for determining classification results for video features.

Optionally, the electronic device may also determine a classification result of the video feature of the first video asset based on a classification algorithm (e.g., a k-nearest neighbor classification algorithm, a decision tree classification algorithm, etc.).

After determining the classification result of the video feature of the first video resource, the electronic device may determine a classification loss value according to the difference information of the classification result and the tag information.

Alternatively, the classification loss value may be a cross entropy function value.

For example, when the first video asset is basketball video, the topic tag of the first video asset may be "# sports". The electronic equipment determines the classification result of the video characteristics of the basketball video as basketball based on the classification model. The electronic device may then determine a classification loss value based on the difference information of "basketball" and "sports".

S203, the electronic equipment acquires the search word characteristics of the second video resource.

Wherein the search term feature may be a feature of a search term when a search is performed on the second video asset.

Specifically, when the video feature extraction model is obtained by training, in order to improve the accuracy of the video feature extraction model, the electronic device may perform video feature extraction model training based on a supervised manner (the accuracy of the model obtained by training based on a supervised manner is higher than the accuracy of the model obtained by training based on an unsupervised manner). And when the video feature extraction model is trained based on a supervised mode, the method is required to depend on supervision information. In this case, the electronic device can obtain the search term feature of the second video asset.

Since the search term feature is a feature of a search term when a search is performed on the second video asset, the electronic device may use the search term feature as the supervision information. Therefore, the electronic equipment can train to obtain the video feature extraction model with higher accuracy based on the supervision information with higher association degree with the video resources.

For example, when the second video asset is basketball video, the search term for the second video asset may be "basketball".

Still further exemplary, when the second video asset is a music video, the search term of the second video asset may be "music".

In one implementation, the present disclosure is not limited to the number of second video assets, as the video feature extraction model training requires a large amount of training data as a basis. In practical applications, the number of second video resources may be 1000 or 10000. Accordingly, the number of the search word features of the second video resources is also multiple, that is, the electronic device may acquire the search word feature corresponding to each second video resource.

In one implementation, the electronic device may obtain the search term feature of the second video asset from a server (e.g., server 110 in fig. 1) that stores data required for the video feature extraction model training, or may obtain the second video asset from a database that stores data required for the video feature extraction model training, and then determine the search term feature of the second video asset based on a feature extraction algorithm, which is not limited in this disclosure.

S204, the electronic equipment performs contrast learning on the video features of the first video resource and the search word features of the second video resource based on the contrast learning model so as to obtain a contrast loss value.

Specifically, after obtaining the video feature of the first video resource and the search term feature of the second video resource, in order to obtain a comparison loss value for training the video feature extraction model, the electronic device may perform comparison learning on the video feature of the first video resource and the search term feature of the second video resource in the comparison learning model to obtain the comparison loss value.

Alternatively, the contrast learning model may be trained in advance for contrast learning of different features to obtain a model of loss values.

Optionally, the electronic device may further perform contrast learning on the video feature of the first video resource and the search term feature of the second video resource based on a contrast learning algorithm, so as to obtain a contrast loss value.

Alternatively, the contrast loss value may be a cross entropy function value.

S205, the electronic equipment trains the video feature extraction model to be trained based on the classification loss value and the comparison loss value so as to obtain the video feature extraction model.

Optionally, the electronic device may add the comparison loss value and the classification loss value to obtain a corresponding joint loss, and train the video feature extraction model to be trained based on the joint loss to obtain the video feature extraction model. From the above, since the classification loss value is obtained according to the difference information between the classification result and the tag information, and the comparison loss value is obtained according to the video feature and the search word feature, the video feature extraction model trained based on the classification loss value and the comparison loss value can be regarded as a video feature extraction model obtained based on the supervised information (without manually labeling the supervision information) including the tag information and the search word feature. Therefore, when the video feature extraction model extracts the video feature of the video to be processed, the video feature of the video to be processed can be fully mined, the accuracy of the target feature extraction model is improved, and meanwhile, the training efficiency of the video feature extraction model is also improved.

In one implementation, when the number of first video assets is a plurality and the number of second video assets is a plurality, the plurality of first video assets and the plurality of second video assets include the same video assets and different video assets. In this case, referring to fig. 2, as shown in fig. 3, in S204, the electronic device performs contrast learning on the video feature of the first video resource and the search word feature of the second video resource based on the contrast learning model, so as to obtain a contrast loss value, where the method specifically includes:

S301, the electronic device determines dot products of video features and search word features of the first type of video resources as first classification targets of the comparison learning model.

The first type of video resources are used for representing the same video resources in the plurality of first video resources and the plurality of second video resources.

S302, the electronic equipment determines dot products of video features and search word features of the second-class video resources as second classification targets of the comparison learning model.

Wherein the second type of video assets is used to represent different ones of the plurality of first video assets and the plurality of second video assets.

Specifically, the contrast learning model is a machine learning technique, and the electronic device may determine the contrast loss value by learning similar, identical, or different data in the plurality of first video resources and the plurality of second video resources.

In the contrast learning model, a learning optimization target needs to be set in advance, so that training data can approach the optimization target, and a final high-accuracy feature extraction model is obtained.

For a first type of video asset, i.e., the same video asset of the plurality of first video assets and the plurality of second video assets, the electronic device may determine that the video features and search term features of the same video asset are similar or identical. In this case, the dot product of the video feature and the search term feature of the same video asset is maximized. Thus, the electronic device can determine a dot product of the video features and the search term features of the first type of video asset as a first classification target of the contrast learning model.

Accordingly, for a second type of video asset, i.e., different ones of the plurality of first video assets and the plurality of second video assets, the electronic device may determine that the video features and search term features of the different video assets are dissimilar. In this case, the dot product of the video feature and the search term feature of the different video asset is less than the dot product of the video feature and the search term feature of the same video asset. Thus, the electronic device can determine a dot product of the video features and the search term features of the second type of video asset as a second classification target for the comparison learning model.

For example, for a plurality of first video assets and a plurality of second video assets, an ith first video asset of the plurality of first video assets and an ith second video asset of the plurality of second video assets are the same. Accordingly, a j-th first video asset of the plurality of first video assets and a j-th second video asset of the plurality of second video assets are also the same.

Wherein i is a positive integer; j is a positive integer.

However, an ith one of the plurality of first video assets and a jth one of the plurality of second video assets are different. Accordingly, the j-th first video asset of the plurality of first video assets and the i-th second video asset of the plurality of second video assets are also different.

In this case, the dot product S of the video feature of the ith first video asset and the search term feature of the ith second video asset _i，i And a dot product S of the video feature of the ith first video asset and the search term feature of the jth second video asset _i，j The following formula is satisfied:

S _i，i ＞S _i，j ，i≠j；

based on the same theory, the dot product S of the video feature of the jth first video resource and the search word feature of the jth second video resource _j，j And a dot product S of the video feature of the jth first video asset and the search term feature of the ith second video asset _j，i The following formula is satisfied:

S _j，j ＞S _j，i ，i≠j。

in this case, the electronic device can determine the dot product S of the video feature of the ith first video asset and the search term feature of the ith second video asset _i，i And the dot product S of the video feature of the jth first video asset and the search term feature of the jth second video asset _j，j Set to 1, dot product S of video feature of ith first video asset and search term feature of jth second video asset _i，j And video features of the jth first video asset and search term features of the ith second video assetDot product S of (2) _j，i Is set to 0 so that the first classification target and the second classification target are distinguished by different labels.

S303, the electronic equipment performs comparison learning on the video features of the first video resource and the search word features of the second video resource based on the first classification target and the second classification target so as to obtain a comparison loss value.

Specifically, after the first classification target and the second classification target are determined, the electronic device can perform contrast learning on the video feature of the first video resource and the search word feature of the second video resource based on the first classification target and the second classification target to obtain a contrast loss value, and a specific implementation manner for determining the contrast loss value is provided, so that the video feature extraction model obtained by subsequent training according to the contrast loss value is convenient, the accuracy of the target feature extraction model is improved, and meanwhile, the training efficiency of the video feature extraction model is also improved.

In an implementation manner, referring to fig. 3, as shown in fig. 4, in S303, the method for performing, by using the electronic device, comparison learning on a video feature of a first video resource and a search word feature of a second video resource based on a first classification target and a second classification target to obtain a comparison loss value specifically includes:

s401, the electronic device determines a first type of feature set based on a first classification target.

Wherein the first set of features is used to represent video features and search term features of the first type of video asset.

Specifically, after determining the first classification target, the electronic device may classify the video features of the plurality of first video resources and the search term features of the plurality of second video resources to obtain the video features and the search term features of the first video resources.

Because the first classification target is used to represent a dot product of the video feature and the search term feature of the same video asset of the plurality of first video assets and the plurality of second video assets, the electronic device may determine a dot product of the video feature of each first video asset and the search term feature of each second video asset to obtain a plurality of dot products. The electronic device can then determine a dot product of the same video asset video feature and the search term feature from the plurality of dot products and determine a set of the same video asset video feature and the search term feature as a first type of feature set.

S402, the electronic device determines a second class feature set based on the second class target.

Wherein the second set of features is used to represent video features and search term features of the second type of video asset.

Specifically, after determining the second classification target, the electronic device may classify the video features of the plurality of first video resources and the search term features of the plurality of second video resources to obtain the video features and the search term features of the second video resources.

Because the second classification target is used to represent a dot product of the video feature and the search term feature of different video assets in the plurality of first video assets and the plurality of second video assets, the electronic device may determine a dot product of the video feature of each first video asset and the search term feature of each second video asset to obtain a plurality of dot products. The electronic device may then determine a dot product of the different video asset video features and the search term features from the plurality of dot products and determine a set of the different video asset video features and the search term features as a second set of features.

S403, the electronic equipment determines a contrast loss value according to the difference information of the first type of feature set and the second type of feature set.

Illustratively, the plurality of first video assets includes a first video asset a and a first video asset B. The plurality of second video assets includes second video asset 1 and second video asset 2. Wherein the first video asset a and the second video asset 1 are the same video asset and the first video asset B and the second video asset 2 are also the same video asset. The first video asset a and the second video asset 2 are different video assets and the first video asset B and the second video asset 1 are also different video assets.

In this case, the electronic device may obtain 2 first class feature sets based on the first classification target, one first class feature set including: the video features of the first video asset a and the search term features of the second video asset 1. Another first class of feature sets includes: video features of the second video asset B and search term features of the second video asset 2.

Accordingly, the electronic device may obtain 2 second class feature sets based on the second classification target, where one second class feature set includes: the video features of the first video asset a and the search term features of the second video asset 2. Another second class of feature sets includes: video features of the second video asset a and search term features of the second video asset 2.

The electronic device may then determine difference information for each first type of feature set and each second type of feature set, and determine a contrast loss value based on the obtained difference information.

The difference information may be a weighted sum of feature vectors in each first type of feature set to obtain a first vector corresponding to each first type of feature set, and a weighted sum of feature vectors in each second type of feature set to obtain a second vector corresponding to each second type of feature set. The electronic device may then determine a difference value of each first vector and each second vector as difference information.

After the first type feature set and the second type feature set are determined, the electronic device can determine the contrast loss value according to the difference information of the first type feature set and the second type feature set, and a specific implementation mode for determining the contrast loss value is provided, so that a video feature extraction model obtained by training according to the contrast loss value is convenient to follow, accuracy of the target feature extraction model is improved, and training efficiency of the video feature extraction model is improved.

In one implementation manner, referring to fig. 4, as shown in fig. 5, in S201, a method for acquiring a video feature of a first video resource by an electronic device specifically includes:

s501, the electronic equipment acquires text features and image features of the first video resource.

Specifically, the electronic device may obtain text features and image features of the first video resource, so as to obtain the video features of the first video resource according to the text features and the image features.

The video feature may be a feature vector representing the asset content of the first video asset.

The text feature may also be a feature vector for text content of the first video asset, such as a video title, a subtitle in a video, a text after speech conversion, etc.

The image feature may also be a feature vector for representing the image content of the first video asset, such as an image of a video cover, video frame, etc.

S502, the electronic equipment performs feature fusion on the text features and the image features based on a multi-mode algorithm to obtain video features.

After the text feature and the image feature of the first video asset are acquired, since the number of the first video assets is huge, if the classification loss value is determined by using the two features, the efficiency of training the video feature extraction model may be reduced. In this case, the electronic device may perform feature fusion on the text feature and the video feature in a multi-modal algorithm to obtain the video feature. Thus, the electronic device can determine the classification loss value based on a multi-modal feature (namely, video feature), and the training efficiency of the video feature extraction model is improved.

Wherein the multi-modal algorithm comprises: multi-head attention mechanism (Multi-head Self-attention) algorithm.

In practical application, the electronic device may further perform feature fusion on the text feature and the video feature through other multi-modal algorithms to obtain the video feature, which is not limited in this disclosure.

According to the method, after the text features and the image features of the first video resource are acquired, the electronic equipment can perform feature fusion on the text features and the image features based on a multi-mode algorithm to obtain video features, a specific implementation mode for determining the video features is provided, so that the classification loss value is determined according to the video features later, the obtained video feature extraction model is trained according to the classification loss value, the accuracy of the target feature extraction model is improved, and meanwhile the training efficiency of the video feature extraction model is also improved.

In one implementation manner, referring to fig. 5, as shown in fig. 6, in S501, a method for acquiring text features and image features of a first video resource by an electronic device specifically includes:

s601, the electronic equipment performs feature extraction on the video image of the first video resource based on an image feature extraction algorithm to obtain image features.

Specifically, in order to quickly obtain the video features, the electronic device may perform feature extraction on the video image of the first video resource based on an image feature extraction algorithm, so as to obtain the image features.

The image features may be feature vectors. The feature vector is used to represent image content of a video image of the first video asset.

In one implementation, the video image of the first video asset may be a cover picture, a video frame, etc. in the first video asset.

In one embodiment, the electronic device may perform video frame processing on the first video asset and obtain a cover picture of the first video asset, thereby obtaining a video image of the first video asset.

For example, when the first video asset is a basketball video, the video image of the basketball video may be a cover image of the basketball video: "Picture of a basketball".

The electronic device may then extract image features of the first video asset from the video images of the basketball video using an image feature extraction algorithm.

Alternatively, the image feature extraction algorithm may be a residual network based feature extraction algorithm (resnet-50).

In practical applications, the electronic device may further perform feature extraction on the video image of the first video resource through other image feature extraction technologies (such as an image feature extraction model, etc.), so as to obtain an image feature, which is not limited in this disclosure.

S602, the electronic equipment performs text detection on the first video resource based on a voice recognition algorithm and a text detection algorithm to obtain text information.

S603, the electronic equipment performs feature extraction on the text information based on a text feature extraction algorithm to obtain text features.

Specifically, in order to quickly obtain the video feature, the electronic device may perform text detection on the first video resource based on a speech recognition algorithm and a text detection algorithm to obtain text information, and perform feature extraction on the text information based on a text feature extraction algorithm to obtain text features.

The text feature may be a feature vector. The feature vector is used to represent text content of the text information in the first video asset.

In one implementation, the text information of the first video asset may be text, titles, voice content, etc. in the first video asset.

For example, when the first video asset is a basketball video, the text information of the basketball video may be a title of the basketball video: "how to play basketball".

Alternatively, the text feature extraction algorithm may be a bi-directional encoder based feature extraction algorithm (Bidirectional Encoder Representation from Transformers, BERT).

In practical applications, the electronic device may perform feature extraction on the text information of the first video resource through other text feature extraction technologies (such as text feature extraction models, etc.), so as to obtain text features, which is not limited in this disclosure.

From the above, the electronic device may obtain the text feature and the image feature of the first video resource based on various feature extraction algorithms, and a specific implementation manner for obtaining the text feature and the image feature of the first video resource is provided, so that the video feature is determined according to the text feature and the image feature of the first video resource, the classification loss value is determined according to the video feature, and the obtained video feature extraction model is trained according to the classification loss value, so that the accuracy of the target feature extraction model is improved, and meanwhile, the training efficiency of the video feature extraction model is also improved.

In one implementation, as shown in fig. 7, the video feature extraction model training method further includes:

s701, the electronic device acquires initial video features of the first video resource and initial search word features of the second video resource.

S702, the electronic equipment performs regularization processing on the initial video features and the initial search word features to obtain processed video features and processed search word features.

S703, the electronic device determines the processed video feature as the video feature of the first video resource and determines the processed search term feature as the search term feature of the second video resource.

Specifically, since the initial video feature is a video feature of the first video resource and the initial search term feature is a search term feature of the second video resource, in order to ensure that the video feature and the search term feature for determining the contrast loss value are at the same latitude, the electronic device may perform regularization processing on the initial video feature and the initial search term feature to obtain a processed video feature and a processed search term feature, determine the processed video feature as the video feature of the first video resource, and determine the processed search term feature as the search term feature of the second video resource.

In practical applications, the regularization may be L2 regularization.

From the above, the electronic device may determine the processed video feature as the video feature of the first video resource, and determine the processed search term feature as the search term feature of the second video resource, which gives a specific implementation manner for obtaining the video feature of the first video resource and the search term feature of the second video resource, so as to perform contrast learning on the video feature of the first video resource and the search term feature of the second video resource, so as to obtain a contrast loss value, and further train the obtained video feature extraction model according to the contrast loss value, thereby improving accuracy of the target feature extraction model and training efficiency of the video feature extraction model.

Fig. 8 is a schematic flow chart of a video feature extraction model training method according to an embodiment of the present application. As shown in fig. 8, the electronic device 120 may obtain video images and text information of the first video asset from the server 110.

Accordingly, the electronic device 120 may also obtain the search term for the second video asset from the server 110.

The electronic device 120 may then determine image features of the video image of the first video asset based on the image feature extraction algorithm (or image encoder).

Accordingly, the electronic device 120 may determine text features of the text information of the first video asset based on a text feature extraction algorithm (or text encoder).

Accordingly, the electronic device 120 may determine search term features of the search terms of the second video asset based on a text feature extraction algorithm (or text encoder).

The electronic device 120 may then perform feature fusion on the text features and the image features based on the multimodal algorithm to obtain video features.

Then, the electronic device 120 may determine a classification result of the video feature of the first video resource based on the classification model, and determine a classification loss value according to difference information of the classification result and the tag information.

Then, the electronic device 120 may perform contrast learning on the video feature of the first video resource and the search word feature of the second video resource based on the contrast learning model to obtain a contrast loss value;

the electronic device 120 may then train the video feature extraction model to be trained based on the classification loss value and the contrast loss value to obtain a video feature extraction model.

Fig. 9 shows a flowchart of a video feature extraction method according to an embodiment of the present application. As shown in fig. 9, the video feature extraction method includes:

and S901, the electronic equipment acquires the video resource to be processed.

In one implementation, the electronic device may obtain the processing video resources from a server corresponding to each of the demanding parties (users or platforms that need to determine the characteristics of the resources).

S902, the electronic equipment inputs the video resources to be processed into a video feature extraction model so as to obtain video features of the video resources to be processed.

The video feature extraction model is trained according to the video feature extraction model training method of any one of fig. 2-8.

The video feature may be a feature vector. The feature vector is used to represent video content of the video asset to be processed.

The technical scheme provided by the embodiment at least brings the following beneficial effects: as can be seen from S901-S902, a usage scenario is presented in which an electronic device determines video features of a video asset to be processed using a target feature extraction model. The video characteristics of the video resources to be processed can be obtained rapidly and accurately through the target characteristic extraction model.

In one embodiment, when the video resources to be processed include a third video resource and a fourth video resource, the electronic device may further determine, according to the content features of the third video resource and the content features of the fourth video resource, the similarity of the third video resource and the fourth video resource after inputting the resource features of the video resources to be processed into the target feature extraction model to obtain the video features, thereby providing an important basis for the subsequent associated search of the video resources.

It will be appreciated that, in actual implementation, the terminal/server of the embodiments of the present disclosure may include one or more hardware structures and/or software modules for implementing the foregoing corresponding video feature extraction model training method, where the executing hardware structures and/or software modules may constitute an electronic device. Those of skill in the art will readily appreciate that the algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Based on such understanding, the embodiment of the disclosure correspondingly provides a video feature extraction model training device, which can be applied to electronic equipment. Fig. 10 shows a schematic structural diagram of a video feature extraction model training apparatus provided in an embodiment of the present disclosure. As shown in fig. 10, the video feature extraction model training apparatus may include: an acquisition unit 1001 and a processing unit 1002;

an obtaining unit 1001, configured to obtain video features and tag information of a first video resource;

a processing unit 1002, configured to determine a classification result of the video feature of the first video resource based on the classification model, and determine a classification loss value according to difference information between the classification result and the tag information;

the obtaining unit 1001 is further configured to obtain a search term feature of the second video resource;

the processing unit 1002 is further configured to perform contrast learning on the video feature of the first video resource and the search term feature of the second video resource based on the contrast learning model, so as to obtain a contrast loss value;

the processing unit 1002 is further configured to train the video feature extraction model to be trained based on the classification loss value and the comparison loss value, so as to obtain the video feature extraction model.

Optionally, when the number of the first video resources is a plurality of the second video resources is a plurality of the first video resources, the same video resources and different video resources are included in the plurality of first video resources and the plurality of second video resources; the processing unit 1002 is specifically configured to:

Optionally, the processing unit 1002 is specifically configured to:

Optionally, the obtaining unit 1001 is specifically configured to:

acquiring text features and image features of a first video resource;

Optionally, the obtaining unit 1001 is specifically configured to:

Optionally, the obtaining unit 1001 is further configured to obtain an initial video feature of the first video resource and an initial search term feature of the second video resource;

the processing unit 1002 is further configured to regularize the initial video feature and the initial search term feature to obtain a processed video feature and a processed search term feature;

the processing unit 1002 is further configured to determine the processed video feature as a video feature of the first video asset and determine the processed search term feature as a search term feature of the second video asset.

Fig. 11 shows a schematic structural diagram of a video feature extraction apparatus provided by an embodiment of the present disclosure. As shown in fig. 11, the video feature extraction apparatus may include: an acquisition unit 1101 and a processing unit 1102;

the obtaining unit 1101 is configured to obtain a video resource to be processed;

the processing unit 1102 is configured to input a video resource to be processed into the video feature extraction model to obtain a video feature of the video resource to be processed; the video feature extraction model is trained according to the video feature extraction model training method in any one of fig. 2-8.

As described above, the embodiments of the present disclosure may divide functional modules of an electronic device according to the above-described method examples. The integrated modules may be implemented in hardware or in software functional modules. In addition, it should be further noted that the division of the modules in the embodiments of the present disclosure is merely a logic function division, and other division manners may be implemented in practice. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated in one processing module.

The specific manner in which each module performs the operation and the beneficial effects of the video feature extraction model training device or the video feature extraction device in the foregoing embodiment are described in detail in the foregoing method embodiment, and are not described herein again.

The embodiment of the disclosure also provides a terminal, which can be a user terminal such as a mobile phone, a computer and the like. Fig. 12 shows a schematic structural diagram of a terminal provided by an embodiment of the present disclosure. The terminal may be a video feature extraction model training device or a video feature extraction device. The apparatus may include at least one processor 61, a communication bus 62, a memory 63, and at least one communication interface 64.

The processor 61 may be a processor (central processing units, CPU), micro-processing unit, ASIC, or one or more integrated circuits for controlling the execution of the programs of the present disclosure. As an example, in connection with fig. 10, the processing unit 1002 in the electronic device performs the same functions as the processor 61 in fig. 12.

Communication bus 62 may include a path to transfer information between the aforementioned components.

The communication interface 64 uses any transceiver-like means for communicating with other devices or communication networks, such as servers, ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc. As an example of this, in one embodiment,

the memory 63 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and be connected to the processing unit by a bus. The memory may also be integrated with the processing unit.

Wherein the memory 63 is used for storing application program codes for executing the disclosed scheme and is controlled to be executed by the processor 61. The processor 61 is operative to execute application code stored in the memory 63 to thereby implement the functions in the methods of the present disclosure.

In a particular implementation, processor 61 may include one or more CPUs, such as CPU0 and CPU1 of FIG. 12, as an example.

In a specific implementation, as an embodiment, the terminal may include multiple processors, such as processor 61 and processor 65 in fig. 12. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In a specific implementation, the terminal may also include an input device 66 and an output device 67, as one embodiment. The input device 66 and the output device 67 communicate and may accept user input in a variety of ways. For example, the input device 66 may be a mouse, keyboard, touch screen device, or sensing device, among others. The output device 67 communicates with the processor 61, and information may be displayed in a variety of ways. For example, the output device 61 may be a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, or the like.

Those skilled in the art will appreciate that the structure shown in fig. 12 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

The embodiment of the disclosure also provides a server. Fig. 13 shows a schematic structural diagram of a server provided by an embodiment of the present disclosure. The server may be a video feature extraction model training device or a video feature extraction device. The server may vary considerably in configuration or performance and may include one or more processors 71 and one or more memories 72. The memory 72 stores at least one instruction, where the at least one instruction is loaded and executed by the processor 71 to implement the video feature extraction model training method or the video feature extraction method provided in the above-described method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The present disclosure also provides a computer-readable storage medium including instructions stored thereon, which when executed by a processor of a computer device, enable the computer to perform the video feature extraction model training method or the video feature extraction method provided by the above-described illustrated embodiments. For example, the computer readable storage medium may be a memory 63 comprising instructions executable by the processor 61 of the terminal to perform the above-described method. For another example, the computer readable storage medium may be a memory 72 comprising instructions executable by the processor 71 of the server to perform the above-described method. Alternatively, the computer readable storage medium may be a non-transitory computer readable storage medium, for example, a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

The present disclosure also provides a computer program product comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the video feature extraction model training method shown in any of the above figures 2-8, or the video feature extraction method shown in the above figure 9.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a video feature extraction model, comprising:

Acquiring video characteristics and tag information of a first video resource;

determining a classification result of the video features of the first video resource based on a classification model, and determining a classification loss value according to difference information of the classification result and the tag information;

acquiring search word characteristics of a second video resource;

based on a comparison learning model, comparing and learning the video features of the first video resource and the search word features of the second video resource to obtain a comparison loss value;

and training the video feature extraction model to be trained based on the classification loss value and the comparison loss value to obtain a video feature extraction model.

2. The method according to claim 1, wherein when the number of the first video assets is plural and the number of the second video assets is plural, the plurality of first video assets and the plurality of second video assets include the same video assets and different video assets; the comparison learning model is based on the comparison learning model, and the comparison learning is performed on the video features of the first video resource and the search word features of the second video resource to obtain comparison loss values, and the comparison loss values comprise:

Determining the dot product of the video features and the search word features of the first video resource as a first classification target of the comparison learning model; the first type of video resources are used for representing the same video resources in the plurality of first video resources and the plurality of second video resources;

determining the dot product of the video features and the search word features of the second class of video resources as a second classification target of the comparison learning model; the second type of video resources are used for representing different video resources in the plurality of first video resources and the plurality of second video resources;

and based on the first classification target and the second classification target, comparing and learning the video features of the first video resource and the search word features of the second video resource to obtain the comparison loss value.

3. The method according to claim 2, wherein the performing contrast learning on the video feature of the first video resource and the search word feature of the second video resource based on the first classification target and the second classification target to obtain the contrast loss value includes:

and determining the contrast loss value according to the difference information of the first type of feature set and the second type of feature set.

4. The method for training the video feature extraction model according to claim 1, wherein the acquiring the video feature of the first video asset comprises:

acquiring text features and image features of the first video resource;

and carrying out feature fusion on the text features and the image features based on a multi-modal algorithm to obtain the video features.

5. The method of claim 4, wherein the obtaining text features and image features of the first video asset comprises:

based on an image feature extraction algorithm, extracting features of the video images of the first video resource to obtain the image features;

and carrying out feature extraction on the text information based on a text feature extraction algorithm to obtain the text features.

6. The video feature extraction model training method of any one of claims 1-5, further comprising:

acquiring initial video features of the first video resource and initial search word features of the second video resource;

determining the processed video feature as a video feature of the first video asset and the processed search term feature as a search term feature of the second video asset.

7. A method for extracting video features, comprising:

acquiring video resources to be processed;

inputting the video resources to be processed into a video feature extraction model to obtain video features of the video resources to be processed; the video feature extraction model is trained by the video feature extraction model training method according to any one of claims 1 to 6.

8. A video feature extraction model training device, comprising: an acquisition unit and a processing unit;

The processing unit is used for determining a classification result of the video features of the first video resource based on the classification model and determining a classification loss value according to the difference information of the classification result and the tag information;

the acquisition unit is further used for acquiring the search word characteristics of the second video resource;

the processing unit is further used for performing contrast learning on the video features of the first video resource and the search word features of the second video resource based on a contrast learning model so as to obtain a contrast loss value;

the processing unit is further configured to train the video feature extraction model to be trained based on the classification loss value and the comparison loss value, so as to obtain a video feature extraction model.

9. A video feature extraction apparatus, comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring video resources to be processed;

the processing unit is used for inputting the video resources to be processed into a video feature extraction model so as to obtain video features of the video resources to be processed; the video feature extraction model is trained by the video feature extraction model training method according to any one of claims 1 to 6.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video feature extraction model training method of any one of claims 1-6 or to implement the video feature extraction method of claim 7.

11. A computer readable storage medium having instructions stored thereon, which when executed by a processor of an electronic device, cause the electronic device to perform the video feature extraction model training method of any of claims 1-6, or to implement the video feature extraction method of claim 7.