CN112905840A

CN112905840A - Video processing method, device, storage medium and equipment

Info

Publication number: CN112905840A
Application number: CN202110182078.XA
Authority: CN
Inventors: 宋治勋
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-04
Also published as: WO2022171067A1

Abstract

The embodiment of the disclosure discloses a video processing method, a video processing device, a storage medium and equipment. The method comprises the following steps: the method comprises the steps of obtaining video data corresponding to a target creative video, inputting the video data into a video feature extraction model, wherein the video feature extraction model is obtained after a preset neural network model containing a twin network is trained, a training sample corresponding to the video feature extraction model comprises at least two sample pairs formed by the creative video, the at least two creative videos have the same content theme and target objects, and determining video features corresponding to the target creative video according to an output result of the video feature extraction model. By adopting the technical scheme, the potential representation information of the creative video can be fully mined, and the video features with richer representation capability are produced.

Description

Video processing method, device, storage medium and equipment

Technical Field

The disclosed embodiments relate to the field of computer technologies, and in particular, to a video processing method, an apparatus, a storage medium, and a device.

Background

With the dramatic increase of the number of internet users, network videos are more and more, and services such as video search based on the network videos or personalized recommendation are more and more. At present, in order to attract users to watch, a video producer usually elaborately designs video content, which embodies creative thinking of the video producer, such creatively designed videos can be called creative videos, and in order to more accurately apply the creative videos, characteristics of the creative videos need to be accurately extracted.

A typical creative video may include an advertising creative. By taking the creative idea of advertisement as an example, with the rapid development of internet technology, the application of online advertisement is more and more extensive. The online advertisement is also called a network advertisement or an internet advertisement, and may include an advertisement published through the internet, and its carrier may include, for example, a web page and a client application, and a party responsible for promoting the online advertisement may be called a platform party, and a principal party for promoting the online advertisement may be called an advertiser.

In the process of publicizing and popularizing advertising objects, the importance of advertising creatives is more and more remarkable. The advertisement originality generally refers to the advertisement works including creative thinking, which highlights the product characteristics and brand connotation through a unique technical technique or a smart advertisement creation script and achieves the effects of publicity and popularization. The ad creative is an expression form finally presented to the user, endows the ad with vitality, and makes the ad become a valuable message for the user.

At present, more and more downstream businesses need to access and use the characteristics of the advertisement creatives, but due to the complexity and abstraction of the advertisement creatives, the potential characteristics of the advertisement creatives cannot be fully mined by the existing scheme.

Disclosure of Invention

The embodiment of the disclosure provides a video processing method, a video processing device, a storage medium and a video processing device, which can obtain video features capable of better identifying creative videos.

In a first aspect, an embodiment of the present disclosure provides a video processing method, including:

acquiring video data corresponding to a target creative video;

inputting the video data into a video feature extraction model, wherein the video feature extraction model is obtained by training a preset neural network model containing a twin network, the twin network comprises two networks with the same structure and sharing weight, a training sample corresponding to the video feature extraction model comprises at least two sample pairs formed by creative videos, and the at least two creative videos have the same content theme and target object;

and determining the target video characteristics corresponding to the target creative video according to the output result of the video characteristic extraction model.

In a second aspect, an embodiment of the present disclosure provides a video processing apparatus, including:

the video data acquisition module is used for acquiring video data corresponding to the target creative video;

the video feature extraction module is used for inputting the video data into a video feature extraction model, wherein the video feature extraction model is obtained by training a preset neural network model containing a twin network, the twin network comprises two networks with the same structure and sharing weight, training samples corresponding to the video feature extraction model comprise sample pairs formed by at least two creative videos, and the at least two creative videos have the same content theme and target objects;

and the video characteristic determining module is used for determining the target video characteristics corresponding to the target creative video according to the output result of the video characteristic extracting model.

In a third aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a video processing method as provided by embodiments of the present disclosure.

In a fourth aspect, the present disclosure provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the video processing method provided by the embodiments of the present disclosure when executing the computer program.

The video processing scheme provided in the embodiment of the disclosure utilizes the video feature extraction model obtained by training the preset neural network model containing the twin network to extract the video features of the target creative video by using the sample pairs formed by different creative videos with the same content theme and target object, skillfully utilizes the supervised information with higher creative similarity between the sample pairs to train the model, the training samples can be determined more reasonably, the training effect is improved, the video data corresponding to the target creative video needing video feature extraction is input into the video feature extraction model, the potential representation information of the creative video can be fully mined, and determines the corresponding video characteristics according to the output result, the characterization capability of the video characteristics is richer, and further, the high-quality access characteristic can be provided for downstream tasks such as video recommendation or video retrieval.

Drawings

Fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a preset neural network model according to an embodiment of the present disclosure

Fig. 4 is a schematic flow chart of another video processing method according to an embodiment of the present disclosure;

fig. 5 is a block diagram of a video processing apparatus according to an embodiment of the disclosure;

fig. 6 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.

Fig. 1 is a flowchart of a video processing method, which is applicable to a scene for extracting video features of a creative video, and can be executed by a video processing apparatus, where the apparatus can be implemented by software and/or hardware, and can be generally integrated in a computer device. As shown in fig. 1, the method includes:

and 101, acquiring video data corresponding to the target creative video.

In the embodiment of the present disclosure, for convenience of understanding the technical solution, a more typical creative video is taken as an example, for example, an advertisement creative video (hereinafter referred to as an advertisement creative) is taken as an example for the following description, and it is understood that the creative video is not limited to an advertisement creative and may also include creative videos in other forms such as a short video, a micro-movie, or a trailer.

In the embodiment of the present disclosure, the target advertisement creative may be understood as an advertisement creative that needs to perform advertisement creative feature extraction, and may be specifically determined by the current application scenario. At present, more and more downstream businesses need to access and use the ad creative features, the related application scenes are also multiple, generally, content understanding needs to be carried out on the ad creative, and the ad creative features output by the model of the embodiment of the disclosure can be utilized for content understanding and subsequent application. Exemplary, application scenarios may include advertisement click-through rate prediction scenarios, advertisement recommendation scenarios, and advertisement search scenarios, among others.

The video data corresponding to the target creative video may include data present in the target creative video in various forms, including, for example, video text data and video image data. The video text data comprises a video title, a video introduction, characters obtained by converting voice in the video and the like; the video image data may include a video cover image and video frames, and the video frames may be selected according to a certain frame extraction rule, such as extracting a certain number of video frames at equal intervals.

When the target creative video is a target advertising creative, the video data corresponding to the target advertising creative may be referred to as advertising creative data. Illustratively, the advertising creative data may include any one or more of text data, picture data, and video data included in the advertising creative. The text data can include, for example, advertising creative title data, advertising creative copy data, and voice text data obtained by voice recognition of sound data in the advertising creative; the picture data may include, for example, ad creative cover picture data, design drawing data in an ad creative, and the like; the video data may include, for example, ad creative video data. Optionally, the text data, the picture data, or the video data may be preprocessed to obtain corresponding advertisement creative data, such as keyword extraction processing, image scaling processing, video frame extraction processing, audio information extraction, and the like, which is not limited specifically. In addition, the advertising creative data may also include other data related to the advertising creative, such as industry related information, advertising target object information, and the like.

102, inputting the video data into a video feature extraction model, wherein the video feature extraction model is obtained by training a preset neural network model including a twin network, the twin network includes two networks with the same structure and sharing weight, a training sample corresponding to the video feature extraction model includes a sample pair formed by at least two creative videos, and the at least two creative videos have the same content theme and target object.

Illustratively, a creative video will generally include a content topic, which may be understood as a central concept or primary content to be represented in the creative video, such as a host, promotional or core scenario, etc., and a target object, which may be understood as a viewer or audience object of the creative video. For two creative videos with the same content theme and target object, the two videos can be considered to have similar creative features, and therefore can be used as training samples of the video feature extraction model. When training sample selection is performed, selection can be performed by referring to two dimensions of a content theme and a target object.

By way of example of advertising creativity, an advertiser can create multiple ad groups, an ad group having multiple ad plans, an ad plan containing multiple ad creatives, that are ultimately presented to a user for direct viewing as an ad creative. An ad group generally refers to a one-time delivery contract made by an advertiser and a promotion platform, and may include promotion purposes, budgets, and the like. Promotional goals may include, for example, advertising topics and advertising target objects. The advertisement theme is an important component of advertisement positioning, namely 'what advertisement', and can comprise objects which can be popularized such as products or services, and the advertisement theme is the center of an advertisement planning activity, and advertisement work at each stage is closely spread around the advertisement theme and generally cannot deviate or transfer the advertisement theme at will; the advertisement target object is an advertisement appeal object, is a target public of an advertisement activity, can be understood as advertisement targeting, is a problem of 'who advertises' in the advertisement targeting, and is required to be based on the advertisement target object besides taking an advertisement theme as a core. The platform side puts advertisements on client software such as short video Application (APP) or other network platforms with the advertisement plan as latitude, and the advertisement plan can also be understood as a specific putting strategy. The advertisement plan is the core of advertisement putting, and different promotion purposes can correspond to different advertisement plans. For example, when the advertisement needs to be targeted to users in Beijing and Shanghai, different advertisement plans can be selected respectively; in another example, different advertising programs may be selected when advertisements are targeted to males and females, respectively.

An ad plan is generally agreed upon by the advertiser and the platform party, and ad creatives within the same ad plan generally need to be designed based on the same ad target subject as the core. Different people generally have different sensitivities to different advertisement creatives, for example, the same advertisement for cosmetics for women is provided, some people are interested in 'full xx minus xx', some people are interested in 'limited version xx, and you miss' the interest without buying, when the platform side needs to show the advertisement creatives under a certain advertisement plan to the current user, the advertisement system generally selects a certain advertisement creative under the advertisement plan in a personalized manner aiming at the current user by using a preset screening strategy, and shows the certain advertisement creative to the current user, wherein the preset screening strategy can be set according to actual requirements, and is not limited specifically.

In embodiments of the present disclosure, an advertising campaign may be understood as a collection of multiple advertising creatives of the same advertising topic and the same advertising target object, which may be in the form of one or more of text, pictures, and video from a material type perspective, that is, the advertising creative may include one or more of text, pictures, and video. In addition, other manifestations, such as sound or smell, etc. may be included. For an online delivery platform, creative thinking information contained in an advertisement creative is abstract and complex, and the characteristics of the advertisement creative are difficult to accurately describe or characterize, so that accurate and effective training samples are difficult to obtain for training a model. Based on the characteristics of the advertisement plan, it can be seen that advertisement creatives in the same advertisement plan have the same content theme and target objects, that is, it can be considered that different advertisement creatives under the same advertisement plan have higher similarity.

Specifically, the corresponding ad creatives in the sample pair may be referred to as a first ad creative and a second ad creative, both of which are in the same ad plan, where the ad plan may be arbitrarily selected according to actual needs. For a selected advertising plan, which may contain multiple advertising creatives, the first advertising creative and the second advertising creative may be any two different advertising creatives in the advertising plan. When training the model, usually need adopt a large amount of training samples to train, exemplarily, can select the advertisement plan of predetermineeing quantity, to every advertisement plan, traverse wherein all advertisement creatives, obtain two advertisement creatives of two liang of combinations, mark as first advertisement creative and second advertisement creative respectively.

In the embodiment of the disclosure, the preset neural network model to be trained can be understood as an initial model designed according to actual requirements, and the preset neural network model includes a twin network. A twin network generally means that there are two networks which are identical in structure and the parameters are shared, i.e. the weights are identical. The two networks can be recorded as a first feature extraction network and a second feature extraction network respectively, the specific structures of the two networks are not limited, and the two networks are used for performing feature extraction on advertisement creative data and can be set according to actual requirements. Optionally, the preset neural network model may further include other network structures, which is not limited in the embodiment of the present disclosure.

Illustratively, when a preset neural network model is trained, first advertisement creative sample data derived from a first advertisement creative can be input into a first feature extraction network, second advertisement creative sample data can be input into a second feature extraction network, training is performed through training means such as network parameter back propagation, and a video feature extraction model (for advertisement creative, referred to as an advertisement creative feature extraction model) is determined according to the trained first feature extraction network or second feature extraction network.

And 103, determining target video characteristics corresponding to the target creative video according to the output result of the video characteristic extraction model.

Illustratively, after the target video features corresponding to the target creative video are obtained, the target video features can be input into other models, and relevant application is performed according to output results of the other models; the target video characteristics can also be directly utilized for relevant application. By taking an advertisement creative idea example, after the target advertisement creative feature is obtained, the target advertisement creative feature can be input into other models, and relevant application is performed according to output results of other models; the targeted ad creative features can also be directly utilized for relevant applications.

For example, taking a creative video prediction scene as an example, the target video characteristics can be input into a click rate prediction model to further predict the probability that the target creative video is clicked by a user; taking a creative video recommendation scene as an example, the creative video recommendation scene can be used for a creative video recall step, assuming that a creative video meeting a certain video characteristic is predicted to be recommended to a user at present, each creative video in a creative video library can be sequentially used as a target creative video, corresponding target video characteristics are output by using a model provided by the embodiment of the disclosure, each target video characteristic is compared with the video characteristics needing to be recommended, a recall sequence is determined according to a comparison result, for example, the similarity between the target video characteristics and the video characteristics needing to be recommended is calculated during comparison, the target video characteristics with the target number ranked in the front are determined as the video characteristics of the creative video to be recalled, and then the creative video sequence to be recalled is obtained.

By taking an advertisement click rate prediction scene as an example, the characteristics of the target advertisement creatives can be input into a click rate prediction model, and the probability of the target advertisement creatives being clicked by the user is further predicted; taking an advertisement recommendation scene as an example, the method can be used for a recall step of advertisement creatives, assuming that advertisement creatives meeting a certain advertisement creatives characteristic are predicted to be recommended to a user at present, then each advertisement creative in an advertisement creatives library can be sequentially used as a target advertisement creative, corresponding target advertisement creative characteristics are output by using a model provided by the embodiment of the disclosure, then each target advertisement creative characteristic is compared with the advertisement creative characteristic needing to be recommended, a recall sequence is determined according to a comparison result, for example, the similarity between the target advertisement creative characteristic and the advertisement creative characteristic needing to be recommended is calculated during comparison, the target advertisement creative characteristics with the target number ranked in the front are determined as the advertisement creative characteristics of the advertisement creative to be recalled, and then the advertisement creative sequence to be recalled is obtained.

The video processing method provided by the embodiment of the disclosure utilizes the video feature extraction model obtained by training the preset neural network model containing the twin network to extract the video features of the target creative video by utilizing the sample pairs formed by different creative videos with the same content theme and target object, skillfully utilizes the supervised information with higher creative similarity between the sample pairs to train the model, the training samples can be determined more reasonably, the training effect is improved, the video data corresponding to the target creative video needing video feature extraction is input into the video feature extraction model, the potential representation information of the creative video can be fully mined, and determines the corresponding video characteristics according to the output result, the characterization capability of the video characteristics is richer, and further, the high-quality access characteristic can be provided for downstream tasks such as video recommendation or video retrieval. With the advertisement intention example, the advertisement intention data that the target advertisement intention that will carry out advertisement intention characteristic extraction corresponds is input to the advertisement intention characteristic extraction model that this disclosed embodiment provided to confirm corresponding target advertisement intention characteristic according to the output result, because this model utilizes the higher characteristic training of similarity between the different advertisement creatives under the same advertisement plan to obtain, can fully excavate the latent representation information of advertisement intention, the advertisement intention characteristic that output representation ability is abundanter, thereby can obtain more accurate application result in concrete application scene.

In some embodiments, the video data includes video text data including a video title and video image data including a video cover image and a video frame.

In some embodiments, the advertisement creative feature extraction model is obtained by adopting the following model training method: inputting first advertisement creative sample data to a first feature extraction network in a preset neural network model to be trained, and inputting second advertisement creative sample data to a second feature extraction network in the preset neural network model, wherein the first advertisement creative sample data is derived from a first advertisement creative, the second advertisement creative sample data is derived from a second advertisement creative, the first advertisement creative and the second advertisement creative exist in the same advertisement plan, and the first feature extraction network and the second feature extraction network form a twin network; calculating a first loss function according to first feature data output by the first feature extraction network and second feature data output by the second feature extraction network; training the preset neural network model based on the first loss function to obtain a target neural network model; and determining an advertisement creative feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model.

Illustratively, the advertising creative sample data may include any one or more of text data, picture data, and video data included in the advertising creative. The text data can include, for example, advertising creative title data, advertising creative copy data, and voice text data obtained by voice recognition of sound data in the advertising creative; the picture data may include, for example, ad creative cover picture data, design drawing data in an ad creative, and the like; the video data may include, for example, ad creative video data. Optionally, the text data, the picture data, or the video data may be preprocessed to obtain corresponding sample data of the ad creative idea, such as keyword extraction processing, image scaling processing, video frame extraction processing, audio information extraction, and the like, which is not limited specifically. In addition, the sample data of the advertising creative may also include other data related to the advertising creative, such as industry related information and advertising target object information.

For example, when the preset neural network model is trained, the first advertisement creative sample data may be input to the first feature extraction network, so as to obtain first feature data output by the first feature extraction network. The first feature data may be embodied in a form without limitation, and may be, for example, a first feature vector, which may be referred to as an embedding (embedding) vector of the ad creative feature. And sending the second advertisement creative sample data to a second feature extraction network to obtain second feature data output by the second feature extraction network. Similarly, the second feature data is not limited to be embodied, but needs to be consistent with the first feature data, and may be, for example, a second feature vector.

Illustratively, since the first ad creative and the second ad creative exist in the same ad plan, the ad creative features of the first ad creative and the second ad creative should have higher similarity, two sets of feature data respectively output by the first feature extraction network and the second feature extraction network should also be similar or should be considered as similar creative features, a first loss function may be designed based on the features, and a specific loss function calculation mode is not limited in the embodiment of the disclosure.

Illustratively, the value of the first loss function is continuously optimized through training means such as network parameter back propagation and the like, and then the preset neural network model is continuously optimized until a certain training cutoff condition is met. The specific training cutoff condition may be set according to actual requirements, and the embodiment of the present disclosure is not limited.

After training aiming at the preset neural network model is finished, network parameters in the trained first feature extraction network and the trained second feature extraction network are synchronously optimized and adjusted, and finally the trained first feature extraction network and the trained second feature extraction network are completely consistent. And finally, determining an advertisement creative feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model, and if other network structures are received after the first feature extraction network or the second feature extraction network, taking the trained first feature extraction network or second feature extraction network and the other network structures after subsequent training as the advertisement creative feature extraction model.

In some embodiments, the first loss function comprises a similarity loss function; the training the preset neural network model based on the first loss function includes: based on the first loss function, training the preset neural network model by using a condition that the similarity of the first feature data and the second feature data meets a preset requirement (for convenience of distinguishing from the following text, the preset requirement may be a first preset requirement) as a training target. Optionally, a first similarity between a first feature vector output by the first feature extraction network and a second feature vector output by the second feature extraction network is calculated based on a first preset loss function calculation manner, so as to obtain a first loss function. The first preset Loss function calculation method may include, for example, a contrast Loss function (contrast Loss) calculation method, specifically, KL divergence (Kullback-Leibler divergence), and Noise-contrast Estimation Loss (NCE). When the preset neural network model is trained, the model can ignore the difference of other irrelevant features except the advertisement creative feature between the first advertisement creative sample data and the second advertisement creative sample data as much as possible, the feature shared by the first advertisement creative sample data and the second advertisement creative sample data is extracted as the advertisement creative feature, and then the potential representation information of the advertisement creative is fully mined. The preset requirement may be set according to an actual requirement, for example, in the specific implementation, a first similarity threshold (e.g., 95%) may be set, and when the value of the first loss function is greater than the first similarity threshold, it may be considered that the second feature data is highly similar to the first feature data, and it may be considered that the training target is achieved.

In some embodiments, the first advertising creative sample data and the second advertising creative sample data comprise a positive sample pair; the first advertisement creative sample data and the third advertisement creative sample data form a negative example sample pair, wherein the third advertisement creative sample data is derived from a third advertisement creative, and the first advertisement creative and the third advertisement creative exist in different advertisement plans. The advantage of appearance setting lies in, through the reasonable negative example sample pair that sets up, can make the model more be applicable to real application environment, improves the training effect of model. For a negative example sample pair, when the similarity loss function is used for training, the preset neural network model may be trained based on the first loss function, with the similarity between the first feature data and the second feature data meeting a second preset requirement as a training target. Wherein, the second preset requirement can be set according to the actual situation. Specifically, a second similarity threshold (e.g., 5%) may be set, and when the value of the first loss function is smaller than the second similarity threshold, the similarity between the second feature data and the first feature data may be considered to meet a second preset requirement. The second similarity threshold is smaller than the first similarity threshold, and a difference between the first similarity threshold and the second similarity threshold is greater than or equal to a preset difference threshold (e.g., 90%).

For example, when the number of training samples is large, or when an advertising creative is selected, and there are many advertising creative sample pairs existing in different advertising plans, if the training sample pairs are all negative sample pairs, the training efficiency may be seriously affected, and therefore, the screening can be performed. Optionally, in the embodiment of the present disclosure, a Hard Negative Mining (Hard Negative Mining) mode may be adopted to automatically construct a Negative sample pair. The hard negative example mining method can be understood as that hard negative examples (hard negative) are mined as much as possible during training, and the hard negative examples (hard negative) are added into a negative sample set, so that the effect is better than that of a negative sample set formed by simple negative examples (easy negative). Negative example sample pairs in embodiments of the present disclosure may be under the same Batch (Batch)And constructing by using a hard case mining method. For example, for ad creative a in ad plan A₁And ad creative a₂Advertisement creative B in advertisement plan B₁And b₂Ad creative C in ad plan C₁And c₂If a is₁And b₁Inputting the candidate negative example sample pair into a preset neural network model, calculating to obtain a first similarity, and calculating a₁And c₁Inputting the candidate negative example sample pair into a preset neural network model, calculating to obtain a second similarity, and calculating a₁And a₂After being input into the preset neural network model as a positive example sample pair, the third similarity is obtained through calculation, and if the second similarity is far larger than the first similarity and even close to the third similarity, the a can be used for calculating the third similarity₁And c₁Chosen as a negative example sample pair for training.

In some embodiments, the first loss function comprises a loss function corresponding to a two-class problem. The method further comprises the following steps: inputting the first advertisement creative sample data into the first feature extraction network, and inputting the third advertisement creative sample data into the second feature extraction network; and calculating a second loss function according to the third characteristic data output by the first characteristic extraction network and the fourth characteristic data output by the second characteristic extraction network, wherein the second loss function and the first loss function are calculated in the same way. Correspondingly, the training the preset neural network model based on the first loss function includes: and training the preset neural network model by taking the first characteristic data and the second characteristic data with the same category and the third characteristic data and the fourth characteristic data with different categories as training targets based on the first loss function and the second loss function. The advantage that sets up like this lies in, judges the similarity of the advertisement intention characteristic of two sets of advertisement intention data and converts into the two classification problems, can make the model that the training obtained more accurately discern the commonality characteristic of different advertisement creatives under the same advertisement plan. For example, the Loss function corresponding to the two-classification problem may be a Hinge Loss function (Hinge Loss), which may maximize the classification interval and reduce the number of misclassified samples. In addition, the Loss function corresponding to other binary problems may also be used, such as a logic Loss function (Logistic Loss), a Cross Entropy Loss function (Cross Entropy Loss), or a Modified regression Loss function (Modified Huber Loss), which is not limited specifically.

In some embodiments, the ad creative text sample data and the ad creative image sample data are included in the ad creative sample data. The first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion. The inputting of the first advertisement creative sample data into the first feature extraction network in the preset neural network model to be trained comprises: inputting first advertisement creative text sample data into the first network structure to output corresponding text extraction features; inputting first advertisement creative image sample data into the second network structure to output corresponding image extraction features; outputting corresponding fusion extraction features by the text extraction features and the image extraction features through the third network structure, wherein the third network structure is constructed based on a self-attention mechanism, namely the third network structure performs feature fusion based on the internal correlation of the text extraction features, the internal correlation of the image extraction features and the correlation between the text extraction features and the image extraction features; and determining first feature data output by the first feature extraction network according to the fusion extraction features. The method has the advantages that the multimode data in the advertisement creatives are reasonably selected as model input, and the multimode information in the advertisement creatives is fused based on a self-attention mechanism, so that the overall characteristics of the advertisement creatives can be extracted, and the characteristics of the advertisement creatives can be more comprehensively and deeply mined. It should be noted that the first feature extraction network and the second feature extraction network have the same structure, and the input modes of the second advertisement creative sample data and the third advertisement creative sample data are the same as the first advertisement creative sample data.

Illustratively, the ad creative text sample data may include creative title data, and may also include data obtained by processing the creative title, such as word segmentation processing or keyword extraction processing. The ad creative image sample data may include cover page picture data, may also include video frame data, and the like. The first network structure is used for extracting text features, and may be, for example, a bag of words model (bag of words model), a term-frequency-inverse document frequency (TF-IDF) model, or a Bidirectional Encoder representation transformations (Bert) model. The second Network structure is used for image feature extraction, and may be, for example, a residual Network (ResNet) model or a Visual Geometry Group (VGG) model. The third network structure is used for cross-modal information fusion, and may be, for example, a BERT model, where a basic encoder of the third network structure adopts a transform (Transformer) structure, and the Transformer structure mainly adopts a self attention (self attention) mechanism to encode and parallelize input data, so as to improve the global characterization capability of each data. The Self attribute can learn the weight of each feature in the input sequence and realize the feature expression of the upper layer through weighted summation. Specifically, the third network structure can realize fusion of text extraction features and image extraction features by means of a Transformer structure in BERT, and simple splicing is not performed, so that the fusion extraction features obtained after fusion can better express the advertisement creative features.

In some embodiments, the ad creative data includes ad creative text data and ad creative image data; the inputting the advertisement creative data into an advertisement creative feature extraction model comprises: inputting ad creative text data into the first network structure to output corresponding target text extraction features; inputting ad creative image data into the second network structure to output corresponding target image extraction features; and outputting corresponding target fusion extraction features by the target text extraction features and the target image extraction features through the third network structure to obtain an output result of the advertisement creative feature extraction model.

In some embodiments, the advertising creative includes a video advertising creative; before the inputting the first advertisement creative sample data into the first feature extraction network in the preset neural network model to be trained, the method further comprises the following steps: acquiring a cover image of the first advertising creative; performing frame extraction processing on the advertisement creative video of the first advertisement creative to obtain a preset number of advertisement creative video frames with time sequence relevance; and determining the first advertisement creative image sample data according to the cover image and the advertisement creative video frame. The advantage of setting up like this lies in, rationally selects image sample data, on the basis of guaranteeing to accurately extract the characteristic data who contains the intention information, the size of effective control single sample data improves model training efficiency, in addition, at the model application stage, can reduce data input, and then promote the extraction efficiency of advertisement intention characteristic.

Illustratively, when the frame extraction processing is performed on the advertisement creative video, the frame extraction rule can be set according to actual requirements, such as frame extraction at equal intervals or frame extraction at unequal intervals, and if the frame extraction is performed at unequal intervals, the video content of the advertisement creative video can be analyzed, and the frame extraction frequency can be properly increased for video clips with advertisement themes appearing in the video. For example, video content analysis is performed on an advertisement creative video, segment segmentation is performed on the advertisement creative video according to an analysis result to obtain one or more first video segments containing advertisement topics and one or more second video segments not containing the advertisement topics, frame extraction is performed on the first video segments by adopting a first frame extraction frequency, frame extraction is performed on the second video segments by adopting a second frame extraction frequency, and the first frame extraction frequency is greater than the second frame extraction frequency. The specific frame extraction frequency can be freely set, and can also be adaptively configured according to the set preset number. In addition, after the cover image and/or the advertisement creative video frame are/is acquired, preprocessing can be performed on the image, such as length and width preprocessing, so that the size of the processed image is adapted to the preset neural network model. In addition, when the sample data of the first advertisement creative image is determined, the time sequence characteristics among the frames of the advertisement creative video are considered, and the content in the video can be more accurately expressed.

In some embodiments, the obtaining ad creative data corresponding to the target ad creative includes: acquiring a target title text of the target advertisement creative, and determining the text data of the advertisement creative according to the target title text; acquiring a target cover image of the target advertisement creative; performing frame extraction processing on the advertisement creative video of the target advertisement creative to obtain a preset number of target advertisement creative video frames with time sequence relevance; the ad creative image data is determined from the target cover image and the target ad creative video frame.

The following further describes the training process for the ad creative feature extraction model.

Fig. 2 is a schematic flow chart of a model training method provided by an embodiment of the present disclosure, which is optimized based on various alternatives in the above embodiments, and is described by using an advertisement creative example as a video advertisement creative example, specifically, the method includes the following steps:

step 201, obtaining sample pair data.

Wherein the sample pair data comprises a positive sample pair and a negative sample pair. The two samples in the positive example pair are from the same advertising program and the two samples in the negative example pair are from different advertising programs. In addition, the negative example sample pair can be constructed in the same batch by using a hard negative example mining method, and the specific manner can refer to the related contents above.

Illustratively, any piece of sample data may be obtained by:

and acquiring the title, cover picture and creative video of the advertisement creative corresponding to the sample data. The method comprises the steps of preprocessing a title, a cover picture and a creative video, for example, performing word segmentation processing on the title in a double Byte Encoding (BPE) mode, performing frame extraction processing on the creative video, performing length and width preprocessing on the cover picture and the picture after the video is extracted, and finally processing the picture into a picture size matched with a model. The pre-processing result can be expressed as: ad creative title: the [ BPE word 1, BPE word 2,. ·, BPE word n ], may specifically be n vectors; advertisement cover picture: a matrix of pixel values of size H x W x 3 (H for length, W for width, 3 for number of RGB channels) convertible into a vector; advertisement creative video: [ decimated frame fig. 1, decimated frame fig. 2, decimated frame fig. 10], assuming 10 frames are decimated, the size of each decimated frame may also be H × W × 3 pixel value matrix, which may be converted into 10 vectors, and each map corresponds to one vector. And generating corresponding sample data according to the preprocessing result.

Fig. 3 is a schematic structural diagram of a preset neural network model according to an embodiment of the present disclosure. Constructing sample pair data according to an advertisement creative1 and an advertisement creative2, taking the advertisement creative1 as an example, preprocessing an advertisement title 1 to obtain text sample data, performing frame extraction on an advertisement video 1, supposing that 10 frames are extracted to obtain 10 video frame pictures, preprocessing the 10 video frame pictures and an advertisement cover picture 1 together to obtain image sample data, wherein the text sample data and the image sample data jointly form first sample data corresponding to the advertisement creative 1; similarly, the construction process of the advertisement creative2 is similar and is not repeated.

Step 202, inputting first sample data in the sample pair data into a first feature extraction network in a preset neural network model to be trained, and inputting second sample data into a second feature extraction network in the preset neural network model.

The preset neural network model comprises a twin network, the first feature extraction network and the second feature extraction network are the same, and only the specific structure of the first feature extraction network is shown in fig. 3. The first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion. Illustratively, the first network structure comprises a Bert model structure; the second network structure comprises a ResNet model structure; the third network structure includes a multi-modal Bert model structure (multiModal _ Bert) implemented by the self attribute mechanism.

The input process of the first sample data and the second sample data is basically the same, and the sample data input process corresponding to the advertising creative1 is described as an example with reference to fig. 3. Inputting text sample data into a Bert model structure to output corresponding text extraction features; inputting Image sample data into a ResNet model structure, and processing the Image sample data through a depth Image (Dense _ Image) layer to output corresponding Image extraction characteristics; merging (concat) the text extraction features and the image extraction features, outputting corresponding fusion extraction features through a MultiModal _ Bert model structure, and processing the fusion extraction features through a Max Pooling (Max Pooling) layer and a depth Output (Dense _ Output) layer to obtain first feature data (Creative1_ emb) Output by a first feature extraction network. The specific structures, the weight parameter quantity and the like in each module and the neural network layer can be set according to actual requirements. Similarly, the sample data corresponding to the ad Creative2 outputs second feature data (Creative2_ emb) via the second feature extraction network.

And step 203, calculating a hinge loss function according to the first characteristic data output by the first characteristic extraction network and the second characteristic data output by the second characteristic extraction network.

As shown in fig. 3, Hinge Loss is calculated from Creative1_ emb and Creative2_ emb.

And 204, when the first characteristic data and the second characteristic data correspond to the normal example sample pair, training a preset neural network model by taking the first characteristic data and the second characteristic data with the same category as a training target, and when the first characteristic data and the second characteristic data correspond to the normal example sample pair, training the preset neural network model by taking the first characteristic data and the second characteristic data with different categories as the training target to obtain a target neural network model.

Step 205, determining an advertisement creative feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model.

The model training method provided by the embodiment of the disclosure performs model training by using the supervised information with higher similarity between different advertisement creatives under the same advertisement plan, fully considers the multi-mode information in the advertisement creatives when determining the input data, generates a training sample according to the title, the cover picture and the advertisement video frame-drawing picture, fuses the multi-mode information by using a feature fusion network structure realized based on an attention mechanism in the model, simultaneously completes cross-mode and homomode feature interaction, performs training by using a hinge loss function in the training process, constructs a negative example sample by using a negative example mining method, can improve the training efficiency and the training effect, enables the trained advertisement creativity feature extraction model to more fully mine the latent representation information of the advertisement creatives and produce advertisement creativity features with richer representation capability, and further provides higher-quality access characteristics for downstream tasks such as advertisement recommendation or advertisement retrieval.

Fig. 4 is a flowchart of another video processing method provided by the embodiment of the disclosure, which is suitable for a scene for feature extraction of an advertising creative of a video type, and can be executed by a model using apparatus, wherein the apparatus can be implemented by software and/or hardware, and can be generally integrated in a computer device. As shown in fig. 4, the method includes:

step 401, a target title text of the target advertisement creative is obtained, and advertisement creative text data is determined according to the target title text.

Step 402, a target cover image of a target ad creative is obtained.

And 403, performing frame extraction processing on the advertisement creative video of the target advertisement creative to obtain a preset number of target advertisement creative video frames with time sequence relevance.

Step 404, inputting the ad creative text data, the target cover image and the target ad creative video frame as the ad creative data corresponding to the target ad creative into the ad creative feature extraction model.

The relevant technical details of the ad creative feature extraction model may refer to the above relevant description, and are not repeated herein.

Step 405, determining the target advertisement creative characteristic corresponding to the target advertisement creative according to the output result of the advertisement creative characteristic extraction model.

The video processing method provided by the embodiment of the disclosure fully considers the multi-modal information in the video advertisement originality, forms data in an input model according to a title, a cover picture and an advertisement video frame-drawing picture, fuses the multi-modal information by using a feature fusion network structure realized based on an attention mechanism in the model, completes cross-modal and co-modal feature interaction, can fully mine the potential representation information of the video advertisement originality, and generates advertisement originality features with richer representation capability, thereby obtaining a more accurate application result in a specific application scene.

It should be noted that the above-mentioned contents are described by taking creative videos as an example of advertising creativity, and for other types of creative videos, relevant technical features can be reasonably and adaptively replaced without any doubt by those skilled in the art, and the embodiments of the present disclosure are not expanded one by one.

Fig. 5 is a block diagram of a video processing apparatus, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform an ad creative feature extraction by executing a video processing method according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

a video data obtaining module 501, configured to obtain video data corresponding to a target creative video;

a video data input module 502, configured to input the video data into a video feature extraction model, where the video feature extraction model is obtained by training a preset neural network model including a twin network, the twin network includes two networks with the same structure and sharing weights, a training sample corresponding to the video feature extraction model includes a sample pair formed by at least two creative videos, and the at least two creative videos have the same content theme and target object;

and a video feature determination module 503, configured to determine, according to an output result of the video feature extraction model, a target video feature corresponding to the target creative video.

The video processing device provided by the embodiment of the disclosure extracts the video features of the target creative video by using the video feature extraction model obtained by training the preset neural network model containing the twin network by using the sample pairs formed by different creative videos with the same content theme and target object, skillfully trains the model by using the supervised information with higher creative similarity between the sample pairs, the training samples can be determined more reasonably, the training effect is improved, the video data corresponding to the target creative video needing video feature extraction is input into the video feature extraction model, the potential representation information of the creative video can be fully mined, and determines the corresponding video characteristics according to the output result, the characterization capability of the video characteristics is richer, and further, the high-quality access characteristic can be provided for downstream tasks such as video recommendation or video retrieval.

Optionally, the video data includes video text data and video image data, the video text data includes a video title, and the video image data includes a video cover image and a video frame.

Optionally, the video feature extraction model is obtained by using the following model training method: inputting first video sample data to a first feature extraction network in a preset neural network model to be trained, and inputting second video sample data to a second feature extraction network in the preset neural network model, wherein the first video sample data is derived from a first creative video, the second video sample data is derived from a second creative video, the first creative video and the second creative video form a sample pair, and the first feature extraction network and the second feature extraction network form a twin network; calculating a first loss function according to first feature data output by the first feature extraction network and second feature data output by the second feature extraction network; training the preset neural network model based on the first loss function to obtain a target neural network model; and determining a video feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model.

Optionally, the first loss function comprises a similarity loss function. The training the preset neural network model based on the first loss function includes: and training the preset neural network model by taking the similarity of the first characteristic data and the second characteristic data meeting a preset requirement as a training target based on the first loss function.

Optionally, the first loss function includes a loss function corresponding to a two-class problem; the first video sample data and the second video sample data form a positive example sample pair; the first video sample data and third video sample data comprise a negative example sample pair, wherein the third video sample data is derived from a third creative video, the first creative video and the third creative video having different content themes and/or target objects.

The model training method further comprises the following steps: inputting the first video sample data to the first feature extraction network, and inputting the third video sample data to the second feature extraction network; calculating a second loss function according to third feature data output by the first feature extraction network and fourth feature data output by the second feature extraction network, wherein the second loss function is calculated in the same way as the first loss function;

optionally, the training the preset neural network model based on the first loss function includes: and training the preset neural network model by taking the first characteristic data and the second characteristic data with the same category and the third characteristic data and the fourth characteristic data with different categories as training targets based on the first loss function and the second loss function.

Optionally, the video sample data includes video text sample data and video image sample data; the first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion. The inputting of the first video sample data into the first feature extraction network in the preset neural network model to be trained includes: inputting first video text sample data into the first network structure to output corresponding text extraction features; inputting first video image sample data into the second network structure to output corresponding image extraction features; outputting corresponding fusion extraction features by the text extraction features and the image extraction features through the third network structure, wherein the third network structure performs feature fusion based on the internal correlation of the text extraction features, the internal correlation of the image extraction features and the correlation between the text extraction features and the image extraction features; and determining first feature data output by the first feature extraction network according to the fusion extraction features.

Optionally, the inputting the video data into a video feature extraction model includes: inputting video text data into the first network structure to output corresponding target text extraction features; inputting video image data into the second network structure to output corresponding target image extraction features; and outputting corresponding target fusion extraction features by the target text extraction features and the target image extraction features through the third network structure to obtain an output result of the video feature extraction model.

Optionally, before the inputting the first video sample data to the first feature extraction network in the preset neural network model to be trained, the method further includes: acquiring a cover image of the first creative video; performing frame extraction processing on the first creative video to obtain a preset number of video frames with time sequence relevance; and determining the first video image sample data according to the cover image and the video frame.

Optionally, the obtaining of the video data corresponding to the target creative video includes: acquiring a target title text of the target creative video, and determining the video text data according to the target title text; acquiring a target cover image of the target creative video; performing frame extraction processing on the target creative video to obtain a preset number of target video frames with time sequence relevance; determining the video data according to the target cover image and the target video frame.

Optionally, the advertisement creative feature extraction model is obtained by adopting the following model training method: inputting first advertisement creative sample data to a first feature extraction network in a preset neural network model to be trained, and inputting second advertisement creative sample data to a second feature extraction network in the preset neural network model, wherein the first advertisement creative sample data is derived from a first advertisement creative, the second advertisement creative sample data is derived from a second advertisement creative, the first advertisement creative and the second advertisement creative exist in the same advertisement plan, and the first feature extraction network and the second feature extraction network form a twin network; calculating a first loss function according to first feature data output by the first feature extraction network and second feature data output by the second feature extraction network; training the preset neural network model based on the first loss function to obtain a target neural network model; and determining an advertisement creative feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model.

Optionally, the first loss function comprises a similarity loss function; the training the preset neural network model based on the first loss function includes: and training the preset neural network model by taking the similarity of the first characteristic data and the second characteristic data meeting a preset requirement as a training target based on the first loss function.

Optionally, the first loss function includes a loss function corresponding to a two-class problem; the first advertisement creative sample data and the second advertisement creative sample data form a positive example sample pair; the first advertisement creative sample data and third advertisement creative sample data form a negative example sample pair, wherein the third advertisement creative sample data is derived from a third advertisement creative, and the first advertisement creative and the third advertisement creative exist in different advertisement plans;

the model training method further comprises the following steps: inputting the first advertisement creative sample data into the first feature extraction network, and inputting the third advertisement creative sample data into the second feature extraction network; calculating a second loss function according to third feature data output by the first feature extraction network and fourth feature data output by the second feature extraction network, wherein the second loss function is calculated in the same way as the first loss function;

correspondingly, the training the preset neural network model based on the first loss function includes: and training the preset neural network model by taking the first characteristic data and the second characteristic data with the same category and the third characteristic data and the fourth characteristic data with different categories as training targets based on the first loss function and the second loss function.

Optionally, the advertisement creative sample data includes advertisement creative text sample data and advertisement creative image sample data; the first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion; the inputting of the first advertisement creative sample data into the first feature extraction network in the preset neural network model to be trained comprises: inputting first advertisement creative text sample data into the first network structure to output corresponding text extraction features; inputting first advertisement creative image sample data into the second network structure to output corresponding image extraction features; outputting corresponding fusion extraction features by the text extraction features and the image extraction features through the third network structure, wherein the third network structure performs feature fusion based on the internal correlation of the text extraction features, the internal correlation of the image extraction features and the correlation between the text extraction features and the image extraction features; and determining first feature data output by the first feature extraction network according to the fusion extraction features.

Optionally, the advertisement creative data includes advertisement creative text data and advertisement creative image data; the inputting the advertisement creative data into an advertisement creative feature extraction model comprises:

inputting ad creative text data into the first network structure to output corresponding target text extraction features; inputting ad creative image data into the second network structure to output corresponding target image extraction features; and outputting corresponding target fusion extraction features by the target text extraction features and the target image extraction features through the third network structure to obtain an output result of the advertisement creative feature extraction model.

Optionally, the ad creative includes a video ad creative; before the inputting the first advertisement creative sample data into the first feature extraction network in the preset neural network model to be trained, the method further comprises the following steps: acquiring a cover image of the first advertising creative; performing frame extraction processing on the advertisement creative video of the first advertisement creative to obtain a preset number of advertisement creative video frames with time sequence relevance; and determining the first advertisement creative image sample data according to the cover image and the advertisement creative video frame.

Optionally, the obtaining of advertisement creative data corresponding to the target advertisement creative includes: acquiring a target title text of the target advertisement creative, and determining the text data of the advertisement creative according to the target title text; acquiring a target cover image of the target advertisement creative; performing frame extraction processing on the advertisement creative video of the target advertisement creative to obtain a preset number of target advertisement creative video frames with time sequence relevance; the ad creative image data is determined from the target cover image and the target ad creative video frame.

Referring now to FIG. 6, shown is a schematic block diagram of a computer device 600 suitable for use in implementing embodiments of the present disclosure. The computer device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The computer device shown in fig. 6 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the computer device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the computer apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the computer device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates a computer device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the computer device; or may exist separately and not be incorporated into the computer device.

The computer readable medium carries one or more programs which, when executed by the computing device, cause the computing device to: acquiring video data corresponding to a target creative video; inputting the video data into a video feature extraction model, wherein the video feature extraction model is obtained by training a preset neural network model containing a twin network, the twin network comprises two networks with the same structure and sharing weight, a training sample corresponding to the video feature extraction model comprises at least two sample pairs formed by creative videos, and the at least two creative videos have the same content theme and target object; determining target video characteristics corresponding to the target creative video according to the output result of the video characteristic extraction model

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation on the module itself, and for example, a video data acquisition module may also be described as a "module that acquires video data corresponding to a target creative video".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a video processing method including:

acquiring video data corresponding to a target creative video;

Optionally, the video feature extraction model is obtained by using the following model training method:

inputting first video sample data to a first feature extraction network in a preset neural network model to be trained, and inputting second video sample data to a second feature extraction network in the preset neural network model, wherein the first video sample data is derived from a first creative video, the second video sample data is derived from a second creative video, the first creative video and the second creative video form a sample pair, and the first feature extraction network and the second feature extraction network form a twin network;

calculating a first loss function according to first feature data output by the first feature extraction network and second feature data output by the second feature extraction network;

training the preset neural network model based on the first loss function to obtain a target neural network model;

and determining a video feature extraction model according to the trained first feature extraction network or second feature extraction network in the target neural network model.

Optionally, the first loss function comprises a similarity loss function;

the training the preset neural network model based on the first loss function includes:

and training the preset neural network model by taking the similarity of the first characteristic data and the second characteristic data meeting a preset requirement as a training target based on the first loss function.

Optionally, the first loss function includes a loss function corresponding to a two-class problem; the first video sample data and the second video sample data form a positive example sample pair; the first video sample data and third video sample data form a negative example sample pair, wherein the third video sample data is derived from a third creative video, the first creative video and the third creative video having different content themes and/or target objects;

the model training method further comprises the following steps:

inputting the first video sample data to the first feature extraction network, and inputting the third video sample data to the second feature extraction network;

calculating a second loss function according to third feature data output by the first feature extraction network and fourth feature data output by the second feature extraction network, wherein the second loss function is calculated in the same way as the first loss function;

correspondingly, the training the preset neural network model based on the first loss function includes:

and training the preset neural network model by taking the first characteristic data and the second characteristic data with the same category and the third characteristic data and the fourth characteristic data with different categories as training targets based on the first loss function and the second loss function.

Optionally, the video sample data includes video text sample data and video image sample data; the first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion;

the inputting of the first video sample data into the first feature extraction network in the preset neural network model to be trained includes:

inputting first video text sample data into the first network structure to output corresponding text extraction features;

inputting first video image sample data into the second network structure to output corresponding image extraction features;

outputting corresponding fusion extraction features by the text extraction features and the image extraction features through the third network structure, wherein the third network structure performs feature fusion based on the internal correlation of the text extraction features, the internal correlation of the image extraction features and the correlation between the text extraction features and the image extraction features;

determining first feature data output by the first feature extraction network according to the fusion extraction features;

optionally, the inputting the video data into a video feature extraction model includes:

inputting video text data into the first network structure to output corresponding target text extraction features;

inputting video image data into the second network structure to output corresponding target image extraction features;

and outputting corresponding target fusion extraction features by the target text extraction features and the target image extraction features through the third network structure to obtain an output result of the video feature extraction model.

Optionally, before the inputting the first video sample data to the first feature extraction network in the preset neural network model to be trained, the method further includes:

acquiring a cover image of the first creative video;

performing frame extraction processing on the first creative video to obtain a preset number of video frames with time sequence relevance;

determining the first video image sample data according to the cover image and the video frame;

optionally, the obtaining of the video data corresponding to the target creative video includes:

acquiring a target title text of the target creative video, and determining the video text data according to the target title text;

acquiring a target cover image of the target creative video;

performing frame extraction processing on the target creative video to obtain a preset number of target video frames with time sequence relevance;

determining the video data according to the target cover image and the target video frame.

According to one or more embodiments of the present disclosure, there is provided a video processing apparatus including:

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A video processing method, comprising:

acquiring video data corresponding to a target creative video;

2. The video processing method of claim 1, wherein the video data comprises video text data and video image data, the video text data comprising a video title, the video image data comprising a video cover image and a video frame.

3. The video processing method according to claim 2, wherein the video feature extraction model is obtained by using a model training method comprising:

4. The video processing method of claim 3, wherein the first loss function comprises a similarity loss function;

5. The video processing method according to claim 3, wherein the first loss function comprises a loss function corresponding to a two-class problem; the first video sample data and the second video sample data form a positive example sample pair; the first video sample data and third video sample data form a negative example sample pair, wherein the third video sample data is derived from a third creative video, the first creative video and the third creative video having different content themes and/or target objects;

the model training method further comprises the following steps:

6. The video processing method according to any one of claims 2-5, wherein the video sample data includes video text sample data and video image sample data; the first feature extraction network comprises a first network structure for text feature extraction, a second network structure for image feature extraction and a third network structure for feature fusion;

correspondingly, the inputting the video data into a video feature extraction model includes:

7. The video processing method according to claim 6, wherein before said inputting the first video sample data into the first feature extraction network in the preset neural network model to be trained, further comprising:

acquiring a cover image of the first creative video;

correspondingly, the acquiring of the video data corresponding to the target creative video includes:

acquiring a target cover image of the target creative video;

8. A video processing apparatus, comprising:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the computer program.