CN116844018A

CN116844018A - Training method and device for video characterization model, electronic equipment and storage medium

Info

Publication number: CN116844018A
Application number: CN202310729467.9A
Authority: CN
Inventors: 沈栋
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-03

Abstract

The embodiment of the disclosure provides a training method and device for a video characterization model, electronic equipment and a storage medium. The method comprises the following steps: acquiring first video data, a first tag prompt template and a first tag, and acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data; inputting the first video data and the second video data to an encoder to obtain a first video representation and a second video representation; inputting the first video representation and the first tag prompt template to a decoder to obtain a first prediction tag; inputting the second video representation and the second tag prompt template to a decoder to obtain a second prediction tag; and performing stage training on the encoder and the decoder according to the first label and the first predictive label, performing stage training on the encoder and the decoder again according to the second label and the second predictive label when the stage training is finished, and determining the encoder after the stage training again as a video characterization model.

Description

Training method and device for video characterization model, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular relates to a training method of a video characterization model, a video characterization method, a training device of the video characterization model, a video characterization device, electronic equipment and a computer readable storage medium.

Background

With the rapid development of video technology, users who view various types of video are increasing.

In the related art, when different types of videos are characterized, different characterization models need to be trained respectively, namely, one video characterization model needs to be trained for each type of video data.

Disclosure of Invention

The embodiment of the disclosure provides a training method of a video characterization model, a video characterization method, a training device of the video characterization model, a video characterization device, electronic equipment and a computer readable storage medium, wherein on one hand, the method adopts a single-stream model to process different types of video data, so that cross-domain characterization of different types of video data can be realized; on the other hand, different prompt templates are adopted to distinguish different training tasks, so that the model training efficiency can be improved.

The embodiment of the disclosure provides a training method of a video characterization model, which comprises the following steps: acquiring first video data, a first tag prompt template and a first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, a first label prompting template is used for prompting a decoder to output a first prediction label, and a second label prompting template is used for prompting the decoder to output a second prediction label; inputting the first video data and the second video data into an encoder respectively to obtain a first video representation of the first video data and a second video representation of the second video data; inputting the first video representation and the first tag prompt template into a decoder to obtain a first prediction tag; inputting the second video representation and the second tag prompt template into the decoder to obtain a second prediction tag; and performing stage training on the encoder and the decoder according to the first label and the first predictive label, performing stage training on the encoder and the decoder again according to the second label and the second predictive label when the stage training is completed, and determining the encoder after the stage training again as a video characterization model.

In some exemplary embodiments of the present disclosure, the first video data includes a first video and text of the first video, and the second video data includes a second video and text of the second video; the inputting the first video data and the second video data into an encoder respectively includes: extracting video features of the first video to obtain visual information of the first video; text segmentation is carried out on the text of the first video, and text information of the first video is obtained; splicing the start mark, the visual information of the first video, the separation mark, the text information of the first video and the end mark to obtain a first mark sequence of the first video, and inputting the first mark sequence into the encoder; extracting video features of the second video to obtain visual information of the second video; text segmentation is carried out on the text of the second video, and text information of the second video is obtained; and performing splicing processing on the start mark, the visual information of the second video, the separation mark, the text information of the second video and the end mark to obtain a second mark sequence of the second video, and inputting the second mark sequence into the encoder.

In some exemplary embodiments of the present disclosure, the acquiring the first video data includes: acquiring the first video; performing voice recognition and text detection on the first video to obtain a text of the first video; the acquiring the second video data includes: acquiring the second video; and performing voice recognition and text detection on the second video to obtain the text of the second video.

In some exemplary embodiments of the present disclosure, the stage training of the encoder and the decoder according to the first tag and the first predictive tag includes: determining a first loss from the first tag and the first predictive tag; and if the first loss is larger than a first preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the first video data and the first label prompting template until the first loss between the first predicted label output by the adjusted decoder and the first label is smaller than or equal to the first preset value, and finishing the stage training.

In some exemplary embodiments of the present disclosure, the retraining of the encoder and the decoder according to the second label and the second predictive label includes: determining a second loss from the second tag and the second predictive tag; and if the second loss is larger than a second preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the second video data and the second label prompting template until the second loss between the second predicted label output by the adjusted decoder and the second label is smaller than or equal to the second preset value, and completing the training again.

In some exemplary embodiments of the present disclosure, the first video data is short video data, the first tag hint template is a short video tag hint template, and the first tag is a short video tag; the second video data is live video data, the second tag prompt template is a live video tag prompt template, and the second tag is a live video tag.

The embodiment of the disclosure provides a video characterization method, which comprises the following steps: acquiring first video data to be processed and second video data to be processed, wherein the first video data to be processed and the second video data to be processed are different types of video data; and respectively inputting the first video data to be processed and the second video data to be processed into a video representation model obtained by training by any one of the methods, and obtaining a first video representation of the first video data to be processed and a second video representation of the second video data to be processed.

In some exemplary embodiments of the present disclosure, the method further comprises: and determining the similarity of the first video data and the second video data according to the first video representation and the second video representation, wherein the similarity is used for recommending the second video data according to the first video data or recommending the first video data according to the second video data.

The embodiment of the disclosure provides a training device for a video characterization model, comprising: an acquisition module configured to perform acquisition of the first video data, the first tag hint template, and the first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting a decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag; an obtaining module configured to perform inputting the first video data and the second video data into an encoder, respectively, obtaining a first video representation of the first video data and a second video representation of the second video data; the obtaining module is further configured to perform inputting the first video representation and the first label hint template into a decoder to obtain a first predictive label; the obtaining module is further configured to perform inputting the second video representation and the second label hint template into the decoder to obtain a second predictive label; and the training module is configured to perform stage training on the encoder and the decoder according to the first label and the first prediction label, perform stage training on the encoder and the decoder again according to the second label and the second prediction label when the stage training is completed, and determine the encoder after the stage training again as a video characterization model.

Embodiments of the present disclosure provide a video characterization apparatus, comprising: an acquisition module configured to perform acquisition of first video data to be processed and second video data to be processed, wherein the first video data to be processed and the second video data to be processed are different types of video data; the obtaining module is configured to perform inputting the first video data to be processed and the second video data to be processed into the video characterization model obtained by training any one of the methods respectively, and obtain a first video characterization of the first video data to be processed and a second video characterization of the second video data to be processed.

An embodiment of the present disclosure provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the executable instructions to implement a training method of the video characterization model as any one of the above or a video characterization method as above.

Embodiments of the present disclosure provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a training method of a video characterization model as any one of the above or a video characterization method as the above.

Embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements a training method of a video characterization model of any one of the above or a video characterization method as described above.

According to the training method of the video characterization model, on one hand, the single-stream model (encoder-decoder) is adopted to process different types of video data, so that cross-domain characterization of different types of video data can be achieved; on the other hand, when the first video tag hint template is input to the decoder, the decoder generates a first predictive tag according to the input data; when the second label prompting template is input to the decoder, the decoder generates a second prediction label according to the input data, namely, different prompting templates are adopted to distinguish different training tasks, so that the model training efficiency can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 illustrates a schematic diagram of an exemplary system architecture for a training method or video characterization method to which embodiments of the present disclosure may be applied.

FIG. 2 is a flowchart illustrating a method of training a video characterization model, according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a training process for a video characterization model, according to an example embodiment.

Fig. 4 is a flow chart illustrating a video characterization method according to an exemplary embodiment.

FIG. 5 is a block diagram of a training apparatus for a video characterization model, according to an example embodiment.

Fig. 6 is a block diagram of a video characterization device, according to an example embodiment.

Fig. 7 is a schematic diagram illustrating a structure of an electronic device suitable for use in implementing exemplary embodiments of the present disclosure, according to an exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The drawings are merely schematic illustrations of the present disclosure, in which like reference numerals denote like or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in at least one hardware module or integrated circuit or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and not necessarily all of the elements or steps are included or performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the present specification, the terms "a," "an," "the," "said" and "at least one" are used to indicate the presence of at least one element/component/etc.; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc., in addition to the listed elements/components/etc.; the terms "first," "second," and "third," etc. are used merely as labels, and do not limit the number of their objects.

As shown in fig. 1, the system architecture may include a server 101, a network 102, a terminal device 103, a terminal device 104, and a terminal device 105. Network 102 is the medium used to provide communication links between terminal device 103, terminal device 104, or terminal device 105 and server 101. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The server 101 may be a server providing various services, such as a background management server providing support for devices operated by a user with the terminal device 103, the terminal device 104, or the terminal device 105. The background management server may perform analysis and other processing on the received data such as the request, and feed back the processing result to the terminal device 103, the terminal device 104, or the terminal device 105.

The terminal device 103, the terminal device 104, and the terminal device 105 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a wearable smart device, a virtual reality device, an augmented reality device, and the like.

In the embodiment of the present disclosure, the server 101 may: acquiring first video data, a first tag prompt template and a first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting the decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag; respectively inputting the first video data and the second video data into an encoder to obtain a first video representation of the first video data and a second video representation of the second video data; inputting the first video representation and the first tag prompt template into a decoder to obtain a first prediction tag; inputting the second video representation and the second tag prompt template into a decoder to obtain a second prediction tag; and performing stage training on the encoder and the decoder according to the first label and the first predictive label, performing stage training on the encoder and the decoder again according to the second label and the second predictive label when the stage training is finished, and determining the encoder after the stage training again as a video characterization model.

In the embodiment of the present disclosure, the server 101 may acquire first to-be-processed video data and second to-be-processed video data from a terminal device, where the first to-be-processed video data and the second to-be-processed video data are different types of video data; and respectively inputting the first video data to be processed and the second video data to be processed into the trained video representation model to obtain a first video representation of the first video data to be processed and a second video representation of the second video data to be processed.

In the embodiment of the present disclosure, the server 101 may determine, according to the first video representation and the second video representation, a similarity of the first video data and the second video data, where the similarity is used to recommend the second video data according to the first video data or recommend the first video data according to the second video data.

It should be understood that the numbers of the terminal device 103, the terminal device 104, the terminal device 105, the network 102 and the server 101 in fig. 1 are only illustrative, and the server 101 may be a server of one entity, may be a server cluster formed by a plurality of servers, may be a cloud server, and may have any number of terminal devices, networks and servers according to actual needs.

The steps of the training method of the video characterization model in the exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings and embodiments. The method provided by the embodiments of the present disclosure may be performed by any electronic device, for example, the server and/or the terminal device in fig. 1 described above, but the present disclosure is not limited thereto.

FIG. 2 is a flowchart illustrating a method of training a video characterization model, according to an exemplary embodiment. As shown in fig. 2, the training method of the video characterization model provided by the embodiment of the disclosure may include the following steps.

In step S210, first video data, a first tag hint template, and a first tag are acquired; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting the decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag.

In an exemplary embodiment, the first video data is short video data, the first tag hint template is a short video tag hint template, and the first tag is a short video tag; the second video data are live video data, the second tag prompt template is a live video tag prompt template, and the second tag is a live video tag.

In an exemplary embodiment, the first video data includes a first video and text of the first video, and the second video data includes a second video and text of the second video.

In the following examples, the first video data is taken as short video data and the second video data is taken as live video data as examples, but the disclosure is not limited thereto.

In the embodiment of the disclosure, the short video data may include a plurality of short videos and text of each short video; the live video data may include a plurality of live videos and text for each live video.

In the embodiment of the disclosure, the short video refers to a video whose playing time is shorter than that of a normal video (e.g., movie theatre), for example, about ten seconds to 1 minute; typically, the short video is played in a portrait mode, but the disclosure is not limited thereto. The live video refers to a video file generated according to live broadcast in real time, for example, the live video is a video file recorded in real time in the live broadcast process, and the live video can be a complete video corresponding to live broadcast in one field or a part of the complete video.

In the embodiment of the present disclosure, the text of the short video may be obtained according to the short video extraction, and the text of the live video may be obtained according to the live video extraction, but the present disclosure is not limited thereto.

In an exemplary embodiment, acquiring first video data includes: acquiring a first video; performing voice recognition and text detection on the first video to obtain a text of the first video; acquiring second video data, comprising: acquiring a second video; and performing voice recognition and text detection on the second video to obtain the text of the second video.

For example, a short video may be acquired; performing voice recognition and text detection on the short video to obtain a text of the short video; the method comprises the steps of carrying out a first treatment on the surface of the Acquiring live video; and performing voice recognition on the live video to obtain the text of the live video.

Specifically, for each obtained short video, a cover picture can be selected, short videos are uniformly sampled to obtain a preset number (for example, 4) of pictures, and meanwhile, voice recognition and text detection are performed on the short videos, so that corresponding text input is obtained through splicing; for each obtained live video, uniformly sampling the live video to obtain (the preset number is +1), simultaneously performing voice recognition on the live video segments, and splicing to obtain corresponding text input.

In an exemplary embodiment, a first tag hint template is used to hint the decoder to output a first predictive tag, e.g., a short video tag hint template is used to hint the decoder to output a short video predictive tag; the second tag prompt template is used for prompting the decoder to output a second prediction tag, and the live broadcast tag prompt template is used for prompting the decoder to output a live broadcast prediction tag.

For example, referring to fig. 3, the short video tag hint template 305 can be "the topic tag of the video is," and the live tag hint template 306 can be "the search term of the video clip is.

In the embodiment of the disclosure, the short video tag may be a topic tag (hashtag) of a short video, and the topic tag of the short video may be marked when a video producer uploads a video, or may be automatically generated by a video playing platform according to the content of the short video; the live tag may be a live search term, such as a keyword entered by a user in a search bar of a video playback platform when the user wants to watch a live broadcast.

In step S220, the first video data and the second video data are input to an encoder, respectively, to obtain a first video representation of the first video data and a second video representation of the second video data.

In the embodiment of the disclosure, a single stream model (converter) Encoder-Decoder can be used to model video features and fuse information of various modes.

In the embodiment of the disclosure, the first video data and the second video data are respectively input to the encoder, and when the first video data are input to the encoder, the encoder extracts characteristics of the first video data to obtain a first video representation corresponding to the first video data; and when the second video data is input to the encoder, the encoder extracts the characteristics of the second video data to obtain a second video frequency representation corresponding to the second video data.

For example, the short video data and the live video data are respectively input to the encoder, and when the short video data are input to the encoder, the encoder extracts characteristics of the short video data to obtain short video representation corresponding to the short video data; when the live video data are input to the encoder, the encoder extracts characteristics of the live video data, and the live video frequency representation corresponding to the live video data is obtained.

In an exemplary embodiment, inputting first video data and second video data into an encoder, respectively, includes: extracting video features of the first video to obtain visual information of the first video; text segmentation is carried out on the text of the first video, so that text information of the first video is obtained; splicing the start mark, the visual information of the first video, the separation mark, the text information of the first video and the end mark to obtain a first mark sequence of the first video, and inputting the first mark sequence into an encoder; extracting video features of the second video to obtain visual information of the second video; text segmentation is carried out on the text of the second video, so that text information of the second video is obtained; and splicing the start mark, the visual information of the second video, the separation mark, the text information of the second video and the end mark to obtain a second mark sequence of the second video, and inputting the second mark sequence into an encoder.

For example, referring to fig. 3, after the short video 301 and the text 302 of the short video are acquired, feature extraction may be performed on the short video 301 using a res net-50 model, and a feature map of the last layer of the res net-50 model is used as visual information (represented by Image Tokens) of the short video 301; at the same time, text 302 of the short video is tokenized, i.e., text 302 is divided into individual Text messages (represented by Text Tokens); the visual and textual information are then concatenated using a start marker (represented using Cls Token), a separation marker (represented using Sep Token) and an end marker (represented using Eos Token) to obtain a short video marker sequence that can be input to an encoder for feature extraction to obtain a short video representation.

Similarly, after acquiring the live video 303 and the text 304 of the live video, feature extraction may be performed on the live video 303 by using a res net-50 model, and a feature map of a last layer of the res net-50 model is used as visual information (represented by Image keys) of the live video 303; meanwhile, the Text 304 of the live video is tokenized, namely the Text 304 is divided into Text information (represented by Text Tokens); the visual information and the text information are then spliced together using a start marker (expressed using Cls Token), a separation marker (expressed using Sep Token) and an end marker (expressed using Eos Token) to obtain a marker sequence of the live video, which can be input into an encoder for feature extraction to obtain a live video representation.

In the embodiment of the disclosure, before the short video data and the live video data are respectively input to an encoder, video feature extraction is performed on the short video data and the live video data, and text segmentation is performed on the text of the short video and the text of the live video; and the start mark, the separation mark and the end mark are used for obtaining the mark sequences of the short video and the live video, the information of each mode can be fused before the video data is input to the encoder, and the information of each mode is interacted after the video data is input to the encoder, so that the encoder can better extract the short video characteristics and the live video characteristics, and the model training efficiency is improved.

Specifically, referring to fig. 3, the above-mentioned marking sequence of short video may be input into an encoder of a multi-layer transform network structure, and the output feature (cls token embedding) of the corresponding position of < cls token > is used as the video representation of the short video by using a bidirectional Attention mechanism; or, inputting the marking sequence of the live video into an encoder, and taking the output characteristic of the position corresponding to the cls token as the video representation of the live video.

In step S230, the first video representation and the first label hint template are input into a decoder to obtain a first predictive label.

For example, the short video representation and the short video label prompting template are input into a decoder to obtain a short video prediction label.

In step S240, the second video representation and the second label hint template are input into a decoder to obtain a second predictive label.

For example, the live video characterization and the live tag prompt template are input into a decoder to obtain a live prediction tag.

In the embodiment of the disclosure, a Decoder (Decoder) can generate corresponding prediction labels according to different input Prompt templates, wherein the short video prediction labels can be topic labels, and the live broadcast prediction labels can be search words; for example, referring to fig. 3, when the short video tag hint template 305 is input to the decoder, the decoder generates a short video prediction tag 307 from the input data (short video representation): "topic label of video is XXX"; upon input of the live tag hint template 306 to the decoder, the decoder will generate a live prediction tag 308 from the input data (live video characterization): "search term of live fragment is XXX".

In the embodiment of the disclosure, different tasks adopt different campts, for example, a short video task adopts a topic label of a video, a live broadcast task adopts a search word of a live broadcast segment, and in practical application, a person skilled in the art can also adopt other campts according to the practical task situation, so the disclosure is not limited.

In step S250, the encoder and the decoder are subjected to stage training according to the first label and the first predictive label, and when the stage training is completed, the encoder and the decoder are subjected to stage training again according to the second label and the second predictive label, and the encoder after the stage training again is determined as a video characterization model.

For example, the encoder and decoder are trained according to the short video tags and the short video predictive tags, and after the encoder and decoder are trained according to the short video tags and the short video predictive tags, the encoder and decoder are trained according to the live broadcast tags and the live broadcast predictive tags, and the trained encoder is determined to be a video characterization model.

In the embodiment of the disclosure, the encoder and the decoder can be firstly subjected to stage training by using the first video data and the first tag prompt template, and then subjected to stage training by using the second video data and the second tag prompt template; the encoder and the decoder can be subjected to stage training by using the second video data and the second tag prompt template, and then the encoder and the decoder can be subjected to stage training by using the first video data and the first video tag prompt template; phase training may also be alternated, which is not limited by the present disclosure.

For example, the encoder and decoder may be trained using the short video data and short video tag hint templates first, and then the encoder and decoder may be trained using the live video data and live video tag hint templates; the encoder and the decoder can be trained by using the live video data and the live video tag prompt template, and then the encoder and the decoder can be trained by using the short video data and the short video tag prompt template; alternatively, training may be performed.

When the encoder and the decoder are trained by using the first video data and the first video tag prompt template, the first video data is input to the encoder and the decoder to obtain a first prediction tag, the first prediction tag is compared with the first tag, and model parameters are adjusted until the first prediction tag and the first tag output by the model are basically the same.

For example, when the encoder and decoder are trained by using the short video data and the short video tag prompt template, the short video data is input to the encoder and decoder to obtain a short video prediction tag, the short video prediction tag is compared with the short video tag, and the model parameters are adjusted until the short video prediction tag and the short video tag output by the model are basically the same.

Similarly, when the encoder and decoder are trained using the second video data and the second video tag hint template, the second video data is input to the encoder and decoder to obtain a second predictive tag, the second predictive tag is compared with the second tag, and the model parameters are adjusted until the second predictive tag and the second tag output by the model are substantially the same.

For example, when the encoder and the decoder are trained by using the live video data and the live video tag prompt template, the live video data is input to the encoder and the decoder to obtain a live video prediction tag, the live video prediction tag and the live video tag are compared, and model parameters are adjusted until the live video prediction tag and the live video tag output by the model are basically the same.

For example, the short video tag is "a restaurant is over birthday", and the decoder will generate corresponding words one by one behind the short video hint template, i.e. the short video predictive tag, which calculates a corresponding loss function with "a restaurant is over birthday", thereby guiding model optimization.

In an exemplary embodiment, stage training of the encoder and decoder according to the first tag and the first predictive tag comprises: determining a first loss from the first tag and the first predictive tag; and if the first loss is larger than a first preset value, adjusting parameters of the encoder and parameters of the decoder, and training the adjusted encoder and the adjusted decoder again according to the first video data and the first label prompting template until the first loss between the first prediction label and the first label output by the adjusted decoder is smaller than or equal to the first preset value, wherein the training is completed.

Specifically, inputting a short video tag and a short video prediction tag into a loss function, and calculating to obtain a first loss, wherein the form of the loss function is not limited in the disclosure; judging whether the first loss is larger than a first preset value or not, wherein the first preset value can be set according to actual conditions; if the first loss is larger than the first preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the short video data and the short video label prompting template, continuously calculating the first loss, judging whether the first loss is larger than the first preset value, and if the first loss is still larger than the first preset value, continuously adjusting the parameters of the encoder and the decoder until the first loss between the short video predictive label and the short video label output by the adjusted decoder is smaller than or equal to the first preset value, wherein the training is completed.

In an exemplary embodiment, the encoder and decoder are again stage trained from the second label and the second predictive label, comprising: determining a second loss based on the second tag and the second predictive tag; and if the second loss is larger than a second preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the second video data and the second label prompting template until the second loss between the second predicted label and the second label output by the adjusted decoder is smaller than or equal to the second preset value, and finishing the training again.

In the embodiment of the present disclosure, a specific process of training an encoder and a decoder according to a live broadcast tag and a live broadcast prediction tag is similar to a specific process of training an encoder and a decoder according to a short video tag and a short video prediction tag, and the disclosure is not repeated here.

In the embodiment of the disclosure, after the retraining of the encoder and the decoder is completed, the retrained encoder is determined as a video characterization model, and when the encoder is actually applied, the trained encoder can be used for video characterization of the input video data.

Fig. 4 is a flow chart illustrating a video characterization method according to an exemplary embodiment. Fig. 4 shows the application of the image segmentation model after training to obtain the image segmentation model using the method provided by the above embodiment.

As shown in fig. 4, the video characterization method provided by the embodiment of the present disclosure may include the following steps.

In step S410, first to-be-processed video data and second to-be-processed video data are acquired, wherein the first to-be-processed video data and the second to-be-processed video data are different types of video data.

For example, short video data and live video data may be acquired.

In step S420, the first video data to be processed and the second video data to be processed are respectively input into the video characterization model obtained by training according to any one of the above methods, and the first video characterization of the first video data to be processed and the second video characterization of the second video data to be processed are obtained.

For example, the short video data and the live video data may be respectively input into a video characterization model obtained by training according to any one of the methods, so as to obtain a short video characterization and a live video characterization.

In the embodiment of the disclosure, a short video representation is obtained by inputting the short video data into a trained video representation model; inputting live video data into a trained video characterization model to obtain live video characterization; that is, the video characterization method provided by the embodiment of the disclosure can realize cross-domain characterization of the short video and the live video, thereby improving the processing efficiency of video data.

In an exemplary embodiment, the above method may further include: and determining the similarity of the first video data and the second video data according to the first video representation and the second video representation, wherein the similarity is used for recommending the second video data according to the first video data or recommending the first video data according to the second video data.

For example, from the short video characterization and the live video characterization, a similarity of the short video data and the live video data is determined, the similarity being used to recommend the live video data from the short video data or to recommend the short video data from the live video data.

Specifically, the similarity between each video representation (including a short video representation and a live video representation) can be calculated, and after a user watches or searches a certain short video, the short video or the live video with higher similarity with the short video is recommended to the user; after a user watches or searches a certain live video, recommending short video or live video with high similarity with the live video to the user.

The video characterization method provided by the embodiment of the invention can realize cross-domain retrieval and cross-domain recommendation of short videos and live videos and improve the viewing rate of different types of video data.

It should also be understood that the above is only intended to assist those skilled in the art in better understanding the embodiments of the present disclosure, and is not intended to limit the scope of the embodiments of the present disclosure. It will be apparent to those skilled in the art from the foregoing examples that various equivalent modifications or variations can be made, for example, some steps of the methods described above may not be necessary, or some steps may be newly added, etc. Or a combination of any two or more of the above. Such modifications, variations, or combinations thereof are also within the scope of the embodiments of the present disclosure.

It should also be understood that the foregoing description of the embodiments of the present disclosure focuses on highlighting differences between the various embodiments and that the same or similar elements not mentioned may be referred to each other and are not repeated here for brevity.

It should also be understood that the sequence numbers of the above processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

It is also to be understood that in the various embodiments of the disclosure, terms and/or descriptions of the various embodiments are consistent and may be referenced to one another in the absence of a particular explanation or logic conflict, and that the features of the various embodiments may be combined to form new embodiments in accordance with their inherent logic relationships.

Examples of training methods for video characterization models provided by the present disclosure are described in detail above. It will be appreciated that the computer device, in order to carry out the functions described above, comprises corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

FIG. 5 is a block diagram of a training apparatus for a video characterization model, according to an example embodiment. Referring to fig. 5, the apparatus 500 may include an acquisition module 510, an acquisition module 520, and a training module 530.

Wherein the obtaining module 510 is configured to perform obtaining the first video data, the first tag hint template, and the first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting a decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag; the obtaining module 520 is configured to perform inputting the first video data and the second video data into an encoder, respectively, obtaining a first video representation of the first video data and a second video representation of the second video data; the obtaining module 520 is further configured to perform inputting the first video representation and the first label hint template into a decoder to obtain a first predictive label; the obtaining module 520 is further configured to perform inputting the second video representation and the second label hint template into the decoder to obtain a second predictive label; the training module 530 is configured to perform a phase training of the encoder and the decoder according to the first label and the first predictive label, and upon completion of the phase training, perform a re-phase training of the encoder and the decoder according to the second label and the second predictive label, and determine the encoder after completion of the re-phase training as a video characterization model.

In some exemplary embodiments of the present disclosure, the first video data includes a first video and text of the first video, and the second video data includes a second video and text of the second video; the obtaining module 520 is configured to perform: extracting video features of the first video to obtain visual information of the first video; text segmentation is carried out on the text of the first video, and text information of the first video is obtained; splicing the start mark, the visual information of the first video, the separation mark, the text information of the first video and the end mark to obtain a first mark sequence of the first video, and inputting the first mark sequence into the encoder; extracting video features of the second video to obtain visual information of the second video; text segmentation is carried out on the text of the second video, and text information of the second video is obtained; and performing splicing processing on the start mark, the visual information of the second video, the separation mark, the text information of the second video and the end mark to obtain a second mark sequence of the second video, and inputting the second mark sequence into the encoder.

In some exemplary embodiments of the present disclosure, the acquisition module 510 is configured to perform: acquiring the first video; performing voice recognition and text detection on the first video to obtain a text of the first video; acquiring the second video; and performing voice recognition and text detection on the second video to obtain the text of the second video.

In some exemplary embodiments of the present disclosure, the training module 530 is configured to perform: determining a first loss from the first tag and the first predictive tag; and if the first loss is larger than a first preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the first video data and the first label prompting template until the first loss between the first predicted label output by the adjusted decoder and the first label is smaller than or equal to the first preset value, and finishing the stage training.

In some exemplary embodiments of the present disclosure, the training module 530 is configured to perform: determining a second loss from the second tag and the second predictive tag; and if the second loss is larger than a second preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the second video data and the second label prompting template until the second loss between the second predicted label output by the adjusted decoder and the second label is smaller than or equal to the second preset value, and completing the training again.

Fig. 6 is a block diagram of a video characterization device, according to an example embodiment. Referring to fig. 6, the apparatus 600 may include an acquisition module 610 and an acquisition module 620.

Wherein the obtaining module 610 is configured to perform obtaining first to-be-processed video data and second to-be-processed video data, wherein the first to-be-processed video data and the second to-be-processed video data are different types of video data; the obtaining module 620 is configured to perform inputting the first video data to be processed and the second video data to be processed into the video characterization model obtained by training any one of the methods, so as to obtain a first video characterization of the first video data to be processed and a second video characterization of the second video data to be processed.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: a determining module configured to perform determining a similarity of the first video data and the second video data based on the first video representation and the second video representation, the similarity being used to recommend the second video data based on the first video data or the first video data based on the second video data.

It should be noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor terminals and/or microcontroller terminals.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

An electronic device 700 according to such an embodiment of the present disclosure is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one storage unit 720, a bus 730 connecting the different system components (including the storage unit 720 and the processing unit 710), and a display unit 740.

Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 710 may perform various steps as shown in fig. 2.

As another example, the electronic device may implement the various steps shown in fig. 2.

The memory unit 720 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 721 and/or cache memory 722, and may further include Read Only Memory (ROM) 723.

The storage unit 720 may also include a program/utility 724 having a set (at least one) of program modules 725, such program modules 725 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 730 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 770 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 760. As shown, network adapter 760 communicates with other modules of electronic device 700 over bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g., a memory, comprising instructions executable by a processor of an apparatus to perform the above method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction which, when executed by a processor, implements the training method of the video characterization model in the above embodiment.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a video characterization model, comprising:

acquiring first video data, a first tag prompt template and a first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting a decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag;

inputting the first video data and the second video data into an encoder respectively to obtain a first video representation of the first video data and a second video representation of the second video data;

inputting the first video representation and the first tag prompt template into a decoder to obtain the first prediction tag;

inputting the second video representation and the second tag prompt template into the decoder to obtain the second prediction tag;

and performing stage training on the encoder and the decoder according to the first label and the first predictive label, performing stage training on the encoder and the decoder again according to the second label and the second predictive label when the stage training is completed, and determining the encoder after the stage training again as a video characterization model.

2. The method of training a video characterization model of claim 1 wherein the first video data comprises a first video and text of the first video and the second video data comprises a second video and text of the second video;

the inputting the first video data and the second video data into an encoder respectively includes:

extracting video features of the first video to obtain visual information of the first video; text segmentation is carried out on the text of the first video, and text information of the first video is obtained; splicing the start mark, the visual information of the first video, the separation mark, the text information of the first video and the end mark to obtain a first mark sequence of the first video, and inputting the first mark sequence into the encoder;

extracting video features of the second video to obtain visual information of the second video; text segmentation is carried out on the text of the second video, and text information of the second video is obtained; and performing splicing processing on the start mark, the visual information of the second video, the separation mark, the text information of the second video and the end mark to obtain a second mark sequence of the second video, and inputting the second mark sequence into the encoder.

3. The method of training a video characterization model of claim 2 wherein the acquiring the first video data comprises:

acquiring the first video; performing voice recognition and text detection on the first video to obtain a text of the first video;

the acquiring the second video data includes:

acquiring the second video; and performing voice recognition and text detection on the second video to obtain the text of the second video.

4. The method of training a video characterization model according to claim 1, wherein the stage training of the encoder and the decoder according to the first label and the first predictive label comprises:

determining a first loss from the first tag and the first predictive tag;

and if the first loss is larger than a first preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the first video data and the first label prompting template until the first loss between the first predicted label output by the adjusted decoder and the first label is smaller than or equal to the first preset value, and finishing the stage training.

5. The method of training a video characterization model according to claim 1 or 4 wherein the re-stage training of the encoder and decoder according to the second label and the second predictive label comprises:

determining a second loss from the second tag and the second predictive tag;

and if the second loss is larger than a second preset value, adjusting parameters of the encoder and parameters of the decoder, training the adjusted encoder and the adjusted decoder again according to the second video data and the second label prompting template until the second loss between the second predicted label output by the adjusted decoder and the second label is smaller than or equal to the second preset value, and completing the training again.

6. The method for training a video characterization model according to any of claims 1 to 5 wherein the first video data is short video data, the first tag hint template is a short video tag hint template, and the first tag is a short video tag; the second video data is live video data, the second tag prompt template is a live video tag prompt template, and the second tag is a live video tag.

7. A method of video characterization, comprising:

acquiring first video data to be processed and second video data to be processed, wherein the first video data to be processed and the second video data to be processed are different types of video data;

inputting the first video data to be processed and the second video data to be processed into a video characterization model trained and obtained according to the method of any one of claims 1-6, respectively, to obtain a first video characterization of the first video data to be processed and a second video characterization of the second video data to be processed.

8. The video characterization method of claim 7, further comprising:

and determining the similarity of the first video data and the second video data according to the first video representation and the second video representation, wherein the similarity is used for recommending the second video data according to the first video data or recommending the first video data according to the second video data.

9. A training device for a video characterization model, comprising:

an acquisition module configured to perform acquisition of the first video data, the first tag hint template, and the first tag; acquiring second video data, a second tag prompt template and a second tag; the first video data and the second video data are different types of video data, the first tag prompt template is used for prompting a decoder to output a first prediction tag, and the second tag prompt template is used for prompting the decoder to output a second prediction tag;

An obtaining module configured to perform inputting the first video data and the second video data into an encoder, respectively, obtaining a first video representation of the first video data and a second video representation of the second video data;

the obtaining module is further configured to perform inputting the first video representation and the first label hint template into a decoder to obtain the first predictive label;

the obtaining module is further configured to perform inputting the second video representation and the second label hint template into the decoder to obtain the second predictive label;

and the training module is configured to perform stage training on the encoder and the decoder according to the first label and the first prediction label, perform stage training on the encoder and the decoder again according to the second label and the second prediction label when the stage training is completed, and determine the encoder after the stage training again as a video characterization model.

10. A video characterization device, comprising:

an acquisition module configured to perform acquisition of first video data to be processed and second video data to be processed, wherein the first video data to be processed and the second video data to be processed are different types of video data;

An obtaining module configured to perform inputting the first to-be-processed video data and the second to-be-processed video data into a video representation model obtained by training according to the method of any one of claims 1-6, respectively, obtaining a first video representation of the first to-be-processed video data and a second video representation of the second to-be-processed video data.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the training method of the video characterization model of any one of claims 1 to 6 or the video characterization method of claim 7 or 8.

12. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the training method of the video characterization model of any one of claims 1 to 6 or the video characterization method of claim 7 or 8.