CN110781345B

CN110781345B - Video description generation model obtaining method, video description generation method and device

Info

Publication number: CN110781345B
Application number: CN201911051111.4A
Authority: CN
Inventors: 张水发; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2022-12-27
Anticipated expiration: 2039-10-31
Also published as: CN110781345A

Abstract

The present disclosure provides a method for acquiring a video description generation model, a method for generating a video description, an apparatus, an electronic device, and a computer-readable storage medium, where the method for acquiring a video description generation model includes: acquiring a plurality of videos from a preset video library; for each video, identifying each video frame in the video to extract characters in the video frame; combining characters corresponding to the video frames of each video to serve as video description of the videos; and training the video frames and the video descriptions corresponding to the videos respectively as training samples to obtain a video description generation model. The embodiment of the disclosure can effectively reduce the manual labeling cost.

Description

Video description generation model obtaining method, video description generation method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method for obtaining a video description generation model, a method and an apparatus for generating a video description, an electronic device, and a computer-readable storage medium.

Background

Under the era background of stable development of the internet and big data, the demand of multimedia information is increased explosively, and the traditional information processing technology cannot meet the demand of the multimedia data on tasks such as labeling and description, for example, with the explosive increase of the number of internet videos, the demand of video description is increased increasingly. Video description (Video hosting) is a technique for generating content description information for a Video. In the field of artificial intelligence, a video description generation model is generally adopted to automatically generate a video description for a video.

The inventor discovers in the process of realizing the disclosure that: in the training stage of the video description generation model, training samples are difficult to obtain, a large amount of manual labeling is needed, and labeling style homogenization can be caused by labeling of a small amount of labeling personnel, so that the generated description language does not meet the requirements of the public.

Disclosure of Invention

In view of this, the present disclosure provides a method for acquiring a video description generation model, a method for generating a video description, an apparatus for generating a video description, an electronic device, and a computer-readable storage medium.

A first aspect of the present disclosure provides a method for acquiring a video description generative model, where the method specifically includes:

acquiring a plurality of videos from a preset video library;

for each video, identifying each video frame in the video to extract characters in the video frame;

combining characters corresponding to the video frames of each video to serve as video description of the videos;

and training the video frames and the video descriptions corresponding to the videos respectively as training samples to obtain a video description generation model.

Optionally, after the identifying each video frame in the video to extract the text in the video frame, the method further includes:

and matching the characters corresponding to each video frame with the pre-stored slogan text, and deleting the characters which are matched with each other.

performing word segmentation on characters corresponding to all video frames in the video to obtain a plurality of word sequences;

and deleting the word sequences with the occurrence frequency not less than the set value.

for each video frame in each video, comparing the video frame with other video frames in the video one by one to determine whether the video frame is similar to any one of the other video frames;

and if so, deleting one of the video frames, and combining the characters corresponding to the two video frames respectively to be used as the characters corresponding to the video frames which are not deleted.

Optionally, the method further comprises:

segmenting words corresponding to the undeleted video frames to obtain a plurality of word sequences;

deleting word sequences whose frequency of occurrence is not less than the first specified value or not more than the second specified value.

Optionally, determining whether the video frame is similar to any other video frame through a pre-established classification network;

the classification network comprises an input layer, a difference layer, a splicing layer, a convolution layer and an output layer;

the input layer is used for acquiring two input video frames;

the difference layer is used for carrying out subtraction operation on the two video frames to obtain a difference image;

the splicing layer is used for splicing the difference image and the two video frames to obtain a spliced image;

the convolution layer is used for carrying out feature extraction on the spliced image to generate a feature vector;

and the output layer is used for outputting a similar result according to the feature vector.

Optionally, the video description generation model comprises an encoder network and a decoder network;

the encoder network is used for extracting the characteristics of a plurality of input video frames and generating the visual characteristics of the video;

the decoder network is used for sequentially generating decoding words according to the visual characteristics and combining the generated decoding words into a video description.

Optionally, the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer;

the input layer is used for acquiring a plurality of input video frames;

the plurality of convolutional layers are respectively used for extracting the characteristics of a plurality of video frames;

the splicing layer is used for splicing the characteristics of the video frames to generate visual characteristics.

Optionally, the decoder network is a long-short term memory network.

Optionally, the training with the video frames and the video descriptions corresponding to the multiple videos as training samples to obtain a video description generation model includes:

inputting the video frame into a specified video description generation model to obtain a prediction description;

and adjusting parameters of the video description generation model according to the difference between the prediction description and the video description corresponding to the video frame to obtain the trained model.

Optionally, the adjusting parameters of the video description generation model according to the difference between the prediction description and the video description corresponding to the video frame includes:

respectively obtaining the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame;

and adjusting parameters of the video description generation model according to the difference between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame.

Optionally, the adjusting parameters of the video description generation model according to the difference between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame includes:

determining whether the prediction description is similar to the video description corresponding to the video frame according to the distance between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame;

and adjusting parameters of the video description generation model according to the similar result.

Optionally, the feature vector is a word vector.

Optionally, the distance is a cosine distance.

According to a second aspect of the embodiments of the present disclosure, there is provided a video description generation method, including:

acquiring a target video;

taking the video frame of the target video as the input of a pre-established video description generation model so as to obtain the video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: and identifying each video frame in the video to extract characters in the video frame, and combining the characters corresponding to the video frame of the video to be used as the video description of the video.

Optionally, after the acquiring the target video, the method further includes:

for each video frame in the target video, comparing the video frame with other video frames in the target video one by one to determine whether the video frame is similar to any one of the other video frames;

and if so, deleting one of the video frames.

the input layer is used for acquiring two input video frames;

the encoder network is used for extracting the characteristics of a plurality of input video frames and generating the visual characteristics of a target video;

the decoder network is used for sequentially generating decoding words according to the visual features and combining the generated decoding words into a video description.

the input layer is used for acquiring a plurality of input video frames;

Optionally, the decoder network is a long-short term memory network.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for obtaining a video description generative model, the apparatus including:

the video acquisition module is used for acquiring a plurality of videos from a preset video library;

the character extraction module is used for identifying each video frame in the videos so as to extract characters in the video frame;

the video description acquisition module is used for combining characters corresponding to video frames of each video to serve as video description of the video;

and the model training module is used for training the video frames and the video descriptions corresponding to the videos respectively as training samples to obtain a video description generation model.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a video description generation apparatus, the apparatus including:

the target video acquisition module is used for acquiring a target video;

the video description generation module is used for taking a video frame of the target video as the input of a video description generation model so as to obtain a video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: and identifying each video frame in the video to extract characters in the video frame, and combining the characters corresponding to the video frame of the video to be used as the video description of the video.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of the first and second aspects.

According to a sixth aspect of embodiments of the present disclosure, there is also provided a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any one of the first and second aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the method comprises the steps of obtaining a plurality of videos from a preset video library, identifying each video frame in the videos to extract characters in the video frame for each video, combining the characters corresponding to the video frame of each video as video description of the video, and finally training the video frames and the video description corresponding to the videos respectively as training samples to obtain a video description generation model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

FIG. 1 is a flowchart illustrating a method for obtaining a video description generative model according to an exemplary embodiment of the present disclosure;

FIG. 2A is an architecture diagram illustrating a video description generation model according to an exemplary embodiment of the present disclosure;

FIG. 2B is an architecture diagram of another video description generative model illustrated in the present disclosure according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a second method for obtaining a video description generative model according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a third method for obtaining a video description generative model according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a fourth method for obtaining a video description generative model according to an exemplary embodiment of the present disclosure;

FIG. 6 is a block diagram of a classification network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a video description generation method according to an exemplary embodiment of the present disclosure;

FIG. 8 is a block diagram illustrating an embodiment of an apparatus for obtaining a video description generative model according to an exemplary embodiment of the present disclosure;

fig. 9 is a block diagram of an embodiment of an acquisition device of a video description generation device according to an exemplary embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device provided in accordance with an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

Aiming at the problems that in the training phase of a video description generation model, the acquisition of a training sample is difficult and the labeling style is homogeneous due to labeling of a small number of labeling personnel, the embodiment of the disclosure provides an acquisition method of a video description generation model, which can be executed by an electronic device, wherein the electronic device can be a computer, a smart phone, a tablet, a personal digital assistant or a server and other computing devices.

Referring to fig. 1, a flowchart of a method for acquiring a video description generative model according to an exemplary embodiment of the present disclosure is shown, where the method includes:

in step S101, a plurality of videos are acquired from a preset video library.

In step S102, for each video, each video frame in the video is identified to extract the text in the video frame.

In step S103, the text corresponding to the video frame of each video is merged as the video description of the video.

In step S104, the video frames and the video descriptions corresponding to the multiple videos are used as training samples to be trained, and a video description generation model is obtained.

It can be understood that, the source of the video library in the embodiment of the present disclosure is not limited at all, and may be specifically selected according to an actual application scenario, for example, the video library may be disposed on the electronic device, or the video library may also be disposed on a server, and the electronic device obtains a video from the server.

In an embodiment, the video library stores a plurality of videos from which the electronic device can obtain a plurality of videos, where the number of videos obtained by the electronic device from the database may be specifically set according to actual situations, and this is not limited in this disclosure, for example, all videos may be obtained, or a specified number of videos (such as 50%, 60%, etc. of all videos) may be obtained.

In one embodiment, for each video, the electronic device identifies each video frame in the video to extract the characters in the video frame, and determines the corresponding relationship between each video frame and the characters thereof; as an example, the electronic device may recognize each video frame through an OCR (Optical Character Recognition) technique to extract text in the video frame.

In a possible implementation manner, for each video frame, the electronic device performs image processing and character recognition on the video frame to obtain characters corresponding to the video frame; the image processing process includes but is not limited to graying, binarization, image noise reduction and other operations; the electronic equipment can graye the color image by a component method, a maximization method, an average method or a weighted average method and the like, can binarize by a bimodal method, a P parameter method or an iterative method and the like, and can perform noise reduction processing by a mean value filter, an adaptive wiener filter, a median filter, a morphological noise filter or a wavelet denoising method and the like; the character recognition process can be completed through a pre-established character recognition model, and after the image processed by the image processing is pre-processed (such as inclination correction, character segmentation and the like), the image is used as the input of the character recognition model, and a recognition result is obtained from the character recognition model; the character recognition model can be obtained based on machine learning algorithm or deep learning algorithm training.

In an embodiment, after obtaining the texts corresponding to all the video frames of each video, the electronic device may combine the texts corresponding to all the video frames of the video to serve as the video description of the video, and then the electronic device trains the video frames and the video descriptions corresponding to the multiple videos respectively as training samples to obtain a video description generation model; referring to fig. 2A, the video description generation model includes an encoder network 11 and a decoder network 12; the encoder network 11 is configured to extract features of multiple input video frames, and generate visual features of a video; the decoder network 12 is configured to sequentially generate decoded words according to the visual features, and combine the generated decoded words into a video description.

In one embodiment, referring to fig. 2B, the encoder network 11 includes an input layer 111, a plurality of convolutional layers 112, and a splicing layer 113; the input layer 111 is configured to obtain a plurality of input video frames; the plurality of convolutional layers 112 are respectively used for extracting the features of a plurality of video frames; the splicing layer 113 is used for splicing the features of a plurality of video frames to generate visual features; the decoder network 12 is a Long Short-Term Memory network 121 (LSTM).

In an embodiment, in a training process, the electronic device inputs the video frame into a specified video description generation model to obtain a prediction description, and then adjusts parameters of the video description generation model according to a difference between the prediction description and a video description corresponding to the video frame to obtain a trained model.

In an implementation manner, the electronic device may respectively obtain the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame, and then adjust the parameter of the video description generation model according to the difference between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame, so as to obtain the trained model.

It can be understood that, in the embodiment of the present disclosure, no limitation is imposed on the feature vector, and a specific selection may be performed according to an actual situation, for example, the feature vector may be a Word vector, the electronic device may obtain the Word vector described in the prediction and the Word vector described in the video description corresponding to the video frame through a preset Word vector generation model, and the electronic device inputs the prediction description or the video description corresponding to the video frame into the preset Word vector generation model respectively to obtain the corresponding Word vector from the Word vector generation model, where the Word vector generation model is used to generate a Word vector according to any input description, for example, the Word vector generation model may be a Word2vec model, an equal glove model, or an ELMo model, the Word2vec model, the glove model, or the ELMo model represents a related model used to generate a Word vector, and for the establishment process of the Word2vec model, the glove model, or the ELMo model, refer to a specific implementation manner in the related technology, which is not performed here.

Specifically, the electronic device may determine whether the prediction description is similar to the video description corresponding to the video frame according to a distance between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame, and then adjust parameters of a model generated by the video description according to a similar result to obtain a trained model; the distance may be a cosine distance, if the previous distance is smaller than a specified value, it indicates that the prediction description is similar to the video description corresponding to the video frame, otherwise, it indicates that the prediction description is not similar to the video description corresponding to the video frame; the specified value can be specifically set according to actual conditions, and the embodiment of the disclosure does not set any limit to this. The method includes the steps of obtaining a plurality of videos from a preset video library, identifying each video frame in the videos to extract characters in the video frame for each video, combining the characters corresponding to the video frames of each video as video descriptions of the videos, and finally training the video frames and the video descriptions corresponding to the videos as training samples to obtain a video description generation model.

Referring to fig. 3, a flowchart of a second method for acquiring a video description generative model according to an exemplary embodiment of the present disclosure is shown, where the method includes:

in step S201, a plurality of videos are acquired from a preset video library. Similar to step S101, the description is omitted here.

In step S202, for each video, each video frame in the video is identified to extract the text in the video frame. Similar to step S102, the description is omitted here.

In step S203, for the corresponding characters of each video frame, matching the corresponding characters with the pre-stored slogan text, and deleting the characters that are matched with each other.

In step S204, the text corresponding to the video frame of each video is merged as the video description of the video. Similar to step S103, the description is omitted here.

In step S205, the video frames and the video descriptions corresponding to the multiple videos are used as training samples to be trained, and a video description generation model is obtained. Similar to step S104, the description is omitted here.

The slogan text is content that has no strong correlation with the video frame, and as an example, the slogan text may be some widely used terms, such as a television station logo (e.g., CCTV), a logo (e.g., fast-hand logo), or a mouth-of-sentence number (e.g., a commercial mouth number or other promotional mouth number); it is understood that the specific setting of the slogan text is not limited in any way in the present disclosure, and the specific setting can be performed according to actual scenes.

In addition, the slogan text can be stored in the electronic device before the matching step, and the specific storage time of the slogan text is not limited in the embodiment of the disclosure, and can be specifically set according to actual conditions.

In this embodiment, after extracting the characters of all the video frames, the electronic device matches the corresponding characters of each video frame with a pre-stored slogan text, and if the matching is consistent, it indicates that the characters matched consistently do not have a strong correlation with the video frames, the electronic device deletes the characters matched consistently as noise, so that the influence of the electronic device on the model training result is avoided, and the accuracy of model prediction is improved.

Then, after the characters matched with the slogan text are deleted, combining the characters corresponding to the video frames of each video by the electronic equipment to serve as the video description of the video, and then training the video frames and the video description corresponding to the videos by the electronic equipment to serve as training samples so as to obtain a video description generation model.

Referring to fig. 4, a flowchart of a method for acquiring a third video description generative model according to an exemplary embodiment of the present disclosure is shown, where the method includes:

in step S301, a plurality of videos are acquired from a preset video library. Similar to step S101, the description is omitted here.

In step S302, for each video, each video frame in the video is identified to extract the text in the video frame. Similar to step S102, the description is omitted here.

In step S303, performing word segmentation on the characters corresponding to all the video frames in the video, obtaining a plurality of word sequences, and deleting the word sequences with an occurrence frequency not less than a set value, to obtain the characters corresponding to each video frame after deleting the word sequences.

In step S304, the text corresponding to the video frame of each video is merged as the video description of the video. Similar to step S103, the description is omitted here.

In step S305, the video frames and the video descriptions corresponding to the plurality of videos are used as training samples to be trained, and a video description generation model is obtained. Similar to step S104, the description is omitted here.

In this embodiment, considering that a vocabulary with too high frequency of occurrence may affect a result of model training, such a vocabulary may not have a strong correlation with video content, for example, it may be some Logo or a publicity slogan, or a plurality of similar or identical vocabularies existing in a video, therefore, after extracting characters of video frames, the electronic device performs segmentation on characters corresponding to all video frames in each video to obtain a plurality of word sequences, and then the electronic device performs statistics on the frequency of occurrence of each word sequence, and deletes the word sequence if the frequency of occurrence of the word sequence is not less than a set value.

It can be understood that, in the embodiment of the present disclosure, specific values of the setting values are not limited at all, and may be specifically set according to actual situations.

In another embodiment, the electronic device may further compare the word sequence with a pre-stored useless vocabulary, and delete the word sequences that are consistent in comparison; it is understood that, in the present disclosure, specific selection of the useless vocabulary is not limited at all, and may be specifically set according to actual situations, for example, the useless vocabulary may be selected by a user, such as "ones" and "o" as the useless vocabulary, or selected based on a preset semantic rule, such as selecting prepositions or exclamations as the useless vocabulary; in the embodiment, the word sequence is further subjected to noise processing, so that the prediction accuracy of the model is improved.

Then, for each video, after obtaining the characters corresponding to all the video frames from which the word sequences are deleted, the electronic device merges the characters corresponding to the video frames of the videos to serve as video descriptions of the videos, and then the electronic device takes the video frames and the video descriptions corresponding to the multiple videos as training samples to train, so that a video description generation model is obtained.

Referring to fig. 5, a flowchart of a fourth method for acquiring a video description generative model according to an exemplary embodiment of the present disclosure is shown, where the method includes:

in step S401, a plurality of videos are acquired from a preset video library. Similar to step S101, the description is omitted here.

In step S402, for each video, each video frame in the video is identified to extract the text in the video frame. Similar to step S102, the description is omitted here.

In step S403, for each video frame in each video, comparing the video frame with other video frames in the video one by one to determine whether the video frame is similar to any of the other video frames, if so, deleting one of the video frames, and merging the characters corresponding to the two video frames respectively to obtain the characters corresponding to the video frame that is not deleted.

In step S404, the text corresponding to the video frame of each video is merged as the video description of the video. Similar to step S103, the description is omitted here.

In step S405, the video frames and the video descriptions corresponding to the multiple videos are used as training samples to be trained, and a video description generation model is obtained. Similar to step S104, the description is omitted here.

In this embodiment, after extracting the text of the video frame, the electronic device performs similar picture judgment, compares each video frame in each video with other video frames in the video one by one, in each comparison process, judges whether the video frame is similar to any one of the other video frames, and if so, deletes one of the video frames and combines the text corresponding to the two video frames respectively to serve as the text corresponding to the video frame that is not deleted; in the embodiment, video description model training is performed by using dissimilar video frames in the target video, which is beneficial to realizing accuracy of a model prediction result, accurate description can be performed on complete content of the video, and error of video description caused by similar content is avoided.

For example, if the picture a and the picture B are similar pictures, the text corresponding to the picture a is "hello", and the text corresponding to the picture B is "i play ball", then deleting one of the pictures (for example, deleting the picture a), and then combining the texts of the picture a and the picture B to obtain the picture B and the text corresponding to the picture B, "hello i play ball".

It can be understood that, for two video frames determined to be similar, which one of the two video frames is specifically selected by the electronic device for deletion is not limited in any way, and may be specifically set according to actual situations.

In a possible implementation manner, the electronic device may determine whether the video frame is similar to any other video frame through a pre-established classification network, please refer to fig. 6, which is a structural diagram of a classification network shown in the present disclosure according to an exemplary embodiment, where the classification network includes an input layer 21, a difference layer 22, a splicing layer 23, a convolution layer 24, and an output layer 25; the input layer 21 is used for acquiring two input video frames; the difference layer 22 is used for performing subtraction operation on the two video frames to obtain a difference image; the splicing layer 23 is configured to splice the difference image and the two video frames to obtain a spliced image; the convolution layer 24 is used for performing feature extraction on the spliced image to generate a feature vector; the output layer 25 is configured to output a similar result according to the feature vector.

Specifically, the electronic device inputs the video frame and any one of the other video frames into the classification network, performs subtraction operation on the two video frames through the classification network to obtain a difference image, then splices the difference image with the two video frames to obtain a spliced image, then performs feature extraction on the spliced image to generate a feature vector, and obtains a similar result according to the feature vector.

The classification model can be obtained by training based on a plurality of training samples, the plurality of training samples comprise positive samples and negative samples, the positive samples comprise two similar pictures and similar conclusions, and the negative samples comprise two similar pictures and dissimilar conclusions.

In another embodiment, the electronic device may further perform word segmentation on the text corresponding to the undeleted video frame to obtain a plurality of word sequences, then count occurrence frequencies of the word sequences, and delete the word sequences if the occurrence frequencies of the word sequences are not less than a first specified value or not greater than a second specified value; wherein the first specified value is greater than the second specified value; the word frequency is counted on the image dimension, the word sequences with too high or too low frequency are removed, and the situation that the model is trapped in local optimization during training is effectively avoided.

It is understood that, the specific values of the first specified value and the second specified value in the embodiments of the present disclosure are not limited in any way, and may be specifically set according to actual situations.

<xnotran> , , C " ", C : </xnotran> <xnotran> " ", " ", " ", " ", "", "", "", "", "", (): </xnotran> What is { "is what": 4, "good children": 1, "whose family": 2, "sleep": 2, if the first designated value is 4 times and the second designated value is 1 time, the word sequence "what" and the word sequence "good children" are deleted, and the character "whose family sleeps" corresponding to the picture C is obtained.

Then, for each video, after the electronic device obtains the characters corresponding to the remaining video frames from which the similar video frames are deleted, the characters corresponding to the remaining video frames of the video are combined to serve as the video description of the video, and then the electronic device takes the video frames and the video description corresponding to the multiple videos as training samples to train, so that a video description generation model is obtained.

Referring to fig. 7, a flowchart of a video description generation method according to an exemplary embodiment of the present disclosure is shown, where the method may be performed by an electronic device, and the electronic device may be a computing device such as a computer, a smart phone, a tablet, a personal digital assistant, or a server, and the method includes:

in step S501, a target video is acquired.

In step S502, a video frame of the target video is used as an input of a video description generation model to obtain a video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: and identifying each video frame in the video to extract characters in the video frame, and combining the characters corresponding to the video frame of the video to be used as the video description of the video.

In this embodiment, after acquiring a target video, the electronic device takes a video frame of the target video as an input of a video description generation model to acquire a video description corresponding to the target video from the video description generation model; it is to be understood that, as for the source of the target video, the embodiment of the present disclosure does not limit this, and may be specifically configured according to the actual situation, for example, the target video may be uploaded by the user to the electronic device, or downloaded by the electronic device from a specified server.

Wherein the video description generative model comprises a decoder network and an encoder network; the encoder network is used for extracting the characteristics of a plurality of input video frames and generating the visual characteristics of the video; the decoder network is used for sequentially generating decoding words according to the visual characteristics and combining the generated decoding words into a video description.

In one embodiment, the encoder network includes an input layer, a plurality of convolutional layers, and a stitching layer; the input layer is used for acquiring a plurality of input video frames; the plurality of convolutional layers are respectively used for extracting the characteristics of a plurality of video frames; the splicing layer is used for splicing the characteristics of a plurality of video frames to generate visual characteristics; the decoder network is a Long Short-Term Memory network (LSTM).

In an embodiment, after acquiring a target video, the electronic device compares each video frame in the target video with other video frames in the target video one by one, determines whether the video frame is similar to any one of the other video frames in each comparison process, and deletes one of the video frames if yes; in the embodiment, the process of generating the video description is realized by using the dissimilar video frames in the target video, which is beneficial to realizing accurate description of the complete content of the video and avoiding the error of the video description caused by the similar content.

It can be understood that, for two video frames that are determined to be similar, which one of the two video frames is specifically selected for deletion by the electronic device is not limited in any way, and may be specifically set according to actual situations.

In a possible implementation manner, the electronic device may determine whether the video frame is similar to any other video frame through a pre-established classification network, where the classification network includes an input layer, a difference layer, a concatenation layer, a convolution layer, and an output layer; the input layer is used for acquiring two input video frames; the difference layer is used for carrying out subtraction operation on the two video frames to obtain a difference image; the splicing layer is used for splicing the difference image and the two video frames to obtain a spliced image; the convolution layer is used for carrying out feature extraction on the spliced image to generate a feature vector; and the output layer is used for outputting a similar result according to the feature vector.

Optionally, the generating of the video description of each video specifically includes: and identifying each video frame in the video to extract characters in the video frame, matching the characters corresponding to each video frame with a pre-stored slogan text, deleting the characters which are consistent in matching, and finally combining the characters corresponding to the video frames of the video to serve as the video description of the video.

Optionally, the generating of the video description of each video specifically includes: identifying each video frame in the video to extract characters in the video frame, performing word segmentation on characters corresponding to all the video frames in the video to obtain a plurality of word sequences, deleting the word sequences with the occurrence frequency not less than a set value to obtain the characters corresponding to each video frame after the word sequences are deleted, and finally combining the characters corresponding to the video frames of the video to serve as the video description of the video.

Optionally, the generating of the video description of each video specifically includes: identifying each video frame in the video to extract characters in the video frame, comparing the video frame with other video frames in the video one by one for each video frame in the video to determine whether the video frame is similar to any one of the other video frames, if so, deleting one of the video frames, merging characters corresponding to the two video frames respectively to be used as characters corresponding to the video frames which are not deleted, and finally merging the characters corresponding to the video frames of the video to be used as video description of the video.

Optionally, the generating of the video description of each video specifically includes: identifying each video frame in the video to extract characters in the video frame, comparing the video frame with other video frames in the video one by one for each video frame in the video to determine whether the video frame is similar to any one of the other video frames, if so, deleting one of the video frames, merging characters corresponding to the two video frames respectively to be used as characters corresponding to undeleted video frames, segmenting the characters corresponding to the undeleted video frames to obtain a plurality of word sequences, deleting the word sequences with the occurrence frequency not less than a first specified value or not more than a second specified value to obtain the characters corresponding to the undeleted video frames after deleting the word sequences, and finally merging the characters corresponding to the video frames of the video to be used as the video description of the video.

Accordingly, referring to fig. 8, a block diagram of an embodiment of an apparatus for obtaining a video description generation model according to an embodiment of the present disclosure is shown, where the apparatus includes:

the video obtaining module 601 is configured to obtain multiple videos from a preset video library.

A text extraction module 602, configured to, for each video, identify each video frame in the video to extract text in the video frame.

The video description obtaining module 603 is configured to combine texts corresponding to video frames of each video as video descriptions of the video.

The model training module 604 is configured to train video frames and video descriptions corresponding to the multiple videos as training samples to obtain a video description generation model.

Optionally, after the text extraction module 602, the method further includes:

and the character deleting module is used for matching the characters corresponding to each video frame with the pre-stored slogan text and deleting the characters which are matched with each other.

Optionally, after the text extraction module 602, the method further includes:

and the first word sequence acquisition module is used for segmenting words corresponding to all video frames in the video to acquire a plurality of word sequences.

And the first word sequence deleting module is used for deleting the word sequences with the frequency of occurrence not less than a set value.

Optionally, after the text extraction module 602, the method further includes:

and the video frame comparison module is used for comparing each video frame in each video with other video frames in the video one by one so as to determine whether the video frame is similar to any one of the other video frames.

And the video frame deleting module is used for deleting one of the video frames if the video frame is the original video frame, and combining the characters corresponding to the two video frames respectively to be used as the characters corresponding to the video frames which are not deleted.

Optionally, the method further comprises:

and the second word sequence acquisition module is used for segmenting words corresponding to the undeleted video frame to acquire a plurality of word sequences.

And the second word sequence deleting module is used for deleting the word sequences of which the occurrence frequency is not less than the first specified value or not more than the second specified value.

Optionally, it is determined whether the video frame is similar to any other video frame through a pre-established classification network.

The classification network includes an input layer, a differential layer, a splice layer, a convolutional layer, and an output layer.

The input layer is used for acquiring two input video frames.

And the difference layer is used for carrying out subtraction operation on the two video frames to obtain a difference image.

The splicing layer is used for splicing the difference image and the two video frames to obtain a spliced image.

And the convolution layer is used for carrying out feature extraction on the spliced image to generate a feature vector.

The output layer is used for outputting similar results according to the feature vectors.

Optionally, the video description generation model comprises a decoder network and an encoder network.

The encoder network is used for extracting the characteristics of a plurality of input video frames and generating the visual characteristics of the video.

Optionally, the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer.

The input layer is used for acquiring a plurality of input video frames.

The plurality of convolutional layers are respectively used for extracting the characteristics of a plurality of video frames.

Optionally, the decoder network is a long-short term memory network.

Optionally, the model training module 604 includes:

the prediction description acquisition unit is used for inputting the video frame into a specified video description generation model to obtain prediction description;

and the parameter adjusting unit is used for adjusting parameters of the video description generation model according to the difference between the prediction description and the video description corresponding to the video frame to obtain the trained model.

Optionally, the parameter adjusting unit includes:

a feature vector obtaining subunit, configured to obtain a feature vector of the prediction description and a feature vector of a video description corresponding to the video frame, respectively;

and the parameter adjusting subunit is used for adjusting the parameters of the video description generation model according to the difference between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame.

Optionally, the parameter adjusting subunit includes:

and determining whether the prediction description is similar to the video description corresponding to the video frame or not according to the distance between the feature vector of the prediction description and the feature vector of the video description corresponding to the video frame, and adjusting the parameters of the video description generation model according to the similar result.

Optionally, the feature vector is a word vector.

Optionally, the distance is a cosine distance.

Accordingly, referring to fig. 9, a block diagram of an embodiment of a video description generating apparatus according to an embodiment of the present disclosure is shown, where the apparatus includes:

a target video obtaining module 701, configured to obtain a target video.

A video description generation module 702, configured to use a video frame of the target video as an input of a video description generation model, so as to obtain a video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: and identifying each video frame in the video to extract characters in the video frame, and combining the characters corresponding to the video frame of the video to be used as the video description of the video.

Optionally, after the obtaining the target video, the method further includes:

and the similarity judgment module is used for comparing each video frame in the target video with other video frames in the target video one by one so as to determine whether the video frame is similar to any one of the other video frames.

And the video frame deleting module is used for deleting one of the video frames if the video frame deleting module is used for deleting one of the video frames.

The input layer is used for acquiring two input video frames.

The encoder network is used for extracting the characteristics of a plurality of input video frames and generating the visual characteristics of the target video.

The input layer is used for acquiring a plurality of input video frames.

The plurality of convolution layers are respectively used for extracting the characteristics of a plurality of video frames.

Optionally, the decoder network is a long-short term memory network.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement without inventive effort.

Accordingly, as shown in fig. 10, the present disclosure further provides an electronic device 80, which includes a processor 81; a memory 82 for storing executable instructions, said memory 82 comprising a computer program 83; wherein the processor 81 is configured to perform any of the methods described above.

The Processor 81 executes a computer program 83 included in the memory 82, and the Processor 81 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 82 stores the computer program of the above method, and the memory 82 may include at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. Also, the apparatus may cooperate with a network storage device that performs a storage function of the memory by being connected through a network. The storage 82 may be an internal storage unit of the device 80, such as a hard disk or a memory of the device 80. The memory 82 may also be an external storage device of the device 80, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash memory Card (Flash Card), etc. provided on the device 80. Further, memory 82 may also include both internal storage units of device 80 and external storage devices. The memory 82 is used for storing a computer program 83 as well as other programs and data required by the device. The memory 82 may also be used to temporarily store data that has been output or is to be output.

The various embodiments described herein may be implemented using a computer-readable medium such as computer software, hardware, or any combination thereof. For a hardware implementation, the embodiments described herein may be implemented using at least one of Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, electronic units designed to perform the functions described herein. For a software implementation, the implementation such as a process or a function may be implemented with a separate software module that allows performing at least one function or operation. The software codes may be implemented by software applications (or programs) written in any suitable programming language, which may be stored in memory and executed by the controller.

The electronic device 80 includes, but is not limited to, the following forms of presence: (1) a mobile terminal: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, etc.; (2) ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as iPad; (3) the server: the device provides computing service, the server comprises a processor, a hard disk, a memory, a system bus and the like, the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like as long as highly reliable service is provided; and (4) other electronic equipment with a computing function. The device may include, but is not limited to, a processor 81, a memory 82. Those skilled in the art will appreciate that fig. 8 is merely an example of the electronic device 80, and does not constitute a limitation of the electronic device 80, and may include more or fewer components than shown, or combine certain components, or different components, e.g., the device may also include an input-output device, a network access device, a bus, a camera device, etc.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an apparatus to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, instructions in the storage medium, when executed by a processor of a terminal, enable the terminal to perform the above-described method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The present disclosure is to be considered as limited only by the preferred embodiments and not limited to the specific embodiments described herein, and all changes, equivalents, and modifications that come within the spirit and scope of the disclosure are desired to be protected.

Claims

1. A method for acquiring a video description generation model is characterized by comprising the following steps:

acquiring a plurality of videos from a preset video library;

for each video, identifying each video frame in the video to extract characters in the video frame, matching the characters corresponding to each video frame with a pre-stored slogan text, and deleting the characters which are matched uniformly; the slogan text comprises contents which have no strong correlation with video frames;

training video frames and video descriptions corresponding to the videos respectively as training samples to obtain a video description generation model; the video description generation model is used for outputting a prediction video description according to an input video frame.

2. The method of claim 1, wherein after identifying each video frame in the video to extract text in the video frame, the method further comprises:

3. The method of claim 1, wherein after identifying each video frame in the video to extract text in the video frame, the method further comprises:

4. The method of claim 3, further comprising:

performing word segmentation on characters corresponding to the undeleted video frames to obtain a plurality of word sequences;

5. The method of claim 3, wherein determining whether the video frame is similar to any other video frame is performed through a pre-established classification network;

the input layer is used for acquiring two input video frames;

6. The method of claim 1, wherein the video description generative model comprises a network of encoders and a network of decoders;

7. The method of claim 6, wherein the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer;

the input layer is used for acquiring a plurality of input video frames;

the plurality of convolution layers are respectively used for extracting the characteristics of a plurality of video frames;

8. The method of claim 6, wherein the decoder network is a long short term memory network.

9. The method according to claim 6, wherein the training with the video frames and the video descriptions corresponding to the videos as training samples to obtain the video description generation model comprises:

10. The method of claim 9, wherein the adjusting parameters of the video description generation model according to the difference between the prediction description and the video description corresponding to the video frame comprises:

11. The method of claim 10, wherein adjusting parameters of the video description generation model according to a difference between the feature vector of the prediction description and a feature vector of a video description corresponding to the video frame comprises:

12. The method of claim 10, wherein the feature vector is a word vector.

13. The method of claim 11, wherein the distance is a cosine distance.

14. A method for generating a video description, comprising:

acquiring a target video;

taking the video frame of the target video as the input of a video description generation model so as to obtain the output video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: identifying each video frame in the video to extract characters in the video frame, matching the characters corresponding to each video frame with a prestored slogan text, and deleting the characters which are matched with each other; the slogan text comprises contents which have no strong correlation with video frames; and combining characters corresponding to the video frames of the video to serve as the video description of the video.

15. The method of claim 14, further comprising, after said obtaining the target video:

and if so, deleting one of the video frames.

16. The method of claim 15, wherein determining whether the video frame is similar to any other video frame is performed through a pre-established classification network;

the input layer is used for acquiring two input video frames;

17. The method of claim 14, wherein the video description generative model comprises a network of encoders and a network of decoders;

18. The method of claim 17, wherein the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer;

the input layer is used for acquiring a plurality of input video frames;

19. The method of claim 17, wherein the decoder network is a long short term memory network.

20. An apparatus for obtaining a video description generative model, comprising:

the character deleting module is used for matching the characters corresponding to each video frame with the prestored slogan text and deleting the characters which are matched with each other; the slogan text comprises content which has no strong correlation with the video frame;

the model training module is used for training video frames and video descriptions corresponding to the videos respectively as training samples to obtain a video description generation model; the video description generation model is used for outputting a prediction video description according to an input video frame.

21. The apparatus of claim 20, further comprising, after the text extraction module:

the first word sequence acquisition module is used for segmenting words corresponding to all video frames in the video to acquire a plurality of word sequences;

22. The apparatus of claim 20, further comprising, after the text extraction module:

the video frame comparison module is used for comparing each video frame in each video with other video frames in the video one by one so as to determine whether the video frame is similar to any one of the other video frames;

23. The apparatus of claim 22, further comprising:

the second word sequence acquisition module is used for segmenting words corresponding to the undeleted video frame to acquire a plurality of word sequences;

and the second word sequence deleting module is used for deleting the word sequences with the frequency of occurrence not less than the first specified value or not more than the second specified value.

24. The apparatus of claim 22, wherein the video frame is determined to be similar to any other video frame through a pre-established classification network;

the input layer is used for acquiring two input video frames;

25. The apparatus of claim 20, wherein the video description generation model comprises a decoder network and an encoder network;

26. The apparatus of claim 25, wherein the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer;

the input layer is used for acquiring a plurality of input video frames;

27. The apparatus of claim 25 wherein the decoder network is a long short term memory network.

28. The apparatus of claim 27, wherein the model training module comprises:

the prediction description acquisition unit is used for inputting the video frame into a specified video description generation model to obtain a prediction description;

29. The apparatus of claim 28, wherein the parameter adjusting unit comprises:

30. The apparatus of claim 29, wherein the parameter adjustment subunit comprises:

31. The apparatus of claim 29, wherein the feature vector is a word vector.

32. The apparatus of claim 30, wherein the distance is a cosine distance.

33. A video description generation apparatus, comprising:

the target video acquisition module is used for acquiring a target video;

the video description generation module is used for taking a video frame of the target video as the input of a video description generation model so as to obtain the output video description corresponding to the target video from the video description generation model; the video description generation model is obtained based on video frames and video description training corresponding to a plurality of videos respectively, and the generation of the video description of each video comprises the following steps: identifying each video frame in the video to extract characters in the video frame, matching the characters corresponding to each video frame with a prestored slogan text, and deleting the characters which are matched with each other; the slogan text comprises content which has no strong correlation with the video frame; and combining characters corresponding to the video frames of the video to serve as the video description of the video.

34. The apparatus of claim 33, further comprising, after said obtaining the target video:

the similarity judgment module is used for comparing each video frame in the target video with other video frames in the target video one by one so as to determine whether the video frame is similar to any one of the other video frames;

35. The apparatus of claim 34, wherein the video frame is determined to be similar to any other video frame through a pre-established classification network;

the input layer is used for acquiring two input video frames;

36. The apparatus of claim 33, wherein the video description generation model comprises a decoder network and an encoder network;

37. The apparatus of claim 36, wherein the encoder network comprises an input layer, a plurality of convolutional layers, and a stitching layer;

the input layer is used for acquiring a plurality of input video frames;

38. The apparatus of claim 36 wherein the decoder network is a long short term memory network.

39. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of claims 1 to 19.

40. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 19.