CN110324706B

CN110324706B - Video cover generation method and device and computer storage medium

Info

Publication number: CN110324706B
Application number: CN201810286238.3A
Authority: CN
Inventors: 盛骁杰
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2022-03-04
Anticipated expiration: 2038-03-30
Also published as: CN110324706A

Abstract

The embodiment of the application discloses a method and a device for generating a video cover and a computer storage medium, wherein the method comprises the following steps: acquiring data to be processed, wherein the data to be processed comprises image data or video data; identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue; and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue. The technical scheme provided by the application can improve the generation efficiency of the video cover.

Description

Video cover generation method and device and computer storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for generating a video cover, and a computer storage medium.

Background

With the continuous development of internet technology, the number of videos in a video playing platform is increasing. Currently, in order to make a user quickly know the subject of the video content, a corresponding video cover is usually generated for the video. In order to save the manpower and material resources consumed by manually generating the video cover, image processing technology is generally adopted to automatically generate the video cover at present.

Video covers are currently typically automatically generated by cover generation devices that support image recognition functionality. Specifically, the cover generation device may analyze image frames in the video based on the standard of OpenGL, thereby generating a video cover.

Since OpenGL can only process images in general, the input cover generation device in the prior art is generally an image. If the video is to be identified, the video data needs to be pre-processed by other devices. Specifically, referring to fig. 1, when generating a video envelope based on video data, two independent devices may be used in the prior art for processing. The preprocessing device may decode the video data and then extract a certain number of image frames from the decoded video frames. To facilitate storage of the image frames, the preprocessing device typically encodes the decimated image frames to obtain images in a format such as jpeg, bmp, png, etc. These images may be loaded by a cover production facility where the loaded images need to be decoded and then processed against the decoded images to produce a video cover.

As can be seen from the above, currently, when generating a video cover, the cover generation device can generally only process the input images. If only video data is currently available, the video data needs to be processed by independent preprocessing equipment and cover generation equipment respectively, and a final video cover can be generated. Such a generation method may result in a low efficiency of generating the video cover.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for generating a video cover, and a computer storage medium, which can improve the generation efficiency of the video cover.

In order to achieve the above object, an embodiment of the present application provides a method for generating a video cover, where the method includes: acquiring data to be processed, wherein the data to be processed comprises image data or video data; identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue; and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue.

In order to achieve the above object, an apparatus for generating a video cover page according to an embodiment of the present invention includes a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, implements the following steps: acquiring data to be processed, wherein the data to be processed comprises image data or video data; identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue; and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue.

To achieve the above object, an embodiment of the present invention further provides a computer storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the following steps: acquiring data to be processed, wherein the data to be processed comprises image data or video data; identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue; and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue.

Therefore, the video cover generation device can expand the functions of video cover generation equipment in the prior art, can process input image data, and can process input video data. The device can identify the type of input data, and when the current data is identified to be video data, the video data can be decoded, so that each video frame contained in the video data is obtained. Then, a certain number of video frames can be extracted from the decoded video frames, the extracted video frames can be directly sent to a processing queue without an image coding process, and the video frames in the processing queue can be used for subsequent production of video covers. Therefore, compared with the prior art, the technical scheme provided by the application expands the type of the data to be processed on one hand, and on the other hand, after the video frame is extracted from the decoded video data, the extracted video frame is not required to be encoded, but can be directly sent into the processing queue for processing, so that the processes of encoding the video frame and subsequently decoding the encoded video frame are saved, and the generation efficiency of the video cover is improved while the video data processing process is simplified.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a video cover generation process in the prior art of the present application;

FIG. 2 is a flowchart of a method for generating a video cover according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an index list in an embodiment of the present application;

FIG. 4 is a diagram illustrating the processing of multiple threads in an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating the processing of data to be processed according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of decoding performed by a CPU and a GPU according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a video cover generation apparatus according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.

The application provides a method for generating a video cover, which can be applied to a business server of a video playing website. After receiving the video uploaded by the user or the administrator, the business server can generate a video cover of the video.

Referring to fig. 2, the method for generating a video cover according to the present application may include the following steps.

S1: acquiring data to be processed, wherein the data to be processed comprises image data or video data.

In this embodiment, when a video envelope needs to be generated for a certain video, data related to the video may be acquired, and the acquired data may be used as the above-mentioned data to be processed. The data to be processed may be video data or image data. Specifically, when a video envelope needs to be generated for a video, the data of the video may be pre-processed in advance, so as to obtain a series of images capable of representing the content of the video. The preprocessing process may be decoding the video data, extracting a certain number of video frames from the decoded video data, and converting the extracted video frames into an image with a certain encoding format in an image encoding manner. Thus, the coded image can be used as the data to be processed. In addition, the video data of the video can be directly input into a generating device of the video cover as the data to be processed, so that the video data can be processed in the generating device.

In this embodiment, the data to be processed in the generating device of the input video cover may be actively loaded by the generating device or passively received by the generating device. Wherein the data to be processed can be sent to the generating device of the video cover by another device, so that the generating device of the video cover can receive the data to be processed. In addition, the data to be processed may also be stored in a resource server, the generating device of the video cover may have a storage address of the data to be processed, and by accessing the storage address, the generating device of the video cover may initiate a data download request to the resource server, so as to download the data to be processed.

In one embodiment, in order to improve the downloading efficiency of the data, the data to be processed may be divided into a plurality of data blocks, and stored in the resource server. In the resource server, an index list associated with the data to be processed divided into a plurality of data blocks may be further stored, and the index list may be used to indicate storage locations of the respective data blocks. Specifically, the index list may include storage identifiers of data blocks in the to-be-processed data. For example, referring to fig. 3, the index list may be represented as an array, where two columns of data may be included in the array, one column of data is a storage identifier of a data block, and the other column of data may be a name of the data block. The form of the storage identity may depend on the way the data block is stored. Specifically, if each data block has its own storage address, the storage identifier of the data block may be the storage address of the data block. For example, the storage address may point to a URL (Uniform Resource Locator) of the data block. If each data block is located at the same storage address, but each data block has its own storage number, the storage identifier of the data block may be the storage number of the data block. Of course, in practical applications, a part of the data blocks in the data to be processed may be stored at one storage address, and another part of the data blocks may be stored at another storage address, so that the storage identifier of the data block may be a combination of the storage address and the storage number of the data block.

In this embodiment, referring to fig. 4, when the device for generating a video cover needs to download the data to be processed from the resource server, a plurality of data blocks can be downloaded simultaneously in a multi-thread parallel downloading manner, so as to improve the efficiency of data downloading. Specifically, the video cover generation device may download the index list of the to-be-processed data from the resource server, and parse the content of the index list. In the index list, the storage identifier of each data block may be noted, and the data amount of each data block and the total data amount of the to-be-processed data may also be noted. In practical applications, if the total data amount of the to-be-processed data is small, the video cover generation device may sequentially download each data block in the to-be-processed data through only one thread. And if the total data volume of the data to be processed is large, at least two processing threads can be opened to download each data block in the data to be processed in parallel. Specifically, the video cover generation device may allocate each storage identifier indicated in the index list to a plurality of processing threads that are opened. Each processing thread can establish a respective download task, and the download task may include a storage identifier of a data block to be downloaded. Therefore, the data block pointed by the storage identifier to be processed can be downloaded in parallel through the at least two processing threads, and the acquisition efficiency of the data to be processed is improved.

S3: identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue.

In this embodiment, after acquiring the data to be processed, the video cover generation device may identify the type of the data to be processed, so as to adopt different processing modes according to different data types. Specifically, video data and image data are generally provided with different suffixes, and by identifying a name suffix of data to be processed, the type of the data to be processed can be identified. For example, data suffixed with avi, MP4 may be video data; the data suffixed with jpg, png may be image data.

As shown in fig. 5, in an embodiment, if the data to be processed is video data, the video data may also be decoded first. The video data can be decoded by adopting a corresponding decoding mode according to different coding modes. For example, the currently used codec may include h.261, h.263, h.264, MPEG, and the like. Since the decoded video data contains a large number of video frames, if all the video frames are processed, a large amount of computing resources are consumed, and the efficiency of generating a video cover page is also reduced. Therefore, in the present embodiment, a certain number of video frames can be extracted from the decoded video data, and the extracted video frames can be processed subsequently.

In the present embodiment, the number of extracted video frames may be specified in advance, and then these number of video frames may be randomly extracted from the video data. Further, video frames may be sequentially extracted from the decoded video data according to a specified number of interval frames. For example, the specified number of interval frames is 200 frames, then every 200 frames can be extracted.

In one embodiment, in order to enable the extracted video frame to cover the content of the video more completely, a scene change frame may be determined in the decoded video data, and the scene change frame may be used as the video frame extracted from the decoded video data. The scene cut frame may be a video frame between two adjacent different scenes in the video. In order to obtain scene change frames corresponding to respective scenes of video data, the scene change frames may be extracted by frame-by-frame comparison in the present embodiment. Specifically, a reference frame may be determined in the video data first, and the similarity between each video frame subsequent to the reference frame and the reference frame may be calculated sequentially.

In this embodiment, the reference frame may be a frame of a picture randomly designated within a certain range. For example, the reference frame may be a frame of picture randomly selected within 2 minutes of the beginning of the video data. Of course, in order not to miss a scene in the video data, the first frame of the video data may be used as the reference frame.

In this embodiment, after the reference frame is determined, each frame picture after the reference frame may be sequentially compared with the reference frame from the reference frame to calculate the similarity between each subsequent frame picture and the reference frame. Specifically, when calculating the similarity between each video frame and the reference frame, the first feature vector and the second feature vector of the reference frame and the current video frame may be extracted, respectively.

In this embodiment, the first feature vector and the second feature vector may have various forms. The feature vector of each frame of picture can be constructed based on the pixel values of the pixel points in the frame of picture. Each frame of picture is usually formed by arranging a plurality of pixel points according to a certain sequence, and the pixel points correspond to respective pixel values, so that a gorgeous picture can be formed. The pixel value may be a numerical value within a specified interval. For example, the pixel value may be a gray scale value, the gray scale value may be any one of 0 to 255, and the magnitude of the numerical value may represent the shade of the gray scale. Of course, the pixel value may also be the respective values of a plurality of color system components in other color system spaces. For example, in an RGB (Red, Green, Blue, Red, Green, Blue) color system space, the pixel values may include R component values, G component values, and B component values.

In this embodiment, the pixel values of the pixel points in each frame of the picture can be obtained, and the feature vector of the frame of the picture is formed by the obtained pixel values. For example, for a current video frame with 9 × 9-81 pixels, pixel values of the pixels may be sequentially obtained, and then the obtained pixel values may be sequentially arranged according to an order from left to right and from top to bottom, so as to form an 81-dimensional vector. The 81-dimensional vector can be used as the feature vector of the current video frame.

In this embodiment, the feature vector may be a CNN (Convolutional Neural Network) feature of each frame. Specifically, the reference frame and each frame picture after the reference frame may be input into a convolutional neural network, and then the convolutional neural network may output the feature vectors corresponding to the reference frame and each other frame picture.

In this embodiment, in order to accurately represent the contents shown in the reference frame and the current video frame, the first feature vector and the second feature vector may also represent scale-invariant features of the reference frame and the current video frame, respectively. In this way, even if the rotation angle, the image brightness or the shooting angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well embody the contents in the reference frame and the current video frame. Specifically, the first Feature vector and the second Feature vector may be a Scale-invariant Feature transform (Sift-invariant Feature transform) Feature, a surf Feature (Speed Up Robust Feature), a color histogram Feature, or the like.

In this embodiment, after the first feature vector and the second feature vector are determined, the similarity between the first feature vector and the second feature vector may be calculated. In particular, the similarity may be expressed in vector space as a distance between two vectors. The closer the distance, the more similar the two vectors are represented, and thus the higher the similarity. The further the distance, the greater the difference between the two vectors and hence the lower the similarity. Therefore, in calculating the similarity between the reference frame and the current video frame, the spatial distance between the first feature vector and the second feature vector may be calculated, and the inverse of the spatial distance may be taken as the similarity between the reference frame and the current video frame. Thus, the smaller the spatial distance, the greater the corresponding similarity, indicating that the reference frame and the current video frame are more similar. Conversely, the greater the spatial distance, the less similarity it corresponds, indicating a greater dissimilarity between the reference frame and the current video frame.

In this embodiment, the similarity between each video frame subsequent to the reference frame and the reference frame may be sequentially calculated in the above manner. In order to determine different scenes in the video data, in the present embodiment, when the similarity between the reference frame and the current video frame is less than or equal to a specified threshold, the current video frame may be determined as a scene change frame. The designated threshold may be a preset value, and the value may be flexibly adjusted according to actual conditions. For example, when the number of scene change frames screened out according to the specified threshold is too large, the size of the specified threshold may be appropriately reduced. For example, when the number of scene change frames to be filtered out based on the predetermined threshold is too small, the size of the predetermined threshold may be increased as appropriate. In this embodiment, the similarity being less than or equal to the predetermined threshold may indicate that the contents in the two frames are significantly different, and therefore, it may be considered that the scene shown in the current video frame is changed from the scene shown in the reference frame. At this time, the current video frame can be reserved as a frame of picture for scene switching.

In this embodiment, when the current video frame is determined as one scene change frame, subsequent other scene change frames may be continuously determined. Specifically, from the reference frame to the current video frame, it can be considered that a scene has changed, and thus the current scene is the content displayed by the current video frame. Based on this, the current video frame may be used as a new reference frame, and the similarity between each video frame after the new reference frame and the new reference frame is sequentially calculated, so as to determine a next scene change frame according to the calculated similarity. Similarly, when determining the next scene switching frame, the similarity between two frames of pictures can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be compared with the specified threshold, so as to determine the next scene switching frame in which the scene changes again after the new reference frame. Thus, after each scene cut frame is determined, the scene cut frames can be used as video frames extracted from the decoded video data.

S5: and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue.

Referring to fig. 5, in the present embodiment, if the data to be processed is image data, the image data may be decoded. In particular, the suffix of the image data indicates its corresponding encoding format, and the image data can be decoded using the same decoding format. After the image data is decoded, an original image represented by the image data can be restored.

In this embodiment, after extracting a video frame or decoding an image, the video frame and the image may be sent to a processing queue. The processing queue may be a queue in a buffer or a video memory, and the video frames/images in the processing queue may be sequentially processed according to a first-in first-out mechanism, so as to generate a video cover. Specifically, when the video cover is generated, the contents displayed in the video frames/images corresponding to the same video may be integrated into one image, so as to generate the video cover. During content integration, key objects in video frames/images can be extracted, and then the extracted key objects are combined together according to a certain arrangement format and upper and lower coverage levels, so that a video cover is formed. For example, there are currently 10 video frames, the facial expression of the person can be extracted from the 10 video frames, and the facial expression can be used as the key object. The extracted facial expressions can then be integrated into an image to obtain the final video cover.

In one embodiment, the to-be-processed data acquired in step S1 may include text description information. The textual description information may be a title of the video or a brief description of the video. The title and the profile may be edited by a video producer or a video uploader in advance, or may be added by a worker reviewing videos, which is not limited in this application. Of course, in practical applications, the text description information may include text labels of the video or descriptive phrases extracted from the barrage information of the video, in addition to the title and the brief description of the video.

In this embodiment, the textual description information may indicate the subject of the video more accurately. Therefore, the corresponding theme tag of the video can be extracted from the text description information. Specifically, the video playing website can summarize a large amount of text description information of videos, screen out various text labels that may be used as video topics, and construct a text label library with the screened text labels. The content in the text label library can be continuously updated. In this way, when extracting the theme label from the text description information, the text description information can be matched with each text label in the text label library, and the text label obtained by matching is used as the theme label of the video. For example, the text description information of the video is "who is going to go and who is staying in the world at the moment of the infinite war", and then when the text description information is matched with each text label in the text label library, a matching result of "super hero" can be obtained. Thus, "super hero" may be used as a subject label for the video.

In this embodiment, a scene tag may be set for the extracted video frame or image, where the scene tag may be a text tag used to represent content shown in the video frame or image. For example, if two people are fighting each other in a video frame, the scene tag corresponding to the video frame may be "martial arts", "fighting" or "time". Specifically, a target object included in the video frame or image may be identified through an image recognition technique, and a word or phrase representing the target object is used as a scene tag of the video frame or image.

In this embodiment, it is considered that the content shown by the video frame or image is not closely related to the theme of the video. In order to enable the generated video cover to accurately reflect the theme of the video, the target frames/target images can be screened from the plurality of video frames/images according to the relevance between each scene label and the theme label.

Taking a video frame as an example, in this embodiment, the association between the scene tag and the theme tag may refer to a degree of similarity between the scene tag and the theme tag. The more similar the scene label and the theme label, the more relevant the content presented by the video frame is to the theme of the video. Specifically, the determining of the association between the scene tag and the theme tag may include calculating a similarity between the scene tag and the theme tag of each of the video frames. In practical application, the scene tag and the theme tag may both be formed by words, and when the similarity between the scene tag and the theme tag is calculated, the scene tag and the theme tag may be respectively represented in a word vector (word vector) manner. In this way, the similarity between the scene tag and the topic tag can be represented by the spatial distance between two word vectors. The closer the spatial distance between the two word vectors is, the higher the similarity between the scene label and the theme label is; conversely, a greater spatial distance between two word vectors indicates a lower degree of similarity between the scene tag and the topic tag. In this way, in an actual application scenario, the inverse of the spatial distance between two word vectors may be used as the similarity between the scene tag and the topic tag.

In this embodiment, after the similarity between the scene tag and the topic tag is calculated, a video frame of which the calculated similarity is greater than or equal to a specified similarity threshold may be determined as the target frame. The specified similarity threshold may be used as a threshold for measuring whether the video frame is sufficiently associated with the theme, and when the similarity is greater than or equal to the specified similarity threshold, it may be indicated that the current video frame is sufficiently associated with the theme of the video, and the content displayed by the video frame may accurately reflect the theme of the video, so that the video frame may be determined as the target frame.

In addition, in practical application, the video frame corresponding to the scene tag with the maximum similarity may also be determined as the target frame. Thus, after the target frame is screened from the video frames, a video cover can be generated based on the screened target frame. Specifically, if the number of the screened target frames is at least two frames, the display content of the target frames can be integrated into a video cover, so as to obtain a video cover matched with the theme of the video. During content integration, key objects in a target frame can be extracted, and then the extracted key objects are combined together according to a certain arrangement format and upper and lower coverage levels, so that a video cover is formed. For example, if there are 10 target frames in total, the facial expression of the person can be extracted from the 10 target frames, and the facial expression can be used as the key object. The extracted facial expressions can then be integrated into an image to obtain the final video cover. Of course, if only one target frame is screened out, the target frame can be directly used as a video cover, so that the process of generating the video cover is simplified.

It should be noted that after downloading the data to be processed by multiple threads, each thread may further continue to perform type identification and subsequent processing on the data block downloaded by itself, for example, in fig. 4, after the current thread finishes downloading the current data block, the data type of the data block may be identified, and if the data block is video data, the data block may be decoded and a video frame may be extracted; if the data block includes one image, the images may be sequentially decoded, and the decoded image may be sent to a processing queue.

In one embodiment, in consideration that video data in a generating device of an input video cover may be from different videos, in order to avoid confusion in generating a video cover, after extracting a video frame from decoded video data, an identifier for characterizing the decoded video data may be added to the video frame, and the video frame carrying the identifier is sent to the processing queue, so that a video cover may be generated based on the video frames with the same identifier in the processing queue. The identifier may be a background number of the video, or a character string obtained by hash operation according to the background number of the video. The specific form of the identifier is not limited in the present application as long as one video can be distinguished from other videos.

In one embodiment, referring to fig. 6, the video data decoding process and the image data decoding process can be performed by a CPU (Central Processing Unit) and/or a GPU (Graphics Processing Unit). In practical applications, if the decoding speed is too fast, and the processing speed of the video frames/images in the processing queue is too slow, data in the processing queue overflows, so that the extracted partial video frames and the decoded partial images are abandoned, and finally, the generated video cover cannot accurately represent the content of the video. Therefore, in this embodiment, the decoding speed of the CPU/GPU and the processing speed of the video frames/images in the processing queue need to be balanced. In particular, a remaining space of the processing queue not filled with video frames/images at the current time may be detected, and the more the remaining space, the faster the decoding speed may be represented. In this way, the current decoding speed can be determined based on the remaining space, so that after the data to be processed is decoded according to the current decoding speed, the remaining space for accommodating the sent video frames/images always exists in the processing queue, and the situation that the processing queue does not have enough remaining space for accommodating the sent video frames/images due to too fast decoding is avoided.

In this embodiment, when determining the current decoding speed based on the residual space, the speed at which the video frames/images in the processing queue are processed may be obtained in advance, and the speed may be used as a reference speed for decoding. Meanwhile, because a certain amount of residual space exists in the processing queue, the decoding speed can be properly increased on the basis of the reference speed, so that the video frames/images to be processed always exist in the processing queue. In this embodiment, a preset correlation between the remaining space in the processing queue and the gain decoding speed may be established, and the gain decoding speed may be an amount of speed additionally increased based on the reference speed. The preset correlation between the residual space and the gain decoding speed may be expressed in that as the residual space decreases, the gain decoding speed also gradually decreases until the residual space decreases to 0. Thus, according to the current remaining space in the processing queue, the target gain decoding speed associated with the current remaining space can be determined, and the sum of the speed at which the video frame/image is processed and the target gain decoding speed can be used as the current decoding speed. For example, if the video frames/pictures are processed at a rate of 50 pictures per second and the gain decoding rate is 10 pictures per second, then the current decoding rate may be 60 pictures per second. The adjustment of the decoding speed can be realized by controlling the computing resources of the CPU or the GPU, and the faster the decoding speed is, the more computing resources are required.

Referring to fig. 7, the present application further provides an apparatus for generating a video cover, the apparatus includes a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, implements the following steps:

s1: acquiring data to be processed, wherein the data to be processed comprises image data or video data;

s3: identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue;

In one embodiment, the computer program, when executed by the processor, further implements the steps of:

and if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue.

extracting video frames from the decoded video data according to the specified interval frame number; or

And determining a scene switching frame in the decoded video data, and using the scene switching frame as a video frame extracted from the decoded video data.

detecting the residual space of the processing queue not filled with the video frames/images at the current moment, and determining the current decoding speed based on the residual space, so that the residual space for containing the sent video frames/images exists in the processing queue after the data to be processed is decoded according to the current decoding speed.

adding an identifier for representing the decoded video data to the video frame, and sending the video frame carrying the identifier to the processing queue, so that a video cover is generated based on the video frames with the same identifier in the processing queue.

In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

The specific functions implemented by the memory and the processor of the apparatus for generating a video cover provided in the embodiment of this specification can be explained in comparison with the foregoing embodiments in this specification, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is provided here.

The present application further provides a computer storage medium having a computer program stored therein, which when executed by a processor, performs the steps of:

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbyscript Description Language (vhr Description Language), and the like, which are currently used by Hardware compiler-software (Hardware Description Language-software). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

Those skilled in the art will also appreciate that, in addition to implementing the apparatus and computer storage medium as pure computer readable program code means, the apparatus and computer storage medium may well be implemented by logically programming method steps to perform the same functions, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such apparatus and computer storage media may thus be considered to be a hardware component, and the means for performing the various functions included therein may also be considered to be structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments of the apparatus and the computer storage medium, reference may be made to the preceding description of embodiments of the method, as opposed to the explanation.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method for generating a video cover, the method comprising:

acquiring data to be processed, wherein the data to be processed comprises image data or video data;

identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data; sending the extracted video frames to a processing queue to generate a video cover based on the video frames in the processing queue; the video frames in the processing queue come from different videos, and the video frames and/or images in the processing queue are sequentially processed according to a first-in first-out mechanism;

if the data to be processed is image data, decoding the data to be processed, and sending the decoded image to the processing queue so as to generate a video cover based on the image in the processing queue;

after extracting the video frame from the decoded video data, the method further comprises:

adding an identifier for representing the decoded video data to the video frame, wherein the identifier of the video data is used for distinguishing that the video data come from different videos, and sending the video frame carrying the identifier to the processing queue, so that a video cover is generated based on the video frames with the same identifier in the processing queue.

2. The method according to claim 1, wherein the data to be processed further includes textual description information; accordingly, generating a video cover based on the video frames in the processing queue comprises:

setting a scene label for the video frame, and extracting a theme label from the text description information;

and screening target frames from the video frames according to the relevance between the scene tags and the theme tags, and generating a video cover based on the display content of the target frames.

3. The method of claim 2, wherein the filtering out the target frame from the video frames comprises:

calculating the similarity between the scene label and the theme label, and determining the video frame corresponding to the scene label with the calculated similarity being greater than or equal to a specified similarity threshold as the target frame; or determining the video frame corresponding to the scene label with the maximum similarity as the target frame.

4. The method of claim 1, wherein extracting the video frame from the decoded video data comprises:

extracting video frames from the decoded video data according to the specified interval frame number;

or

5. The method of claim 4, wherein determining a scene cut frame in the decoded video data comprises:

determining a reference frame in the decoded video data, and sequentially calculating the similarity between a video frame after the reference frame and the reference frame;

if the similarity between the current video frame and the reference frame in the decoded video data is less than or equal to a specified threshold, determining the current video frame as a scene switching frame;

and taking the current video frame as a new reference frame, and sequentially calculating the similarity between the video frame after the new reference frame and the new reference frame so as to determine the next scene switching frame according to the calculation result.

6. The method according to claim 1, wherein the data to be processed is downloaded from a resource server, and the data to be processed is divided into a plurality of data blocks for storage in the resource server; accordingly, acquiring the data to be processed includes:

downloading an index list of the data to be processed from the resource server, wherein the index list comprises storage identifiers of data blocks in the data to be processed;

opening at least two processing threads, and respectively configuring storage identifiers to be processed for the at least two processing threads according to the index list;

and downloading the data block pointed by the storage identifier to be processed in parallel through the at least two processing threads.

7. The method of claim 1, wherein in decoding the data to be processed, the method further comprises:

8. The method of claim 7, wherein determining a current decoding speed based on the remaining space comprises:

acquiring the speed of processing the video frames/images in the processing queue;

determining a target gain decoding speed associated with the current residual space in the processing queue according to a preset association relationship between the residual space and the gain decoding speed;

and taking the sum of the speed of processing the video frame/image and the target gain decoding speed as the current decoding speed.

9. An apparatus for generating a video cover, the apparatus comprising a memory and a processor, the memory having stored therein a computer program, the computer program when executed by the processor implementing the steps of:

the computer program, when executed by the processor, further implements the steps of:

10. The apparatus according to claim 9, wherein the data to be processed further includes textual description information; accordingly, the computer program, when executed by the processor, further implements the steps of:

11. The apparatus of claim 10, wherein the computer program, when executed by the processor, further performs the steps of:

12. The apparatus of claim 9, wherein the computer program, when executed by the processor, further performs the steps of:

13. The apparatus of claim 9, wherein the computer program, when executed by the processor, further performs the steps of:

14. A computer storage medium, in which a computer program is stored, which computer program, when executed by a processor, performs the steps of:

identifying the type of the data to be processed, if the data to be processed is video data, decoding the data to be processed, and extracting a video frame from the decoded video data;

adding an identifier for characterizing the decoded video data to the video frame;

sending the extracted video frames to a processing queue so as to generate a video cover based on the video frames with the same identification in the processing queue; the video frames in the processing queue come from different videos, and the video frames and/or images in the processing queue are sequentially processed according to a first-in first-out mechanism;