CN109388721B

CN109388721B - Method and device for determining cover video frame

Info

Publication number: CN109388721B
Application number: CN201811217665.2A
Authority: CN
Inventors: 赵翔; 李鑫; 刘霄; 李旭斌; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2021-05-28
Anticipated expiration: 2038-10-18
Also published as: CN109388721A

Abstract

The invention provides a method and a device for determining a cover video frame, wherein the method comprises the following steps: extracting keywords of an article text, and acquiring first vectors corresponding to the keywords; extracting a main word of each video frame in a preset time period in the article video, and acquiring a second vector corresponding to each main word; calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword; and determining the target video frame as a cover video frame according to the similarity of each video frame and the article text. Therefore, the effect that the video frame serving as the cover conforms to the consistent image-text of the article content is achieved, the automatic adaptation of the video frame serving as the cover and the article content is achieved, and the cover determining efficiency and the click rate and browsing experience of a user are improved.

Description

Method and device for determining cover video frame

Technical Field

The invention relates to the technical field of multimedia information, in particular to a method and a device for determining a cover video frame.

Background

With the rapid development of the mobile internet, more and more videos appear in articles, for example, in a pushed article of a social network site, a lot of video clips are included to improve the interest of the article, and in order to make a user better understand the video content, the videos inserted into the article are displayed in the form of a video cover. However, in the related art, the video frame regarded as the cover of the video is determined as the default or is randomly selected, so that the video frame regarded as the cover is not in accordance with the content of the article, the click interest of the user is not effectively aroused, and the click rate and the browsing rate of the video by the user are not high.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the first objective of the present invention is to provide a method for determining a cover video frame, so as to realize automatic adaptation of the video frame as a cover to the content of an article.

A second object of the present invention is to provide a device for determining cover video frames.

A third object of the invention is to propose a computer program product.

A fourth object of the invention is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, a first embodiment of the present invention provides a method for determining a cover video frame, including the following steps: extracting keywords of an article text, and acquiring a first vector corresponding to each keyword; extracting a subject word of each video frame in a preset time period in the article video, and acquiring a second vector corresponding to each subject word; calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword; and determining that the target video frame is a cover video frame according to the similarity of each video frame and the article text.

In addition, the method for determining the cover video frame in the embodiment of the invention also has the following additional technical characteristics:

optionally, the cover video frame is determined to be an article cover frame, and/or the cover video frame is determined to be a video cover frame.

Optionally, the extracting a main word of each video frame in the article video within a preset time period includes: detecting whether each video frame contains a face or not, and if the fact that each video frame contains the face is known, extracting face features; and querying a preset face database to obtain main words corresponding to the face features.

Optionally, the extracting a main word of each video frame in the article video within a preset time period includes: detecting whether each video frame contains an article of a preset type, and if yes, extracting article features; and querying a preset article database to obtain a main word corresponding to the article characteristics.

Optionally, the calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword includes: calculating a sub-distance between a second vector corresponding to each main word in each video frame and a first vector corresponding to each keyword; adding all sub-distances corresponding to each video frame to obtain a corresponding total distance; and calculating the reciprocal of the total distance of each video frame to obtain the similarity of each video frame and the article text through addition.

Optionally, the determining, according to the similarity between each video frame and the article text, that the target video frame is the cover video frame includes: and comparing the similarity of each video frame with the article text to obtain a target video frame corresponding to the maximum similarity as the cover video frame.

Optionally, the method further comprises: acquiring one or more image quality indexes of each video frame; determining that the target video frame is the cover video frame according to the similarity between each video frame and the article text, including: acquiring weights corresponding to the image quality indexes and weights corresponding to the similarity; calculating score data of each video frame according to each image quality index and corresponding weight of each video frame, and similarity and corresponding weight of each video frame and the article text; and determining the target video frame corresponding to the maximum value of the score data as the cover video frame according to the score data of each video frame.

The embodiment of the second aspect of the present invention provides a device for determining a cover video frame, including: the first acquisition module is used for extracting keywords of the article text and acquiring a first vector corresponding to each keyword; the second obtaining module is used for extracting a main word of each video frame in a preset time period in the article video and obtaining a second vector corresponding to each main word; the calculation module is used for calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword; and the cover determining module is used for determining the target video frame as the cover video frame according to the similarity of each video frame and the article text.

A third embodiment of the present invention provides a computer program product, which when executed by an instruction processor implements the method for determining cover video frames according to the foregoing method embodiment.

A fourth aspect of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the method for determining a cover video frame according to the foregoing method embodiment.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

extracting keywords of an article text, acquiring first vectors corresponding to the keywords, extracting main words of each video frame in a preset time period in an article video, acquiring second vectors corresponding to the main words, calculating the similarity between each video frame and the article text according to the second vectors corresponding to the main words and the first vectors corresponding to the keywords, and further determining that a target video frame is a cover video frame according to the similarity between each video frame and the article text. Therefore, the effect that the video frame serving as the cover conforms to the consistent image-text of the article content is achieved, the automatic adaptation of the video frame serving as the cover and the article content is achieved, and the cover determining efficiency and the click rate and browsing experience of a user are improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1-1 is a scene schematic of a determination of a cover video frame according to one embodiment of the invention;

FIGS. 1-2 are schematic diagrams of a scene of a determination result of a cover video frame according to another embodiment of the present invention;

FIG. 2 is a flow diagram of a method of determining cover video frames according to one embodiment of the invention;

FIG. 3 is a flow diagram of a method of determining cover video frames according to another embodiment of the present invention;

FIG. 4 is a flow diagram of a method of determining cover video frames according to yet another embodiment of the present invention;

FIG. 5 is a flow diagram of a method of determining cover video frames according to yet another embodiment of the present invention;

fig. 6 is a schematic view of an application scenario of a cover video frame determination method according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of a cover video frame determination apparatus according to an embodiment of the present invention; and

fig. 8 is a schematic structural diagram of a cover video frame determination apparatus according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

A method and apparatus for determining a cover video frame according to an embodiment of the present invention will be described with reference to the drawings. Based on the description of the prior art, it can be known that a method for selecting cover video frames is lacked in the related art, and the increase of user traffic such as click rate and the like caused by matching of pictures and texts is not considered. Different video covers are displayed in different article contents aiming at the same inserted video so as to adapt to the article where the video is located, and the click rate of a user on the video and the reading experience of the article are improved.

The cover video frame can be applied to a video inserted into an article and used as a video cover, the video inserted into the article can be inserted into any position of the article according to the article requirement, and the cover video frame can also be used as an article cover, for example, the cover video frame can be used as an article cover for pushing the article by a WeChat public number, or can be used as a link thumbnail when sharing article links on social platforms such as a friend circle and the like.

For example, when a cover video frame is applied to a video cover, for the same insert video a, when an article 1 describing a star is inserted, as shown in the left diagram of fig. 1-1, the displayed video cover is a video frame containing the star, and when an article 2 describing a building is inserted, as shown in the right diagram of fig. 1-1, the displayed video cover is a video frame describing the building, and in this example, the video insertion position is the middle part of the article.

For another example, when the cover video frame is used as a text cover pushed by the public number, as shown in the left diagram of fig. 1-2, for the same insert video a, when the article 1 describing the star is inserted, the displayed article cover is a video frame including the star, as shown in the right diagram of fig. 1-2, and when the article 2 describing the building is inserted, the displayed article cover is a video frame including the building.

Fig. 2 is a flowchart of a cover video frame determination method according to an embodiment of the present invention, as shown in fig. 2, the method including:

step 101, extracting keywords of an article text, and acquiring first vectors corresponding to the keywords.

The first vector represents the characteristics of the keyword, including the probability distribution condition of the word sequence in multiple dimensions, and the method for generating the first vector comprises a neural network, dimension reduction of a word co-occurrence matrix, a probability model and the like.

Specifically, in the embodiment of the invention, keywords of the article text are extracted, the keywords represent main embodied ideas of the article, and the keywords are processed into a first vector so as to facilitate similarity comparison of the main ideas of the article in the following process.

It should be noted that, in different application scenarios, the ways of extracting keywords of an article text are different, and as one possible implementation way, after performing part-of-speech analysis and word segmentation processing on the article text, counting the frequency of occurrence of each word segmentation, and taking a word with a higher frequency of occurrence as a keyword, as another possible implementation way, inputting the article text into a preset learning model, where the input of the learning model is the article text and the output is the main idea of the article text, after obtaining the main idea output by the learning model, performing part-of-speech analysis and word segmentation processing on the article text, calculating the relevance between each analysis and the main idea, and taking a word with a relevance greater than a certain value as a keyword.

And 102, extracting the main words of each video frame in a preset time period in the article video, and acquiring second vectors corresponding to the main words.

The article video refers to the video inserted into the article.

Specifically, to facilitate determining video frames consistent with the article keywords, the video frames are processed into a second vector at the same latitude as the first vector, and the second vector is used for representing the main embodiment content of each video frame. Certainly, when the inserted video is a complete video, in order to improve the learning efficiency, the video in the preset time period is selected as the video where the video frame is located, as a possible implementation manner, considering that a climax part of the video, that is, a part capable of most embodying the content of the video, is located in the middle part of the video, and therefore, based on the thirty percent to seventy percent of the video frame of the video, the second vector is obtained, and therefore, in this embodiment, a time period corresponding to the pseudo-climax part of the preset time period is used as another possible implementation manner, a user previously screens out, based on the labels marked on different video segments based on the content of the video in the video, the time period where the video segments possibly related to the inserted article are located roughly, and uses the time period as the preset time period.

Specifically, a main word of each video frame in a preset time period in the article video is extracted, wherein the main word is used for representing content mainly contained in the current video frame, and the main word may be a main idea embodied by bullet screen content of the current video frame, a main idea embodied by subtitle content of the current video frame, or a character content, a general object content (such as a building, a living good, a cosmetic product, an environment representative) and the like contained in the video frame.

It should be noted that, in different application scenarios, the way of extracting the main word of each video frame in the article video within the preset time period is different, which is exemplified as follows:

the first example:

in this example, the main words include character content, such as star, scholars, animated characters, etc., and as shown in fig. 3, the manner of extracting the main words of each video frame in the article video within a preset time period includes:

step 201, detecting whether each video frame contains a human face, and if the video frame contains the human face, extracting human face features.

Specifically, it may be detected whether each video frame includes a human face based on whether each video frame includes features of five sense organs such as eyes and a nose of a person, and if it is known that a human face exists, in order to determine a specific person corresponding to the human face, facial features, such as features of five sense organs shape, five sense organs size, and the like, that can identify the uniqueness of the human face, may be extracted.

Step 202, querying a preset face database to obtain a main word corresponding to the face feature.

It can be understood that a face database containing the corresponding relationship between the face features and the main words corresponding to the characters is preset, and after the face features are obtained, the preset face database is queried to obtain the main words corresponding to the face features.

The second example is:

in this example, the main word includes an article, and as shown in fig. 4, the manner of extracting the main word of each video frame in the article video within a preset time period includes:

step 301, detecting whether each video frame contains an article of a preset type, and if so, extracting article features.

Specifically, whether each video frame contains articles in a preset category is detected based on the color, shape and the like corresponding to the connected domain in each video frame, wherein the articles in the preset category may contain general articles such as cosmetics and living goods, or a specific article category may be selected and set according to article contents, for example, the main content of the current article is introduction cosmetics, and the preset article category may correspond to a more fine-grained article category under the cosmetics category, for example, including lipstick, blush, mascara and the like. After each video frame is detected to contain the preset category of articles, the characteristics such as colors and shapes of the articles, which can show the uniqueness of the articles, are extracted.

Step 302, querying a preset article database to obtain a subject word corresponding to the article feature.

It can be understood that an article database containing the corresponding relationship between the article characteristics and the subject words corresponding to the articles is preset, and after the article characteristics are obtained, the preset article database is queried to obtain the subject words corresponding to the article characteristics.

And 103, calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword.

And 104, determining the target video frame as a cover video frame according to the similarity of each video frame and the article text.

Specifically, in order to determine cover video frames which are more consistent with the article text, the similarity between each video frame and the article text is calculated according to the second vector corresponding to each main word and the first vector corresponding to each keyword. And determining a similar video frame as a video cover according to the similarity, thereby realizing the effect of image-text coincidence.

In an embodiment of the present invention, the similarity is embodied based on a distance between vectors, and in this embodiment, as shown in fig. 5, calculating a similarity between each video frame and an article text according to a second vector corresponding to each subject word and a first vector corresponding to each keyword includes:

step 401, calculating a sub-distance between a second vector corresponding to each main word in each video frame and a first vector corresponding to each keyword.

Specifically, calculating a sub-distance between a second vector corresponding to each main word in each video frame and a first vector corresponding to each keyword to calculate a similarity of each main word corresponding to the keyword of each article text.

Step 402, adding all sub-distances corresponding to each video frame to obtain a corresponding total distance.

Specifically, in this embodiment, the sub-distances corresponding to each video frame are added to obtain a corresponding total distance, which can be used to represent the overall similarity between the main word in each video frame and the keywords of the article.

And 403, calculating the reciprocal of the total distance of each video frame, and obtaining and adding the reciprocal to obtain the similarity between each video frame and the article text.

It is understood that, based on the principle of generating vector distances, the greater the vector distance is, the lower the similarity between vectors is, and therefore, in this embodiment, the reciprocal of the total distance of each video frame is calculated to obtain the similarity between each video frame and the article text by addition, and further, the similarity between each video frame and the article text is compared, so that the target video frame corresponding to the maximum value of the obtained similarity can be the cover video frame.

In the actual implementation process, as mentioned above, the cover video frame may be determined as an article cover, where the article cover may be a cover of a pushed article shown in fig. 1-2, or may be a link thumbnail of the pushed article shown in fig. 6, and of course, the cover video frame may also be determined as a video cover of an inserted video shown in fig. 1-2.

In an embodiment of the present invention, in order to further improve the click rate of the user, the video frame serving as the cover page may be determined based on the video quality of the video frame, such as the definition and the aesthetic degree.

Specifically, in this embodiment, one or more image quality indicators of each video frame are obtained, for example, the definition and the aesthetic measure of an image (the aesthetic measure may be obtained according to a pre-established deep learning model, etc.) are obtained, and then, when determining a cover video frame, a corresponding weight value is set in advance based on each image quality indicator and the similarity, where the weight value may be set according to the attribute of an article, for example, when the article belongs to an entertainment-type article, the weight of the similarity is greater than the weight of each image quality indicator, and for example, when the article belongs to a national defense-type article, the weight of the similarity is smaller than the weight of each image quality indicator.

Furthermore, according to each image quality index and corresponding weight of each video frame, and similarity and corresponding weight of each video frame and the article text, score data of each video frame is calculated, for example, each image quality index is normalized respectively, the normalized data and the corresponding weight are multiplied, meanwhile, the product value of the similarity and the corresponding weight is calculated, the sum of the two product values is used as the score data of each video frame, and the target video frame corresponding to the maximum value of the score data is determined as a cover according to the score data of each video frame, wherein the score data can be normalized in ten-degree, five-degree, and the like without limitation.

To sum up, the method for determining a cover video frame according to the embodiment of the present invention extracts keywords of an article text, obtains first vectors corresponding to the keywords, extracts a subject word of each video frame in a preset time period in the article video, obtains second vectors corresponding to the subject words, calculates a similarity between each video frame and the article text according to the second vectors corresponding to the subject words and the first vectors corresponding to the keywords, and determines that a target video frame is the cover video frame according to the similarity between each video frame and the article text. Therefore, the effect that the video frame serving as the cover conforms to the consistent image-text of the article content is achieved, the automatic adaptation of the video frame serving as the cover and the article content is achieved, and the cover determining efficiency and the click rate and browsing experience of a user are improved.

In order to implement the foregoing embodiment, the present invention further provides a device for determining a cover video frame, fig. 7 is a schematic structural diagram of a device for determining a cover video frame according to an embodiment of the present invention, and as shown in fig. 7, the device for determining a cover video frame includes: a first acquisition module 10, a second acquisition module 20, a calculation module 30, and a cover determination module 40.

The first obtaining module 10 is configured to extract keywords of an article text, and obtain first vectors corresponding to the keywords.

The second obtaining module 20 is configured to extract a subject word of each video frame in a preset time period in the article video, and obtain a second vector corresponding to each subject word.

And the calculating module 30 is configured to calculate a similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword.

And the cover determining module 40 is used for determining the target video frame as the cover video frame according to the similarity between each video frame and the article text.

In an embodiment of the present invention, as shown in fig. 8, on the basis as shown in fig. 7, the first obtaining module 10 includes an extracting unit 11 and an obtaining unit 12, where the extracting unit 11 is configured to detect whether each video frame includes a face, and extract a face feature when it is known that the video frame includes the face.

The obtaining unit 12 is configured to query a preset face database to obtain a main word corresponding to the face feature.

It should be noted that the foregoing explanation of the embodiment of the method for determining a cover video frame is also applicable to the device for determining a cover video frame of this embodiment, and is not repeated here.

To sum up, the device for determining a cover video frame according to the embodiment of the present invention extracts keywords of an article text, obtains first vectors corresponding to the keywords, extracts a subject word of each video frame in a preset time period in the article video, obtains second vectors corresponding to the subject words, calculates a similarity between each video frame and the article text according to the second vectors corresponding to the subject words and the first vectors corresponding to the keywords, and determines that a target video frame is the cover video frame according to the similarity between each video frame and the article text. Therefore, the effect that the video frame serving as the cover conforms to the consistent image-text of the article content is achieved, the automatic adaptation of the video frame serving as the cover and the article content is achieved, and the cover determining efficiency and the click rate and browsing experience of a user are improved.

In order to implement the above embodiments, the present invention further provides a computer program product, which when executed by an instruction processor implements the method for determining cover video frames as described in the foregoing method embodiments.

In order to implement the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of determining a cover video frame as described in the aforementioned method embodiments.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for determining a cover video frame, comprising the steps of:

inputting an article text in an article into a pre-constructed learning model to acquire subject information of the text;

performing part-of-speech analysis on the article text to obtain a plurality of text word segments;

calculating the correlation degree between each text participle in the plurality of text participles and the topic information, and determining the text participles with the correlation degree larger than a preset threshold value as keywords;

acquiring a first vector corresponding to each keyword;

extracting a main word of each video frame in a preset time period in a video inserted into the article, and acquiring a second vector corresponding to each main word, wherein the main word is used for representing a main idea embodied by subtitle content of each video frame, contained character content and general article content;

calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword;

and determining that the target video frame is a cover video frame according to the similarity of each video frame and the article text.

2. The method of claim 1, further comprising:

and determining the cover video frame as an article cover frame, and/or determining the cover video frame as a video cover frame.

3. The method of claim 1, wherein the extracting the main word of each video frame in the article video within a preset time period comprises:

detecting whether each video frame contains a face or not, and if the fact that each video frame contains the face is known, extracting face features;

and querying a preset face database to obtain main words corresponding to the face features.

4. The method of claim 1, wherein the extracting the main word of each video frame in the article video within a preset time period comprises:

detecting whether each video frame contains an article of a preset type, and if yes, extracting article features;

and querying a preset article database to obtain a main word corresponding to the article characteristics.

5. The method of claim 1, wherein said calculating a similarity of each video frame to the article text based on the second vector corresponding to each of the subject words and the first vector corresponding to each of the keywords comprises:

calculating a sub-distance between a second vector corresponding to each main word in each video frame and a first vector corresponding to each keyword;

adding all sub-distances corresponding to each video frame to obtain a corresponding total distance;

and calculating the reciprocal of the total distance of each video frame to obtain the similarity of each video frame and the article text through addition.

6. The method of claim 5, wherein the determining a target video frame as the cover video frame based on the similarity of each video frame to the article text comprises:

and comparing the similarity of each video frame with the article text to obtain a target video frame corresponding to the maximum similarity as the cover video frame.

7. The method of any of claims 1-6, further comprising:

acquiring one or more image quality indexes of each video frame;

determining that the target video frame is the cover video frame according to the similarity between each video frame and the article text, including:

acquiring weights corresponding to the image quality indexes and weights corresponding to the similarity;

calculating score data of each video frame according to each image quality index and corresponding weight of each video frame, and similarity and corresponding weight of each video frame and the article text;

and determining the target video frame corresponding to the maximum value of the score data as the cover video frame according to the score data of each video frame.

8. A cover video frame determination apparatus, comprising:

the third acquisition module is used for inputting the article text in the article into a pre-constructed learning model and acquiring the subject information of the text;

the fourth acquisition module is used for performing part-of-speech analysis on the article text to acquire a plurality of text participles;

the determining module is used for calculating the correlation degree of each text participle in the plurality of text participles and the topic information, and determining the text participles with the correlation degree larger than a preset threshold value as the keywords;

the first acquisition module is used for extracting keywords of the article text and acquiring a first vector corresponding to each keyword;

the second obtaining module is used for extracting a main word of each video frame in a preset time period in the article video and obtaining a second vector corresponding to each main word, wherein the main word is used for representing a main idea embodied by subtitle content of each video frame, contained character content and general article content;

the calculation module is used for calculating the similarity between each video frame and the article text according to the second vector corresponding to each main word and the first vector corresponding to each keyword;

and the cover determining module is used for determining the target video frame as the cover video frame according to the similarity of each video frame and the article text.

9. A computer program product, wherein a processor of instructions in the computer program product, when executed, implements the method of determining a cover video frame of any of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the cover video frame determination method of any one of claims 1-7.