CN117573925A

CN117573925A - Method and device for determining predicted playing time, electronic equipment and storage medium

Info

Publication number: CN117573925A
Application number: CN202410054227.8A
Authority: CN
Inventors: 黄剑辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-02-20

Abstract

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for determining a predicted playing duration, an electronic device, and a storage medium, where the method is: acquiring a video to be processed and acquiring text content associated with the video to be processed; extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracting text features corresponding to the video to be processed based on the text content; and then, carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed. In this way, by means of the text content and the video image content which can differentially represent the video, the targeted prediction of the estimated playing time length of the video to be processed can be realized, so that the interpretability of the determining process of the estimated playing time length is improved, and the accuracy of the determined estimated playing time length is also improved.

Description

Method and device for determining predicted playing time, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for determining a predicted playing duration, an electronic device, and a storage medium.

Background

With the development of information technology, each video matched with the video searching requirement can be recalled in response to the video searching requirement of the related object, the video which can be presented and the presentation sequence of the video are determined by sequencing the recalled videos, and video pushing is carried out according to the corresponding presentation sequence. When video ordering is performed, video scores are generally calculated according to statistical information such as historical pushing conditions, clicked conditions and historical playing time length of videos, and then video ordering results are determined according to the video scores of the videos.

Currently, in order to determine a corresponding estimated playing time length for a video which has not been pushed in a video scoring process, a parameter estimation mode is generally adopted, and an estimated parameter for determining the estimated playing time length of the video is obtained based on the historical playing time length of each video which has been pushed, so that the estimated playing time lengths of different videos can be determined according to the estimated parameter.

However, since the estimated playing time length of the video which is pushed and the estimated playing time length of the video which is not pushed are determined according to the estimated parameters obtained by analyzing the statistical relationship between the historical playing time lengths of other videos and the total video time length, the estimated playing time length determined for the video is not specific, and the accuracy of the determined estimated playing time length is difficult to ensure.

Disclosure of Invention

The embodiment of the application provides a method and a device for predicting the played time length of a video, electronic equipment and a storage medium, which are used for improving the accuracy of the determined estimated play time length.

In a first aspect, a method for determining a predicted playing duration is provided, including:

acquiring a video to be processed and acquiring text content associated with the video to be processed;

extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracting text features corresponding to the video to be processed based on the text content;

and carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

In a second aspect, a device for determining a predicted playing duration is provided, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed and acquiring text content associated with the video to be processed;

the extraction unit is used for extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracting text features corresponding to the video to be processed based on the text content;

And the prediction unit is used for carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

Optionally, after the obtaining the estimated playing duration corresponding to the video to be processed, the prediction unit is further configured to:

acquiring an estimated playing time length duty ratio corresponding to the video to be processed according to the estimated playing time length, and extracting keywords aiming at the text content when the estimated playing time length duty ratio is determined to exceed a preset duty ratio threshold value to obtain description keywords corresponding to the video to be processed;

and adding the description keywords to a preset keyword set, and guiding to generate description texts of other videos according to the description keywords included in the keyword set.

Optionally, when extracting the video feature corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, the extracting unit is configured to:

performing frame extraction processing on the video to be processed to obtain at least one key video frame, and performing feature extraction on the at least one key video frame to obtain image features corresponding to the at least one key video frame respectively;

And based on at least one image feature, fusing to obtain video features corresponding to the video to be processed.

Optionally, when the frame extracting process is performed on the video to be processed to obtain at least one key video frame, the extracting unit is configured to:

dividing the video to be processed into a plurality of video segments to be processed according to the preset proportion of each segment duration and the total video duration of the video to be processed;

for each video segment to be processed, the following operations are performed: obtaining a corresponding set frame extraction proportion according to the corresponding segment duration proportion of the video segment to be processed, determining a corresponding key video frame number according to the frame extraction proportion and a preset frame extraction total number, and taking the key video frame number video frames uniformly extracted from the video segment to be processed as obtained key video frames.

starting from the video starting position of the video to be processed, performing video frame extraction operation once every set time period until the residual time period of the video to be processed is lower than the set time period;

And taking the extracted at least one video frame as the obtained at least one key video frame.

Optionally, when the obtaining the video to be processed and the text content associated with the video to be processed, the obtaining unit is configured to perform any one of the following operations:

selecting one candidate video with associated click watching times lower than a set threshold value from the recalled candidate videos, determining the one candidate video as a video to be processed, and determining the video title of the one candidate video as text content associated with the video to be processed;

and determining the acquired video to be distributed as a video to be processed, and determining the video title of the video to be distributed as text content associated with the video to be processed.

Optionally, when the video to be processed is included in each candidate video recalled, after obtaining the estimated playing duration corresponding to the video to be processed, the obtaining unit is further configured to:

determining a video score corresponding to the video to be processed by combining a historical pushing condition and a historical clicking condition corresponding to the video to be processed based on the estimated playing time length of the video to be processed;

And when the video score exceeds a preset score threshold, determining the video to be processed as the video to be pushed.

Optionally, the function of the device is implemented based on a trained target duration estimation model, where the target duration estimation model is obtained by training a training unit in the device in the following manner:

obtaining a training sample set, wherein one training sample comprises: a piece of sample video, a sample playing result of the piece of sample video, and a sample text associated with the piece of sample video;

and carrying out multiple rounds of iterative training on the constructed initial duration estimation model by adopting the training sample set to obtain a trained target duration estimation model.

Optionally, during a round of iterative training, the training unit is configured to perform the following operations:

adopting the initial duration estimation model, extracting sample video features corresponding to sample videos based on at least one sample video frame extracted from the sample videos, extracting sample text features corresponding to the sample videos based on sample texts associated with the sample videos, and carrying out regression prediction processing based on fusion features of the sample video features and the sample text features to obtain predicted play results corresponding to the sample videos;

And adjusting model parameters of the initial duration estimation model based on the numerical difference between the predicted playing result and the sample playing result of the sample video.

Optionally, the initial duration estimation model is constructed by the training unit based on any one of the following structures:

an image coding network, a text coding network, and a prediction network comprising a fully connected layer;

an image coding network, a text coding network, and a prediction network including a fully connected layer and a unified layer.

In a third aspect, an electronic device is presented comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the computer program.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the above method.

In a fifth aspect, a computer program product is proposed, comprising a computer program which, when executed by a processor, implements the above method.

The beneficial effects of the application are as follows:

the method, the device, the electronic equipment and the storage medium for determining the predicted playing time are provided, a video to be processed is obtained, and text content associated with the video to be processed is obtained; extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracting text features corresponding to the video to be processed based on the text content; and then, carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

In this way, when the corresponding estimated playing time length is determined for the video to be processed, the multi-mode feature can be obtained by extracting at least one extracted key video frame and associated text content, and further the estimated playing time length corresponding to the video to be processed can be predicted based on the obtained image feature and text feature by means of regression prediction; in the prediction process of the predicted playing time, the influence of the text content corresponding to the video to be processed and the image content in the video to be processed is comprehensively considered, which is equivalent to the comprehensive modeling by combining the text content and the image content, and the predicted playing time can be finally obtained through fitting; moreover, by means of the text content and the video image content which can be used for differentially representing the video, the targeted prediction of the estimated playing time length of the video to be processed can be realized, the interpretability of the determining process of the estimated playing time length is improved, and the accuracy of the determined estimated playing time length is also improved.

Drawings

Fig. 1 is a schematic diagram of a video pushing process according to an embodiment of the present application;

fig. 2 is a schematic diagram of a possible application scenario in the embodiment of the present application;

Fig. 3 is a schematic diagram of a determining process of a predicted playing duration in an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for determining candidate videos according to an embodiment of the present application;

FIG. 5A is a schematic diagram of a video segmentation method according to an embodiment of the present application;

FIG. 5B is a schematic diagram of a segment frame extraction method according to an embodiment of the present application;

FIG. 5C is a schematic diagram illustrating another frame-pumping scheme according to an embodiment of the present application;

FIG. 6A is a schematic structural diagram of an initial duration estimation model according to an embodiment of the present application;

FIG. 6B is a schematic diagram of another initial duration estimation model according to an embodiment of the present application;

FIG. 6C is a schematic diagram of a process of training to obtain a target duration estimation model in an embodiment of the present application;

fig. 7 is a schematic process diagram of determining a predicted playing duration of a video according to an embodiment of the present application;

fig. 8 is a schematic logic structure diagram of a determining device for predicting a play duration in the embodiment of the present application;

fig. 9 is a schematic diagram of a hardware composition structure of an electronic device to which the embodiments of the present application are applied;

fig. 10 is a schematic diagram of a hardware composition structure of another electronic device to which the embodiment of the present application is applied.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, which can be made by a person of ordinary skill in the art without any inventive effort, based on the embodiments described in the present application are intended to be within the scope of the technical solutions of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be capable of operation in sequences other than those illustrated or otherwise described.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Some of the terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

Predicting play time length: in the embodiment of the application, the time length that the pointer can be watched after being estimated to the video and pushed to the user is obtained; it should be understood that the estimated play duration is an estimated value determined for the user population; the predicted playing time length can measure the attraction condition of the video to the user to a certain extent, or the interested condition of the video to the user.

Pushing: it should be noted that, in the embodiment of the present application, after a video is pushed to a user, whether the video is clicked and watched by the user or not may be regarded as completing one video exposure.

Key video frames: is extracted from each video frame included in the video, and has more analysis value for analysis from the viewpoint of visual images.

Regression prediction: in the embodiment of the application, the numerical result is obtained by fitting by means of regression analysis technology, so that prediction of the estimated playing time length is realized.

The following briefly describes the design concept of the embodiment of the present application:

with the development of information technology, each video matched with the video searching requirement can be recalled in response to the video searching requirement of the related object, the video which can be presented and the presentation sequence of the video are determined by sequencing the recalled videos, and video pushing is carried out according to the corresponding presentation sequence.

Referring to fig. 1, which is a schematic diagram of a related video pushing process in the embodiment of the present application, after obtaining the content of a user Query (Query), each video is recalled from an index library by a recall mode; further, through coarse ranking and fine ranking, the video sequence presented during final pushing is determined in a ranking manner; as can be seen from the content illustrated in fig. 1, assuming that the content of the Query of the user is "skip X-hop", after performing the recall and the sorting based on the "skip X-hop", the video searched according to the "skip X-hop" can be pushed to the user according to the video presentation sequence illustrated in fig. 1, where the recall is performed after searching in different index libraries based on the same Query content to obtain corresponding search results, and then recall the intersection content of different search results.

It can be seen that the coarse-ranking and fine-ranking two-wheeled video ranking of videos finally determines a video ranking result, and the video ranking result determines each video capable of being presented and the presentation sequence of each video.

Under the related technology, when video push ordering is performed, video scores are generally calculated according to statistical information such as historical push conditions, clicked conditions and historical playing time length of videos, and then video ordering results are determined according to the video scores of all videos.

In the conception process, the applicant finds that the statistical information such as the playing time length, the historical exposure condition, the clicking condition and the like of the video are important characteristics in the video coarse arrangement and the video fine arrangement process, and plays a key role, in particular to the playing time length characteristics. Generally, the longer the playing time after clicking on a video, the higher the satisfaction of the video for the user to search for this time. Based on this, in order to ensure the referenceability of the video ordering result, the playing time length corresponding to the video needs to be estimated for the video which has not been pushed or the video with fewer pushing times.

At present, when determining the estimated playing time length of the video, a parameter estimation mode is generally adopted, and based on the historical playing time length of each video which is pushed, an estimated parameter for determining the estimated playing time length of the video is obtained, so that the estimated playing time lengths of different videos can be determined according to the estimated parameter.

Specifically, when determining the estimated play time in a parameter estimation manner, the following formula may be adopted for calculation:

wherein r represents the click play time length ratio, namely the ratio of the estimated play time length determined for one video (assumed to be video X) to the total video time length of the video X; click represents the historical viewing time of video X; exp is the total duration of video X;and->An estimation parameter determined for statistics; l is the estimated play time determined for video X.

However, when determining the estimated playing time length of the video, a set of same estimation parameters are adopted for the video which is pushed and the video which is not pushed, so that the video playing condition can be estimated only according to the statistical relationship between the historical playing time length determined by analyzing the historical data and the total time length of the video, the estimated playing time length determined for the video is not specific, and the accuracy of the determined estimated playing time length is difficult to ensure.

In view of this, in the embodiment of the present application, a method, an apparatus, an electronic device, and a storage medium for determining a predicted playing duration are provided, a video to be processed is obtained, and text content associated with the video to be processed is obtained; extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracting text features corresponding to the video to be processed based on text content; and then, carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.

Fig. 2 is a schematic diagram of a possible application scenario in the embodiment of the present application. The application scenario diagram includes a client device 210, and a processing device 220.

In some possible embodiments of the present application, the processing device 220 may obtain an original video sent by a related object on the client device 210, and determine an estimated playing duration corresponding to the original video according to a title configured by the related object for the original video and a key video frame extracted from the original video; and then in the video sequencing process possibly related, calculating video scores by combining the determined estimated playing time length, and sequencing and pushing the videos according to the video scores.

In other possible embodiments of the present application, the processing device 220 may provide a video searching and pushing function, and in response to a searching operation of a related object on the client device 210, push a video meeting a requirement to the client device 210, where the searching operation triggered by the related object may be initiated on any one of an applet application, a client application, and a web application, which is not limited in this application specifically. Based on this, in order to meet the video push requirement, before the video is released, the processing device 220 may determine a corresponding estimated playing duration for the video, so that the processing is directly performed according to the predetermined estimated playing duration when the video score is calculated subsequently; in this case, optionally, only the video with the number of times of being clicked and watched being lower than the first set value in the recalled videos may be processed according to the pre-determined estimated playing duration, where the value of the first set value is set according to the actual processing requirement.

In other possible embodiments of the present application, the processing device 220 may provide a video searching and pushing function, and in response to a searching operation of a related object on the client device 210, push a video meeting a requirement to the client device 210, where the searching operation triggered by the related object may be initiated on any one of an applet application, a client application, and a web application, which is not specifically limited in this application. Based on this, to meet the video pushing requirement, after completing video recall, the processing device 220 may determine, for each video that is clicked for a number of times lower than the first set value, a corresponding estimated play duration; and further, determining a corresponding video score by combining the estimated playing time length, and sequencing all recalled videos according to the video score, wherein the value of the first set value is set according to the actual processing requirement.

Client devices 210 include, but are not limited to, cell phones, tablet computers, notebooks, electronic book readers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

The processing device 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), basic cloud computing services such as big data and artificial intelligent platform; in a possible implementation manner, the processing device may be a terminal device with a certain processing capability, such as a tablet computer, a notebook computer, or the like.

In the embodiment of the present application, communication between the client device 210 and the processing device 220 may be performed through a wired network or a wireless network. In the following description, the determination process of the estimated play time length will be described only from the viewpoint of the processing device 220.

The following describes schematically the determination process of the estimated playing time length in combination with possible application scenarios:

and (3) applying the first scene, and responding to video searching operation triggered by the user to determine the estimated playing time length of the video.

In the service scene corresponding to the application scene, in order to meet the searching requirement of various types of videos, the processing equipment can recall the matched videos after determining the video content searched by the user, and determine the video which can be pushed and the presentation sequence when the video is pushed by sequencing the recalled videos.

Specifically, for a newly released video or a video with a smaller number of pushed videos in each recalled video, there may be no historical play data of the video, or even if there is historical play data, it is insufficient to meet the evaluation requirement for the video play duration. In this case, the corresponding estimated play duration may be determined for the newly released or clicked video with a smaller number of plays.

Then, according to the determined estimated playing time length and other information related to the video, video scores of all the videos can be calculated, and further push ordering of the videos can be achieved according to the obtained video scores.

And when the second application scene actively recommends the video to the user, determining the estimated playing time length of the video.

In the service scene corresponding to the application scene II, after the video content possibly interested by the user is determined, the video can be actively pushed to the user when the user is determined to access the application. Specifically, after the video types of interest of the user are determined in advance according to the interest content actively configured by the user, each video of possible interest of the user can be recalled when the user accesses the application, and the video which can be pushed and the presentation sequence when the video is pushed can be determined by sequencing each video.

For newly released or pushed videos in the recalled videos, historical playing data of the videos may not exist, or even the historical playing data exists but is insufficient to meet the evaluation requirement of the video playing time. In this case, the corresponding estimated play duration may be determined for the newly released or clicked video with a smaller number of plays.

In addition, it should be understood that in the specific embodiments of the present application, the determination of the estimated playing time period is involved, and when the embodiments described in the present application are applied to specific products or technologies, the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The following describes a determination process of the estimated playing time length from the point of view of the processing device with reference to the accompanying drawings:

referring to fig. 3, which is a schematic diagram of a process for determining the estimated play duration in the embodiment of the present application, the process for determining the estimated play duration is described below with reference to fig. 3:

step 301: the processing device obtains a video to be processed and obtains text content associated with the video to be processed.

In the embodiment of the application, according to the actual processing requirement, the processing equipment can determine the corresponding estimated playing time length by adopting the processing mode of the application request protection aiming at the video with the time length which cannot be effectively measured according to the historical data in the recalled video after completing the video recall; or, the processing device may determine, by adopting a processing manner claimed in the present application, an estimated playing time length of the video before the video is released, where the watched time length is also referred to as the played time length, and indicates a playing condition of the video after being pushed.

The process of obtaining the video to be processed and the text content associated with the video to be processed is described below in connection with two possible processing scenarios:

and processing the first scene, and selecting a video to be processed from all the recalled candidate videos.

Specifically, in the process of acquiring corresponding data of a processing scene, selecting one candidate video with associated click watching times lower than a set threshold value from all the candidate videos recalled, determining the one candidate video as a video to be processed, and determining the video title of the one candidate video as text content associated with the video to be processed.

It should be noted that, the number of times of clicking and watching refers to the total number of times of clicking and watching the video by the user after pushing the video to the user; the value of the set threshold is set according to the actual processing requirement, which is not particularly limited in this application, wherein the number of click-to-view is determined by analyzing the historical data of the corresponding video.

In addition, for the video recall corresponding to the processing scene one, the processing device may respond to the video search request triggered by the related object, or the processing device may actively perform the video recall by analyzing the video browsing requirement of the related object, where the analysis of the video browsing requirement of the related object is performed under the condition that the related object is authorized and informed; in addition, the processing scheme of the application request protection is irrelevant to the recall mode of the video, so the application does not specifically limit the recall mode adopted in the recall process of the video.

For example, referring to fig. 4, which is a schematic diagram illustrating a process for determining candidate videos in an embodiment of the present application, it is assumed that a user sends a video search request based on "polar bear" on a used client device, and a processing device recalls 6 candidate videos based on "polar bear"; further, assuming that the preset threshold value for the number of click plays is 800, from the recalled 6 candidate videos, 3 candidate videos with the number of click plays lower than the set threshold value can be determined: candidate videos 1, 2, 5; furthermore, the candidate videos 1, 2 and 5 can be respectively used as videos to be processed, and the corresponding estimated playing time length can be determined.

In this way, for each recalled candidate video, by analyzing the click view times associated with the candidate video, whether the view duration of the user group for the candidate video can be effectively determined can be judged based on the historical viewed condition of the candidate video, which is equivalent to screening in each candidate video by the click view times associated with the video, and the to-be-processed video of which the view duration of the user group for the candidate video cannot be effectively determined based on the historical viewed condition can be screened.

And processing the second scene, and selecting the video to be distributed as the video to be processed.

In the process of acquiring the data corresponding to the second processing scene, the processing device can determine the acquired video to be transmitted as the video to be processed, and determine the video title of the video to be transmitted as the text content associated with the video to be processed.

Specifically, in the processing process corresponding to the processing scene two, the processing equipment determines the predicted playing time length in the video release stage, so that each released video is associated with the predicted playing time length which is predicted and determined, and a referenceable basis can be provided for calculating the video score of the new video in the video sequencing process possibly performed later.

Therefore, by taking the video to be distributed as the video to be processed, each video after distribution can be ensured to be associated with the estimated playing time length, and a referenceable playing time length basis is provided for the new video.

Step 302: the processing equipment extracts video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extracts text features corresponding to the video to be processed based on text content.

In the embodiment of the application, when the processing equipment extracts the image features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, frame extraction processing is performed on the video to be processed to obtain at least one key video frame, and feature extraction is performed on the at least one key video frame to obtain the image features corresponding to the at least one key video frame respectively; and fusing to obtain video features corresponding to the video to be processed based on at least one image feature.

Specifically, in the process of extracting frames from the video to be processed to obtain at least one key video frame, in some possible implementation manners, the processing device may directly select a cover image of the video to be processed as the key video frame; in other possible implementations, the frames can be extracted differently in different content areas under the condition that the content importance degrees of the different content areas in the video are considered to be different; or, indiscriminate frame extraction can be performed in the video without considering the difference of the content importance degree of different content areas, wherein the content areas refer to areas covered by a section of sub-video in the video when the video is divided into a plurality of sections; the content importance level is used to describe the embodiment of the video content to the video theme.

For example, for a video, the beginning of the video will generally describe content such as a shooting background describing the video, the end of the video is used to summarize the video content, and the middle part of the video is used to show the core content of the video; based on this, it is apparent that the importance of the content area corresponding to the shooting background and the importance of the content area corresponding to the end of video are far less than those of the core content.

The following describes a process of extracting at least one key video frame from a video to be processed, taking two possible frame extraction modes as examples:

and in the first frame extraction mode, different content areas in the video to be processed are subjected to differential frame extraction.

In the processing process corresponding to the frame extraction mode one, the processing equipment can divide the video to be processed into a plurality of video segments to be processed according to the preset time length proportion of each segment and the total video time length of the video to be processed; and then, aiming at each video segment to be processed, executing the following operations: and obtaining a correspondingly set frame extraction proportion according to the corresponding segment duration ratio of the video segment to be processed, determining the corresponding key video frame number according to the frame extraction proportion and the preset frame extraction total number, and taking the key video frame number video frames uniformly extracted from the video segment to be processed as the obtained key video frames.

It should be noted that, in the embodiment of the present application, a preset video segmentation mode is used to indicate that a plurality of video segments to be processed are partitioned from a video to be processed; in addition, the preset video segmentation mode can limit the video duration corresponding to each video segment to be processed by restricting the proportion of each segmentation duration under the condition of not distinguishing the video content types, so that a plurality of video segments to be processed are partitioned from the video to be processed; meanwhile, corresponding frame extraction ratios can be set according to the proportion of each segment duration, and the total number of video frames extracted from each video segment to be processed is determined based on the frame extraction ratios and the preset total number of frame extraction.

In addition, in the embodiment of the present application, the uniform extraction of video frames from a video segment to be processed refers to that when video frames are extracted from one video segment to be processed, different video frames are extracted at the same time interval.

For example, referring to fig. 5A, which is a schematic diagram of a video segmentation method according to an embodiment of the present application, it is assumed that the preset video segmentation method is: dividing three video segments to be processed, wherein the corresponding segment time length accounts for 0.2, 0.6 and 0.2 respectively; assuming that one video to be processed is 20 seconds, the first video segment to be processed: the video segment 1 to be processed corresponds to video content of 0-4S in the video to be processed; the second video segment to be processed: the video segment 2 to be processed corresponds to the video content of 4S-16S in the video to be processed; third video to be processed: and the video segment 3 to be processed corresponds to the video content of 16S-20S in the video to be processed.

For another example, referring to fig. 5B, which is a schematic diagram of a segment frame extraction manner in the embodiment of the present application, it is assumed that the segment duration is 0.2, 0.6, and 0.2; for the first segment duration duty cycle: 0.2, the configured frame extraction proportion is 0.2; for the second segment duration ratio: 0.6, the configured frame extraction proportion is 0.5; for the third segment duration ratio: 0.2, configured frame-pumped ratio of 0.3, and set frame-pumped total of 10. Then, in the first section duration duty constraint to-be-processed video segment 1, 2 frames of video frames need to be extracted, in the second section duration duty constraint to-be-processed video segment 2, 5 frames of video frames need to be extracted, and in the third section duration duty constraint to-be-processed video segment 3, 3 frames of video frames need to be extracted.

Continuing to explain with reference to fig. 5B, assuming that when frames are extracted from different video segments to be processed, a corresponding number of video frames are extracted uniformly, in the video segment to be processed 1, one video frame may be extracted at intervals of 1.3 seconds; in the video segment 2 to be processed, a video frame can be selected to be extracted every 2 seconds; in the video segment 3 to be processed, one video frame can be selected to be extracted every 1S; and then the frame extraction situation illustrated in fig. 5B can be obtained.

Therefore, under the condition of considering the difference of the content importance degree of different content areas in the video to be processed, the segmentation duration proportion of different video segments can be determined by combining with practical processing experience, so that different numbers of video frames can be extracted from different video segments differently under the condition that the total number of extracted frames is fixed, and each extracted video frame can cover different content areas of the video in a balanced mode.

And extracting video frames in the second frame extracting mode by taking the set time length as an interval in the video to be processed.

In the processing process corresponding to the frame extraction mode II, the processing equipment can start from the video starting position of the video to be processed, and perform video frame extraction operation once every set time length until the residual time length of the video to be processed is lower than the set time length; and taking the extracted at least one video frame as the obtained at least one key video frame, wherein the value of the set duration is set according to the actual processing requirement, and the application is not particularly limited to the value.

For example, referring to fig. 5C, which is a schematic diagram of another frame extraction manner in the embodiment of the present application, as can be seen from the content illustrated in fig. 5C, assuming that the set duration is 3S and the video to be processed is 20 seconds, during the frame extraction process, one video frame is extracted every 3S, and then video frame extraction can be performed at the 3S position, the 6S position, the 9S position, the 12S position, the 15S position, and the 18S position in sequence, that is, 6 frame extraction positions illustrated in fig. 5C are determined in the video to be processed of 20S.

In this way, by performing the frame extraction operation at intervals of a set duration in the video to be processed, a representative key video frame can be obtained from the video to be processed.

And extracting at least one key video frame from the video to be processed by the processing equipment, extracting the image characteristics of each key video frame, and further carrying out characteristic fusion on the image characteristics to obtain the image characteristics corresponding to the video to be processed.

Specifically, in the embodiment of the present application, the process of obtaining the video feature based on at least one key video frame, obtaining the text feature based on the text content associated with the video to be processed, and obtaining the predicted playing duration based on the video feature and the text feature by the subsequent regression prediction may be implemented based on the trained target duration estimation model.

In a feasible implementation manner of the method, all users can be regarded as a whole user group, and the watching time of the user group for the video is analyzed through model modeling; or, all users can be divided into user groups according to attribute information such as user disclosure, for example, age, so that different set threshold values can be preset for the click watching times of the videos respectively in order to differently select the videos to be processed according to different user groups, and model modeling can be performed for different user groups respectively.

For example, the processing device may consider all users of the service as a user group, and in the determination process of the estimated playing duration, obtain a target duration estimation model through training, and determine the estimated playing duration of the video.

For another example, the processing device may divide all users of the service into a plurality of user groups according to age information disclosed by the users, and perform the following operations for each user group: based on historical operation data of each user in a user group for a video, a training sample is constructed, and a corresponding model is trained according to the training sample. Furthermore, in a specific processing process, a proper model can be selected according to the age of the user of the pushed video, and the estimated playing time length can be determined according to the selected model.

In the following description, only taking all users as one user group as an example, the training process of the target duration estimation model will be described:

referring to fig. 6A, which is a schematic structural diagram of an initial duration estimation model according to an embodiment of the present application, referring to fig. 6A, the initial duration estimation model is constructed to include an image coding network, a text coding network, and a prediction network including a full connection layer, where,

image coding network: in the training process, the method is used for encoding the sample video frames and extracting sample image features; in the embodiment of the application, after the pretraining of the Resnet network is performed by adopting a large-scale classified image data set (ImageNet), the obtained pretrained Resnet network is determined to be an image coding network, for example, the Resnet152 is adopted as the image coding network, so that image coding is realized; alternatively, other image encoders implementing an image encoding function may be employed as the image encoding network.

Text encoding network: in the training process, the method is used for carrying out semantic coding on the sample text, and the coding obtains semantic vectors (namely sample text characteristics) of the sample text; in the embodiment of the application, the pre-trained BERT network can be used as a text coding network; alternatively, long Short-term memory neural network (LSTM) may be used as the text encoding network, or other networks capable of achieving semantic encoding, such as convolutional neural network (Convolutional Neural Networks, CNN), may be used as the text encoding network.

Prediction network: in the training process, the method is used for splicing the obtained sample video features and sample text features to realize the function of image-text fusion, inputting the fusion result into a full-connection layer to predict floating point values, and obtaining the output predicted playing time.

Referring to fig. 6B, which is a schematic structural diagram of another initial duration estimation model according to an embodiment of the present application, referring to fig. 6B, the initial duration estimation model is constructed to include an image coding network, a text coding network, and a prediction network including a full-connection layer and a normalization layer, where,

Text encoding network: in the training process, the method is used for carrying out semantic coding on the sample text, and the coding obtains semantic vectors (namely sample text characteristics) of the sample text; in the embodiment of the application, the pre-trained BERT network can be used as a text coding network; alternatively, LSTM may be used as the text encoding network, or alternatively, other networks capable of achieving semantic encoding, such as CNN, may be used as the text encoding network.

Prediction network: in the training process, the method is used for splicing the obtained sample video features and sample text features to realize the function of image-text fusion, inputting the fusion result into a full-connection layer to predict floating point values, and carrying out normalization processing based on a normalization layer (marked as a sigmoid layer) to obtain the output predicted playing duration duty ratio.

In this way, by means of the different model structures illustrated in fig. 6A and 6B, an initial duration estimation model capable of obtaining different types of estimated play results can be constructed according to actual processing requirements, so that the requirement of obtaining different results can be met, and the usability of the estimated play results is improved.

Further, referring to fig. 6C, which is a schematic diagram of a process of training to obtain a target duration estimation model in the embodiment of the present application, a process of training to obtain a target duration estimation model is described below with reference to fig. 6C:

step 601: the processing device obtains a training sample set, wherein one training sample comprises: a piece of sample video, a sample playing result corresponding to the sample video, and a sample text associated with the sample video.

Specifically, when the processing device acquires a training sample set, the processing device can select sample videos according to historical operated data of each video and construct the training sample set; alternatively, the processing device may obtain the training sample set directly from the other device.

In the process of self-constructing the training sample set, the processing device can select the video with the clicked playing times exceeding the second set value as the sample video according to the historical operated data of each video, or can directly use the exposure clicked video (or called the clicked video) in the appointed historical time period as the sample video, wherein the value of the second set value is set according to the actual processing requirement, and the application does not limit the specific limitation.

Furthermore, when determining a corresponding sample playing result for each selected sample video, considering that two feasible model structures of fig. 6A and 6B exist in the application, when selecting the model structure illustrated in fig. 6A to construct an initial duration estimation model, the sample playing result refers to a sample playing duration, or may be a sample playing duration duty ratio; when the model structure illustrated in fig. 6B is selected to construct the initial duration estimation model, the sample play result refers to the sample play duration ratio, where the sample play duration ratio refers to the ratio between the sample play duration of a sample video and the total video duration of the sample video.

When determining the sample playing time length of one sample video, acquiring each single watched time length associated with the sample video according to the historical operated data of the sample video, and determining the sample playing time length corresponding to the sample video based on each single watched time length, wherein the single watched time length refers to the duration of video playing after a user clicks the video for playing once, and the operated condition of the video is recorded in the historical operated data of one video.

It should be noted that, when determining the corresponding sample playing duration for one sample video, the average value of the single-time watched duration may be calculated according to each single-time watched duration associated with one sample video, and the average value of the single-time watched duration may be determined as the corresponding sample playing duration; or, sorting can be performed according to the ascending or descending order of the single watched duration, and the intermediate value in the sorted single watched duration is determined as the sample playing duration; or, in the case that the sample video is the exposure click video in the specified historical time period, the single-time watched duration corresponding to the sample video can be directly determined as the sample playing duration.

In addition, the processing device, upon selecting a sample video, may determine the title text of the sample video as the sample text associated with the sample video.

Step 602: the processing equipment adopts a training sample set to carry out multi-round iterative training on the constructed initial duration estimation model, and a trained target duration estimation model is obtained.

After the processing equipment acquires a training sample set, performing multiple rounds of iterative training on the constructed initial duration estimation model until a preset convergence condition is met, so as to obtain a trained target duration estimation model, wherein the preset convergence condition can be that the number of model training rounds reaches a preset third set value or the number of times that a model loss value is continuously lower than a fourth set value reaches a preset fifth set value; the specific values of the third setting value, the fourth setting value, and the fifth setting value are set according to actual processing requirements, which are not particularly limited in this application.

The following describes the processing performed in the model training process, taking as an example the processing steps performed in the initial round of iterative training process:

the processing equipment adopts an initial duration estimation model, extracts sample video features corresponding to sample videos based on at least one sample video frame extracted from the sample videos, extracts sample text features corresponding to the sample videos based on sample texts associated with the sample videos, and carries out regression prediction processing based on fusion features of the sample video features and the sample text features to obtain predicted play results corresponding to the sample videos; and adjusting model parameters of the initial duration estimation model based on the numerical difference between the predicted playing result and the sample playing result of the sample video.

Specifically, in a round of iterative training process, at least one sample video frame is extracted from the read sample video, where the frame extraction manner of the sample video frame is the same as that described in step 302, and the manner of extracting at least one key video frame from the video to be processed is not described herein.

The training process will be specifically described below by taking an image encoding network as a pre-trained res net network and a text encoding network as a pre-trained Bert network as an example.

After the processing equipment inputs the sample text into the Bert network, carrying out semantic coding on the sample text based on the Bert network to obtain the whole semantic vector of the sample text, namely obtaining the sample text characteristics, and marking the sample text characteristics as Cls_emb=Bert (title), wherein the title represents the sample text, and the Cls_emb represents the sample text characteristics extracted based on the sample text;

at the same time, the processing device inputs at least one sample video frame into the ResNet network, and finally, sample video characteristics corresponding to the sample video are obtained. Specifically, in the case that the number of sample video frames is 1 frame, the ResNet can be directly adopted to carry out image coding on the sample video frames, and the obtained image features are the sample video features; under the condition that the number of the sample video frames is multiple frames, image coding can be carried out on each sample video frame by adopting ResNet, so that corresponding image characteristics are obtained, and then weighting fusion is carried out on the obtained plurality of image characteristics, so that the corresponding sample video characteristics are obtained, wherein the weighting coefficient adopted in the weighting fusion is set according to the actual processing requirement. The resulting sample video feature is denoted img_emb, which may be denoted img_emb=res net (Img), where Img represents at least one sample video frame of input.

Further, inputting the sample video features and the sample image features into a prediction network, and carrying out feature Fusion of the sample video features and the sample image features in the prediction network to obtain feature Fusion results, wherein under the condition that a vector splicing mode is adopted to carry out dimension splicing on the sample video features and the sample text features to obtain feature Fusion results, the obtained vector Fusion results are recorded as fusion= [ Cls_emb: img_emb ], [ is ] in a vector splicing mode; in addition, in the embodiment of the application, the image-text information fusion can be realized in a mode such as vector connection or dot product, and a feature fusion result is obtained.

Then, under the condition of training by adopting the model structure illustrated in fig. 6A, continuing to carry out regression prediction based on the feature Fusion result by means of the fully connected layer in the prediction network, so as to obtain a predicted playing duration, which is denoted as FNN (Fusion); in the case of training using the model structure illustrated in fig. 6B, the regression prediction and normalization processing can be performed based on the feature Fusion result by continuing to use the fully connected layer and the normalized layer in the prediction network, so as to obtain a predicted play duration ratio, which is denoted as sigmoid (FNN (Fusion)).

Further, after obtaining the predicted playback time, the model loss value may be calculated using a commonly used loss function in regression models such as mean square error (Mean Square Error, MSE), mean absolute value loss function (mean absolute error, MAE), smoothed mean absolute error (Huber), and the like.

Taking the example of calculating model loss using the MSE loss function, the processing device calculates model loss values as follows:

wherein, loss represents the calculated model Loss value; n represents the total number of training samples simultaneously input in one training batch (batch);the real playing time length duty ratio specifically refers to the ratio of the sample playing time length of the sample video to the total video time length; />The predicted playing time length duty ratio calculated based on the predicted playing time length specifically refers to a ratio of the predicted playing time length to the total video time length of the corresponding sample video.

Furthermore, the processing device adjusts the text coding network, the image coding network and the prediction network in the initial duration estimation model according to the obtained model loss value.

In this way, in one round of iterative process, the processing equipment can guide the initial duration estimation model to learn and extract the image-text fusion characteristics of the sample video by means of the image-text information related to the sample video and the sample play result, and output the capability of predicting the play result; in addition, in the training process, although the mode modeling of supervised training is adopted, in some implementations, sample construction can be completed only according to historical exposure click data (or historical operated data), and no additional labeling data is needed. In addition, in the technical scheme provided by the application, the length and the context of the sample text are not limited when the training sample is constructed, so that the generalization capability is strong, and the method can be applied to various application scenes.

Comprehensively, the method is equivalent to a modeling mode of estimating the playing time length based on the image-text multi-mode content, in a feasible implementation mode, the method is used for directly adopting a pretrained Resnet to finish encoding for a sample video frame, adopting a Bert network to perform encoding for a sample text content, and fitting the historical click playing time ratio (or the historical playing time length ratio) of the sample video by using a regression mode.

Similarly, the processing device may iteratively execute the training process for the initial duration estimation model until a preset model convergence condition is satisfied, thereby finally obtaining a trained target duration estimation model.

In this way, the target duration estimation model for predicting and determining the estimated playing duration of the video can be obtained by training the initial duration estimation model through multiple rounds of iterative training, wherein the target duration estimation model is based on video frames in the video and text content associated with the video; therefore, in the determination process of the estimated play time length, the influence of the image-text information of the video is integrated; and the method can realize targeted prediction aiming at different videos to obtain reasonable predicted playing time, so that the interpretive performance of the determination process of the predicted playing time is improved, and the accuracy of the determined predicted playing time is also improved.

Step 303: and the processing equipment carries out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

Specifically, the processing device may use the trained target duration estimation model to perform regression prediction processing based on the fusion feature of the video feature and the text feature of the video to be processed, so as to obtain the predicted playing duration corresponding to the video to be processed, where the fusion mode of the video feature and the text feature is identical to the mode of fusing the sample video feature and the sample text feature in step 602, and the application will not be described herein.

It should be understood that, in the embodiment of the present application, the processing device adopts the prediction network in the target duration estimation model, and can perform regression prediction based on the fusion feature to obtain the predicted play result, where the predicted play result may be the predicted play duration, or the predicted play duration duty ratio. Under the condition that the obtained estimated playing result is the estimated playing duration duty ratio, the processing equipment needs to calculate the estimated playing duration based on the total duration of the video to be processed and the estimated playing duration duty ratio.

For example, assuming that the estimated playing time length output by the target time length estimation model is 0.8 and the total time length of the video to be processed is 20S, the estimated playing time length corresponding to the video to be processed is 0.8×20=16s.

Further, after obtaining the estimated playing time length corresponding to the video to be processed, in some possible implementation manners, the processing device may obtain the estimated playing time length duty ratio corresponding to the video to be processed according to the estimated playing time length, and when determining that the estimated playing time length duty ratio exceeds the preset duty ratio threshold, extract keywords for the corresponding text content to obtain description keywords corresponding to the video to be processed; furthermore, the description keywords are added to a preset keyword set, and according to each description keyword included in the keyword set, the description texts of other videos are guided to be generated.

Specifically, because the influence of text features is introduced in the model prediction process, the implicit relationship between text semantics and predicted playing time can be established in the target time length estimation model. The processing device can capture descriptive keywords which are strongly related to the estimated playing time length in the text content by analyzing the text content corresponding to the high estimated playing time length, in other words, the descriptive keywords which can possibly obtain the high estimated playing time length can be captured by analyzing the text content, wherein when the descriptive keywords are extracted, the word words in the text content can be removed, the words are segmented aiming at the rest of the content, and each word obtained by the word segmentation is used as the descriptive keywords.

Optionally, in other possible implementations, the processing device may determine whether to perform keyword extraction according to the predicted playing duration, specifically, perform keyword extraction on the corresponding text content to obtain the corresponding description keyword when determining that the predicted playing duration exceeds the preset duration threshold, where the relevant extraction mode is the same as the above processing procedure, which is not described in detail in this application.

Meanwhile, the processing device may construct a keyword set to save the extracted descriptive keywords; and further, when the description text is generated for other videos later, the description text can be generated according to the description keywords in the keyword set.

Therefore, by capturing words which are strongly related to the playing time in the text content, words which can well express the video theme and are more attractive to users can be arranged, so that reference basis can be provided for the generation of descriptive texts of other videos, and the possibility that the other videos can be clicked and played is improved.

In addition, under the condition that the video to be processed is determined from the recalled candidate videos, after the processing equipment obtains the estimated playing time length corresponding to the video to be processed, the video scoring corresponding to the video to be processed can be determined by combining the historical pushing condition and the historical clicking condition corresponding to the video to be processed based on the estimated playing time length of the video to be processed; and when the video score exceeds a preset score threshold, determining the video to be processed as the video to be pushed.

Specifically, after obtaining the estimated playing time, the processing device may obtain a video score by weighting calculation according to the estimated playing time, the historical pushing total number and the historical clicking total number; furthermore, when the video score is determined to exceed the preset score threshold, it can be determined that the high probability of the video to be processed after being pushed can be well fed back by the user, so that the video to be processed can be determined to be the video to be pushed, wherein the weight values respectively preset for the estimated playing time length, the historical pushing total number and the historical clicking total number are set according to actual processing requirements, and the value of the score threshold is also set according to the actual processing requirements, so that the video to be processed is not particularly limited.

In this way, the determined estimated play duration is equivalent to the completion of video features associated with the video to be processed, so that video scores can be calculated and obtained corresponding to the recalled video to be processed, and further, a reference basis can be provided for the video recommendation process, and more exposure opportunities can be provided for video which is not pushed or video with fewer pushing times under the condition of video pushing according to the video scores.

The following describes, with reference to a specific example, a process of determining an estimated playing duration of a video:

Referring to fig. 7, which is a schematic diagram of a process of determining a predicted playing time of a video in an embodiment of the present application, a processing device responds to a user to initiate a video query request based on "braised pork", and can recall candidate videos related to "braised pork"; furthermore, according to the actual processing requirement, a trained target duration estimation model is adopted, all the recalled candidate videos are used as videos to be processed, and the estimated playing duration corresponding to each video to be processed is respectively predicted; or, according to the processing requirement of the duration, firstly determining candidate videos with the clicked playing times lower than a set threshold value from the candidate videos, taking the candidate videos as the videos to be processed, and then predicting and determining the corresponding estimated playing duration for the videos to be processed.

Continuing to describe with reference to fig. 7, assuming that the number of the determined videos to be processed is 4, a target duration estimation model may be adopted, and for each video to be processed, a corresponding estimated playing duration duty ratio is determined based on the key video frames extracted from the video to be processed and the video titles of the video to be processed; and further determining the estimated playing time according to the estimated playing time duty ratio and the corresponding total time length of the video to be processed.

In summary, when predicting the predicted playing time of the video in a regression mode, on one hand, the text semantic of text content associated with the video can be understood, the positive-negative association between the words and the playing time is captured, and the implicit association between the text semantic and the playing time is established, so that the association between key words in the text content and information such as the playing time of the video can be enhanced; on the other hand, visual impact caused by image information in the video can be considered; in the determination process of the estimated playing time, the text and image information can be synthesized for modeling, and the relation between the content semantics of the video and the playing time of the video can be effectively captured.

Based on the technical scheme, the video playing time length prediction method and device can integrate the image content of the video and the text content of the video, reasonably predict playing time length aiming at high-quality video with lack of history or no exposure, and supplement the playing time length type characteristics of the video, so that the video exposing probability is improved, the experience of the whole searching system and the online video clicking probability are optimized, and the video playing method and device can be suitable for various scenes needing searching and recommending the video content or the image-text content; in addition, it should be understood that, because the present application synthesizes the video key frame and the video title information to perform multi-mode information modeling, the estimated playing time length of the video can be reasonably determined, so that for the video with inconsistent video content and text content, a very short estimated playing time length can be predicted, thus causing the video with inconsistent text content to be difficult to push and reducing the influence of the video with inconsistent text content. Furthermore, the target duration estimation model trained by the method not only can reasonably give the play duration pre-estimated value (or pre-estimated play duration) of the unexposed video, but also can assist in capturing words in the document which are strongly related to the play duration by means of the prediction result of the model.

Based on the same inventive concept, referring to fig. 8, which is a schematic logic structure diagram of a device for determining a predicted playing time length in the embodiment of the present application, the device 800 for determining a predicted playing time length includes an obtaining unit 801, an extracting unit 802, and a predicting unit 803, where,

an obtaining unit 801, configured to obtain a video to be processed, and obtain text content associated with the video to be processed;

an extracting unit 802, configured to extract video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, and extract text features corresponding to the video to be processed based on text content;

and the prediction unit 803 is used for carrying out regression prediction processing based on the fusion characteristics of the video characteristics and the text characteristics to obtain the estimated playing time length corresponding to the video to be processed.

Optionally, after obtaining the estimated playing duration corresponding to the video to be processed, the prediction unit 803 is further configured to:

acquiring an estimated playing time length duty ratio corresponding to the video to be processed according to the estimated playing time length, and extracting keywords aiming at text content when the estimated playing time length duty ratio is determined to exceed a preset duty ratio threshold value to obtain description keywords corresponding to the video to be processed;

And adding the description keywords to a preset keyword set, and guiding to generate description texts of other videos according to each description keyword included in the keyword set.

Optionally, when extracting the video feature corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, the extracting unit 802 is configured to:

Optionally, when performing frame extraction processing on the video to be processed to obtain at least one key video frame, the extracting unit 802 is configured to:

for each video segment to be processed, the following operations are performed: and obtaining a correspondingly set frame extraction proportion according to the corresponding segment duration ratio of the video segment to be processed, determining the corresponding key video frame number according to the frame extraction proportion and the preset frame extraction total number, and taking the key video frame number video frames uniformly extracted from the video segment to be processed as the obtained key video frames.

Optionally, when acquiring the video to be processed and the text content associated with the video to be processed, the acquiring unit 801 is configured to perform any one of the following operations:

selecting one candidate video with associated click watching times lower than a set threshold value from the recalled candidate videos, determining the one candidate video as a video to be processed, and determining a video title of the one candidate video as text content associated with the video to be processed;

Optionally, in the case that the video to be processed is included in the recalled candidate videos, after obtaining the estimated playing duration corresponding to the video to be processed, the obtaining unit 801 is further configured to:

Optionally, the function of the device is implemented based on a trained target duration estimation model, where the target duration estimation model is obtained by training by the training unit 804 in the device in the following manner:

obtaining a training sample set, wherein one training sample comprises: a sample video, a sample play result of the sample video, and a sample text associated with the sample video;

and carrying out multiple rounds of iterative training on the constructed initial time length estimation model by adopting a training sample set to obtain a trained target time length estimation model.

Optionally, during a round of iterative training, the training unit 804 is configured to perform the following operations:

extracting sample video features corresponding to the sample video based on at least one sample video frame extracted from the sample video by adopting an initial duration estimation model, extracting sample text features corresponding to the sample video based on sample texts associated with the sample video, and carrying out regression prediction processing based on fusion features of the sample video features and the sample text features to obtain a prediction playing result corresponding to the sample video;

Optionally, the initial duration estimation model is built by the training unit 804 based on any one of the following structures:

For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

Having introduced the method and apparatus for determining the estimated play time of the exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the present application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method. Referring to fig. 9, a schematic diagram of a hardware component of an electronic device to which embodiments of the present application are applied, where in an embodiment, the electronic device may be the processing device 220 shown in fig. 2. In this embodiment, the electronic device may be configured as shown in fig. 9, including a memory 901, a communication module 903, and one or more processors 902.

A memory 901 for storing a computer program executed by the processor 902. The memory 901 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 901 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 901 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 901, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 901 may be a combination of the above memories.

The processor 902 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. The processor 902 is configured to implement the above-mentioned method for determining the estimated playing duration when calling the computer program stored in the memory 901.

The communication module 903 is used to communicate with the client device and the server.

The specific connection medium between the memory 901, the communication module 903, and the processor 902 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 901 and the processor 902 are connected through the bus 904 in fig. 9, and the bus 904 is depicted in a bold line in fig. 9, and the connection manner between other components is only schematically illustrated, and is not limited thereto. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 9, but only one bus or one type of bus is not depicted.

The memory 901 stores a computer storage medium, in which computer executable instructions are stored, for implementing the method for determining the estimated playing time length according to the embodiments of the present application. The processor 902 is configured to execute the above-described method for determining the estimated playing duration, as shown in fig. 3.

In another embodiment, the electronic device may be another electronic device, and referring to fig. 10, a schematic diagram of a hardware composition of another electronic device to which the embodiment of the present application is applied, where the electronic device may specifically be the client device 210 shown in fig. 2. In this embodiment, the structure of the electronic device may include, as shown in fig. 10: communication component 1010, memory 1020, display unit 1030, camera 1040, sensor 1050, audio circuit 1060, bluetooth module 1070, processor 1080 and the like.

The communication component 1010 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.

Memory 1020 may be used to store software programs and data. Processor 1080 performs various functions and data processing of client device 210 by executing software programs or data stored in memory 1020. The memory 1020 in the present application may store an operating system and various application programs, and may also store a computer program for executing the method for determining the estimated play duration in the embodiment of the present application.

The display unit 1030 may also be used to display information entered by a user or provided to a user as well as a graphical user interface (graphical user interface, GUI) of various menus of the client device 210. In particular, the display unit 1030 may include a display screen 1032 disposed on the front of the client device 210. The display unit 1030 may be used to display a page or the like that triggers a video search operation in the embodiment of the present application.

The display unit 1030 may also be used to receive input numeric or character information and generate signal inputs related to user settings and function control of the client device 210. In particular, the display unit 1030 may include a touch screen 1031 disposed on the front of the client device 210 and may collect touch operations thereon or thereabout by a user.

The touch screen 1031 may be covered on the display screen 1032, or the touch screen 1031 may be integrated with the display screen 1032 to implement the input and output functions of the client device 210, and after integration, the touch screen may be simply referred to as a touch screen. The display unit 1030 may display an application program and corresponding operation steps.

The camera 1040 may be used to capture still images, and the user may comment the image captured by the camera 1040 through the application. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to a processor 1080 for conversion into a digital image signal.

The client device may also include at least one sensor 1050, such as an acceleration sensor 1051, a distance sensor 1052, a fingerprint sensor 1053, and a temperature sensor 1054. The client device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.

Audio circuitry 1060, speakers 1061, microphone 1062 may provide an audio interface between a user and the client device 210. Audio circuit 1060 may transmit the received electrical signal after conversion of the audio data to speaker 1061 for conversion by speaker 1061 into an audio signal output. On the other hand, microphone 1062 converts the collected sound signals into electrical signals, which are received by audio circuitry 1060 and converted into audio data, which are output to communications component 1010 for transmission to, for example, another client device 210, or to memory 1020 for further processing.

The bluetooth module 1070 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol.

Processor 1080 is a control center of the client device and connects the various parts of the overall terminal using various interfaces and lines, performs various functions of the client device and processes data by running or executing software programs stored in memory 1020 and invoking data stored in memory 1020. In some embodiments, processor 1080 may include at least one processing unit; processor 1080 may also integrate the application processor and the baseband processor. Processor 1080 in the present application may run an operating system, an application program, a user interface display, a touch response, and a method for determining an estimated play duration in the embodiments of the present application. In addition, a processor 1080 is coupled to the display unit 1030.

In some possible embodiments, aspects of the method for determining a predicted playing time period provided in the present application may also be implemented as a program product, which includes a computer program for causing an electronic device to perform the steps in the method for determining a predicted playing time period according to the various exemplary embodiments of the present application described above when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having a computer-usable computer program embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method for determining a predicted playing time is characterized by comprising the following steps:

2. The method of claim 1, wherein after obtaining the estimated play duration corresponding to the video to be processed, further comprises:

3. The method of claim 1, wherein the extracting video features corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed comprises:

4. The method of claim 3, wherein the performing frame extraction processing on the video to be processed to obtain at least one key video frame comprises:

5. The method of claim 3, wherein the performing frame extraction processing on the video to be processed to obtain at least one key video frame comprises:

6. The method of claim 1, wherein the obtaining the video to be processed, and the text content associated with the video to be processed, perform any one of:

7. The method of claim 6, wherein, in the case that the video to be processed is included in each candidate video recalled, obtaining the estimated play time corresponding to the video to be processed further comprises:

8. The method according to any one of claims 1-7, wherein the method is implemented based on a trained target duration estimation model, the target duration estimation model being trained in the following manner:

9. The method of claim 8, wherein during a round of iterative training, the following operations are performed:

10. The method of claim 8, wherein the initial duration estimation model is constructed based on any one of the following structures:

11. A device for determining a predicted play duration, comprising:

12. The apparatus of claim 11, wherein after the obtaining the estimated playing duration corresponding to the video to be processed, the prediction unit is further configured to:

13. The apparatus of claim 11, wherein, when extracting the video feature corresponding to the video to be processed based on at least one key video frame extracted from the video to be processed, the extracting unit is to:

14. The apparatus of claim 13, wherein when the extracting unit performs frame extraction processing on the video to be processed to obtain at least one key video frame, the extracting unit is configured to:

15. The apparatus according to any one of claims 11-14, wherein the functionality of the apparatus is implemented based on a trained target duration estimation model, the target duration estimation model being trained by a training unit in the apparatus by:

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-10 when executing the computer program.

17. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the method according to any of claims 1-10 when executed by a processor.

18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.