CN114245232B

CN114245232B - Video abstract generation method and device, storage medium and electronic equipment

Info

Publication number: CN114245232B
Application number: CN202111531817.8A
Authority: CN
Inventors: 于朋鑫; 王少康; 陈宽
Original assignee: Infervision Medical Technology Co Ltd
Current assignee: Infervision Medical Technology Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-10-31
Anticipated expiration: 2041-12-14
Also published as: CN114245232A

Abstract

The embodiment of the invention discloses a video abstract generation method, a device, a storage medium and electronic equipment. The method comprises the following steps: acquiring a video to be processed, and extracting key frames in the video to be processed; carrying out preset processing on each key frame to obtain a processing result of each key frame, displaying the processing result of each key frame, and obtaining interaction information of the processing results of the key frames, non-key frames and key frames in the video to be processed by a user in the display process; and determining a target video frame for generating the video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame. By carrying out preset processing on the key frames, the processing time length and the waiting time length of the user are reduced, and the display efficiency of the processing results is accelerated. And forming a video abstract based on the interaction information and the processing result of each video frame, and fusing the attention degree of the user to each video frame with the processing result, thereby improving the accuracy of the video abstract.

Description

Video abstract generation method and device, storage medium and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a video abstract generating method, a device, a storage medium and electronic equipment.

Background

The cloud platform has important value for popularization of the artificial intelligence application, on one hand, the artificial intelligence application generally has higher calculation hardware dependence, the cost is difficult to bear in underdeveloped areas, and the cloud platform can greatly reduce the requirement on local equipment, so that the artificial intelligence application is easier to popularize; on the other hand, by virtue of the development of communication technology, the cloud platform can more simply realize the joint work among multiple centers and assist the development of industry.

At present, the processing method for medical video analysis is to input a plurality of video data uploaded by a user into an artificial intelligent computing module sequentially or in batches, and feed back the computing data so that the user can view the result, wherein the data in a computing queue cannot be viewed by the user. When a large amount of data is uploaded, a plurality of unprocessed data are stored in the calculation queue, and a user can only check the completed data and wait for the processing results of other data. Meanwhile, in the case of generating a result report for video data, a large amount of redundant information is also present in the generated result report due to the large number of video frames in the video, redundancy of content, and the like.

Disclosure of Invention

The embodiment of the invention provides a video abstract generation method, a device, a storage medium and electronic equipment, which are used for improving the generation efficiency of a video abstract and reducing information redundancy.

In a first aspect, an embodiment of the present invention provides a method for generating a video summary, including:

acquiring a video to be processed, and extracting key frames in the video to be processed;

performing preset processing on each key frame to obtain a processing result of each key frame, displaying the processing result of each key frame, and obtaining interaction information of a user on the key frame, a non-key frame and the processing result of the key frame in the video to be processed in the display process;

and determining a target video frame for generating a video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame.

In a second aspect, an embodiment of the present invention further provides a video summary generating apparatus, including:

the key frame extraction module is used for acquiring a video to be processed and extracting key frames in the video to be processed;

the key frame processing module is used for carrying out preset processing on each key frame to obtain a processing result of each key frame, and displaying the processing result of each key frame;

The interactive information acquisition module is used for acquiring interactive information of a user on a key frame, a non-key frame and a processing result of the key frame in the video to be processed in the display process;

and the video abstract generating module is used for determining a target video frame for generating a video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the video summary generating method as provided in any embodiment of the present invention when executing the program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video summary generating method as provided in any embodiment of the present invention.

According to the technical scheme, the key frames in the video to be processed are extracted, and the key frames are subjected to preset processing, so that the processing time and the waiting time of a user are reduced; and the display of the key frame processing results in the video to be processed accelerates the display efficiency of the processing results. In the display process, the interactive information of each video frame is recorded, a video abstract is formed based on the interactive information and the processing result of each video frame, the attention degree of the user to each video frame is fused with the processing result, and the accuracy of the video abstract is improved.

Drawings

Fig. 1 is a schematic flow chart of a video summary generation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a determination of a target video frame provided by an embodiment of the present invention;

fig. 3 is a schematic diagram of a video summary generation flow provided in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a video summary generating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a schematic flow chart of a video summary generating method according to an embodiment of the present invention, where the embodiment is applicable to fast processing a video and avoiding a long-time waiting situation, the method may be performed by a video summary generating apparatus according to an embodiment of the present invention, where the video summary generating apparatus may be implemented by software and/or hardware, and the video summary generating apparatus may be configured on an electronic computing device, and specifically includes the following steps:

S110, acquiring a video to be processed, and extracting key frames in the video to be processed.

S120, carrying out preset processing on each key frame to obtain a processing result of each key frame, displaying the processing result of each key frame, and obtaining interaction information of a user on the key frame, the non-key frame and the processing result of the key frame in the video to be processed in the display process.

S130, determining a target video frame for generating a video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame.

The video to be processed in the embodiment may include, but is not limited to, medical image video, monitoring video, etc., and the processing manner of the video may include, but is not limited to, identification of an interested region in each video frame, segmentation of an interested region in the video frame, classification of the video frame, face recognition of the video frame, etc., which are not limited.

Under the condition that a processing instruction of the video to be processed is received, identifying a key frame in the video to be processed, wherein the key frame is a local video frame in the video to be processed, and processing the key frame to reduce the processed video frame data and accelerate the processing speed. In some alternative embodiments, extracting key frames in the video to be processed includes: extracting key frames from the video to be processed based on preset interval frame numbers, wherein the preset interval frame numbers can be fixedly set, for example, 10 frames; alternatively, the preset interval frame number may be determined according to the total frame number of the video to be processed, wherein the preset interval frame number may be determined based on the total frame number of the video to be processed and a preset proportion, and the preset proportion may be 10%, and the preset interval frame number is a positive integer.

In some alternative embodiments, the purpose of extracting key frames in the video to be processed is to eliminate duplicate video frames, reducing the time consumption caused by processing duplicate video frames. Correspondingly, extracting the key frame in the video to be processed comprises the following steps: and determining video differences of adjacent video frames in the video to be processed, and determining key frames in the adjacent video frames when the video differences meet preset conditions. The similarity may be calculated for the adjacent video frames, the video difference may be determined by the similarity, specifically, the similarity may be determined by calculating distance information (for example, euclidean distance information) of the adjacent video frames, or the pixel values of the corresponding pixels of the adjacent video frames may be compared to determine the similarity, where the similarity may be a proportion of the number of pixels in which the difference value of the pixel values of the corresponding pixels of the adjacent video frames is within a preset range.

The video difference is inversely related to the similarity, the larger the similarity of the adjacent video frames is, the smaller the video difference is, and the smaller the similarity of the adjacent video frames is, the larger the video difference is. Specifically, the similarity of the adjacent video frames is smaller than a preset threshold, the video difference of the adjacent video frames meets the difference condition, that is, the adjacent video frames are not repeated image frames, and at least one key frame can be determined in the adjacent video frames. Optionally, the first frame of the video to be processed is determined as a key frame, and for each pair of adjacent video frames satisfying the difference condition, the next frame of the adjacent video frames is determined as a key frame.

Alternatively, the number of frames of the key frames may be set, which may be input in advance by the user, or may be determined according to the total number of frames of the video to be processed and a preset ratio. Accordingly, after determining the video difference of each group of adjacent video frames, the ranking may be performed based on the video difference, for example, the ranking may be performed from large to small based on the video difference, and the key frames of the adjacent video frames in the preset ranking range are determined, so as to reduce the number of the key frames.

In some embodiments, the purpose of extracting the key frame includes, but is not limited to, removing the repeated video frame, extracting the video frame including the preset object, and the like, and correspondingly, a plurality of key frame extraction models can be preset according to the extraction purpose of the key frame, and the corresponding key frame extraction model is called according to the extraction purpose of the key frame, the video to be processed is output to the key frame extraction model, and the key frame or the key frame information output by the key frame extraction model is obtained, where the key frame information can be a timestamp or a frame number of the key frame. The key frame extraction model may be obtained by training based on a sample video and a standard key frame of the sample video, taking an extraction purpose of eliminating repeated video frames as an example, where the standard key frame of the sample video is a non-repeated video frame in the sample video, that is, a key frame determined by a video difference. Taking the extraction purpose of extracting a video frame including a preset object as an example, a standard key frame of a sample video may be a video frame including a preset object in the sample video, for example, the preset object may be a preset face, or a preset region of interest, where the preset region of interest may be a preset focus.

Correspondingly, extracting the key frame in the video to be processed comprises: and extracting key frames from the video to be processed based on a preset key frame extraction model. Alternatively, the number of key frames may be set, and the key frames of the number of key frames are randomly determined from the key frames determined by the key frame extraction model.

The extracted key frames are subjected to preset processing, wherein the preset processing can be determined according to processing requirements, and the processing comprises, but is not limited to, identification of an interested region in each video frame, segmentation of the interested region in the video frame, classification of the video frame, face recognition of the video frame and the like. Optionally, a preset processing model is preset, a corresponding preset processing model is called according to the processing requirement, and the extracted key frames are respectively input into the called preset processing model to obtain the processing result of the preset processing model on each key frame. Accordingly, the preset processing model includes, but is not limited to, a region of interest extraction model, a region of interest segmentation model, a video frame classification model, a face recognition model, and the like, according to the processing requirements. Accordingly, the processing results of the key frame include, but are not limited to, classification results of the key frame, probability of including the region of interest in the key frame, pixel positions in the region of interest in the key frame, recognition results and recognition probabilities of the key frame, and the like.

The processing process of all video frames in the video frames to be processed is replaced by processing the local video frames extracted from the video to be processed, namely the key frames, and the processing results of the key frames are visually displayed, so that the processing time of the video to be processed and the waiting time of a user are reduced.

In this embodiment, the electronic device is configured with a display unit, or the electronic device is electrically connected or communicatively connected to the display unit, and the processing result of the key frame is visually displayed by the display unit, for example, the display unit may be a display or a display screen. In this embodiment, the display manner of the processing result of the key frame is not limited, and for example, the processing result of each key frame may form a display page, and the display page may include one or more of the processing result, the identification of the key frame (such as a timestamp or a key frame serial number), and the key frame image. Correspondingly, a switching space exists in the display page to realize switching between the display page and the adjacent display page. For example, an index may be formed according to the position of the key frame in the video to be processed, and the index may be in a list form, a progress bar form, or the like without limitation. The index in list form includes identifiers of key frames, each identifier is associated with a presentation page of a corresponding key frame, and when the identifier of the key frame in the index is detected to be selected, the presentation page corresponding to the selected identifier is displayed so as to present a processing result of the key frame corresponding to the identifier. In some embodiments, the QA may further include an identifier of a non-key frame, where the identifier of the non-key frame may be associated with a corresponding video frame, and when the identifier of the non-key frame is selected, the corresponding video frame is displayed, and it should be noted that, in the index, the key frame identifier and the non-key frame identifier are displayed differently, so as to distinguish them from each other. The index in the form of a progress bar is related to each frame in the video to be processed in a time stamp or a video frame sequence mode, and index positions corresponding to key frames in the index in the form of the progress bar are displayed in a distinguishing mode, for example, the index positions can be displayed in a distinguishing mode through different colors or highlighting modes, so that a user can conveniently and quickly lock the key frame positions, each video frame position in the index is related to each video frame in the video to be processed one by one, the related video frames are displayed when any video frame position is selected, and when a processing result exists in the video frames, the processing result can be displayed at a preset position of a display page, or the preset position in the video frames. Taking the text information of the processing result as the recognition probability, the classification result and the like as an example, the display interface may include a processing result display area, where the display area is used for displaying the processing result, and taking the processing result as the segmentation result of the region of interest as an example, the pixel points corresponding to the segmentation result may be displayed in a video frame in a distinguishing manner, for example, may be highlighted.

And the processing results of the key frames are visually displayed, so that the processing results of the key frames are convenient to view, and interaction with a user is realized.

In the display process of the processing result, the interactive information of the user on the processing result and/or each video frame in the video to be processed is monitored in real time, the attention degree of the user on the video frame is reflected through the interactive information, and the browsing time of an exemplary video frame is long, so that the attention degree of the user on the video frame is high; and viewing, labeling, scaling, sharing and the like are performed on a video frame, so that the video frame is high in attention.

In some embodiments, the interaction information includes interaction information of a time dimension and interaction information of an operation dimension. Wherein the interaction information of the time dimension comprises one or more of the following: dwell time, dwell time duty cycle, number of accesses. In the case where any video frame (including key frames and non-key frames) is selected for presentation, the number of accesses is cumulatively increased by one, i.e., in the case where any video frame is switched from other video frames to the presentation state, the number of accesses is cumulatively increased by one. The display duration of each access is the time of clicking to enter a certain video frame as a starting point, the time of making an end decision (such as generating a summary report or directly leaving the display interface of the video frame) is taken as an end point, the time difference between the starting point and the end point is the residence duration of a single access, and the residence duration of each access is accumulated under the condition that multiple accesses to any video frame exist, so as to obtain the total residence duration. The dwell time length ratio is the ratio of the dwell time length of any video frame to the total dwell time length of all video frames in the video to be processed.

The interaction information of the operation dimension includes, but is not limited to, a scaling operation, parameter adjustment of the video frame, a labeling operation, a sharing operation, and the like. Parameters that adjust to the video frame include, but are not limited to, contrast, brightness, etc. The operation on the video frame can be realized through a preset adjusting gesture or a preset adjusting control, and the interaction information of the corresponding operation is determined under the condition that the adjusting gesture input by the user or the preset adjusting control is triggered is detected.

In some optional embodiments, the acquiring the interaction information of the user on the key frame and the non-key frame in the video to be processed and the processing result of the key frame in the display process includes: recording display time information and display times information for any one of a key frame, a non-key frame and a processing result of the key frame in the video to be processed under the condition that the selection operation of any one is detected, and recording operation information corresponding to the preset operation when the preset operation is detected in the display process of any one of the above. And accumulating the access times of the video frames plus one for any video frame of the key frames or the non-key frames under the condition that any video frame is selected for display or the processing result of any video frame is selected, and recording the current stay time of the video frame or the processing result. In the displaying process of any video frame, whether a preset operation exists or not is monitored, for example, the touch gesture of a user and the selection operation of each control of the display interface are monitored, if any preset operation is monitored, the preset operation is recorded, and in an exemplary manner, the interaction information corresponding to the preset operation is recorded as 1, and in the displaying process of the video frame, if no preset operation is monitored, the interaction information corresponding to the preset operation is recorded as 0.

Illustratively, the zooming operation may be implemented through a zooming control, or implemented through a gesture track of opposite or backward directions input by two fingers, and in the case that the zooming control is triggered during the presentation of the video frame or the zooming gesture is detected, the zooming operation exists in the video frame is recorded, for example, the interaction information of the zooming operation of the video frame may be recorded as 1. Illustratively, the annotation operation may be implemented by selection of an annotation control or by adding text, graphics, etc. to the video frame. When the annotation control is triggered or annotation information exists in the video frame during the display process of the video frame, the annotation operation of the video frame is recorded, for example, the interaction information of the annotation operation of the video frame can be recorded as 1.

Under the condition that a video abstract generation instruction is received, forming a video abstract based on interaction information of each video frame and a processing result of a key frame in the display process, wherein the video abstract gives consideration to the processing result of the video frame and the attention of a user to the video frame, and improving the effectiveness of the video abstract. And determining a target video frame for generating the video abstract from the video frames to be processed, and generating the video abstract based on the target video frame. Alternatively, the target video frame may be determined in a key frame, or may be determined in a key frame and a non-key frame, where the non-key frame may be all non-key frames in the video to be processed, or may be a local non-key frame, and the local non-key frame may be obtained by random sampling, or may be obtained by sampling according to a preset rule, where the number of non-key frames may be preset, or may be determined according to the number of key frames, for example, may be n times the number of key frames, which is not limited. The video frames in the screening range of the target video frames are summary candidate video frames, namely the summary candidate video frames comprise key frames, or the summary candidate video frames in the video to be processed comprise key frames and non-key frames.

In some embodiments, the determining a target video frame for generating a video summary in the video to be processed based on the processing result of the key frame and the interaction information includes: the method comprises the steps of determining user attention of corresponding video frames based on interaction information of the video frames, determining importance indexes of the video frames based on the user attention of the video frames and processing results, and determining target video frames for generating video summaries based on the importance indexes of the video frames. The method comprises the steps of presetting weight information in each piece of interaction information, and carrying out weighting processing on each piece of interaction information and the corresponding weight information to obtain the attention of the user. Converting the processing result of the video frame into a processing value, which may be exemplified by taking the recognition probability in the processing result as the processing value; or converting the classification result into a corresponding numerical representation; or, whether the region of interest exists in the video frame is respectively set to 1, 0 and the like; and setting the processing value to a specific value, such as-1, for a video frame in which the processing result does not exist. The user attention and the processing result are respectively provided with weight information, and an importance index is generated based on the user attention and the processing result and the corresponding weight information. And sequencing all the video frames according to the importance index, and determining the target video frames according to the sequencing. Optionally, the number of target video frames in the video summary is determined from the sorted video frames according to the number of target video frames in the video summary, and illustratively, the number of target video frames may be 10, or 20% of the total number of frames of the video to be processed, etc., the first n video frames in the importance index sorting are determined as target video frames, and the video summary is generated based on the target video frames, where n is the number of target video frames.

In some embodiments, determining a target video frame for generating a video summary in the video to be processed based on the processing result of the key frame and the interaction information includes: generating video characteristic information of each abstract candidate video frame based on the image characteristics of the abstract candidate video frame in the video to be processed and the interaction information corresponding to each abstract candidate video frame; and inputting the video characteristic information of each abstract candidate video frame into a video frame screening model to obtain importance indexes of each abstract candidate video frame, and determining a target video frame in the abstract candidate video frame based on the importance indexes.

In this embodiment, the image features of the abstract candidate video frames and the interaction information corresponding to each abstract candidate video frame are fused to obtain the video feature information of the abstract candidate video frames, so that the comprehensiveness of the feature information is improved, and the accuracy of determining the target video frames is further improved.

Optionally, generating the video feature information of each abstract candidate video frame based on the image feature of the abstract candidate video frame in the video to be processed and the interaction information corresponding to each abstract candidate video frame includes: respectively extracting image features of each abstract candidate video frame based on a preset feature extraction model; converting the interaction information corresponding to each abstract candidate video frame into corresponding numerical values; and for each abstract candidate video frame, forming a feature vector serving as video feature information by using the image feature of the abstract candidate video frame and the numerical value corresponding to the interaction information.

In some embodiments, the feature extraction model may be a convolutional neural network model. And respectively inputting each abstract candidate video frame into the convolutional neural network model to obtain the image characteristic of each abstract candidate video frame, wherein the image characteristic can be a vector characteristic or a matrix characteristic. Converting the interaction information corresponding to each abstract candidate video frame into corresponding values, wherein the types of the values corresponding to different interaction information can be different, for example, the interaction information of the operation dimension can be 1 and 0, for example, if a preset operation exists, the value corresponding to the preset operation is 1, and if no preset operation exists, the value of the preset operation is 0; for example, the value corresponding to the interaction information of the time dimension may be the monitored duration or proportion, etc. The data corresponding to the interaction information of each video frame may be formed into a vector feature, and the corresponding data of each interaction information of the vector feature is arranged based on a preset sequence. And splicing the image features and adjacent features of the interaction information to obtain feature vectors of the video feature information. In some embodiments, the feature vector of the video feature information further includes a value corresponding to the processing result, where the value corresponding to the processing result may be an identification probability, a digital matrix corresponding to the segmentation result (for example, a pixel of the region of interest is 1, a pixel of the non-region of interest is 0), a digital classification identifier corresponding to the classification result, and so on. Referring to fig. 2, fig. 2 is a flow chart illustrating the determination of a target video frame according to an embodiment of the present invention. The feature vectors in fig. 2 include image features, location identification of video frames, key frame identification, processing results (prediction probabilities), values of scaling operations, dwell time duty, and number of repeated accesses. Wherein the prediction probabilities of the non-key frames are all set to a first value, for example, 0.001, and the dwell time for the non-dwell video frames is set to a second value, for example, 0.01, etc. And splicing the characteristic information to obtain a characteristic vector.

And inputting the feature vector into a video frame screening model to obtain the importance index of each video frame. The video frame screening model includes a transformer network and a multi-layer perceptron network. The transformer network (Transformer Network) may include a plurality of encoding modules, where each encoding module is sequentially connected and configured to perform feature extraction on an input feature vector, and the multi-layer perceptron network processes feature information extracted by the transformer network to obtain an importance index of each video frame. Each coding module extracts the characteristics of each video frame and the characteristics of adjacent video frames respectively, so that the condition that similar video frames are screened as target video frames at the same time is avoided.

Generating the video summary based on the target video frame may be forming the video summary based on one or more of the target video frame image, the processing result of the target video frame. By forming the video abstract, a user can conveniently browse the video to be processed by browsing the video abstract instead of browsing the video to be processed, so that the analysis and processing process of the video to be processed is simplified, the interference information in the video to be processed is reduced, and the pertinence and the effectiveness of analysis and processing of the video to be processed are improved.

On the basis of the above embodiment, generating a video summary of the video to be processed based on the target video frame includes: sorting the target video frames based on an importance index of the target video frames; and generating a video abstract based on the ordered target video frames and/or the processing results of the target video frames. And sequencing the target video frames, and sequencing summary information corresponding to the target video frames to form a video summary, so that a user can browse the video summary according to the importance index.

On the basis of the embodiment, the processing result of the key frame is displayed through the interactive interface, and meanwhile, the interactive interface also comprises a plurality of processing controls, wherein the processing controls comprise, but are not limited to, processing controls of non-key frames and generation controls of video summaries. And under the condition that the processing control of the non-key frame is selected, carrying out preset processing on the non-key frame in the video to be processed. Optionally, the processing control of the non-key frame may include selection of the non-key frame to be processed, where the non-key frame to be processed may be all the non-key frames, or may be a non-key frame selected by a user, for example, in a selection mode of a video frame, the identifier or the timestamp of the video frame is selected in an index, and the selected non-key frame is determined to be the video frame to be processed, so that pertinence of processing the video frame is improved, time consumption and computational waste caused by processing all the video frames are avoided, and meanwhile, omission of extraction of the selection of the key frame is avoided through processing the non-key frame.

And generating the video abstract based on the processing result of the processed video frames and the interaction information under the condition that the generation control of the video abstract is selected. Wherein the processed video frames comprise processed key frames or comprise processed key frames and processed non-key frames.

Correspondingly, the method provided by the embodiment further comprises the following steps: in the display process, receiving a processing instruction of a non-key frame in the video to be processed, responding to the processing instruction, carrying out preset processing on the non-key frame corresponding to the processing instruction, and displaying a processing result. By selecting whether to process the non-key frames or not by the user after processing the key frames, the problem of long time consumption caused by directly processing all video frames is avoided, and the problem of missing processing results caused by processing only the key frames is also avoided. The video to be processed is processed according to the user demands, so that the difference processing of different videos is improved, and the user demands are met.

On the basis of the above embodiment, the embodiment of the present invention also provides a preferred example, referring to fig. 3, and fig. 3 is a schematic diagram of a video summary generation flow provided by the embodiment of the present invention. The processing flow is specifically as follows: the key frame extraction is performed on the input video data (i.e. the video to be processed in the above embodiment), and has two effects, namely, optimizing the task scheduling process to optimize the user experience and reducing the computing burden of the platform; another aspect is to use it for model prediction and to generate preliminary reference information for the user, and to output a video summary report directly from the result. The key frames may be extracted according to the difference fluctuation of the frames before and after the video, or extracted at regular intervals as key frames, or extracted by a deep learning algorithm.

Key frame predictions (processing results in the above embodiments) are obtained using an artificial intelligence prediction model (i.e., processing model in the above embodiments) and returned to the user interaction page. The functions of the artificial intelligence predictive model may include, but are not limited to, lesion detection, segmentation, quantification, and the like.

The user decides in the interactive page according to the predicted key frame result: (1) Whether all video frames are predicted or not, and if not, ending interaction; (2) Whether to output the summary report, and if not, the interaction is ended. Operations for at the interactive page are detected, it is determined whether all video frames are predicted, and a summary report is output. And recording interaction information of the user in the interaction page in the display process of the key frame prediction result. The interactive information comprises a time dimension interactive information feature and a frame-by-frame multi-dimensional information feature (such as interactive information of an operation dimension), wherein the time dimension feature takes the moment when a user clicks into certain data as a starting point, the moment when the user makes an end decision (generates a summary report or directly leaves the data interface) as an end point, and the user can only stay on a certain frame of N frames of the whole video at each time point, and the feature of each frame related to the time information can be obtained according to the record, and the length is N, and the feature comprises 'the stay time of the user in the certain frame', 'the stay time ratio of the certain frame', 'the number of times that the certain frame is repeatedly accessed', and the like. The multi-dimensional information feature of each frame is composed of a plurality of vectors with the length of N, and recorded information can be information obtained by whether a certain type of behavior occurs on each frame, such as whether local scaling is performed, whether contrast is adjusted and the like based on user interaction, or can be a result obtained by whether a current frame is a key frame, a prediction result of the current frame and the like based on a previous algorithm or model.

And under the condition that an operation instruction for predicting all video frames is detected, obtaining a prediction result of the whole video by using an artificial intelligent prediction model, returning to the interactive page to be checked by a user, and recording the interactive information of the user in the interactive page. The user may further decide whether to output a summary report or not and if not, the interaction ends. The prediction of all video frames indicates that the user considers that the preliminary key frames are insufficient to summarize the video information, which can be regarded as a priori information for constructing new features or for performing some additional operations on certain frames (e.g., key frames) in the frame ordering, such as weight reduction.

Under the condition that an operation instruction for outputting the summary report is detected, performing feature construction by using the input video, the key frames, the key frame prediction result and the recorded user interaction information, predicting the importance index of each frame based on a multi-mode deep learning model of the transformer structure, referring to fig. 2, generating a video summary report according to the prediction result, and returning to the user.

Fig. 4 is a schematic structural diagram of a video summary generating apparatus according to an embodiment of the present invention, where the apparatus includes:

A key frame extraction module 210, configured to obtain a video to be processed, and extract a key frame in the video to be processed;

the key frame processing module 220 is configured to perform preset processing on each key frame to obtain a processing result of each key frame, and display the processing result of each key frame;

the interactive information acquisition module 230 is configured to acquire interactive information of a user on a key frame, a non-key frame and a processing result of the key frame in the video to be processed in a display process;

the video summary generating module 240 is configured to determine a target video frame for generating a video summary in the video to be processed based on the processing result of the key frame and the interaction information, and generate a video summary of the video to be processed based on the target video frame.

Optionally, the interaction information includes interaction information of a time dimension and interaction information of an operation dimension;

the interaction information collection module 230 is configured to:

recording display time information and display times information for any one of a key frame, a non-key frame and a processing result of the key frame in the video to be processed under the condition that the selection operation of any one is detected, and recording operation information corresponding to the preset operation when the preset operation is detected in the display process of any one of the above.

Optionally, the video summary generating module 240 includes:

the video feature information generating unit is used for generating video feature information of each abstract candidate video frame based on image features of the abstract candidate video frame in the video to be processed and interaction information corresponding to each abstract candidate video frame;

and the target video frame determining unit is used for inputting the video characteristic information of each abstract candidate video frame into a video frame screening model to obtain an importance index of each abstract candidate video frame, and determining the target video frame in the abstract candidate video frame based on the importance index.

Optionally, the summary candidate video frame in the video to be processed includes a key frame, or the summary candidate video frame in the video to be processed includes a key frame and a non-key frame;

the video characteristic information generating unit is used for:

respectively extracting image features of each abstract candidate video frame based on a preset feature extraction model;

converting the interaction information corresponding to each abstract candidate video frame into corresponding numerical values;

and for each abstract candidate video frame, forming a feature vector serving as video feature information by using the image feature of the abstract candidate video frame and the numerical value corresponding to the interaction information.

Optionally, the video frame screening model includes a transformer network and a multi-layer perceptron network.

Optionally, the video summary generating module 240 is configured to:

sorting the target video frames based on an importance index of the target video frames; and generating a video abstract based on the ordered target video frames and/or the processing results of the target video frames.

Optionally, the key frame extraction module 210 is configured to:

extracting key frames from the video to be processed based on a preset interval frame number; or alternatively, the process may be performed,

determining video differences of adjacent video frames in the video to be processed, and determining key frames in the adjacent video frames when the video differences meet preset conditions; or alternatively, the process may be performed,

and extracting key frames from the video to be processed based on a preset key frame extraction model.

Optionally, the apparatus further comprises:

and the non-key frame processing module is used for receiving a processing instruction of the non-key frames in the video to be processed in the display process, responding to the processing instruction, carrying out preset processing on the non-key frames corresponding to the processing instruction, and displaying a processing result.

The video abstract generating device provided by the embodiment of the invention can execute the video abstract generating method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the video abstract generating method.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Fig. 5 shows a block diagram of an electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention. Device 12 is typically an electronic device that assumes image classification functionality.

As shown in fig. 5, the electronic device 12 is in the form of a general purpose computing device. Components of the electronic device 12 may include, but are not limited to: one or more processors 16, a memory device 28, and a bus 18 connecting the various system components, including the memory device 28 and the processors 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry standard architecture (Industry Standard Architecture, ISA) bus, micro channel architecture (Micro Channel Architecture, MCA) bus, enhanced ISA bus, video electronics standards association (Video Electronics Standards Association, VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnect, PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The storage 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from and writing to a removable nonvolatile optical disk (e.g., a Compact Disc-Read Only Memory (CD-ROM), digital versatile Disc (Digital Video Disc-Read Only Memory, DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The storage device 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.

Programs 36 having a set (at least one) of program modules 26 may be stored, for example, in storage 28, such program modules 26 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a gateway environment. Program modules 26 generally perform the functions and/or methods of the embodiments described herein.

The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, camera, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more gateways (e.g., local area network (Local Area Network, LAN), wide area network Wide Area Network, WAN) and/or public gateways, such as the internet) via the gateway adapter 20. As shown, gateway adapter 20 communicates with other modules of electronic device 12 over bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, disk array (Redundant Arrays of Independent Disks, RAID) systems, tape drives, data backup storage systems, and the like.

The processor 16 executes various functional applications and data processing by running a program stored in the storage device 28, for example, implementing the video summary generating method provided by the above-described embodiment of the present invention.

The embodiment of the invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the video summary generating method as provided by the embodiment of the invention.

Of course, the computer-readable storage medium provided by the embodiments of the present invention, on which the computer program stored, is not limited to the method operations described above, but may also perform the video summary generating method provided by any of the embodiments of the present invention.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer-readable signal medium may include a propagated data signal with computer-readable source code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

The source code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wire, electrical leads, fiber optic cables, RF, and the like, or any suitable combination of the foregoing.

Computer source code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The source code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of gateway, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A video summary generation method, comprising:

performing preset processing on each key frame to obtain a processing result of each key frame, displaying the processing result of each key frame, and obtaining interaction information of a user on the key frame, a non-key frame and the processing result of the key frame in the video to be processed in the display process; the interactive information comprises interactive information of a time dimension and interactive information of an operation dimension; the interaction information of the operation dimension comprises a scaling operation, parameter adjustment of a video frame, a labeling operation and a sharing operation;

Determining a target video frame for generating a video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame;

the obtaining the interactive information of the user on the key frame, the non-key frame and the processing result of the key frame in the video to be processed in the display process comprises the following steps:

2. The method according to claim 1, wherein the determining a target video frame for generating a video summary in the video to be processed based on the processing result of the key frame and the interaction information comprises:

generating video characteristic information of each abstract candidate video frame based on the image characteristics of the abstract candidate video frame in the video to be processed and the interaction information corresponding to each abstract candidate video frame;

And inputting the video characteristic information of each abstract candidate video frame into a video frame screening model to obtain importance indexes of each abstract candidate video frame, and determining a target video frame in the abstract candidate video frame based on the importance indexes.

3. The method of claim 2, wherein the summary candidate video frames in the video to be processed comprise key frames or wherein the summary candidate video frames in the video to be processed comprise key frames and non-key frames;

the generating video feature information of each abstract candidate video frame based on the image feature of the abstract candidate video frame in the video to be processed and the interaction information corresponding to each abstract candidate video frame comprises the following steps:

4. The method of claim 3, wherein the video frame screening model comprises a transformer network and a multi-layer perceptron network.

5. The method of claim 1, wherein generating the video summary of the video to be processed based on the target video frame comprises:

sorting the target video frames based on an importance index of the target video frames;

and generating a video abstract based on the ordered target video frames and/or the processing results of the target video frames.

6. The method of claim 1, wherein the extracting key frames in the video to be processed comprises:

7. The method according to claim 1, wherein the method further comprises:

in the display process, receiving a processing instruction of a non-key frame in the video to be processed, responding to the processing instruction, carrying out preset processing on the non-key frame corresponding to the processing instruction, and displaying a processing result.

8. A video summary generation apparatus, comprising:

the interactive information acquisition module is used for acquiring interactive information of a user on a key frame, a non-key frame and a processing result of the key frame in the video to be processed in the display process; the interactive information comprises interactive information of a time dimension and interactive information of an operation dimension; the interaction information of the operation dimension comprises a scaling operation, parameter adjustment of a video frame, a labeling operation and a sharing operation;

the video abstract generating module is used for determining a target video frame for generating a video abstract in the video to be processed based on the processing result of the key frame and the interaction information, and generating the video abstract of the video to be processed based on the target video frame;

the interactive information acquisition module is further configured to record, for any one of a key frame, a non-key frame, and a processing result of the key frame in the video to be processed, presentation time information and presentation times information when a selection operation of the any one is detected, and record operation information corresponding to a preset operation when the preset operation is detected in a presentation process of the any one.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the video summary generation method of any one of claims 1-7 when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the video summary generation method as claimed in any one of claims 1-7.