CN116916060A

CN116916060A - Video processing method and related equipment

Info

Publication number: CN116916060A
Application number: CN202311022128.3A
Authority: CN
Inventors: 裴森; 常仲翰; 陈奇
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-10-20

Abstract

The disclosure provides a video processing method and related equipment. The method comprises the following steps: acquiring a video to be processed; dividing the video to be processed into a plurality of scene video clips based on scene switching positions; determining a text description corresponding to each scene video clip; and removing the scene video segments with repeated semantics in the video to be processed based on the semantic relativity between the text descriptions to obtain a target video.

Description

Video processing method and related equipment

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a video processing method, apparatus, device, medium, and program product.

Background

In video authoring, high-quality materials with long video length often occupy large storage space, and the introduction of the materials into video editing software often needs to occupy more computing resources, so that challenges are presented to the storage space and computing capacity of editing equipment, workload of editing videos by users is increased, and video processing efficiency is reduced.

Disclosure of Invention

The present disclosure proposes a video processing method, apparatus, device, storage medium, and program product to solve the technical problem of inefficiency of video editing to a certain extent.

In a first aspect of the present disclosure, a video processing method is provided, including:

acquiring a video to be processed;

dividing the video to be processed into a plurality of scene video clips based on scene switching positions;

determining a text description corresponding to each scene video clip;

and removing the scene video segments with repeated semantics in the video to be processed based on the semantic relativity between the text descriptions to obtain a target video.

In a second aspect of the present disclosure, there is provided a video processing apparatus including:

the acquisition module is used for acquiring the video to be processed;

the segmentation module is used for dividing the video to be processed into a plurality of scene video clips based on scene switching positions;

the text module is used for determining text description corresponding to each scene video clip;

and the de-duplication module is used for removing the scene video fragments with repeated semantics in the video to be processed based on the semantic relativity between the text descriptions to obtain a target video.

In a third aspect of the disclosure, an electronic device is provided, which is characterized by comprising one or more processors and a memory; and one or more programs, wherein the one or more programs are stored in the memory and executed by the one or more processors, the programs comprising instructions for performing the method of the first or second aspect.

In a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium containing a computer program which, when executed by one or more processors, causes the processors to perform the method of the first or second aspect.

In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

As can be seen from the foregoing, according to the video processing method and related apparatus provided by the present disclosure, by splitting a video to be processed into a plurality of scene video segments based on scenes, and removing the scene video segments having repeated semantics therein based on semantic relativity between the scene video segments, it is possible to reduce the occurrence of repeated pictures or contents while guaranteeing to reflect the complete content of the video to be processed, not only save storage and computing resources, improve the efficiency of video editing, but also consider the presentation effect of video editing.

Drawings

In order to more clearly illustrate the technical solutions of the present disclosure or related art, the drawings required for the embodiments or related art description will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1 is a schematic diagram of a video processing architecture according to an embodiment of the disclosure.

Fig. 2 is a schematic hardware architecture diagram of an exemplary electronic device according to an embodiment of the disclosure.

Fig. 3 is a schematic flow chart of a video processing method of an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of a video processing method according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram of a video processing apparatus according to an embodiment of the disclosure.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Fig. 1 shows a schematic diagram of a video processing architecture of an embodiment of the present disclosure. Referring to fig. 1, the video processing architecture 100 may include a server 110, a terminal 120, and a network 130 providing a communication link. The server 110 and the terminal 120 may be connected through a wired or wireless network 130. The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, security services, CDNs, and the like.

The terminal 120 may be a hardware or software implementation. For example, when the terminal 120 is a hardware implementation, it may be a variety of electronic devices having a display screen and supporting page display, including but not limited to smartphones, tablets, e-book readers, laptop and desktop computers, and the like. When the terminal 120 is implemented in software, it may be installed in the above-listed electronic device; it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module, without limitation.

It should be noted that, the video processing method provided in the embodiment of the present application may be executed by the terminal 120 or may be executed by the server 110. It should be understood that the number of terminals, networks, and servers in fig. 1 are illustrative only and are not intended to be limiting. There may be any number of terminals, networks, and servers, as desired for implementation.

Fig. 2 shows a schematic hardware structure of an exemplary electronic device 200 provided by an embodiment of the disclosure. As shown in fig. 2, the electronic device 200 may include: processor 202, memory 204, network module 206, peripheral interface 208, and bus 210. Wherein the processor 202, the memory 204, the network module 206, and the peripheral interface 208 are communicatively coupled to each other within the electronic device 200 via a bus 210.

The processor 202 may be a central processing unit (Central Processing Unit, CPU), video processor, neural Network Processor (NPU), microcontroller (MCU), programmable logic device, digital Signal Processor (DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. The processor 202 may be used to perform functions related to the techniques described in this disclosure. In some embodiments, processor 202 may also include multiple processors integrated as a single logic component. For example, as shown in fig. 2, the processor 202 may include a plurality of processors 202a, 202b, and 202c.

The memory 204 may be configured to store data (e.g., instructions, computer code, etc.). As shown in fig. 2, the data stored by the memory 204 may include program instructions (e.g., program instructions for implementing the video processing methods of embodiments of the present disclosure) as well as data to be processed (e.g., the memory may store configuration files of other modules, etc.). The processor 202 may also access program instructions and data stored in the memory 204 and execute the program instructions to perform operations on the data to be processed. The memory 204 may include volatile storage or nonvolatile storage. In some embodiments, memory 204 may include Random Access Memory (RAM), read Only Memory (ROM), optical disks, magnetic disks, hard disks, solid State Disks (SSD), flash memory, memory sticks, and the like.

The network module 206 may be configured to provide communications with other external devices to the electronic device 200 via a network. The network may be any wired or wireless network capable of transmitting and receiving data. For example, the network may be a wired network, a local wireless network (e.g., bluetooth, wiFi, near Field Communication (NFC), etc.), a cellular network, the internet, or a combination of the foregoing. It will be appreciated that the type of network is not limited to the specific examples described above. In some embodiments, network module 306 may include any combination of any number of Network Interface Controllers (NICs), radio frequency modules, receivers, modems, routers, gateways, adapters, cellular network chips, etc.

Peripheral interface 208 may be configured to connect electronic device 200 with one or more peripheral devices to enable information input and output. For example, the peripheral devices may include input devices such as keyboards, mice, touchpads, touch screens, microphones, various types of sensors, and output devices such as displays, speakers, vibrators, and indicators.

Bus 210 may be configured to transfer information between the various components of electronic device 200 (e.g., processor 202, memory 204, network module 206, and peripheral interface 208), such as an internal bus (e.g., processor-memory bus), an external bus (USB port, PCI-E bus), etc.

It should be noted that, although the architecture of the electronic device 200 described above only shows the processor 202, the memory 204, the network module 206, the peripheral interface 208, and the bus 210, in a specific implementation, the architecture of the electronic device 200 may also include other components necessary to achieve normal operation. Furthermore, those skilled in the art will appreciate that the architecture of the electronic device 200 may also include only the components necessary to implement the embodiments of the present disclosure, and not all of the components shown in the figures.

The development of shooting equipment enables people to shoot video materials at any time and any place, and the development of mass storage equipment enables the duration of shooting materials to be longer and longer. These high quality, large storage space, long video length materials clearly present significant impediments and difficulties to the editing of video creators, and the importation of these materials into video editing software also requires more computing resources, which presents higher requirements and challenges to the editing equipment. Therefore, how to save the storage and computing resources involved in video editing, improve the video editing efficiency, and compromise the presentation effect of video editing is a technical problem to be solved.

In view of this, embodiments of the present disclosure provide a video processing method and related devices. The video to be processed is segmented into a plurality of scene video segments based on scenes, and the scene video segments with repeated semantics are removed based on semantic relativity among the scene video segments, so that the occurrence of repeated pictures or contents is reduced while the complete content of the video to be processed is ensured to be reflected, storage and calculation resources are saved, the video editing efficiency is improved, and the video editing presentation effect is also considered.

Specifically, the transition position of the video to be processed may be monitored to determine one or more scene switching positions, and the video to be processed is cut into a plurality of different scene video segments video_scene_i, i=1, 2, … …, n, n is a positive integer based on the one or more scene switching positions. The text description text_scene_i (for example, the abstract corresponding to the scene video clip) can be determined for each scene video clip video_scene_i, and feature extraction is performed on the text description text_scene_i to obtain the corresponding text feature_scene_i. Scene video clips with repeated semantics can be removed from the video to be processed based on semantic relativity between text features_scene_i (i.e. semantic relativity between corresponding scene video clips), so as to obtain a final target video. The repeated pictures or repeated contents can be reduced, the storage and calculation resources are saved, and the efficiency of video editing is improved.

Referring to fig. 3, fig. 3 shows a schematic flowchart of a video processing method according to an embodiment of the present disclosure. In fig. 3, the video processing method 300 may further include the following steps.

In step S310, a video to be processed is acquired.

The video to be processed may include video frames of different scenes, and between the different scenes, transition images of the video of the two scenes may be connected, for example, fade-in fade-out, hard cutting, rotation transformation, stretching transformation, and the like. The video to be processed may be uploaded locally or acquired via a network.

In some embodiments, the method 300 may further comprise: and carrying out frame extraction processing on the video to be processed by preset sampling frequency to obtain a video frame sequence to be processed. The sampling frequency may be represented by a frame rate FPS (frame per second), where the frame rate FPS represents the number of images included in video data having a duration of 1 second, for example, if 30 images are included in the video data for a duration of 1s, the fps=30 of the video data. Typically, the frame rate of video data is between 30-60. And the sampling frequency sampling frame rate FPS, the number of images sampled in 1 second can be represented. Specifically, the sampling frequency may be a frame rate fps=a, and then frames may be extracted from the video to be processed, and a frames may be extracted per second. For example. For video data with a frame rate of 30, each second includes 30 frames of images, and sampling is performed at a sampling frequency of fps=2, so that 2 frames of images can be uniformly used from every 30 frames of images. Because the frame rate of the video data is generally higher, the video processing consumes time, and the video data can be sampled based on downsampling, so that the computational complexity is reduced, and the video processing efficiency and response speed are improved.

In some embodiments, the video frame sequence to be processed can replace the video to be processed to perform subsequent processing, so that the data volume of video processing is reduced, the calculation cost is saved, and the efficiency of video processing is improved. Specifically, a frame rate a of a to-be-processed video with a duration of b seconds may be extracted based on the frame rate a, that is, a frame of a video frame is extracted every second according to a time sequence, so as to obtain a video frame sequence x_a_b including a frame, and in a subsequent processing process of the embodiment of the present disclosure, the video frame sequence may be used as a data base to replace the to-be-processed video to perform removal of repeated semantic segments.

In step S320, the video to be processed is divided into a plurality of scene video clips based on the scene switching position.

The scene-switching location may refer to a transition location in the video, among other things. Each scene video clip corresponds to the same scene, pictures in the scene video clip have no jitter, and video content remains basically unchanged.

In some embodiments, dividing the video to be processed into a plurality of scene video clips based on scene cut locations comprises:

determining scene switching probability of video frames in the video to be processed;

determining the video frame with the scene switching probability larger than or equal to a switching threshold value as the scene switching position;

and cutting the video to be processed at the scene switching position to obtain a plurality of scene video clips.

Specifically, the video to be processed can be input into a transition detection network to determine scene switching positions, and the scene switching positions are truncated, so that a plurality of scene video clips can be obtained. For example, referring to fig. 4, fig. 4 shows a schematic diagram of a video processing method according to an embodiment of the present disclosure. In fig. 4, a video to be processed 410 is subjected to video frame extraction to obtain a sequence of video frames to be processed, m (m is a positive integer) video frames in the sequence of video frames to be processed are sequentially input into a transition detection network 420, so as to obtain n scene video segments video_scene_1, … … and video_scene_n. The transition detection network 420 may perform transition detection on m video frames, for example, a transition score sk=f (xk), where xk is the input kth video frame, and f is a transition scoring function. From this, the transition score s= [ S1, S2, S3, …, sm ]) of m video frames can be obtained to represent the probability that each video frame is transitioned. The switching threshold value threshold1 may be set, and then a video frame greater than or equal to threshold1 in the transition score s= [ S1, S2, S3, …, sm ]) is a scene switching position, for example, S20 is greater than or equal to threshold1, and then a video frame20 corresponding to S20 is a scene switching position. The sequence of video frames to be processed may then be truncated at the scene-cut location to obtain a plurality of scene video clips video_scene_1, … …, video_scene_n.

In some embodiments, the transition detection network may be trained on the initial neural network based on transition detection training data. Further, the transition detection training data may include a training image and a corresponding transition reality score, the higher the transition reality score indicating a greater probability that the training image is transitioned. Specifically, the training image is used as input layer data, and the corresponding transition true score is used as output layer data to train the initial neural network. And obtaining a transition estimation score of the training image based on a transition training network, and calculating a difference value (or variance) between the transition estimation score and the transition true score to obtain a loss function. And adjusting the weight of the initial neural network based on the loss function so as to minimize the loss function and obtain the trained transition detection network.

In step S330, a text description corresponding to each of the scene video clips is determined.

Wherein the text description corresponding to a scene video clip may refer to a representative text description, such as a scene video summary, for describing the primary image content of the scene video clip.

In some embodiments, determining a text description for each of the scene video clips includes:

determining a plurality of video frame text descriptions corresponding to each video frame in the scene video clip based on the image content of the video frame;

extracting features from the video frame text descriptions to obtain a plurality of frame text features;

for each frame text feature, calculating the frame text similarity between the frame text feature and other frame text features;

and determining the video frame text description corresponding to the frame text feature with the maximum frame text similarity as the text description of the scene video segment.

For each scene video clip, the video frame text description with the highest similarity with other video frame text descriptions can be selected from the video frame text descriptions based on the video frame text description corresponding to each video frame in the scene video clip, and then the video frame text description with the highest similarity with other video frame text descriptions is used as the video abstract of the scene video clip. Specifically, a scene video clip is input into an image description network, which may be denoted as h, and the output text description as t, then t=h (x), x being the input image. For example, if the input image x includes a dog and grass, the text description t may be "the dog plays on the lawn".

The image description network 430 can perform image description on the video frames in each scene video clip, and output the text description of the video frames corresponding to each video frame. Specifically, as shown in fig. 4, n scene video clips video_scene_1, … … may be obtained, where text_scene_1, … … and text_scene_n correspond to text descriptions of video_scene_n. For example, the scene video clip video_scene_1 includes K1 video frames, where K1 is a positive integer, and K1 video frame text descriptions corresponding to the K1 video frames may be output based on the image description network 430 respectively. The text description of the K1 video frames is based on a text feature network (the input text information can be converted into text features with fixed length, for example, the input text information can be converted into feature vectors with length of 512, namely floating point number vectors), and the K1 frame text features are obtained through feature extraction. For each frame text feature, calculating the similarity between the frame text feature and other inter-frame features, obtaining K1-1 similarity (such as cosine similarity), and determining the frame text description corresponding to the frame text feature with the largest similarity as the text description text_scene_1 of the scene video segment video_scene_1. Similarly, text descriptions of other scene video clips may be available and will not be described in detail herein.

In step S340, the scene video segments with repeated semantics in the video to be processed are removed based on the semantic relativity between the text descriptions, so as to obtain a target video.

Among other things, semantic relevance (semantic relevance) can refer to the degree of similarity between image features and text features, i.e., cosine similarity. The overall correlation between the scene video segments and the main content of the whole video to be processed and the segment correlation between the scene video segments can be reflected through the semantic correlation between the text descriptions corresponding to the scene video segments.

In some embodiments, removing the scene video segment with repeated semantics in the video to be processed based on semantic relevance between the text descriptions to obtain a target video, including:

determining the overall relevance of the scene video segment and the video to be processed and the segment relevance between the scene video segments based on the semantic relevance between the text descriptions;

determining a reserved fragment and a fragment to be removed in the scene video fragment based on the overall correlation degree and the fragment correlation degree, wherein the fragment to be removed and at least one reserved fragment have repeated semantics;

and removing the fragments to be removed from the video to be processed to obtain the target video.

Whether repeated semantics exist among the scene video clips can be accurately determined through the clip correlation, and the content integrity of the whole video to be processed can be ensured through the whole correlation. Combining the two can ensure the complete content to be presented and simultaneously determine the fragment to be removed of repeated semantics, and removing the fragment to be removed from the video to be processed can obtain the target video. The repeated pictures or repeated contents can be reduced, the storage and calculation resources are saved, and the efficiency of video editing is improved.

In some embodiments, determining a retained segment and a to-be-removed segment in the scene video segment based on the overall relevance and the segment relevance, the to-be-removed segment having repeated semantics with at least one of the retained segments, comprising:

recording the scene video segments with the minimum overall relevance as reserved segments in a reserved segment set, and recording the scene video segments with the overall relevance which is not the minimum as segments to be removed in a segment set to be removed;

repeating the following steps until a preset condition is met:

calculating the semantic similarity between each fragment to be removed in the fragment set to be removed and the reserved fragment set based on the fragment correlation;

when the minimum value in the semantic similarity is smaller than a similarity threshold value, moving the fragment to be removed corresponding to the minimum value to the reserved fragment set;

wherein, the preset conditions include: the fragment set to be removed is empty, and/or the semantic similarity between all fragments to be removed in the fragment set to be removed and the reserved fragment set is smaller than the similarity threshold.

In some embodiments, determining the overall relevance of the scene video clip to the video to be processed and the clip relevance between the scene video clip based on the semantic relevance between the textual descriptions comprises:

extracting the characteristics of each text description to obtain corresponding text characteristics;

calculating a plurality of text similarities between the text features and other text features according to each text feature to obtain a segment correlation between the scene video segments;

and obtaining the overall correlation degree of the scene video segment corresponding to each text feature and the video to be processed based on the average value of the text similarities.

In some embodiments, calculating semantic similarity of each to-be-removed fragment of the set of to-be-removed fragments to the set of retained fragments based on the fragment relevance comprises:

and determining the maximum value of the fragment correlation degree between the fragment to be removed and each reserved fragment in the reserved fragment set as the semantic similarity of the fragment to be removed and the reserved fragment set.

Specifically, as shown in fig. 4, the semantic deduplication may be performed on n scene video segments video_scene_1, … …, video_scene_n based on an iterative deduplication policy. For example, text_scene_1, … …, text_scene_n of a plurality of scene video clips output from the image description network 430 may be input into the text feature network 440. The text feature network 440 may perform feature extraction on each text description text_scene_1, … …, text_scene_n, to obtain n corresponding text features feature_1, … …, feature_n.

For each text feature_i (i=1, … …, n), calculating the text similarity between the text feature_i and other text features feature_j (j not equal to i), obtaining n-1 text similarity k1, … …, kn-1, and obtaining the average value sigma (k1+ … … +kn-1) k/(n-1) of the n-1 text similarity, thereby obtaining the overall relevance similarity_i of the scene video segment video_scene_i and the whole sequence of the video frame sequence to be processed (or the video to be processed). Since there are n text features, n overall relevance can be obtained. On this basis, the scene video segment video_scene_p corresponding to the minimum value similarity_p=min (similarity_i) of the n overall correlations is taken as the starting segment, namely the segment K1, the reserved segment set k= [ K1] is put, the other scene video segments are marked as R2, … …, rn-1, and the to-be-removed set r= [ R2, … …, rn-1] is put.

The final retained segment and the segment to be removed may be determined based on an iterative deduplication policy as follows: for each element Ry in the set R to be removed, y=1, … …, n-1, calculating a text similarity_text of each element Ry and each element in the reserved fragment set K, and determining the maximum value in the text similarity_text as the semantic similarity_ry_k of the element Ry and the reserved fragment set K. In this way, n-1 semantic similarity between all n-1 elements in the removed set R and the reserved fragment set K can be obtained, the minimum value of the semantic similarity_ry_k is compared with a preset similarity threshold2, and if the minimum value is smaller than the similarity threshold2 (for example, threshold 2=0.8), the element corresponding to the minimum value is moved into the reserved fragment set K to serve as a reserved fragment. And repeatedly calculating each element in the set R to be removed based on the ascending iterative deduplication strategy until a preset condition is met.

Specifically, the preset condition may include that the reserved fragment set K may include n fragments, i.e. the set to be removed is empty. The preset condition may further include that the semantic similarity between all the fragments to be removed and the reserved fragment set in the fragment set to be removed is smaller than a similarity threshold2, that is, the similarity between the elements in the reserved fragment set K and the elements in the fragment set to be removed R is larger than the similarity threshold2, and no scene video fragment with low similarity exists. After reaching the preset condition, the to-be-removed segment in the to-be-removed segment set is a scene video segment with repeated semantics with at least one reserved segment in the reserved segment set. Further, other video processing steps can be performed on the target video obtained after the scene video segments with repeated semantics are removed, for example, high light scene extraction with specified duration is performed on the target video, so that a video result with uniform duration of each scene can be obtained.

Therefore, according to the video processing method disclosed by the embodiment of the disclosure, the occurrence of repeated pictures or contents is reduced while the complete content of the video to be processed is ensured to be reflected, so that storage and calculation resources are saved, the efficiency of video editing is improved, and the presentation effect of the video editing is also considered.

It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same technical concept, corresponding to the method of any embodiment, the present disclosure further provides a video processing apparatus, referring to fig. 5, including:

the acquisition module is used for acquiring the video to be processed and the target duration;

the acquisition module is used for acquiring the video to be processed;

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the various modules may be implemented in the same one or more pieces of software and/or hardware when implementing the present disclosure.

The device of the foregoing embodiment is configured to implement the corresponding video processing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same technical concept, the present disclosure also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the video processing method according to any of the above embodiments, corresponding to the method of any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the video processing method according to any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in details for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present disclosure. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present disclosure, and this also accounts for the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present disclosure are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the disclosure, are intended to be included within the scope of the disclosure.

Claims

1. A video processing method, the method comprising:

acquiring a video to be processed;

determining a text description corresponding to each scene video clip;

2. The method of claim 1, wherein removing the scene video segments with repeated semantics in the video to be processed based on semantic relevance between the text descriptions to obtain a target video, comprises:

3. The method of claim 2, wherein determining a retained segment and a to-be-removed segment in the scene video segment based on the overall relevance and the segment relevance, the to-be-removed segment having repeated semantics with at least one of the retained segments, comprises:

repeating the following steps until a preset condition is met:

moving the fragments to be removed, of which the semantic similarity is smaller than a similarity threshold, to the reserved fragment set;

4. A method according to claim 3, wherein determining the overall relevance of the scene video clip to the video to be processed and the clip relevance between the scene video clips based on semantic relevance between the textual descriptions comprises:

5. The method of claim 4, wherein calculating semantic similarity of each to-be-removed fragment of the set of to-be-removed fragments to the set of retained fragments based on the fragment relevance comprises:

6. The method of claim 1, wherein dividing the video to be processed into a plurality of scene video clips based on scene cut locations comprises:

7. The method of claim 1, determining a text description for each of the scene video clips, comprising:

8. A video processing apparatus comprising:

the acquisition module is used for acquiring the video to be processed;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when the program is executed.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.