CN117061815A

CN117061815A - Video processing method, video processing device, computer readable medium and electronic equipment

Info

Publication number: CN117061815A
Application number: CN202210488001.XA
Authority: CN
Inventors: 闻长远; 张振铎; 赵丽丽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2023-11-14

Abstract

The embodiment of the application provides a video processing method, a video processing device, a computer readable medium and electronic equipment. The video processing method comprises the following steps: performing lens splitting treatment on the video to be treated to obtain a plurality of lens-level video clips; performing scene recognition processing based on the plurality of shot-level video clips to generate scene-level video clips; calculating the correlation between the scene-level video clips according to the characteristics of the scene-level video clips; and according to the correlation of the scene-level video clips, carrying out fusion processing on the associated scene-level video clips to generate scenario-level video clips. The technical scheme of the embodiment of the application can improve the generation efficiency of the video clips and ensure the accuracy, the integrity and the continuity of the generated scenario-level video clips through a multi-level processing mode.

Description

Video processing method, video processing device, computer readable medium and electronic equipment

Technical Field

The present application relates to the field of computers and communication technologies, and in particular, to a video processing method, a video processing device, a computer readable medium, and an electronic apparatus.

Background

Currently, when video files are clipped, the clipping is usually performed by a clipping person according to his own understanding of the video files, which results in low video clipping efficiency. However, the video processing scheme proposed in the related art generally performs only simple boundary recognition, which results in inaccurate video segments obtained finally, and it is difficult to ensure the integrity and continuity of the scenario.

Disclosure of Invention

The embodiment of the application provides a video processing method, a device, a computer readable medium and electronic equipment, which can improve the generation efficiency of video clips and ensure the accuracy, the integrity and the continuity of generated scenario-level video clips in a multi-level processing mode.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a video processing method including: performing lens splitting treatment on the video to be treated to obtain a plurality of lens-level video clips; performing scene recognition processing based on the plurality of shot-level video clips to generate scene-level video clips; calculating the correlation between the scene-level video clips according to the characteristics of the scene-level video clips; and according to the correlation of the scene-level video clips, carrying out fusion processing on the associated scene-level video clips to generate scenario-level video clips.

According to an aspect of an embodiment of the present application, there is provided a video processing apparatus including: the splitting unit is configured to carry out lens splitting treatment on the video to be treated to obtain a plurality of lens-level video clips; a first processing unit configured to perform scene recognition processing based on the plurality of shot-level video clips to generate a scene-level video clip; a computing unit configured to compute correlations between the scene-level video clips according to features of the scene-level video clips; and the second processing unit is configured to fuse the associated scene-level video clips according to the correlation of the scene-level video clips to generate scenario-level video clips.

In some embodiments of the application, based on the foregoing scheme, the splitting unit is configured to: performing frame extraction processing on the video to be processed to obtain a plurality of video image frames; detecting whether shot boundaries exist between adjacent video image frames in the plurality of video image frames to obtain shot boundary detection results; and carrying out shot splitting processing on the video to be processed according to the shot boundary detection result.

In some embodiments of the application, based on the foregoing scheme, the splitting unit includes: the model processing unit is configured to sequentially input the plurality of video image frames into a shot boundary recognition model according to a set sliding window length to obtain a shot boundary prediction result output by the shot boundary recognition model, wherein the shot boundary prediction result is used for representing whether shot boundaries exist between adjacent video image frames or not; and the generating unit is configured to generate the shot boundary detection result according to the shot boundary prediction result.

In some embodiments of the application, based on the foregoing scheme, the generating unit is configured to: acquiring color detection results of each of the plurality of video image frames, wherein the color detection results represent whether each video image frame is a solid-color image with a specified color; and generating the shot boundary detection result according to the color detection result and the shot boundary prediction result.

In some embodiments of the application, based on the foregoing scheme, the generating unit is configured to: if the target video image frame which is the solid-color image exists in the video image frames according to the color detection result, identifying the target video image frame and the video image frame before the target video image frame as a lens boundary, and obtaining a first identification result; and taking the shot boundary prediction result between adjacent frames except the first recognition result in the shot boundary prediction results and the first recognition result as the shot boundary detection result.

In some embodiments of the application, based on the foregoing scheme, the generating unit is configured to: counting values in a color histogram of each video image frame; and determining whether each video image frame is a solid-color image with a specified color according to the value in the color histogram of each video image frame.

In some embodiments of the present application, based on the foregoing, the shot boundary prediction result includes a probability value of a shot boundary between adjacent video image frames; the generation unit is configured to: identifying a shot boundary between two frames of video image frames with the corresponding probability value being greater than or equal to a set threshold value, and identifying a non-shot boundary between two frames of video image frames with the corresponding probability value being less than the set threshold value, so as to obtain a second identification result; and generating the shot boundary detection result according to the second identification result.

In some embodiments of the application, based on the foregoing, the first processing unit is configured to: extracting at least one dimension feature contained in each lens-level video segment to obtain content features of each lens-level video segment; sequentially inputting the content characteristics of the at least one shot-level video segment into a scene boundary recognition model according to a sliding window taking the shot-level video segment as a unit to obtain a scene boundary prediction result output by the scene boundary recognition model, wherein the scene boundary prediction result is used for indicating whether scene boundaries exist between adjacent shot-level video segments or not; and generating the scene-level video clip according to the scene boundary prediction result.

In some embodiments of the present application, based on the foregoing scheme, the scene boundary prediction result includes a probability value of a scene boundary between adjacent shot-level video clips; the first processing unit is configured to: identifying a scene boundary between two shot-level video clips with corresponding probability values larger than or equal to a set threshold value, and identifying a non-scene boundary between two shot-level video clips with corresponding probability values smaller than the set threshold value; and dividing the plurality of shot-level video clips according to the recognition results of the scene boundary and the non-scene boundary to obtain the scene-level video clips.

In some embodiments of the present application, based on the foregoing, the video processing apparatus further includes: an audio extraction unit configured to extract audio of the video to be processed; a dividing unit configured to divide audio of the video to be processed into at least one audio-level segment according to correlation between audio contents; and the adjusting unit is configured to adjust the scenario boundary of the scenario-level video clip according to the at least one audio-level clip.

In some embodiments of the present application, based on the foregoing scheme, the dividing unit is configured to: dividing the audio of the video to be processed into a plurality of audio segments with set lengths; extracting audio features corresponding to each of the plurality of audio segments; determining associated audio segments according to the audio characteristics corresponding to the audio segments; and carrying out fusion processing on the associated audio segments to obtain the at least one audio-level segment.

In some embodiments of the application, based on the foregoing, the adjusting unit is configured to: detecting whether the boundary time stamp of the audio-level segment is within the set time before and after the time point according to the boundary information of the at least one audio-level segment and the time point of the scenario boundary of the scenario-level video segment; and if the boundary time stamp of the appointed audio-level segment is within the set time length before and after the time point, taking the boundary time stamp of the appointed audio-level segment as the boundary time stamp of the scenario-level video segment so as to adjust the scenario boundary of the scenario-level video segment.

In some embodiments of the application, based on the foregoing, the adjusting unit is configured to: and if the audio-level fragments with the boundary time stamps within the set time periods before and after the time point do not exist, taking the boundary time stamp of the scenario-level video fragment as the scenario boundary of the scenario-level video fragment.

In some embodiments of the present application, based on the foregoing, the video processing apparatus further includes: the third processing unit is configured to generate a video tag of at least one of the following video clips, and store the generated video tag in association with the corresponding video clip: the shot-level video clip, the scene-level video clip, and the scenario-level video clip.

According to an aspect of an embodiment of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a video processing method as described in the above embodiment.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and storage means for storing one or more computer programs which, when executed by the one or more processors, cause the electronic device to implement the video processing method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the electronic device reads and executes the computer program from the computer-readable storage medium, so that the electronic device performs the video processing method provided in the above-described various alternative embodiments.

In the technical schemes provided by some embodiments of the present application, after a video to be processed is subjected to lens splitting processing, scene recognition processing is performed based on a plurality of lens-level video clips to generate scene-level video clips, and then, according to the correlation of the scene-level video clips, fusion processing is performed on the associated scene-level video clips to generate scenario-level video clips, so that scenario-level video clips can be generated by performing lens splitting, scene fusion and other processing on the video to be processed, automatic generation of scenario-level video clips is realized, and generation efficiency of video clips is improved; meanwhile, the technical scheme of the embodiment of the application is that the lens is split firstly, then the scene is split, and finally the scenario-level video clips are generated through fusion processing according to the correlation of the scene-level video clips, so that the accuracy of the generated scenario-level video clips can be ensured through the multi-level processing mode, and the integrity and the continuity of the generated scenario-level video clips can be further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the application may be applied;

fig. 2 shows a schematic diagram of a display interface of a terminal device according to an embodiment of the present application;

FIG. 3 shows a flow chart of a video processing method according to one embodiment of the application;

FIG. 4 shows a flow chart of a video processing method according to one embodiment of the application;

FIG. 5 shows a flow chart of a video processing method according to one embodiment of the application;

FIG. 6 shows a flow chart of a video processing method according to one embodiment of the application;

FIG. 7 shows a schematic diagram of the structure of a video;

FIG. 8 shows a system architecture schematic of a video processing scheme applied to an embodiment of the present application, according to one embodiment of the present application;

FIG. 9 shows a flowchart of deriving a shot-level video clip according to one embodiment of the application;

FIG. 10 illustrates a flowchart of deriving a scene-level video clip according to one embodiment of the application;

FIG. 11 illustrates a flowchart of obtaining scenario-level video clips according to one embodiment of the present application;

FIG. 12 shows a schematic diagram of generating a video tag according to one embodiment of the application;

fig. 13 shows a block diagram of a video processing apparatus according to an embodiment of the application;

fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments are now described in a more complete manner with reference being made to the figures. However, the illustrated embodiments may be embodied in various forms and should not be construed as limited to only these examples; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics of the application may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the application. However, it will be recognized by one skilled in the art that the present inventive arrangements may be practiced without all of the specific details of the embodiments, that one or more specific details may be omitted, or that other methods, elements, devices, steps, etc. may be used.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the field of video processing, there is often a need to understand video (e.g., raw film) content for editing. In the related art, different scenes of a video are identified based on special marks such as simple subtitles and station marks, and accurate analysis of story logic of video content cannot be ensured. Some schemes also provide understanding of the storyline in the video through fixed scene templates or feature similarity calculation, but are difficult to adapt to different types of application scenes, such as matching with the time length of the commentary when constructing the commentary video, and point clamping operation according to the progress of background music when mixing and cutting the video.

Based on the above, the technical scheme of the embodiment of the application provides a new video processing scheme. In particular, as shown in fig. 1, a system architecture 100 applied to a video processing scheme according to an embodiment of the present application may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include a smart phone, tablet, notebook, smart voice interaction device, smart home appliance, vehicle terminal, aircraft, and the like. The server 130 may be a server providing various services, which may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and artificial intelligence platforms. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.

The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.

For example, a video clip party may initiate a clip request for a certain video to the server 130 through the terminal device 110. Specifically, if a file of a video to be processed is stored in the server 130, the terminal device 110 may transmit identification information of the video to be processed, such as a video name, a video URL (uniform resource locator, uniform resource locator system), etc., to the server 130 when initiating a clip request. If the server 130 does not store a file of the video to be processed, the terminal device 110 may transmit the video file of the video to be processed to the server 130 when initiating a clip request.

In one embodiment of the present application, after receiving a clip request for a video to be processed initiated by the terminal device 110, the server 130 may perform shot splitting processing on the video to be processed to obtain a plurality of shot-level video segments, and then perform scene recognition processing based on the plurality of shot-level video segments to generate a scene-level video segment. After the scene-level video clips are obtained, the correlation between the scene-level video clips can be calculated according to the characteristics of the scene-level video clips, and then fusion processing is carried out on the associated scene-level video clips according to the correlation of the scene-level video clips to generate scenario-level video clips.

Therefore, the technical scheme of the embodiment of the application can generate the scenario-level video clips by carrying out the processes of lens splitting, scene segmentation, scene fusion and the like on the video to be processed, thereby realizing the automatic generation of the scenario-level video clips and improving the generation efficiency of the video clips; meanwhile, the technical scheme of the embodiment of the application is that the lens is split firstly, then the scene is split, and finally the scenario-level video clips are generated through fusion processing according to the correlation of the scene-level video clips, so that the accuracy of the generated scenario-level video clips can be ensured through the multi-level processing mode, and the integrity and the continuity of the generated scenario-level video clips can be further improved.

Alternatively, the server 130 may also use artificial intelligence (Artificial Intelligence, abbreviated as AI) technology, which is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain the best results when processing the video to be processed. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Specifically, the server 130 may process the video to be processed by using Computer Vision (CV) technology, which is a science of researching how to make a machine "look" and further means that a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition, detection and measurement on the target, and further perform graphic processing, so that the Computer processes the video into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.

In a specific application scenario of the present application, as shown in fig. 2, a setting interface of a video clip may be displayed on a display interface of a terminal device, where a to-be-processed video may be selected by first triggering a "select file …" button. Then, a clip requirement such as "output format" for representing the format of the video file obtained by clipping, "output location" for representing the storage location of the video file obtained by clipping, etc. may be selected. After selecting the clip requirement, an "auto clip" button may be triggered, submitting the clip request. Then, the server side can perform shot splitting processing on the video to be processed to obtain a plurality of shot-level video clips, and then performs scene recognition processing based on the plurality of shot-level video clips to generate scene-level video clips. After the scene-level video clips are obtained, the correlation between the scene-level video clips can be calculated according to the characteristics of the scene-level video clips, and then fusion processing is carried out on the associated scene-level video clips according to the correlation of the scene-level video clips to generate scenario-level video clips. And storing the generated scenario-level video clip at an output position selected by the user according to the set format.

The implementation details of the technical scheme of the embodiment of the application are described in detail below:

fig. 3 shows a flow chart of a video processing method according to an embodiment of the application, which may be performed by a terminal device, or which may be performed by a server, or by both the terminal device and the server. Referring to fig. 3, the video processing method at least includes steps S310 to S340, and is described in detail as follows:

in step S310, a lens splitting process is performed on a video to be processed, so as to obtain a plurality of lens-level video clips.

In the embodiment of the application, the lens splitting processing is performed on the video to be processed mainly to identify each lens picture contained in the video to be processed. Specifically, a shot picture is a basic unit that constitutes the entire video, which is the basis of narrative and ideographic work, in a video, a shot refers to a complete segment between a previous optical transition to a subsequent optical transition. The shot-level video clips obtained after the shot splitting process are video clips taking shots as units, and each shot-level video clip comprises a group of video image frames.

In step S320, scene recognition processing is performed based on the plurality of shot-level video clips to generate scene-level video clips.

In embodiments of the present application, a scene-level video clip may include one or more shot-level video clips, each having a single, relatively complete meaning, such as representing a course of action, representing a correlation, representing a meaning, and so forth. A scene level video clip is a complete narrative hierarchy in a video, similar to a curtain in a drama, where multiple scene level video clips are connected together to form a complete video, like chapters in a novel. Thus, a scene-level video clip can be understood as a structural form of video, a transition or transition between scenes, referred to as a transition, which is also essentially a transition or transition between two scene-level video clips.

In step S330, the correlation between the scene-level video clips is calculated from the features of the scene-level video clips.

In one embodiment of the present application, since a scene-level video clip contains one or more shot-level video clips, the features of the scene-level video clip may be integrated with the features of the one or more shot-level video clips it contains. For example, taking the feature of one shot-level video segment as one dimension, the features of one or more shot-level video segments contained in one scene-level video segment can be combined into a one-dimensional or multi-dimensional feature matrix, and the feature matrix is taken as the feature of the scene-level video segment. Or the average processing can be performed on the characteristics of one or more lens-level video clips contained in one scene-level video clip, and the obtained average characteristics are used as the characteristics of the scene-level video clip.

Optionally, in calculating the correlation between the scene-level video clips, a distance, such as a euclidean distance, between the features of the scene-level video clips may be calculated, if the distance between the features of the two scene-level video clips is less than a set distance threshold, then the two scene-level video clips are said to be correlated; if the distance between features of two scene-level video clips is greater than or equal to a set distance threshold, then it is indicated that the two scene-level video clips are uncorrelated.

Optionally, when calculating the correlation between the scene-level video clips, the cosine similarity between the features of the scene-level video clips may also be calculated, and if the cosine similarity between the features of the two scene-level video clips is greater than a set similarity threshold, then the two scene-level video clips are described as being correlated; if the cosine similarity between the features of the two scene-level video clips is less than or equal to the set similarity threshold, then the two scene-level video clips are not correlated.

In step S340, fusion processing is performed on the associated scene-level video clips according to the correlation of the scene-level video clips, so as to generate scenario-level video clips.

In an embodiment of the present application, scenario-level video clips refer to video clips formed of a plurality of scene-level video clips having a correlation. For example, in some film-tv dramas, some other content is often introduced in the process of narrative, for example, some recall fragments of owners are introduced in the video describing the science fiction adventure process, which causes the recall fragments to interrupt the video describing the science fiction adventure process, so that the relevant scene-level video fragments can be integrated according to the relevance of the scene-level video fragments, so as to obtain complete scenario-level video fragments.

In one embodiment of the present application, a video tag of at least one of the following video clips may be generated, and the generated video tag and the corresponding video clip may be stored in association with each other: shot-level video clips, scene-level video clips, and scenario-level video clips. The technical scheme of the embodiment enables the corresponding video clips to be conveniently searched according to the video tags, so that the convenience of video clip searching is improved.

Alternatively, in generating the video tag of the video clip, the identification may be performed by a machine learning model, e.g., the machine learning model may generate the tag of the video clip by identifying a location, object feature, action, level of sophistication, etc. in the video clip.

The technical scheme of the embodiment shown in fig. 3 can generate scenario-level video clips by performing lens splitting, scene segmentation, scene fusion and other processes on the video to be processed, so that the scenario-level video clips are automatically generated, and the generation efficiency of the video clips is improved; meanwhile, the technical scheme of the embodiment of the application is that the lens is split firstly, then the scene is split, and finally the scenario-level video clips are generated through fusion processing according to the correlation of the scene-level video clips, so that the accuracy of the generated scenario-level video clips can be ensured through the multi-level processing mode, and the integrity and the continuity of the generated scenario-level video clips can be further improved.

Fig. 4 shows a flow chart of a video processing method according to an embodiment of the application, which may be performed by a terminal device, or which may be performed by a server, or by both the terminal device and the server. Referring to fig. 4, the video processing method at least includes steps S410 to S430, and steps S320 to S340 shown in fig. 3, and steps S410 to S430 included therein are described in detail below:

In step S410, frame extraction processing is performed on the video to be processed, so as to obtain a plurality of video image frames.

In one embodiment of the present application, the frame extraction process may be performed on the video to be processed according to a set frame rate, for example, the frame extraction process may be performed according to the frame rate of the video to be processed.

In step S420, whether a shot boundary exists between adjacent video image frames in the plurality of video image frames is detected, so as to obtain a shot boundary detection result.

In one embodiment of the present application, a machine learning model may be used to detect whether a shot boundary is between adjacent video image frames, and in particular, the machine learning model may be a shot boundary recognition model obtained through sample training. During specific recognition processing, a plurality of video image frames can be sequentially input into a shot boundary recognition model according to a set sliding window length to obtain a shot boundary prediction result output by the shot boundary recognition model, wherein the shot boundary prediction result is used for representing whether shot boundaries exist between adjacent video image frames or not; and then generating a shot boundary detection result according to the shot boundary prediction result.

In one embodiment of the present application, if the shot boundary prediction result includes a probability value of a shot boundary between adjacent video image frames, the process of generating the shot boundary detection result may be to identify, as the shot boundary, between two video image frames having a corresponding probability value greater than or equal to a set threshold value and to identify, as the shot boundary detection result, between two video image frames having a corresponding probability value less than the set threshold value, a non-shot boundary.

If the shot boundary prediction result includes a classification result of a shot boundary between adjacent video image frames, it may be determined whether a shot boundary is between two video image frames directly from the shot boundary prediction result.

Optionally, when training the machine learning model to obtain the shot boundary recognition model, one or more groups of video image frames marked with shot boundaries can be used as training samples, and sequentially input into the machine learning model to be trained according to the set sliding window length to obtain a prediction result output by the machine learning model, and then a model loss value is calculated according to the prediction result and the shot boundaries marked in the training samples, and further parameters of the machine learning model are adjusted according to the loss value, and the process is repeated until the machine learning model converges, so that the shot boundary recognition model is obtained.

In one embodiment of the present application, after obtaining the shot boundary prediction result output by the shot boundary recognition model, division may be performed between video image frames recognized as the shot boundary according to the shot boundary prediction result, so as to obtain each shot-level video clip.

In other embodiments of the present application, after the shot boundary prediction result output by the shot boundary recognition model is obtained, the shot boundary may be determined in combination with the color detection result of the video image frame, specifically, the color detection result of each of the plurality of video image frames may be obtained, where the color detection result indicates whether each of the video image frames is a solid-color image of a specified color, and then the shot boundary detection result is generated according to the color detection result and the shot boundary prediction result. In this embodiment, since at the shot boundary there may be a transition frame, i.e. a shot switched image frame, such image frames are typically solid color images, such as solid black images, such image frames, if present, may be identified as a shot boundary between such image frame and the previous image frame.

Specifically, if it is determined that a target video image frame that is a solid-color image exists among the plurality of video image frames according to the color detection result, a shot boundary is identified between the target video image frame and a preceding video image frame of the target video image frame, and a first identification result is obtained. And then taking the shot boundary prediction result between adjacent frames except the first recognition result in the shot boundary prediction results and the first recognition result as a shot boundary detection result. In other words, if there is a target video image frame of a solid-color image among the video image frames, a shot boundary is directly recognized between such image frame and the previous image frame, and a shot boundary prediction result outputted by the shot boundary recognition model is continued to be adopted as a shot boundary detection result for other video image frames.

In one embodiment of the present application, the color detection result of the video image frame may be determined by the color histogram of the video image frame, specifically, the value in the color histogram of each video image frame may be counted, and then, whether each video image frame is a solid image of a specified color is determined according to the value in the color histogram of each video image frame. Wherein the color histogram describes the proportion of different colors in the whole image, so that whether each video image frame is a solid-color image with a specified color can be determined through the color histogram.

With continued reference to fig. 4, in step S430, shot splitting is performed on the video to be processed according to the shot boundary detection result, so as to obtain a plurality of shot-level video clips.

Alternatively, the plurality of video image frames may be sliced from the locations identified as shot boundaries to obtain a plurality of shot-level video clips.

The technical scheme of the embodiment shown in fig. 4 enables shot boundaries to be accurately identified, ensures the accuracy of shot boundary division, and is further beneficial to improving the integrity and continuity of generated scenario-level video clips. The implementation details of step S320 to step S340 shown in fig. 4 may refer to the technical solutions of the foregoing embodiments, and will not be described in detail.

Fig. 5 shows a flow chart of a video processing method according to an embodiment of the application, which may be performed by a terminal device, or which may be performed by a server, or by both the terminal device and the server. Referring to fig. 5, the video processing method includes steps S310, S330 and S340 shown in fig. 3, and includes steps S510 to S530 between steps S310 and S330, and implementation details of steps S510 and S530 are described in detail below:

In step S510, at least one dimension of the features included in each shot-level video clip is extracted, so as to obtain content features of each shot-level video clip.

In one embodiment of the present application, the feature of at least one dimension contained in the shot-level video clip may be, for example, at least one of the following features: action features in a shot-level video clip, object features (e.g., character features) in a shot-level video clip, location features in a shot-level video clip, image features in a shot-level video clip, etc.

In step S520, the content features of at least one shot-level video clip are sequentially input into the scene boundary recognition model according to the sliding window using the shot-level video clip as a unit, so as to obtain a scene boundary prediction result output by the scene boundary recognition model, where the scene boundary prediction result is used to represent whether a scene boundary exists between adjacent shot-level video clips.

In one embodiment of the present application, if the scene boundary prediction result includes a probability value of a scene boundary between adjacent shot-level video clips, the process of generating the scene-level video clips according to the scene boundary prediction result may be to identify two shot-level video clips having a corresponding probability value greater than or equal to a set threshold as a scene boundary, identify two shot-level video clips having a corresponding probability value less than the set threshold as non-scene boundaries, and then divide the plurality of shot-level video clips according to the identification result of the scene boundary and the non-scene boundary, so as to obtain the scene-level video clips.

If the scene boundary prediction result includes a classification result of scene boundaries between adjacent shot-level video clips, it may be determined whether scene boundaries between two adjacent shot-level video clips are directly according to the scene boundary prediction result.

Optionally, when training the scene boundary recognition model, one or more groups of shot-level video clips marked with scene boundaries can be used as training samples, and sequentially input into the scene boundary recognition model according to a set sliding window length to obtain a prediction result output by the scene boundary recognition model, then a model loss value is calculated according to the prediction result and the scene boundary marked in the training samples, further parameters of the scene boundary recognition model are adjusted according to the loss value, and the process is repeated until the scene boundary recognition model converges.

In step S530, a scene-level video clip is generated according to the scene boundary prediction result.

In one embodiment of the present application, after obtaining the scene boundary prediction result output by the scene boundary recognition model, division may be performed between shot-level video clips recognized as scene boundaries according to the scene boundary prediction result, so as to obtain each scene-level video clip.

The technical scheme of the embodiment shown in fig. 5 enables scene boundaries to be accurately identified, ensures the accuracy of scene boundary division, and is further beneficial to improving the integrity and continuity of generated scenario-level video clips. The implementation details of step S310, step S330 and step S340 shown in fig. 5 may refer to the technical solutions of the foregoing embodiments, and will not be described again.

Fig. 6 shows a flowchart of a video processing method according to an embodiment of the present application, which further includes the following steps S610 to S630 on the basis of the respective steps shown in fig. 3:

in step S610, the audio of the video to be processed is extracted.

In the embodiment of the application, because the video contains video images and audio, the contained audio can be extracted from the video to be processed, and the time stamp of the audio corresponds to the time stamp of the video images one by one.

In step S620, the audio of the video to be processed is divided into at least one audio-level segment according to the correlation between the audio contents.

In one embodiment of the present application, when dividing the audio of the video to be processed into at least one audio-level segment, the audio of the video to be processed may be divided into a plurality of audio segments of a set length, and then the audio features corresponding to each of the plurality of audio segments may be extracted. After the audio features corresponding to the audio segments are extracted, the correlation between the audio segments can be calculated according to the audio features, then the associated audio segments are determined according to the correlation between the audio segments, and then fusion processing is carried out on the associated audio segments to obtain at least one audio-level segment.

Optionally, in calculating the correlation between the audio segments, a distance, such as a euclidean distance, between the audio features corresponding to the audio segments may be calculated, and if the distance between the audio features corresponding to the two audio segments is less than a set distance threshold, then it is indicated that the two audio segments are correlated; if the distance between the audio features corresponding to two audio segments is greater than or equal to the set distance threshold, then it is indicated that the two audio segments are uncorrelated.

Optionally, when calculating the correlation between the audio segments, the cosine similarity between the audio features corresponding to the audio segments may also be calculated, and if the cosine similarity between the audio features corresponding to the two audio segments is greater than the set similarity threshold, it is indicated that the two audio segments are correlated; if the cosine similarity between the audio features corresponding to the two audio segments is less than or equal to the set similarity threshold, then the two audio segments are not correlated.

In step S630, scenario boundaries of scenario-level video clips are adjusted according to at least one audio-level clip.

In one embodiment of the present application, the process of adjusting the scenario boundaries of the scenario-level video clip according to at least one audio-level clip may be: and detecting whether the boundary time stamp of the audio-level segment is within the front and rear set time length of the time point of the scenario boundary of the scenario-level video segment according to the boundary information of at least one audio-level segment and the time point of the scenario boundary of the scenario-level video segment. If the boundary time stamp of the appointed audio-level segment is within the front and back set time length of the time point of the scenario boundary of the scenario-level video segment, the boundary time stamp of the appointed audio-level segment is used as the boundary time stamp of the scenario-level video segment so as to adjust the scenario boundary of the scenario-level video segment; and if the audio-level fragments with the boundary time stamps within the set time periods before and after the scenario boundary of the scenario-level video fragments do not exist, taking the boundary time stamps of the scenario-level video fragments as the scenario boundary of the scenario-level video fragments.

Specifically, the set duration may be, for example, 1 second, and assuming that the scenario boundary of one scenario-level video clip is 30 th second (the starting time of the scenario-level video clip is 0 th second), if the boundary timestamp of the audio-level clip is 29.5 seconds, 29.5 seconds may be used as the scenario boundary of the scenario-level video clip, that is, the adjusted time span of the scenario-level video clip is 0-29.5 seconds.

If the boundary time stamp of the audio-level segment is 28 seconds, the boundary time stamp of the scenario-level video segment can be used as the scenario boundary of the scenario-level video segment, i.e. the time span of the scenario-level video segment is not adjusted, because the boundary time stamp of the audio-level segment is far away from the scenario boundary of the scenario-level video segment.

The technical solution of the embodiment shown in fig. 6 enables the scenario boundaries of scenario-level video clips to be adjusted according to the boundaries of audio-level clips, so as to ensure that the obtained scenario-level video clips are more accurate through matching between the boundaries of audio-level clips and the scenario boundaries of scenario-level video clips.

It should be noted that, in fig. 6, an example is taken after step S610 is performed in step S340, and in other embodiments of the present application, the execution sequence of each step between step S610 and step S310 to step S340 may be arbitrary, for example, may be executed in parallel, that is, the process of generating scenario-level video clips in step S310 to step S340 and the process of generating audio-level clips in step S610 to step S620 may be executed in parallel.

The implementation details of the technical solution of the embodiment of the present application are described in detail below with reference to fig. 7 to 12:

as shown in fig. 7, for a Video (Video), which includes a plurality of Video image frames (frames), a group of Video image frames may further form a shot (shot), where a shot is a basic unit that forms the entire Video, in a Video, a shot refers to a complete segment between a previous optical transition and a subsequent optical transition, and a shot may further be divided into sub-shots (sub-shots). A scene (scene) may include one or more shots. Specifically, in the example shown in fig. 7, video image frames of one video constitute a shot 1, a shot 2, a shot 3, and a shot 4. Lens 1 may be divided into sub-lens 1-1 and sub-lens 1-2; lens 2 may be divided into sub-lenses 2-1; the lens 3 may be divided into a sub-lens 3-1, a sub-lens 3-2, and a sub-lens 3-3; the lens 4 may be divided into a sub-lens 4-1 and a sub-lens 4-2. Meanwhile, scene a includes shot 1 and shot 3; scene B includes shots 2 and 4. It should be noted that, herein, a shot refers to a shot-level video clip, and a scene refers to a scene-level video clip.

The technical scheme of the embodiment of the application can automatically disassemble videos (such as film and television drama) to form a short video material library so as to output multi-level video clip materials conforming to different time-length logics. Specifically, the key frame detection can be utilized to split the shots, and the shot-level video clips are output; based on the shot boundary, analyzing through a deep neural network according to the information such as the position, the object, the action and the like, and outputting scene-level boundary judgment so as to divide the scene-level video clips; and further fusing analysis processing results of the audio clips on the basis of the scene-level video clips, and outputting the scenario-level video clips through global dynamic programming. Finally, a multi-level video material library containing label information such as video understanding actions, objects, fields, highlights and the like is formed, namely a lens level, a scene level and a scenario level, and the requirements of content production such as video illustration, video mixing and cutting on different time-length fragments are met.

In one embodiment of the present application, as shown in fig. 8, the system architecture applied to the video processing scheme of the embodiment of the present application may include four functional modules of shot-level boundary detection, scene-level boundary detection, scenario-level boundary detection, and video content understanding.

Optionally, shot-level boundary detection mainly includes two sub-modules, namely a CNN (convolutional neural network ) video frame jump detection module and a CV (Computer Vision) statistical frame characteristic; the scene boundary detection mainly comprises a feature extraction module of actions, objects and the like and a scene boundary prediction model; the scenario-level boundary detection mainly comprises a visual segment global optimization module, an audio segment global optimization module and an audio and video segment fusion module. The video content understanding comprises the steps of performing action recognition, object recognition, place recognition and highlight recognition on the fragments with different levels of granularity, and performing corresponding material library format storage.

The main flow of the video processing scheme in the embodiment of the application comprises the following steps:

1. inputting a video to be processed, carrying out shot splitting on the video to be processed to obtain start-stop time information of shots, carrying out black screen detection on each frame of video image of the video to be processed to obtain CV statistical characteristics, and fusing shot detection and black screen detection results to obtain shot boundary information and corresponding shot-level video clips;

2. Each shot-level video segment in the shot-level video segments is subjected to deep learning feature extraction, which can comprise action features, position features, image conventional features, object features and the like, and the features contained in the same shot-level video segment are sent into a scene boundary detection model to carry out scene boundary detection, so as to obtain scene boundary information and corresponding scene-level video segments;

3. and carrying out global dynamic programming fusion on the scene boundary to obtain a scenario-level video fragment, simultaneously carrying out audio fragment detection on the film to obtain a video fragment result of an audio mode, and finally obtaining scenario-level fragment boundary information and a corresponding scenario-level video fragment through fusion of the audio-video fragment result.

4. In addition, video clips of different levels can be respectively sent to modules for processing such as action recognition, site recognition, object recognition, highlight analysis and the like, and relevant tag information of the clips is obtained.

The following describes each of the above-described processing flows in detail:

fig. 9 shows a process of obtaining a shot-level video clip according to an embodiment of the present application, and the process shown in fig. 9 may be performed by a terminal device, or may be performed by a server, or may be performed by both the terminal device and the server. As shown in fig. 9, a process of obtaining a shot-level video clip according to an embodiment of the present application includes the steps of:

In step S901, the video is densely decimated according to the frame rate.

Alternatively, video of duration t0 seconds may be densely decimated at a frame rate of fps (frame/s) to obtain a total of M frames of fps×t0 video images f ₁ ,f ₂ ,f ₃ ,…,f _M ]。

Step S902, the densely-extracted frame images are combined in N frames/group, and window sliding is performed.

Specifically, the sliding window processing can be performed on the M-frame video image according to the window length N, that is, the N-frame video image is sent to the CNN deep learning model (i.e., the video frame jump detection model) at a time to determine whether the M-frame video image is a shot boundary.

In step S903, the probability of the shot boundary between two adjacent frames is predicted by the CNN deep learning model.

In one embodiment of the application, the CNN deep learning model outputs the prediction probability of whether the boundary between the front frame and the rear frame in the N frames of video images is a shot boundary, and the final model outputs M-1 inter-frame prediction probability results O through sliding window control ₁ ,O ₂ ,…,O _M-1 ]。

Step S904, carrying out color histogram statistics on each frame of image to obtain a statistic value b.

Specifically, the dense frame extraction result can be sent to a color histogram statistics module to count the color histogram value of the image of each frame.

Step S905, comparing the color histogram statistic b with a threshold value, and determining whether it is a black screen transition picture.

Specifically, whether the video image frame is pure black or not can be determined according to the color histogram value, for example, if the color histogram statistical value b indicates that the proportion of black pixels exceeds a set threshold, the video image frame can be regarded as a black screen transition picture, and then the video image frame is processed to the M-frame black screen detection result [ b ] ₁ ,b ₂ ,b ₃ ,…,b _M ]。

Step S906, judging whether a frame is a black screen or not on the basis of the prediction probability, if so, setting the probability to be 1, and considering the probability as a transition lens.

In particular, e.gIf it is determined that a certain frame is a black screen transition picture according to the color histogram, the frame is considered to be a transition frame, then the probability of the shot boundary between the frame predicted by the CNN deep learning model in step S903 and the previous frame is set to 1, otherwise, the detection result in step S903 is maintained. Specifically, the probability O 'of whether a shot boundary exists between the front frame and the rear frame in the video image' _i Can be expressed as:

wherein b _i =true indicates that the frame is determined to be a black transition picture based on the color histogram statistic.

In step S907, the prediction probability of the shot boundary between the two frames is compared with a threshold value, and whether the shot boundary is determined.

Specifically, the prediction probability of a shot boundary between two frames is compared with a threshold value Ths, and if the threshold value Ths is exceeded, the shot boundary is considered, and the value is set to 1; if the threshold Ths is not exceeded, the non-shot boundary is considered, and the value is set to 0. Furthermore, M-1 sequences [ s ] of 0-1 can be obtained ₁ ,s ₂ ,s ₃ ,…,s _M-1 ]。

In step S908, the shot is divided, so as to obtain a shot-level video clip.

Specifically, the sequence [ s ] obtainable by the foregoing ₁ ,s ₂ ,s ₃ ,…,s _M-1 ]Judging whether the interval between two frames is 0 continuously, and then decomposing the video to be processed into N shots. For example, 15 frames of video images are input, the corresponding prediction result is 0000001000010000, and then the video images can be split into 3 shots: 1-6 frames are one shot, 7-11 frames and 12-15 frames are one shot.

In the embodiment shown in fig. 9, the shot boundary is performed by the deep learning model and the statistical result of the color histogramAnd (5) identification. In other embodiments of the present application, shot boundary recognition may also be performed solely by the deep learning model shown in FIG. 9, i.e., directly from the probability result [ O ] output by the deep learning model ₁ ,O ₂ ,…,O _M-1 ]To determine whether a shot boundary exists between two adjacent frames.

In addition, in other embodiments of the present application, the recognition of the lens boundary may also be performed by Optical flow (Optical flow), which is the projection of the motion of an object in three-dimensional space on a two-dimensional image plane, which is generated by the relative speeds of the object and the camera, reflecting the motion direction and speed of the corresponding image pixels of the object in a very small time, and specifically, the motion of the object in the video may be detected by Optical flow to recognize whether the lens boundary is recognized by the change of the motion direction and speed of the object.

Fig. 10 shows a flow of obtaining a scene-level video clip according to an embodiment of the present application, and the flow shown in fig. 10 may be performed by a terminal device, or may be performed by a server, or may be performed by both the terminal device and the server. As shown in fig. 10, a process of obtaining a scene-level video clip according to one embodiment of the present application includes the steps of:

in step S1001, feature extraction is performed on SN shot-level video clips.

Specifically, SN shot-level video clips [ S ] in total obtained by shot boundary detection result ₁ ,S ₂ ,S ₃ ,…,S _N ]Respectively by including an action feature extraction module M _action Object feature extraction module M _object Site feature extraction module M _place Image feature extraction module M _image The m feature extraction modules perform feature extraction, so that each lens-level video segment obtains m-dimensional features S _{1_a} ,S _{1_p} ,S _{1_f} ,S _{1_} …S _{1_m} ]。

Step S1002, scene boundary detection is performed based on the feature extraction result.

Specifically, SN lens-level video clips are sequentially processed into sn×m viterbi groups according to D pieces/group by sliding window operationThe sign is sent into a scene boundary prediction model to be predicted, and SN-1 prediction probability results [ SE ] are obtained ₁ ,SE ₂ ,SE ₃ ,…,SE _N-1 ]。

In step S1003, whether the scene boundary is determined by the scene boundary detection threshold.

Specifically, comparing the SN-1 prediction results with a scene boundary threshold value Thss, and if the threshold value Thss is exceeded, considering that scene boundaries are between shot-level video clips; if the threshold Thss is not exceeded, then the scene boundaries between shot-level video clips are not considered. The sequence SO can be obtained by judgment as follows:

/>

wherein, if SO _i =1, then the scene boundary between the i+1th shot-level video clip and the i video-level shot clip is described; after the sequence SO is obtained, the SN shot-level video clips can be divided into SS scene-level video clips by scene boundaries.

In the embodiment shown in fig. 10, feature extraction is performed on a shot-level video segment, and then the extracted features are input into a scene boundary prediction model for prediction. In other embodiments of the present application, a feature extraction network may be set in the scene boundary prediction model, and then SN shot-level video clips are sequentially sent to the scene boundary prediction model for prediction through sliding window operation according to D pieces/group, to obtain SN-1 prediction probability results [ SE ] ₁ ,SE ₂ ,SE ₃ ,…,SE _N-1 ]。

Fig. 11 illustrates a flow of obtaining scenario-level video-clips according to an embodiment of the present application, and the flow illustrated in fig. 11 may be performed by a terminal device, or may be performed by a server, or may be performed by both the terminal device and the server. As shown in fig. 11, a flow of obtaining scenario-level video clips according to one embodiment of the present application includes the steps of:

And step 1101, processing the SS scene-level video clips by a visual clip global optimization module.

Specifically, SS scene-level video clips are used as inputs of a visual clip global optimization module, correlations between the scene-level video clips are calculated, and then the SS scene-level video clips are fused into SM visual clips [ SM ] according to the correlations ₁ ,SM ₂ ,SM ₃ ,…,SM _sm ]。

In step S1102, the audio of the video to be processed is split into a seconds/segment, and the audio of the video to be processed is divided into sa audio segments by the audio segment global optimization module.

Specifically, the audio of the video to be processed is divided into audio segments of a seconds/segment through an audio segment global optimization module, then the audio characteristics of each audio segment are extracted, the correlation between the audio segments is calculated according to the audio characteristics of the audio segments, and then SA audio-level segments are obtained through fusion according to the correlation between the audio segments obtained through calculation ₁ ,SA ₂ ,SA ₃ ,…,SA _sa-1 ]。

And step S1103, fusing the global optimization result of the video clip and the global optimization result of the audio clip by an audio-video clip fusion module to obtain a final scenario-level video clip.

Specifically, when the global optimization result of the visual segment and the global optimization result of the audio segment are fused, the visual segment is mainly used, and the boundary of the final scenario-level video segment is determined by judging the distance between the audio segmentation boundary and the video segment boundary.

For example, if the visual segment global optimization result outputs a scenario-level video segment from 0 th second to 30 th second, if the audio segment global optimization result outputs an audio-level segment from 0 th second to 29.5 th second, since the difference between the two is 0.5 second and less than the set distance threshold value of 1 second, 29.5 seconds can be used as the scenario boundary of the final scenario-level video segment, i.e. the time span of the final scenario-level video segment after adjustment is 0-29.5 seconds.

If the audio segment global optimization result outputs an audio level segment from 0 th second to 28 th second, the difference between the audio level segment and the audio level segment is 2 seconds and is larger than the set distance threshold value for 1 second, the visual segment global optimization result can be taken as the main, namely, the time span of the final scenario level video segment after adjustment is 0-30 seconds.

It should be noted that, in the embodiment shown in fig. 11, the scenario-level video clips are generated by fusing the processing result of the visual clip global optimization module and the result of the audio clip global optimization module, and in other embodiments of the present application, an end-to-end manner may be adopted, that is, by training a machine learning model, a series of scenario-level video clips are input into the machine learning model, and then the scenario-level video clip division result output by the machine learning model is directly obtained.

In one embodiment of the present application, after obtaining the shot-level video clip, the scene-level video clip, and the scenario-level video clip, as shown in fig. 12, the shot-level video clip, the scene-level video clip, and the scenario-level video clip are input to the video content understanding module, and the tags including the motion, the location, the character, the highlight, and the like corresponding to each video clip are output through the motion tag recognition module, the location tag recognition module, the object tag recognition module, the highlight recognition module, and the like in the video content understanding module, and then the video clip and the corresponding tag are associated and stored in the material database, so that the corresponding video clip can be retrieved through the tag.

Alternatively, each of the video content understanding modules may be implemented by a machine learning model, for example, the action tag recognition module may be implemented by an action tag recognition model, which may be obtained by training a sample video carrying tags. Specifically, each section of sample video can carry an action tag, such as "running", "talking", "martial arts" and the like, then the sample video is input into an action tag recognition model, and then a model loss value is calculated according to a prediction result output by the action tag recognition model and the action tag corresponding to the sample video, parameters of the action tag recognition model are adjusted by using the model loss value, and then the above-mentioned processes are repeated until the model converges, so that training of the action tag recognition model is realized. Alternatively, the action tag recognition model may be, for example, a CNN model, a deep learning model, or the like.

According to the technical scheme provided by the embodiment of the application, a large amount of film and television materials (namely the video to be processed) can be disassembled to generate multi-level video clips (namely the lens-level video clip, the scene-level video clip and the scenario-level video clip), so that the method is used for generating film and television commentary and editing videos, and the efficiency of commentary and editing videos can be improved. Meanwhile, as a plurality of video clips with different levels exist, the video retrieval time length and accuracy can be effectively improved, and the situation of middle picture jump of front and rear sentences can be reduced.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the video processing method of the above embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the video processing method of the present application.

Fig. 13 shows a block diagram of a video processing apparatus according to an embodiment of the application. Referring to fig. 13, a video processing apparatus 1300 according to an embodiment of the present application includes: a splitting unit 1302, a first processing unit 1304, a computing unit 1306, and a second processing unit 1308.

The splitting unit 1302 is configured to perform lens splitting processing on the video to be processed to obtain a plurality of lens-level video clips; a first processing unit 1304 configured to perform scene recognition processing based on the plurality of shot-level video clips to generate a scene-level video clip; a calculating unit 1306 configured to calculate correlation between the scene-level video clips according to the features of the scene-level video clips; a second processing unit 1308 is configured to perform fusion processing on the associated scene-level video segments according to the correlation of the scene-level video segments, and generate scenario-level video segments.

In some embodiments of the present application, based on the foregoing scheme, the splitting unit 1302 is configured to: performing frame extraction processing on the video to be processed to obtain a plurality of video image frames; detecting whether shot boundaries exist between adjacent video image frames in the plurality of video image frames to obtain shot boundary detection results; and carrying out shot splitting processing on the video to be processed according to the shot boundary detection result.

In some embodiments of the present application, based on the foregoing scheme, the splitting unit 1302 includes: the model processing unit is configured to sequentially input the plurality of video image frames into a shot boundary recognition model according to a set sliding window length to obtain a shot boundary prediction result output by the shot boundary recognition model, wherein the shot boundary prediction result is used for representing whether shot boundaries exist between adjacent video image frames or not; and the generating unit is configured to generate the shot boundary detection result according to the shot boundary prediction result.

In some embodiments of the present application, based on the foregoing scheme, the first processing unit 1304 is configured to: extracting at least one dimension feature contained in each lens-level video segment to obtain content features of each lens-level video segment; sequentially inputting the content characteristics of the at least one shot-level video segment into a scene boundary recognition model according to a sliding window taking the shot-level video segment as a unit to obtain a scene boundary prediction result output by the scene boundary recognition model, wherein the scene boundary prediction result is used for indicating whether scene boundaries exist between adjacent shot-level video segments or not; and generating the scene-level video clip according to the scene boundary prediction result.

In some embodiments of the present application, based on the foregoing scheme, the scene boundary prediction result includes a probability value of a scene boundary between adjacent shot-level video clips; the first processing unit 1304 is configured to: identifying a scene boundary between two shot-level video clips with corresponding probability values larger than or equal to a set threshold value, and identifying a non-scene boundary between two shot-level video clips with corresponding probability values smaller than the set threshold value; and dividing the plurality of shot-level video clips according to the recognition results of the scene boundary and the non-scene boundary to obtain the scene-level video clips.

Fig. 14 shows a schematic diagram of a computer system suitable for use in implementing an electronic device, which may be the terminal device 110 or the server 130 shown in fig. 1, according to an embodiment of the application.

It should be noted that, the computer system 1400 of the electronic device shown in fig. 14 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 14, the computer system 1400 includes a central processing unit (Central Processing Unit, CPU) 1401, which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1402 or a program loaded from a storage section 1408 into a random access Memory (Random Access Memory, RAM) 1403, for example, performing the methods described in the above embodiments. In the RAM 1403, various programs and data required for system operation are also stored. The CPU 1401, ROM 1402, and RAM 1403 are connected to each other through a bus 1404. An Input/Output (I/O) interface 1405 is also connected to bus 1404.

The following components are connected to the I/O interface 1405: an input section 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1408 including a hard disk or the like; and a communication section 1409 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1409 performs communication processing via a network such as the internet. The drive 1410 is also connected to the I/O interface 1405 as needed. Removable media 1411, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1410 so that a computer program read therefrom is installed as needed into storage portion 1408.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. When executed by a Central Processing Unit (CPU) 1401, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer programs.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more computer programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments. For example, the video to be processed may be subjected to shot splitting processing to obtain a plurality of shot-level video clips, and then scene recognition processing is performed based on the plurality of shot-level video clips, so as to generate scene-level video clips. After the scene-level video clips are obtained, the correlation between the scene-level video clips can be calculated according to the characteristics of the scene-level video clips, and then fusion processing is carried out on the associated scene-level video clips according to the correlation of the scene-level video clips to generate scenario-level video clips.

It should be noted that, when the above embodiments of the present application are applied to specific products or technologies, if related data of a video to be processed needs to be acquired, corresponding permissions or agreements need to be obtained, and the collection, use and processing of the related data need to comply with related laws and regulations and standards of related countries and regions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, comprising several instructions for causing an electronic device to perform the method according to the embodiments of the present application.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A video processing method, comprising:

performing lens splitting treatment on the video to be treated to obtain a plurality of lens-level video clips;

performing scene recognition processing based on the plurality of shot-level video clips to generate scene-level video clips;

calculating the correlation between the scene-level video clips according to the characteristics of the scene-level video clips;

and according to the correlation of the scene-level video clips, carrying out fusion processing on the associated scene-level video clips to generate scenario-level video clips.

2. The video processing method according to claim 1, wherein performing lens splitting processing on the video to be processed comprises:

performing frame extraction processing on the video to be processed to obtain a plurality of video image frames;

detecting whether shot boundaries exist between adjacent video image frames in the plurality of video image frames to obtain shot boundary detection results;

and carrying out shot splitting processing on the video to be processed according to the shot boundary detection result.

3. The method according to claim 2, wherein detecting whether a shot boundary exists between adjacent video image frames in the plurality of video image frames, to obtain a shot boundary detection result, comprises:

sequentially inputting the video image frames into a shot boundary recognition model according to a set sliding window length to obtain a shot boundary prediction result output by the shot boundary recognition model, wherein the shot boundary prediction result is used for representing whether shot boundaries exist between adjacent video image frames or not;

and generating the shot boundary detection result according to the shot boundary prediction result.

4. The video processing method according to claim 3, wherein generating the shot boundary detection result from the shot boundary prediction result comprises:

acquiring color detection results of each of the plurality of video image frames, wherein the color detection results represent whether each video image frame is a solid-color image with a specified color;

and generating the shot boundary detection result according to the color detection result and the shot boundary prediction result.

5. The video processing method of claim 4, wherein generating the shot boundary detection result from the color detection result and the shot boundary prediction result comprises:

If the target video image frame which is the solid-color image exists in the video image frames according to the color detection result, identifying the target video image frame and the video image frame before the target video image frame as a lens boundary, and obtaining a first identification result;

and taking the shot boundary prediction result between adjacent frames except the first recognition result in the shot boundary prediction results and the first recognition result as the shot boundary detection result.

6. The method of video processing according to claim 4, wherein acquiring the color detection result of each of the plurality of video image frames comprises:

counting values in a color histogram of each video image frame;

and determining whether each video image frame is a solid-color image with a specified color according to the value in the color histogram of each video image frame.

7. The video processing method according to claim 3, wherein the shot boundary prediction result includes a probability value of a shot boundary between adjacent video image frames;

generating the shot boundary detection result according to the shot boundary prediction result, including:

Identifying a shot boundary between two frames of video image frames with the corresponding probability value being greater than or equal to a set threshold value, and identifying a non-shot boundary between two frames of video image frames with the corresponding probability value being less than the set threshold value, so as to obtain a second identification result;

and generating the shot boundary detection result according to the second identification result.

8. The video processing method according to claim 1, wherein performing scene recognition processing based on the plurality of shot-level video clips to generate a scene-level video clip comprises:

extracting at least one dimension feature contained in each lens-level video segment to obtain content features of each lens-level video segment;

sequentially inputting the content characteristics of the at least one shot-level video segment into a scene boundary recognition model according to a sliding window taking the shot-level video segment as a unit to obtain a scene boundary prediction result output by the scene boundary recognition model, wherein the scene boundary prediction result is used for indicating whether scene boundaries exist between adjacent shot-level video segments or not;

and generating the scene-level video clip according to the scene boundary prediction result.

9. The video processing method according to claim 8, wherein the scene boundary prediction result includes a probability value of a scene boundary between adjacent shot-level video clips;

generating the scene-level video clip according to the scene boundary prediction result, including:

identifying a scene boundary between two shot-level video clips with corresponding probability values larger than or equal to a set threshold value, and identifying a non-scene boundary between two shot-level video clips with corresponding probability values smaller than the set threshold value;

and dividing the plurality of shot-level video clips according to the recognition results of the scene boundary and the non-scene boundary to obtain the scene-level video clips.

10. The video processing method according to claim 1, characterized in that the video processing method further comprises:

extracting the audio of the video to be processed;

dividing the audio of the video to be processed into at least one audio-level segment according to the correlation between the audio contents;

and adjusting the scenario boundary of the scenario-level video clip according to the at least one audio-level clip.

11. The video processing method according to claim 10, wherein dividing the audio of the video to be processed into at least one audio-level segment according to the correlation between audio contents, comprises:

Dividing the audio of the video to be processed into a plurality of audio segments with set lengths;

extracting audio features corresponding to each of the plurality of audio segments;

determining associated audio segments according to the audio characteristics corresponding to the audio segments;

and carrying out fusion processing on the associated audio segments to obtain the at least one audio-level segment.

12. The video processing method of claim 10, wherein adjusting scenario boundaries of the scenario-level video clip according to the at least one audio-level clip comprises:

detecting whether the boundary time stamp of the audio-level segment is within the set time before and after the time point according to the boundary information of the at least one audio-level segment and the time point of the scenario boundary of the scenario-level video segment;

and if the boundary time stamp of the appointed audio-level segment is within the set time length before and after the time point, taking the boundary time stamp of the appointed audio-level segment as the boundary time stamp of the scenario-level video segment so as to adjust the scenario boundary of the scenario-level video segment.

13. The video processing method of claim 12, wherein adjusting scenario boundaries of the scenario-level video clip according to the at least one audio-level clip further comprises:

And if the audio-level fragments with the boundary time stamps within the set time periods before and after the time point do not exist, taking the boundary time stamp of the scenario-level video fragment as the scenario boundary of the scenario-level video fragment.

14. The video processing method according to any one of claims 1 to 13, characterized in that the video processing method further comprises:

generating a video tag of at least one of the following video clips, and storing the generated video tag and the corresponding video clip in an associated manner: the shot-level video clip, the scene-level video clip, and the scenario-level video clip.

15. A video processing apparatus, comprising:

the splitting unit is configured to carry out lens splitting treatment on the video to be treated to obtain a plurality of lens-level video clips;

a first processing unit configured to perform scene recognition processing based on the plurality of shot-level video clips to generate a scene-level video clip;

a computing unit configured to compute correlations between the scene-level video clips according to features of the scene-level video clips;

and the second processing unit is configured to fuse the associated scene-level video clips according to the correlation of the scene-level video clips to generate scenario-level video clips.

16. A computer readable medium on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the video processing method according to any one of claims 1 to 14.

17. An electronic device, comprising:

one or more processors;

storage means for storing one or more computer programs that, when executed by the one or more processors, cause the electronic device to implement the video processing method of any of claims 1-14.

18. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, from which computer readable storage medium a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the video processing method according to any one of claims 1 to 14.