CN113497977A

CN113497977A - Video processing method, model training method, device, equipment and storage medium

Info

Publication number: CN113497977A
Application number: CN202010191630.7A
Authority: CN
Inventors: 杨攸奕; 武元琪; 李名杨
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2021-10-12

Abstract

The embodiment of the invention provides a video processing method, a model training method, a device, equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a video to be processed containing a target video picture, wherein the target video picture can correspond to at least one of commodity information, a using process and logistics information of a commodity. Then, the video to be processed is input to the first model, and the first model can output the interest prediction information corresponding to each video segment in the video to be processed. And then, according to the interest degree information, a target video clip with the interest degree meeting the requirement is intercepted from the whole video to be processed, and finally the target video clip is output to the user. Therefore, the method is a scheme for automatically intercepting the highlight segments in the video to be processed, and the intercepted video segments contain the content which is interested by the user. Due to the automation of the interception, the user can conveniently and efficiently obtain the highlight video clip.

Description

Video processing method, model training method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a video processing method, a model training method, a device, equipment and a storage medium.

Background

With the continuous development of the internet, a plurality of fields can realize some functions by means of videos. For example, the recommendation of commodities can be realized by live video, videos of the freight transportation process can be shot, the tracking of freight transportation can be realized according to the videos, and then, for example, users can also shoot own daily life and share the videos to other people.

Taking a video sharing scene as an example, a user may shoot the behavior of a pet at home for a long period of time, and then the user needs to determine which parts of the long video are the fun behaviors of the pet, and manually clip and upload the video clip corresponding to the fun behaviors to a video sharing platform.

Just because the artificial video editing process is complex and the requirement on the user is high, how to efficiently and conveniently screen out the video segments which are interested by the user becomes an urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a model training method, a device, equipment and a storage medium, which are used for efficiently and conveniently screening out interesting video segments.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a video to be processed containing a target video picture, wherein the target video picture corresponds to at least one of commodity information of a commodity, a using process of the commodity and logistics information of the commodity;

inputting the video to be processed into a first model so that the first model outputs interest prediction information corresponding to the video to be processed, wherein the interest prediction information is used for reflecting interest corresponding to different video segments in the video to be processed and whether the different video segments contain the target video picture;

intercepting a target video segment meeting the interest degree requirement from the video to be processed according to the interest degree prediction information;

and outputting the target video clip.

In a second aspect, an embodiment of the present invention provides a video processing apparatus, including:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a video to be processed containing a target video picture, and the target video picture corresponds to at least one of commodity information of a commodity, a using process of the commodity and logistics information of the commodity;

the input module is used for inputting the video to be processed into a first model so as to enable the first model to output interestingness prediction information corresponding to the video to be processed, wherein the interestingness prediction information is used for reflecting interestingness corresponding to different video clips in the video to be processed and whether the different video clips contain the target video picture;

the intercepting module is used for intercepting a target video segment meeting the interest degree requirement from the video to be processed according to the interest degree prediction information;

and the output module is used for outputting the target video clip.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is configured to store one or more computer instructions, where the one or more computer instructions, when executed by the processor, implement the video processing method in the first aspect. The electronic device may also include a communication interface for communicating with other devices or a communication network.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the video processing method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a video to be processed containing a target video picture, wherein the target video picture corresponds to at least one of detail information, a use process, a motion attitude and logistics information of a target object;

inputting the video to be processed into a first model so that the first model outputs an interestingness curve corresponding to the video to be processed, wherein the interestingness curve is used for reflecting interestingness corresponding to different video clips in the video to be processed and whether the different video clips contain the target video picture;

and displaying the interestingness curve.

In a sixth aspect, an embodiment of the present invention provides a video processing apparatus, including:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a video to be processed containing a target video picture, and the target video picture corresponds to at least one of detail information, a use process, a motion posture and logistics information of a target object;

the input module is used for inputting the video to be processed into a first model so as to enable the first model to output an interest degree curve corresponding to the video to be processed, and the interest degree curve is used for reflecting interest degrees corresponding to different video clips in the video to be processed and whether the different video clips contain the target video pictures;

and the display module is used for displaying the interestingness curve.

In a seventh aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is used to store one or more computer instructions, and when the one or more computer instructions are executed by the processor, the electronic device implements the video processing method in the fifth aspect. The electronic device may also include a communication interface for communicating with other devices or a communication network.

In an eighth aspect, the present invention provides a non-transitory machine-readable storage medium, on which an executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is enabled to implement at least the video processing method according to the fifth aspect.

In a ninth aspect, an embodiment of the present invention provides a model training method, including:

acquiring an image sample and a video sample for training a first model, wherein the first model comprises a first backbone network and a first output network;

acquiring object type marking information corresponding to the image sample;

obtaining interest degree marking information corresponding to the video sample, wherein the interest degree marking information is used for reflecting the interest degree of a user on different video clips in the video sample;

inputting the image sample into the first backbone network, and training the first backbone network by taking the object class marking information as supervision information;

and inputting the video sample into a first model formed by a trained first backbone network and the first output network, and training the first output network by taking the interestingness marking information as supervision information.

In a tenth aspect, an embodiment of the present invention provides a model training apparatus, including:

the system comprises a sample acquisition module, a first analysis module and a second analysis module, wherein the sample acquisition module is used for acquiring an image sample and a video sample for training a first model, and the first model comprises a first trunk network and a first output network;

the information acquisition module is used for acquiring object type marking information corresponding to the image sample; obtaining interest degree marking information corresponding to the video sample, wherein the interest degree marking information is used for reflecting the interest degree of a user on different video clips in the video sample;

the sample input module is used for inputting the image sample into the first backbone network, and training the first backbone network by taking the object class marking information as supervision information; and inputting the video sample into a first model formed by a trained first backbone network and the first output network, and training the first output network by taking the interestingness marking information as supervision information.

In an eleventh aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is used to store one or more computer instructions, and the one or more computer instructions, when executed by the processor, implement the model training method in the ninth aspect. The electronic device may also include a communication interface for communicating with other devices or a communication network.

In a twelfth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the model training method according to the ninth aspect.

The video processing method provided by the invention obtains the video to be processed containing the target video picture, wherein the target video picture can comprise at least one item of content in commodity information, the use process of the commodity and logistics information of the commodity. Then, the video to be processed is input to the first model, and the first model can output the interest prediction information corresponding to each video segment in the video to be processed. The interest prediction information is used to indicate whether each video segment includes the target video picture and the interest corresponding to each video segment. And then, according to the interest degree information, a target video clip with the interest degree meeting the requirement is intercepted from the whole video to be processed, and finally the target video clip is output to the user.

Therefore, the method is a scheme for automatically intercepting the highlight segments in the video to be processed, and the intercepted video segments contain the content which is interested by the user. Due to the fact that the interception is automatic, the user does not need to carry out any operation, and therefore the user can obtain the highlight video clip conveniently and efficiently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video processing method according to an embodiment of the present invention;

fig. 2 is an interestingness curve corresponding to a video to be processed according to an embodiment of the present invention;

fig. 3 is a flowchart of another video processing method according to an embodiment of the present invention;

fig. 4 is a flowchart of another video processing method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a model training method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a specific implementation manner of step 403 according to an embodiment of the present invention;

fig. 7 is an interestingness curve corresponding to a video sample provided in an embodiment of the present invention;

fig. 8 is a flowchart of a method for modifying interestingness marking information according to an embodiment of the present invention;

FIG. 9 is a flow chart of another model training method provided by embodiments of the present invention;

fig. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 10;

fig. 12 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device corresponding to the video processing apparatus provided in the embodiment shown in fig. 12;

FIG. 14 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

FIG. 15 is a schematic structural diagram of an electronic device corresponding to the model training apparatus provided in the embodiment shown in FIG. 14;

FIG. 16 is a schematic diagram of a video processing method and a model training method applied in a home entertainment scene according to an embodiment of the present invention;

fig. 17 is a schematic view of a video processing method and a model training method applied in a live video scene according to an embodiment of the present invention;

fig. 18 is a schematic diagram of a video processing method and a model training method applied in a payment scenario according to an embodiment of the present invention;

fig. 19 is a schematic view of a video processing method and a model training method applied in a transportation scene according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well. "plurality" generally includes at least two unless the context clearly dictates otherwise.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

The above background mentions that there is a need in a home entertainment setting to screen video segments of interest about pets. Besides, the requirement of screening interesting video clips also exists in video live scenes.

Specifically, the anchor may take live video through a shooting device to make merchandise recommendations to the audience. In practical applications, live broadcasting usually lasts for tens of minutes or even one or two hours, and thus, a user needs to watch a complete video to know all the commodities recommended in the live broadcasting. It is readily understood that there may be content in the live video that interacts with the viewer in addition to content that includes merchandise recommendations. For some users who want to quickly know which commodities are recommended in the live video, only videos related to commodities are video clips in which the users are interested.

At this time, the interactive part in the whole live video can be removed according to the video processing method provided in the following embodiment, and only the content recommended by the commodity is reserved. Of course, the present invention does not limit the use scenes, and for other scenes with video segment screening, the video processing method and the model training method provided by the present invention are used in the same way, but the definition of the video segment of interest may be different in different scenes.

The following takes a live video scene as an example, and details a video processing method provided by the present invention are described in detail with reference to the following embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 1 is a flowchart of a video processing method according to an embodiment of the present invention, where the video processing method according to the embodiment of the present invention may be executed by a video processing device. It will be appreciated that the video processing device may be implemented as software, or a combination of software and hardware. The video processing device in this embodiment and each of the embodiments described below may specifically be an electronic device that is provided with data processing capabilities. As shown in fig. 1, the method comprises the steps of:

s101, acquiring a video to be processed containing a target video picture, wherein the target video picture corresponds to at least one of commodity information of a commodity, a use process of the commodity and logistics information of the commodity.

According to the above description, the live video obtained by the main broadcast can be the video to be processed. However, since the duration of the live video is usually longer, at this time, in order to reduce the processing pressure of the first model, optionally, after the video processing device obtains the live video, the video processing device may further clip the live video into a plurality of segments of videos with shorter duration, at this time, each segment of videos after clipping may be regarded as a video to be processed, and the duration may be, for example, 10 minutes.

Since the anchor shows the commodity information such as the commodity style and the model when recommending the commodity and demonstrates the commodity using mode, the video to be processed may include the target video picture corresponding to the commodity information and the commodity using process in the live broadcast scene.

It should be noted that, in the payment scenario, the target video frame may correspond to the commodity information. In a transportation logistics scene, the target video picture may correspond to commodity information and logistics information.

S102, inputting the video to be processed into a first model so that the first model outputs interest prediction information corresponding to the video to be processed, wherein the interest prediction information is used for reflecting interest corresponding to different video segments in the video to be processed and whether the different video segments contain target video pictures.

And then, inputting the video to be processed into the first model, and acquiring interest prediction information which is output by the first model and corresponds to the video to be processed. The output interest prediction information can reflect the interest of the user corresponding to each video clip in the video to be processed, and can also reflect whether each video clip contains a target video picture which is interested by the audience, namely a video picture corresponding to the commodity information and the use process of the commodity.

The video segment in the video to be processed may be a segment with a short duration in the video, optionally, a video segment with a duration of several hundred milliseconds, and of course, one frame of image in the video sample may also be regarded as one video segment.

S103, intercepting a target video segment meeting the interest degree requirement from the video to be processed according to the interest degree prediction information.

And S104, outputting the target video clip.

And then, according to the interest degree prediction information corresponding to different video clips output by the first model, intercepting a target video clip meeting the interest degree requirement from the video to be processed, and outputting the intercepted clip to audiences. The screened target video clips definitely comprise commodity information and the use process of commodities.

Optionally, the interest prediction information corresponding to the video to be processed may be embodied as an interest score corresponding to each video segment. Then, for the determination of the target video segment, in an optional manner, if the interest level and the interest level of each of the consecutive video segments, such as the consecutive multi-frame images, are all greater than the set threshold, or if the interest level and the interest level of each of the consecutive video segments, such as the consecutive multi-frame images, are all greater than the set threshold, then the video segments are determined to be the target video segment.

Optionally, the interest prediction information corresponding to the video to be processed may also be embodied as an interest curve corresponding to the video, and a labeling point for labeling the start-stop position of the target video segment also exists on the interest curve.

For the start-stop position of the target video segment, there may be points in the interestingness curve that have the following characteristics: the interest degree curve corresponding to the video to be processed can have a starting annotation point and a terminating annotation point. The slope value of the start annotation point is greater than a first threshold, the slope value of the end annotation point is less than a second threshold, and the first threshold is greater than the second threshold, where the first threshold is usually a positive number, and the second threshold is a negative number, which can be understood by referring to fig. 2.

In this embodiment, a to-be-processed video including a target video picture is acquired. Then, the video to be processed is input to the first model, and the first model can output the interest prediction information corresponding to each video segment in the video to be processed. And then, according to the interest degree information, a target video clip with the interest degree meeting the requirement is intercepted from the whole video to be processed, and finally the target video clip is output to the user. Therefore, the method is a scheme for automatically intercepting the highlight segments in the video to be processed, and the intercepted video segments contain the content which is interested by the user. Due to the fact that the interception is automatic, the user does not need to carry out any operation, and therefore the user can obtain the highlight video clip conveniently and efficiently.

Based on the above embodiments, it is easy to understand that a live video generally includes a plurality of recommended commodities, and therefore, the intercepted target video segments may be multiple, and optionally, the multiple target video segments may be merged according to a time sequence of the multiple target video segments, so as to output a merging result to the viewing user.

In addition, in practical application, the output video clips can be enjoyed in order to make the video clips more entertaining. Alternatively, the dubbing music can be random, and the content of the video clip can be identified so as to be dubbed according to the content, such as the video is matched with cheerful music when cosmetics or electronic products are recommended; recommended bedding, video with soothing music, etc.

Fig. 3 is a flowchart of another video processing method according to an embodiment of the present invention, as shown in fig. 3, after step 101, the method includes the following steps:

s201, inputting the video to be processed into a second model.

And S202, obtaining a classification result output by the second model and corresponding to the video to be processed.

S203, if the classification result shows that the video to be processed is the video which is possibly interested by the user, inputting the video to be processed into the first model for processing.

Specifically, as mentioned in the embodiment shown in fig. 1, for a long live video, it can be cut into multiple segments of videos to be processed with shorter duration. In order to reduce the processing pressure of the first model, the plurality of segments of the video to be processed can be input into the second model before being input into the first model. The second model may output a classification result corresponding to the video to be processed, where the classification result may indicate that the video to be processed may or may not be a video of interest to the user. If the classification result indicates that the video to be processed is likely to be interested by the user, the video to be processed is input into the first model, and the target video segment in the video to be processed is continuously screened out by the first model according to the mode shown in fig. 1.

It can be seen that the second model is in fact a coarse screen model. For the multiple segments of videos to be processed obtained by cutting, the videos to be processed which are not interested by the user can be filtered by the second model, and then the filtered videos to be processed are further input into the first model, so that the target video segments are filtered by the first model.

Alternatively, the second model may be provided in the video processing device together with the first model. Alternatively, the second model may also be provided in another device with lower data processing capability. And after the second model determines that the video to be processed is the video which is possibly interested by the user, sending the video to the video processing equipment, and screening the video segments by the first model. In practical application, the second model may be set in a terminal device used by a user, and the first model may be set in a cloud server.

It should be noted that the classification result output by the second model can reflect whether the to-be-processed video includes a target video frame that is interested by the user, but at this time, the to-be-processed video also includes many non-target video frames.

In this embodiment, since the number of the videos to be processed is usually large, the videos to be processed may be input to the second model for preliminary screening, and then the videos to be processed that pass through the preliminary screening may be input to the first model, so as to reduce the processing pressure of the first model.

In addition, for other scenarios, such as a transportation scenario, a payment scenario, a home entertainment scenario, and so on, the above and following embodiments provided by the present invention are also applicable, and specific processes may be described with reference to the following embodiments shown in fig. 16 to 19.

Fig. 4 is a flowchart of another video processing method according to an embodiment of the present invention, where the video processing method according to an embodiment of the present invention may be executed by a video processing device. As shown in fig. 4, the method includes the steps of:

s301, a to-be-processed video containing a target video picture is obtained, wherein the target video picture corresponds to at least one item of detail information, use process, motion posture and logistics information of a target object.

For the step 301, as shown in the embodiment shown in fig. 1, a live video scene is taken as an example to be described in detail, and related content may refer to related description in the embodiment shown in fig. 1, which is not described herein again. As can be seen from the above description, the target object in the scene may be a recommended commodity, and the target video picture corresponds to the detailed information and the use process of the commodity.

For other scenarios, such as a home entertainment scenario, where the target object may be a pet or a child of a home, the target video frame corresponds to a moving pose of the target object, such as a jumping motion of the pet, and so on. In the payment scenario, the target object may be a commodity purchased by the user at a store, and the target video screen corresponds to detailed information of the commodity. In a transportation scene, the target object may be goods, and the target video picture corresponds to detailed information and logistics information of the goods.

S302, inputting the video to be processed into the first model, so that the first model outputs an interest degree curve corresponding to the video to be processed, wherein the interest degree curve is used for reflecting interest degrees corresponding to different video segments in the video to be processed and whether the different video segments contain target videos.

S303, displaying the interestingness curve.

The manner of obtaining the interestingness curve may refer to the related description in the embodiment shown in fig. 1, which is not described herein again. When the interestingness curve is obtained, it can be further displayed. Because the interestingness curve can be composed of the interestingness scores corresponding to the video segments in the video to be processed, the user can intuitively know the interestingness degree of each video segment in the video to be processed according to the curve so as to further process the video to be processed. For example, if the interest degree of each video segment in the video to be processed is low, the user may choose to directly ignore the video and not screen out the highlight video.

In this embodiment, the video to be processed is input to the first model, so that the first model outputs the interestingness curve corresponding to the video, and the user can intuitively know whether the video really has the content of interest of the user according to the curve, so as to further determine how to process the video.

Based on the embodiment shown in fig. 4, while the video processing device displays the interestingness curve, the start-stop position of the target video segment meeting the interestingness requirement in the video to be processed may also be determined according to the interest score in the interestingness curve; the start-stop positions actually correspond to two labeled points in the interestingness curve respectively, and the labeled points have characteristics which can be referred to in the related description in fig. 1, which are not described again.

Then, the video processing device broadcasts the start-stop position corresponding to the target video segment, for example, the broadcast content may be: "Xth second to Xth second are the fragments of interest". The user can generate corresponding voice instructions based on the broadcast content. For example, when the user considers that the duration of an interested segment in a video to be processed is short, the user usually sends a voice instruction of 'not generating the video'; when the duration of the segment of interest is long, the user can issue a voice instruction of "generate video". If the voice instruction sent by the user is to generate a video, the video processing equipment intercepts the target video segment according to the interestingness curve and outputs the target video segment to the user.

In this embodiment, based on the interestingness curve output by the model, the video processing device may broadcast the curve, so that the user sends out a corresponding voice command according to the broadcast information, thereby avoiding that only a small number of highlight segments exist in the video to be processed, and after the segments are output, the user sees the video that is not highlight actually, that is, the highlight degree of the output target video is ensured.

The first model is used in the process of intercepting the target video segment according to the video processing method provided by the embodiments. The following embodiments are provided to describe the model training method in detail. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 5 is a flowchart of a model training method according to an embodiment of the present invention, where the model training method according to an embodiment of the present invention may be executed by a video processing apparatus, and a trained model may be considered as a functional module in the video processing apparatus. It will be appreciated that the video processing device may be implemented as software, or a combination of software and hardware. The video processing device in this embodiment and each of the embodiments described below may specifically be an electronic device that is provided with data processing capabilities. The first model trained by the embodiments of the present invention is used for realizing the screening of the video segment of interest.

The following describes the training process of the first model in detail by taking a home entertainment scene in the background art as an example, so that the trained first model can accurately capture the highlight segment containing the pet in the video. As shown in fig. 5, the method includes the steps of:

s401, obtaining an image sample and a video sample for training a first model, wherein the first model comprises a first backbone network and a first output network.

Wherein, the first model to be trained can be composed of a first backbone network and a first output network. The first backbone network may be configured to predict whether the video content is of interest to the user, and the first output network may be configured to implement video screening and output the screened video segments to the user.

Based on the above description, the video processing device may obtain different types of training samples, i.e. obtain image samples and video samples, through the internet. The image sample and the video sample may include a target object or a specific motion of the target object, and the target object or the specific motion of the target object also corresponds to the usage scenario of the first model. For example, in a home entertainment scene, the target objects included in the image sample and the video sample may be animals, and the specific motion of the target object may be a jumping, rolling, or the like motion of the animals. Meanwhile, parameters such as the size of the image sample, the frame rate of the video sample and the like can be set according to different scenes in which the first model is used.

In addition, in consideration of the balance between the positive and negative samples, the image sample and the video sample may not include the target object or the specific motion of the target object.

And S402, acquiring object type marking information corresponding to the image sample.

And S403, obtaining interestingness marking information corresponding to the video sample, wherein the interestingness marking information is used for reflecting the interestingness of the user in different video segments in the video sample.

For the obtained image sample, the object type included in the image sample may be labeled manually, for example, the object types of animals, people, backgrounds, and the like included in the image sample may be labeled, so that the video processing device obtains object type labeling information corresponding to the image sample.

For the obtained video sample, the interest level of the video sample can also be labeled manually, so that the video processing device obtains corresponding interest level labeling information. The labeling of the video sample can be understood as labeling the interest degree of the user in different video segments in the sample video.

The video segment may be a shorter-duration segment in the sample video, and optionally, may be a video segment with a duration of several hundred milliseconds, and of course, one frame of image in the video sample may also be regarded as a video segment. At this time, the annotation of the video sample can be understood as the annotation of the interestingness of each frame of image in the video sample by the user. The interest level annotation information may be embodied as interest scores of the user on the video segment, for example, the user is interested in scoring 1 on the video segment, and is not interested in scoring 0. After the annotation, the interest score corresponding to each frame of image can reflect which parts in the video sample are the video segments which are interested by the user and need to be screened out.

It should be noted that although the labeling of the video samples is performed in units of the minimum unit of the video, i.e. the image frame, in practical applications, such fine-grained labeling can be achieved in the following simple manner: the user only needs to mark that the Mth to Nth seconds in the video sample are interesting to the user and the Kth to Pth seconds are uninteresting to the user, so that the image frames corresponding to the Mth to Nth seconds in the video sample can obtain the same interest level, and the image frames corresponding to the Kth to Pth seconds can also obtain the same interest level, thereby realizing the marking of the interest level of each frame of image.

It should be noted that the above labeling manner is actually a labeling without a comparison process, that is, the user directly labels the interest degree of the image frame according to his own knowledge. Compared with a relative labeling mode, the time cost of labeling can be greatly reduced. The relative labeling modes are as follows: the image frames in the video sample are compared with the image frames in other samples to determine the interest level of the image frames in the video sample. In this way, a large amount of comparison work is required to label one video sample, which results in a large increase in the time cost of labeling.

The image sample and the video sample which are labeled can be obtained by the training equipment and input into the first model at different stages so as to complete model training.

S404, inputting the image sample into the first backbone network, and taking the object type labeling information as supervision information to train the first backbone network.

S405, inputting the video sample into a first model formed by the trained first backbone network and a first output network, and training the first output network by taking the interest level marking information as supervision information.

Based on the obtained content, the video processing device inputs the image sample into a first trunk network in the first model in a first training stage, marks the object class information as supervision information, and trains the first trunk network. The trained first model may have a classification function, that is, it may be able to predict whether the image includes the target object or a specific motion of the target object. And then, in the second training stage, inputting the video sample into the first model obtained after the first training stage, and training the first output network by using the interest level marking information of the video sample as the supervision information.

After the two-stage training, the training of the whole first model is completed, so that the first model can realize the function of screening out the interested part of the video.

The model training method provided by the embodiment obtains the image sample and the video sample for training the first model. And acquiring object category marking information corresponding to the image sample and interest level marking information corresponding to the video sample. Then, a first backbone network in the first model is trained according to the image samples and the labeling information of the object classes. And training the output network in the first model according to the video sample and the interestingness marking information based on the trained first backbone network, thereby completing the training of the whole first model.

In the general model training process, the training is usually not performed in stages, but is performed at one time by adopting samples of corresponding types directly according to the use of the model, so that the model training effect is poor. The training method is actually a model training mode which is staged and uses different types of samples at different stages. The range of image sample acquisition is wide and the number of the image samples is large, so that the first trunk network can be ensured to have a better training effect. And then, the video sample and the first backbone network with better training effect are combined to continue training the first output network, so that the whole model is ensured to have better training effect.

In order to ensure the accuracy of the interestingness marking information corresponding to the video sample, as shown in fig. 6, the interestingness marking information may be obtained in the following manner, that is, an optional implementation manner of the step 403 described above:

s4031, video clips of interest respectively labeled on the video samples by a plurality of users are obtained.

S4032, performing statistics on the interesting video segments corresponding to the multiple users to determine interest level annotation information corresponding to the video samples according to the statistical results.

Although it has been mentioned in the above embodiments that the user can mark the video sample, that is, the user marks which video segments in the video sample are interesting and which are not interesting. However, the number of users is not limited, and in this case, in order to ensure the accuracy of the annotation, multiple users may respectively annotate the same video sample. After the video processing device obtains the labeling results of the multiple users labeling the video samples, the multiple labeling results can be counted, and the counting results are finally determined as interest level labeling information corresponding to the video samples, so that the information is used for training the first output network.

Specifically, a video sample a is taken as an example to illustrate the annotation process, and multiple users can respectively annotate the video sample a to determine which video segments in the video sample a are interesting to the users and which video segments are not interesting, and these annotation results can be obtained by the video processing device. Based on this, an optional statistical manner is provided, for each video segment of the video sample a, the number of users who label the video segment as the video segment of interest may be counted, and a larger number indicates a higher interest level of the user in the video segment, and the interest level labeling information corresponding to the video sample may be finally determined according to the statistical result.

As mentioned in the embodiment shown in fig. 5, the interestingness marking information may be embodied as an interest point corresponding to each video segment in the video sample a. The number of users who mark a video segment as a video segment of interest can be determined as the interest point of the video segment. At this time, the annotation information of the video sample a may be the interest score of each video segment.

And forming an interest degree curve corresponding to the video sample A by the interest points of each video segment. On the basis, optionally, the interestingness curve may also be labeled, that is, it is to label which highlight segments need to be screened out, and the highlight segments are the target video segments mentioned in the subsequent embodiments. At this time, for the video sample a, the corresponding interestingness label information is specifically represented as an interestingness curve with label points.

Specifically, according to the labeling of different video segments in the video sample a by a plurality of users, an interestingness curve corresponding to the video sample a can be obtained. Then, a first annotation point with a slope value larger than a first threshold value is determined in the interestingness curve, and a second annotation point with a slope value smaller than a second threshold value is determined at the same time. And the second annotation point is a point which is positioned after the first annotation point and has a first slope value smaller than a second threshold value. And the first threshold is greater than the second threshold, the first threshold being generally a positive number and the second threshold being a negative number. Thus, the curve segment composed of the first annotation point and the second annotation point may actually show a gradual increase to decrease in the user's interest level, such as the thickened curve segment in the curve shown in FIG. 7.

Since each point in the interest curve corresponds to a video segment in the video sample, and each video segment may be an image frame, the curve segment composed of the first annotation point and the second annotation point corresponds to a video segment in the video sample, and the video segment is a highlight segment.

Optionally, the interest points of each video segment in the video sample a may also be normalized to obtain interest degree labeling information, and an interest degree curve corresponding to the video sample a may also be formed by the interest points subjected to the normalization processing, that is, the interest degree labeling information corresponding to the video sample a is a normalization result.

Optionally, in an actual application, the video sample a may be content that has been uploaded to a video sharing platform, and at this time, the annotation of the user may also be correspondingly converted into a collection operation, a approval operation, or a viewing operation, etc. that the video viewing user generates on the video sample a. That is, the interest level annotation information corresponding to the video sample a may be obtained according to the collection frequency, the viewing frequency, and the like corresponding to the video sample a.

Of course, the user information of the video viewing user may also be taken into consideration, such as only the operations of collection, approval, viewing, etc. triggered by a certain age or a certain level of users, so as to determine the interest tagging information of the video sample according to only the operation triggered by the user.

In this embodiment, a plurality of users label the video sample to obtain a plurality of labeling results. Then, the video processing device counts the plurality of labeling results to obtain interest level labeling information corresponding to the video samples finally, so that the condition that the labeling result is inaccurate due to the fact that the only user labels the video samples randomly is avoided, and the training effect of the first model is indirectly guaranteed.

In the embodiment shown in fig. 6, the accuracy of the interestingness annotation information can be guaranteed to some extent by annotating the same video sample by multiple users. On this basis, optionally, the interest level annotation information may be further corrected, and a specific correction process may be as shown in fig. 8, where the process includes the following steps:

s501, inputting the video sample into a preset interestingness prediction model.

And S502, obtaining interest prediction information which is output by the interest prediction model and corresponds to the video sample.

And S503, correcting the interest degree marking information according to the interest degree prediction information.

And inputting the video sample into an interest prediction model which is trained in advance, and outputting interest prediction information corresponding to the video sample by the prediction model. The prediction information may also be expressed as interest points corresponding to each video segment in the video sample, and thus the interest points may also obtain an interest degree curve.

Because the interestingness prediction information output by the prediction model and the interestingness marking information manually marked by the user usually have differences, the interestingness marking information manually marked by the user can be corrected by using the interestingness prediction information output by the model.

The correction process may be performed in units of one video segment in the video sample. In an optional modification manner, for a video segment a in a video sample a, it may correspond to one interest prediction information and also to one interest annotation information, and the two information may be modified according to a preset modification function, so as to obtain modified interest prediction information and modified interest annotation information respectively. And finally determining interest degree marking information corresponding to the video clip a according to the two corrected information.

For example, the product of the two corrected information and the corresponding weight coefficients may be calculated. And if the sum of the two products is greater than zero, finally determining the sum of the products as the interest level marking information corresponding to the video clip a. The finally determined interestingness labeling information corresponding to the video segment a can be used for training the first output network.

After the above correction is performed on each video segment in the video sample, the correction of the interestingness labeling information corresponding to the whole video sample is completed.

In this embodiment, the accuracy of the interest level annotation information can also be improved by correcting the interest level annotation information, so that the situation that the annotation result is inaccurate when the only user arbitrarily annotates the video sample is avoided, and the training effect of the first model is indirectly ensured.

It should be noted that the correction process in this embodiment may be executed on the basis of the embodiment shown in fig. 6, or may be executed separately, and both of them can ensure the accuracy of the interestingness labeling information, thereby indirectly ensuring the training effect of the first model.

After the first model is trained in the manner provided by the embodiments, the first model can be formally used to process the video to be processed, so as to screen out the target video segment in the video to be processed, which is interested by the user.

For the above-mentioned home entertainment scenes, the user may place a shooting device at home, where the shooting device may be a terminal device used by the user, such as a mobile phone. The device can photograph the behavior of the pet for a longer period of time, such as several hours. At this time, similar to the embodiment shown in fig. 1, optionally, in order to reduce the processing pressure of the first model, after obtaining the long-period video captured by the capturing device, the video processing device may further crop the long-period video into a plurality of to-be-processed videos with shorter time duration.

The screening process of the target video clip for the first model comprises the following steps: firstly, a video to be processed is input into a first model, and interestingness prediction information which is output by the first model and corresponds to the video to be processed is obtained, wherein the interestingness prediction information reflects interestingness corresponding to different video segments in the video to be processed. And then, according to the interestingness corresponding to different video clips output by the first model, intercepting a target video clip meeting the interestingness requirement from the video to be processed, and outputting the intercepted clip to a watching user. Wherein, the screened target video segment definitely includes the target object or the specific action of the target object.

As will be readily appreciated, the plurality of target video segments may be merged according to their chronological order to output the merged result to the viewing user. In practical applications, the dubbing music for the video may be random, or the content of the video segment may be identified, so as to dub music according to the content, such as matching fast music with the video shot in the day, matching slow music with the video shot in the night, and so on.

Optionally, the video processing device may start to perform the screening operation after the user triggers the screening operation, and optionally, the user may also set a working time for the video processing device, so that the video processing device may perform the video screening every preset time interval, so that the user may view the interested target video segment at regular time.

The user who shoots the video is different from the user who performs annotation in the above embodiment, and the user who performs annotation can be regarded as a worker corresponding to the video processing device.

As can be easily understood, the type of the content included in the interestingness prediction information corresponding to the to-be-processed video output by the first model is the same as the interestingness label information corresponding to the video sample during the training of the first model, and therefore, optionally, the interestingness prediction information corresponding to the to-be-processed video may be embodied as the interest score corresponding to each segment of the video. Then, for the determination of the target video segment, in an optional manner, if the interest level and the interest level of each of the consecutive video segments, such as the consecutive multi-frame images, are all greater than the set threshold, or if the interest level and the interest level of each of the consecutive video segments, such as the consecutive multi-frame images, are all greater than the set threshold, then the video segments are determined to be the target video segment.

Optionally, the interest prediction information corresponding to the video to be processed may also be embodied as an interest curve corresponding to the video, and a mark point is also present on the interest curve and used for marking a start-stop position of the target video segment. For the start-stop position of a target video segment, there may be points in the interestingness curve that have the following characteristics:

a third annotation point and a fourth annotation point may exist in the interestingness curve corresponding to the video to be processed, where the third annotation point indicates that the starting position of the target video segment is the starting annotation point in the embodiment shown in fig. 1, and the fourth annotation point indicates that the ending position of the target video segment is the ending annotation point. The relationship between the labeling point and the threshold value can refer to the related content in the embodiment shown in fig. 1, and is not described herein again.

It should be noted that the above-mentioned determination method of the target video segment may be selected and used according to the difference of the interestingness labeling information of the video sample during training.

In practical applications, the first backbone Network in the first model may be an inclusion v4 or a ResNet151 Network, and the like, and the first output Network may be a Boundary-Sensitive Network (BSN) or the like.

Since the second model is set in the terminal device used by the user, the second backbone network may be a lightweight, fast MobileNetV1 network, MobileNetV2 network, or ShuffleNet network, etc., and the second output network may be a LAFNet network, etc., in consideration of the processing capability of the terminal device.

For the training mode of the second model, similar to the first model, fig. 9 is a flowchart of another model training method provided in the embodiment of the present invention, and as shown in fig. 9, the method may include the following steps:

s601, obtaining an image sample and a video sample for training a second model, wherein the second model comprises a second backbone network and a second output network.

And S602, acquiring object type marking information corresponding to the image sample.

S603, obtaining interestingness marking information corresponding to the video sample, wherein the interestingness marking information is used for reflecting whether the user is interested in the video sample.

S604, inputting the image sample into a second backbone network, and taking the object type labeling information as supervision information to train the second backbone network.

S605, inputting the video sample into a second model formed by a trained second backbone network and a second output network, and training the second output network by taking the interest level marking information as supervision information.

The method for training the second model is basically the same as the method for training the first model, except that the second model is a classification model, so the interest level annotation information of the video sample corresponds to whether the user is interested in the video sample, which is different from the method for training the first model.

In addition, the specific process of the second model training may refer to the related description in the embodiment shown in fig. 5, and is not described herein again.

This embodiment is also a staged model training approach, and different types of training samples are used in different training stages. That is, the second backbone network is trained by using the image sample, and then the second output network is continuously trained based on the trained first backbone network and the video sample. The range of image sample acquisition is wider and the number of the image samples is more, so that the second backbone network can be ensured to have a better training effect. And the video sample and the second backbone network with better training effect are combined to continue training the second output network, so that the whole model has better training effect.

The video processing apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these video processing devices can each be constructed using commercially available hardware components configured through the steps taught by the present scheme.

Fig. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention, as shown in fig. 10, the apparatus includes:

the first obtaining module 11 is configured to obtain a to-be-processed video including a target video picture, where the target video picture corresponds to at least one of commodity information of a commodity, a use process of the commodity, a transaction process of the commodity, and a transportation process of the commodity.

An input module 12, configured to input the video to be processed into a first model, so that the first model outputs interestingness prediction information corresponding to the video to be processed, where the interestingness prediction information is used to reflect interestingness corresponding to different video segments in the video to be processed and whether the different video segments include the target video picture.

And the intercepting module 13 is configured to intercept, from the video to be processed, a target video segment meeting the interest level requirement according to the interest level prediction information.

And the output module 14 is used for outputting the target video clip.

Optionally, the apparatus further comprises: a second acquisition module 21.

The input module 12 is further configured to input the video to be processed to the second model.

The second obtaining module 21 is configured to obtain a classification result output by the second model and corresponding to the video to be processed.

The input module 12 is further configured to input the video to be processed to the first model for processing if the classification result indicates that the video to be processed is a video that may be interested by the user.

The video processing apparatus shown in fig. 10 can execute the model training method provided in the embodiments shown in fig. 1 to fig. 3, and for the parts not described in detail in this embodiment, reference may be made to the related descriptions of the embodiments shown in fig. 1 to fig. 3, which are not described again here.

Having described the internal functions and structure of the video processing apparatus, in one possible design, the structure of the video processing apparatus may be implemented as an electronic device, as shown in fig. 11, which may include: a processor 31 and a memory 32. Wherein the memory 32 is used for storing a program for supporting the electronic device to execute the video processing method provided in the foregoing embodiments shown in fig. 1 to 3, and the processor 31 is configured to execute the program stored in the memory 32.

The program comprises one or more computer instructions which, when executed by the processor 31, are capable of performing the steps of:

acquiring a video to be processed containing a target video picture, wherein the target video picture corresponds to at least one of commodity information of a commodity, a using process of the commodity, a transaction process of the commodity and a transportation process of the commodity;

and outputting the target video clip.

The electronic device may further include a communication interface 33 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for the electronic device, which includes a program for executing the video processing method in the method embodiments shown in fig. 1 to 3.

Fig. 12 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present invention, as shown in fig. 12, the apparatus includes:

an obtaining module 41, configured to obtain a to-be-processed video including a target video picture, where the target video picture corresponds to at least one of detail information, a use process, a motion posture, and logistics information of a target object.

An input module 42, configured to input the video to be processed into a first model, so that the first model outputs an interestingness curve corresponding to the video to be processed, where the interestingness curve is used to reflect interestingness corresponding to different video segments in the video to be processed and whether the different video segments include the target video picture.

And a display module 43 for displaying the interestingness curve.

Optionally, the apparatus further comprises:

a determining module 51, configured to determine, according to interest scores corresponding to different video segments reflected in the interest degree curve, start and end positions of a target video segment meeting the interest degree requirement in the video to be processed.

And the broadcasting module 52 is used for broadcasting the start-stop position.

And the intercepting module 53 is configured to receive a voice instruction generated by a user according to the broadcast content, and intercept, from the video to be processed, a target video segment meeting an interest requirement according to the quick search interest curve.

And an output module 54, configured to output the target video segment.

The video processing apparatus shown in fig. 12 may execute the model training method provided in the embodiment shown in fig. 4, and reference may be made to the related description of the embodiment shown in fig. 4 for a part not described in detail in this embodiment, which is not described herein again.

Having described the internal functions and structure of the video processing apparatus, in one possible design, the structure of the video processing apparatus may be implemented as an electronic device, as shown in fig. 13, which may include: a processor 61 and a memory 62. Wherein the memory 62 is used for storing a program for supporting the electronic device to execute the video processing method provided in the foregoing embodiment shown in fig. 4, and the processor 61 is configured to execute the program stored in the memory 62.

The program comprises one or more computer instructions which, when executed by the processor 61, are capable of performing the steps of:

and displaying the interestingness curve.

The electronic device may further include a communication interface 63 for communicating with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for the electronic device, which includes a program for executing the video processing method in the embodiment of the method shown in fig. 4.

The model training apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these model training devices can each be constructed using commercially available hardware components configured through the steps taught in the present scheme.

Fig. 14 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention, and as shown in fig. 14, the apparatus includes:

a sample obtaining module 71, configured to obtain an image sample and a video sample for training a first model, where the first model includes a first backbone network and a first output network;

an information obtaining module 72, configured to obtain object category labeling information corresponding to the image sample; and acquiring interest degree marking information corresponding to the video sample, wherein the interest degree marking information is used for reflecting the interest degree of a user on different video clips in the video sample.

A sample input module 73, configured to input the image sample into the first backbone network, and train the first backbone network by using the object class label information as supervision information; and inputting the video sample into a first model formed by a trained first backbone network and the first output network, and training the first output network by taking the interestingness marking information as supervision information.

Optionally, the information obtaining module 72 specifically includes:

an obtaining unit 721 is configured to obtain the video segments of interest respectively labeled to the video samples by a plurality of users.

The counting unit 722 is configured to count the interested video segments corresponding to the multiple users, so as to determine interest level annotation information corresponding to the video sample according to a counting result.

Optionally, the statistical unit 722 is specifically configured to: according to the interesting video segments marked by the users respectively, determining the number of interested persons corresponding to different video segments in the video sample; and determining interest degree marking information corresponding to the video sample according to the number of interested persons corresponding to different video clips in the video sample.

Optionally, the statistical unit 722 is specifically configured to: determining an interest degree curve corresponding to the video sample according to the interest people numbers corresponding to different video clips in the video sample; determining a first annotation point of which the corresponding slope value is greater than a first threshold value in the interestingness curve; determining a second labeling point with a corresponding slope value smaller than a second threshold value in the interestingness curve, wherein the second labeling point is a point with a first slope value smaller than the second threshold value after the first labeling point in the interestingness curve; and determining interest level marking information corresponding to the video sample according to the first marking point and the second marking point.

Optionally, the apparatus further comprises: and a correction module 81.

The sample input module 73 is further configured to input the video sample into a preset interestingness prediction model.

The information obtaining module 72 is further configured to obtain interestingness prediction information corresponding to the video sample, which is output by the interestingness prediction model.

And the correcting module 81 is configured to correct the interest level marking information according to the interest level prediction information.

Optionally, the apparatus further comprises: a video input module 82, a truncation module 83, and an output module 84.

And a video input module 82, configured to input a video to be processed to the first model.

The information obtaining module 72 is configured to obtain interest prediction information output by the first model and corresponding to the video to be processed, where the interest prediction information is used to reflect interest corresponding to different video segments in the video to be processed.

The intercepting module 83 is configured to intercept, from the video to be processed, a target video segment meeting the interest level requirement according to the interest level prediction information.

The output module 84 is configured to output the target video segment.

Optionally, the intercepting module 83 is specifically configured to: if the interest degrees corresponding to the continuous video segments are all larger than a set threshold value, or if the average value of the interest degrees corresponding to the continuous video segments is larger than the set threshold value, determining the video segments as target video segments.

Optionally, the output module 84 is specifically configured to: and if the number of the target video clips is multiple, merging the multiple target video clips according to the time sequence of the multiple target video clips.

Optionally, the apparatus further comprises: and a result acquisition module 85.

The video input module 82 is further configured to input the video to be processed to the second model.

The result obtaining module 85 is configured to obtain a classification result output by the second model and corresponding to the video to be processed.

The video input module 83 is further configured to input the video to be processed to the first model for processing if the classification result indicates that the video to be processed is a video that may be interested by the user.

Optionally, the sample obtaining module 71 is configured to obtain an image sample and a video sample for training the second model, where the second model includes a second backbone network and a second output network;

the information obtaining module 72 is configured to obtain object category labeling information corresponding to the image sample; and acquiring interest degree annotation information corresponding to the video sample, wherein the interest degree annotation information is used for reflecting whether a user is interested in the video sample.

The sample input module 73 is configured to input the image sample into the second backbone network, and train the second backbone network by using the object class label information as supervision information; and inputting the video sample into a second model formed by a trained second backbone network and the second output network, and training the second output network by taking the interestingness marking information as supervision information.

The model training apparatus shown in fig. 14 may execute the model training method provided in the embodiments shown in fig. 5 to 9, and for parts not described in detail in this embodiment, reference may be made to the related descriptions of the embodiments shown in fig. 5 to 9, which are not described again here.

Having described the internal functions and structure of the model training apparatus, in one possible design, the structure of the model training apparatus may be implemented as an electronic device, as shown in FIG. 15, which may include: a processor 91 and a memory 92. Wherein the memory 92 is used for storing a program for supporting the electronic device to execute the model training method provided in the foregoing embodiments shown in fig. 5 to 9, and the processor 91 is configured to execute the program stored in the memory 92.

The program comprises one or more computer instructions which, when executed by the processor 91, are capable of performing the steps of:

acquiring object type marking information corresponding to the image sample;

The electronic device may further include a communication interface 93 configured to communicate with other devices or a communication network.

In addition, an embodiment of the present invention provides a computer storage medium for storing computer software instructions for the electronic device, which includes a program for executing the model training method in the method embodiments shown in fig. 5 to 9.

For understanding, the specific implementation processes of the video processing method and the model training method provided above are exemplarily described in conjunction with the following application scenarios.

In a home entertainment scene, a user can shoot a video of a pet by using the terminal equipment, and the pet video is also a to-be-processed video. Since the user generally fixes the terminal device at a certain position and then shoots the pet, the to-be-processed video only contains the pet when the pet moves within the shooting range, and therefore the to-be-processed video may not contain the pet, and even if the pet is contained, the time for the pet to go out of the mirror is usually less than the time length of the to-be-processed video.

Based on this, for obtaining the video to be processed, the video to be processed is firstly input into the second model in the terminal device, so that the classification result corresponding to the video to be processed is output by the second model. If the classification result shows that the video to be processed is the video which is possibly interested by the user, the video contains the pet and the pet has the action, the video to be processed is further sent to the video processing equipment in the cloud. It can be seen that the second model actually serves as a preliminary screen.

Then, the preliminarily screened to-be-processed video can be input into the first model in the video processing device, so that the interest degree prediction information corresponding to the to-be-processed video is output by the first model, and the prediction information is used for indicating the interest degree corresponding to each video segment in the to-be-processed video of the user and whether each video segment contains a video picture of a pet. Each frame of image in the video to be processed can be regarded as a video segment, and the prediction information can be specifically expressed as interest points of each video segment in the video to be processed by a user, and an interest degree curve can be formed by the interest points. The video processing device can intercept a target video segment which is interested by the user from the video to be processed according to the interestingness curve. For example, when the interestingness curve corresponding to the video to be processed is shown in fig. 16, the video segment corresponding to the curve segment between the point a and the point B, where the interest score is always higher than the threshold value, may be determined as the target video segment and be cut out.

Finally, the visual video clip is sent to the terminal equipment, so that the user can watch the wonderful video clip of the pet. That is, redundant parts in the video to be processed, such as parts of pets which are not used for shooting or pets which have no action, can be eliminated through the first model, and only wonderful video segments are reserved.

Alternatively, when the target video segment has multiple segments, such as the target video segment 1 corresponding to the curve segment from the point a to the point B and the target video segment 2 corresponding to the curve segment from the point C to the point D in fig. 16, the multiple segments of video may be combined together according to the time sequence, that is, according to the sequence of the target video segment 1 to the target video segment 2, and finally the combined result is presented to the user. The above process can also be understood in conjunction with fig. 16.

Of course, if the processing pressure of the video processing apparatus is not taken into consideration, the video to be processed may be directly input to the first model.

In this scenario, in one manner, the video processing device process may automatically derive highlight video segments in the manner described above. In another way, after the video processing device obtains the interest level curve, the curve may be displayed, and the curve is broadcasted, for example, "xth second to xth second are highlight videos" is broadcasted, so that the user determines whether to further intercept the highlight videos of interest according to the broadcasted content.

Wherein, the first model used in the two modes includes a first backbone network and a first output network, and then for the training process of the first model: the video processing equipment firstly obtains an image sample and a video sample for training a first model, and simultaneously obtains object category marking information corresponding to the image sample marked by a user and interest level marking information corresponding to the video sample. The interest level marking information of the video sample reflects the interest level of the user in different video clips in the video sample. Then, the image sample is input into the first backbone network, and the object class marking information is used as supervision information to train the first backbone network. And inputting the video sample into a first model formed by a trained first backbone network and a first output network, and training the first output network by taking the interest level marking information as supervision information, thereby finishing the training of the whole model.

Optionally, in the model training process, the accuracy of the interestingness marking information can be improved in a statistical or correction mode, so that the training effect of the first model is indirectly guaranteed. For specific content, reference may be made to related descriptions in the foregoing embodiments, and details are not described herein again.

The second model comprises a second backbone network and a second output network, and for the training process of the first model, similar to the above, the object category marking information corresponding to the acquired image sample and the interest level marking information corresponding to the video sample are acquired, and at this time, the interest level marking information reflects whether the user is interested in the video sample, which is different from the training of the first model. Then, the image sample is input into a second backbone network, and the object type labeling information is used as supervision information to train the second backbone network. And inputting the video sample into a second model formed by a trained second backbone network and a second output network, and training the second output network by taking the interest level marking information as supervision information. For specific content, reference may be made to related descriptions in the foregoing embodiments, and details are not described herein again.

In a live video scene, a main broadcast can shoot a live video through a shooting device so as to recommend commodities to audiences. The long-time live video can be divided into short-time videos with preset time duration, namely the to-be-processed videos can contain commodity information corresponding to commodities of the anchor and the use process of the commodities and/or video pictures corresponding to the interaction process of the anchor and audiences.

At this time, the live video may be input into the second model first, so that the second model outputs the classification result corresponding to the live video. And if the classification result shows that the live video is a video which is possibly interested by the user and shows that the video contains the content related to the commodity, further sending the video to be processed to the video processing equipment at the cloud end. Then, the primarily screened live video can be input into the first model in the video processing device again, so that the interest degree prediction information corresponding to the video to be processed is output by the first model, and the prediction information is used for indicating the interest degree corresponding to each video clip in the video to be processed by the user and whether each video clip contains a target video picture related to the commodity. The prediction information can be expressed as interest points of each video segment in the video to be processed by the user, and the interest points can form an interest degree curve. The video processing device can intercept a target video segment which is interested by the user from the video to be processed according to the interestingness curve. For example, when the interestingness curve corresponding to the live video is as shown in fig. 14, the video segment corresponding to the curve segment between the point a and the point B, where the interest score is always higher than the threshold value, may be determined as the target video segment and intercepted.

Finally, the visual video clip is sent to the terminal equipment, so that the user can view the video clip only containing the clip interested by the audience, namely only containing the video clip corresponding to the commodity information and the commodity using process. Namely, the part of the live video which is mainly played and interacted with the user can be eliminated through the first model, and only the segment which is interested by the audience is reserved.

Of course, if the processing pressure of the video processing apparatus is not taken into consideration, the video to be processed may be directly input to the first model. In addition, for the training process of the first model and the second model, refer to the related description in the above embodiments, and are not described herein again.

In a payment scene, after a user purchases a commodity in an online store, a settlement operation can be triggered on settlement equipment in the store, so that a camera on the settlement equipment is opened. Then, the user can show the commodity of purchase for the camera in proper order, and this camera can gather whole show video promptly be pending video promptly. The display video may include video pictures corresponding to merchandise information and/or redundant video pictures. And the redundant video described above may be due to inexperience or lack of specification of user operations.

At this time, the display video may be first input into the second model, so that the second model outputs the classification result corresponding to the display video. And if the classification result shows that the display video is a video which is possibly interested by the user and shows that the video is a video picture containing commodity information, further sending the video to be processed to the video processing equipment at the cloud end.

Then, the primarily screened display video can be input into the first model in the video processing device, so that the interest degree prediction information corresponding to the video to be processed is output by the first model, and the prediction information is used for indicating the interest degree corresponding to each video clip in the video to be processed by the user and whether each video clip contains a target video picture related to the commodity. For example, when an interestingness curve corresponding to a display video is shown in fig. 15, a video segment corresponding to a curve segment from a point a to a point B, where an interest score is always higher than a threshold, may be determined as a target video segment and captured, and the target video segment does not include a redundant easy video picture.

Finally, the settlement device may recognize the commodities included in the visual frequency segment to calculate the transaction amount according to the recognition result, and the user may settle the transaction amount according to the transaction amount displayed on the settlement device.

In a transportation scene, each goods transfer station can be provided with a photographing device for photographing a transportation video, and the transportation videos photographed by all the transfer stations passing through the goods from a mailing position to a receiving position can be combined into a whole transportation video, namely a to-be-processed video. The display video may include video frames and/or redundant video frames corresponding to the merchandise information. And the logistics information may be location information of the transfer station, and the redundant video may be caused by unskilled or irregular operation of the user.

At this time, the transportation video may be first input into the second model, so that the second model outputs the classification result corresponding to the transportation video. And if the classification result shows that the display video is a video which is possibly interested by the user and shows that the video is a video picture containing commodity information and logistics information, further sending the video to be processed to the video processing equipment at the cloud end.

Then, the primarily screened transportation video can be input into the first model in the video processing device again, so that the interest degree prediction information corresponding to the video to be processed is output by the first model, and the prediction information is used for indicating the interest degree corresponding to each video clip in the video to be processed by the user and whether each video clip contains a video image corresponding to the commodity information and the logistics information. For example, when the interestingness curve corresponding to the transport video is shown in fig. 16, a video segment corresponding to a curve segment from a point a to a point B, where the interestingness score is always higher than the threshold value, may be determined as a target video segment and cut out, and the target video segment does not include a redundant easy video picture.

Finally, the target video clip can be sent to the goods management platform, so that the platform can monitor and trace the transportation condition of the goods.

Certainly, in the scenes of live video broadcasting, payment and transportation, if there is a demand for interest distribution broadcasting in a home entertainment scene, the method can also be implemented so that a user can determine whether to intercept an interesting segment in the video according to the broadcasting content, and specific content is not repeated herein. The broadcast functions in the above-mentioned several common cases are not shown in the drawings.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video processing method, comprising:

and outputting the target video clip.

2. The method of claim 1, comprising:

inputting the video to be processed into a second model;

obtaining a classification result output by the second model and corresponding to the video to be processed;

and if the classification result shows that the video to be processed is the video which is possibly interested by the user, inputting the video to be processed into the first model for processing.

3. A video processing method, comprising:

and displaying the interestingness curve.

4. The method of claim 3, further comprising:

determining the position of a target video clip meeting the interest degree requirement in the video to be processed according to the interest scores corresponding to different video clips reflected in the interest degree curve;

broadcasting the corresponding position of the target video clip;

receiving a voice instruction generated by a user according to broadcast content, and intercepting a target video clip meeting the interest degree requirement from the video to be processed;

and outputting the target video clip.

5. A method of model training, comprising:

acquiring object type marking information corresponding to the image sample;

6. The method according to claim 5, wherein the obtaining interest level annotation information corresponding to the video sample comprises:

obtaining interesting video segments respectively labeled on the video samples by a plurality of users;

and counting the interesting video segments corresponding to the users to determine interest degree annotation information corresponding to the video samples according to the counting result.

7. The method according to claim 6, wherein the performing statistics on the video segments of interest corresponding to the plurality of users to determine the interest level annotation information corresponding to the video sample according to the statistical result comprises:

according to the interesting video segments marked by the users respectively, determining the number of interested persons corresponding to different video segments in the video sample;

and determining interest degree marking information corresponding to the video sample according to the number of interested persons corresponding to different video clips in the video sample.

8. The method according to claim 7, wherein the determining interest degree annotation information corresponding to the video sample according to the number of people of interest corresponding to each of different video clips in the video sample comprises:

determining an interest degree curve corresponding to the video sample according to the interest people numbers corresponding to different video clips in the video sample;

determining a first annotation point of which the corresponding slope value is greater than a first threshold value in the interestingness curve;

determining a second labeling point with a corresponding slope value smaller than a second threshold value in the interestingness curve, wherein the second labeling point is a point with a first slope value smaller than the second threshold value after the first labeling point in the interestingness curve;

and determining interest level marking information corresponding to the video sample according to the first marking point and the second marking point.

9. The method of claim 5, further comprising:

inputting the video sample into a preset interestingness prediction model;

obtaining interest prediction information which is output by the interest prediction model and corresponds to the video sample;

and correcting the interest degree marking information according to the interest degree prediction information.

10. The method of claim 5, further comprising:

inputting a video to be processed to the first model; obtaining interest degree prediction information which is output by the first model and corresponds to the video to be processed, wherein the interest degree prediction information is used for reflecting interest degrees corresponding to different video segments in the video to be processed;

and outputting the target video clip.

11. The method according to claim 10, wherein the intercepting a target video segment satisfying an interest level requirement from the video to be processed according to the interest level prediction information comprises:

if the interest degrees corresponding to the continuous video segments are all larger than a set threshold value, or if the average value of the interest degrees corresponding to the continuous video segments is larger than the set threshold value, determining the video segments as target video segments.

12. The method of claim 10, wherein outputting the target video segment comprises:

and if the number of the target video clips is multiple, merging the multiple target video clips according to the time sequence of the multiple target video clips.

13. The method of claim 10, further comprising:

inputting the video to be processed into a second model;

14. The method of claim 13, further comprising:

acquiring an image sample and a video sample for training the second model, wherein the second model comprises a second backbone network and a second output network;

acquiring object type marking information corresponding to the image sample;

obtaining interest level marking information corresponding to the video sample, wherein the interest level marking information is used for reflecting whether a user is interested in the video sample;

inputting the image sample into the second backbone network, and training the second backbone network by taking the object class marking information as supervision information;

and inputting the video sample into a second model formed by a trained second backbone network and the second output network, and training the second output network by taking the interestingness marking information as supervision information.

15. A video processing apparatus, comprising:

and the output module is used for outputting the target video clip.

16. A video processing apparatus, comprising:

and the display module is used for displaying the interestingness curve.

17. A model training apparatus, comprising:

18. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the video processing method of claim 1 or 2.

19. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the video processing method of claim 3 or 4.

20. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the model training method of any one of claims 5 to 14.

21. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the video processing method of claim 1 or 2.

22. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the video processing method of claim 3 or 4.

23. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the model training method of any one of claims 5 to 14.