CN115909122A

CN115909122A - Video processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN115909122A
Application number: CN202211103425.6A
Authority: CN
Inventors: 吴庆双; 周效军; 李琳
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-04-04

Abstract

The application discloses a video processing method and device, electronic equipment and a readable storage medium, and belongs to the technical field of computers. The video processing method of the embodiment of the application comprises the following steps: determining at least one person and at least one type of video object in the video clip, wherein at least one object is arranged under each type of video object; determining a first association of each person with each object under each type of video object; determining a target person and a target object in the video clip according to the first relevance, wherein the target object comprises an object which is under each type of video object and is strongly related to the target person; and extracting key images from the video clips according to the target person and the target object. Therefore, the video content identification efficiency can be improved.

Description

Video processing method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of computers, and particularly relates to a video processing method and device, an electronic device and a readable storage medium.

Background

In the prior art, video content identification is mainly performed by screening video content after an identifier watches the video content first, and selecting out wonderful/key content. Therefore, the existing video content identification is low in efficiency.

Disclosure of Invention

An embodiment of the present application provides a video processing method, an apparatus, an electronic device, and a readable storage medium, so as to solve the problem that the existing video content identification efficiency is low.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, a video processing method is provided, which is applied to an electronic device, and includes:

determining at least one person and at least one type of video object in the video clip, wherein at least one type of video object is arranged below each type of video object;

determining a first association of each person with each type of object under each type of video object;

determining a target person and a target object in the video segment according to the first relevance, wherein the target object comprises an object which is under each type of video object and is strongly related to the target person;

and extracting key images from the video clips according to the target person and the target object.

In a second aspect, a video processing apparatus is provided, which is applied to an electronic device, and includes:

the device comprises a first determination module, a second determination module and a third determination module, wherein the first determination module is used for determining at least one person and at least one type of video object in a video clip, and at least one type of video object is arranged under each type of video object;

a second determining module for determining a first association of each person with each object under each type of video object;

a third determining module, configured to determine, according to the first relevance, a target person and a target object in the video segment, where the target object includes an object that is under each type of video object and is strongly associated with the target person;

and the extraction module is used for extracting key images from the video clips according to the target person and the target object.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored in the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In the embodiment of the application, at least one person and at least one type of video objects in a video clip can be determined, at least one type of object is arranged under each type of video objects, a first relevance of each person and each type of object under each type of video objects is determined, a target person and a target object in the video clip are determined according to the first relevance, the target object comprises an object which is under each type of video objects and is strongly related to the target person, and key images are extracted from the video clip according to the target person and the target object. Therefore, the key images capable of representing the video content can be extracted from the video clips by means of the target person and the object strongly associated with the target person, and therefore compared with the manual identification of the video content, the video content identification efficiency can be improved.

Drawings

Fig. 1 is a flowchart of a video processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an analysis of a video clip in an embodiment of the present application;

FIG. 3 is a schematic diagram of a display interface in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/", and generally means that the former and latter related objects are in an "or" relationship.

The video processing method, the video processing apparatus, the electronic device, and the readable storage medium provided in the embodiments of the present application are described in detail with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Referring to fig. 1, fig. 1 is a flowchart of a video processing method provided in an embodiment of the present application, where the method is applied to an electronic device, and as shown in fig. 1, the method includes the following steps:

step 11: at least one person and at least one type of video object in the video segment are determined.

In this embodiment, there is at least one object under each type of video object. The at least one type of video object does not include people and may include, but is not limited to, scenes, actions, emotions, and the like. Taking a certain type of video object as an example of an action, the video object may have a plurality of actions, such as action 1, action 2, action 3, and the like.

In some embodiments, the video segments may be obtained by splitting a complete video. For example, a video can be identified and split by a shot edge detection algorithm, one shot corresponds to one video clip, and shot cut indicates that one video clip ends and the next video clipThe video segment starts, and a segment start time and end time set of the video is finally generated, where it can be determined that a video segment is not included in the final segment set if the number of frames of the video segment is lower than a preset threshold, such as 500 frames. Further, if people, scenes, actions and emotions in a video segment are used as identification objects, after a segment set { segment 1, segment 2, segment 3 \8230 \8230andsegment N } is obtained, a segment entity can be generated through traversing of people, actions, scenes and emotion identification algorithms, as shown in fig. 2, people, scenes, actions and emotion information contained in one video segment together form a corresponding segment entity, a complete video segment is finally split into a segment entity set, and multiple objects are arranged under each class of objects, for example, the video segment 1 contains a person X ₁₁ ～X _x1 And act X ₁₁ ～X _y1 Scene X ₁₁ ～X _z1 Emotion X ₁₁ ～X _h1 For example, video clip 3 contains character X ₃₁ ～X _x3 And action X ₃₁ ～X _y3 Scene X ₃₁ ～X _z3 Emotion X ₃₁ ～X _h3 . Because the people with too short shots are generally not important people, if the occurrence time of a certain person in the video segment is relatively short, for example, less than one thirtieth of the duration of the video segment, the person can be ignored and is not included in the subsequent analysis.

In some embodiments, each video frame of a video clip may be identified to determine at least one person and at least one type of video object in the video clip.

Step 12: a first association of each person with each object under each type of video object is determined.

In this step, each kind of object under each kind of video object may be traversed for each person in sequence, and relevance calculation may be performed to obtain relevance of each person and each kind of object under each kind of video object.

Step 13: and determining a target person and a target object in the video segment according to the first relevance, wherein the target object comprises an object which is under each type of video object and is strongly related to the target person.

Note that the target person is a main person in the video segment that satisfies a certain condition. The strong correlation may be understood as strong correlation, correlation greater than a certain condition, or the like.

Step 14: and extracting key images from the video clips according to the target person and the target object.

In this embodiment, the key image may represent video content of a corresponding video clip.

In some embodiments, a video frame containing both a target person and a target object in a video segment may be determined as a key image of the video segment, and the key image may be extracted from the video segment.

The video processing method of the embodiment of the application can determine at least one person and at least one type of video object in a video clip, wherein each type of video object has at least one type of object, determine a first relevance between each person and each type of object in each type of video object, determine a target person and a target object in the video clip according to the first relevance, wherein the target object comprises an object which is in strong association with the target person in each type of video object, and extract a key image from the video clip according to the target person and the target object. Therefore, the key images capable of representing the video content can be extracted from the video clips by means of the target person and the object strongly associated with the target person, and therefore compared with the manual identification of the video content, the video content identification efficiency can be improved.

In the embodiment of the present application, relevance calculation may be performed according to content included in a video frame in a video clip. The determining of the first relevance of each person to each object under each type of video object may include: firstly, extracting a plurality of first video frames containing each person from a video clip aiming at each person; then, for each object under each type of video object, extracting a plurality of second video frames containing each object from the video clips; finally, a first association of each person with each object is determined based on the extracted plurality of first video frames and the plurality of second video frames. In this way, the first association of each person with each object may be determined based on the video frame content, thereby accurately determining the association.

It should be noted that the number of the first video frames and the number of the second video frames may be the same or different. The first video frame may be understood as a video frame containing a corresponding person. The second video frame may be understood as a video frame containing the corresponding object. For a video frame in a video clip, if the video frame contains a corresponding person and object, the video frame can be both a first video frame and a second video frame.

In some embodiments, n first video frames may be extracted from the video segment, where the value of n is determined based on actual requirements, for example, n is greater than 1/30 of the total number of video frames in the video segment. If the number of the video frames containing the corresponding characters in the video clip is larger than n, n first video frames can be extracted from the video clip randomly or according to a preset rule.

In other embodiments, n second video frames may be extracted from the video segment, where the value of n is determined based on actual requirements, for example, n is greater than 1/30 of the total number of video frames in the video segment. If the number of the video frames containing the corresponding objects in the video clip is larger than n, n second video frames can be extracted from the video clip randomly or according to a preset rule.

Optionally, since there is a high possibility that the person and the object have a strong association when they are in the same video frame, the first association between each person and each object may be determined according to the frame numbers of the plurality of first video frames and the plurality of second video frames. The above process of determining the first association of each person with each object may comprise: firstly, calculating the relevance value between the frame numbers of a plurality of first video frames and the frame numbers of a plurality of second video frames; then, a first association of each person with each object is determined based on the association value. The smaller the relevance value between the frame numbers is, the closer the frame numbers of the corresponding first video frame and the second video frame are, and the greater the relevance between the corresponding person and the object is.

In some embodiments, the relevance value AV may be calculated using the following equation (1):

wherein n represents the number of the first video frames and the second video frames, that is, the same number of the first video frames and the second video frames are extracted from the video clip. F _pi And the frame number of the ith first video frame representing a first person, wherein the first person is any person in at least one person of the video segment. F _qi And representing the frame number of the ith second video frame of the first object, wherein the first object is any object under at least one type of video objects of the video clip.

Optionally, when determining the target person and the target object in the video segment according to the first relevance, first, for each type of video object, according to the first relevance, selecting multiple groups of association pairs of persons and objects with the largest first relevance; then, according to the multiple groups of association pairs of the characters and the objects, the target characters and the target objects are determined. The number of groups of the selected association pairs may be determined based on actual requirements, and is not limited thereto. In this way, the target person and the target object can be determined based on the persons and the objects with the greater relevance, so that the accuracy of the determined target person and the determined target object is improved.

In some embodiments, a second person may be determined as the target person, the second person being a person associated with each type of video object in the associated pairs of groups of persons and objects, and a second object associated with the target person in the associated pairs of groups of persons and objects may be determined as the target object. That is, a person associated with each type of video object in the associated pairs of groups of persons and objects is determined as a target person, and an object associated with the target person in the associated pairs of groups of persons and objects is determined as a target object.

Optionally, in order to extract the key image in the video segment more accurately, in a case that at least one type of video objects in the video segment is multiple types of video objects, before determining a target character and a target object in the video segment according to the first relevance, a second relevance between each type of object under the first type of video object and each type of object under the second type of video object may be determined, where the first type of video object is one type of video objects in the multiple types of video objects, and the second type of video object is one type of video objects except the first type of video objects in the multiple types of video objects; then, the target person and the target object in the video clip are determined according to the first relevance and the second relevance. Therefore, the target person and the target object in the video clip can be determined based on the relevance between the person and the object and the relevance between different types of objects, so that the target person and the target object can be determined more accurately, and the key image in the video clip can be extracted more accurately.

It should be noted that for the calculation of the association between the objects of different types, the association between the objects of different types may be determined in the same manner as the above calculation of the association between the people and the objects, for example, according to the frame numbers of the video frames containing the objects of different types. When determining the target person and the target object in the video clip according to the first relevance and the second relevance, the association pair of the multiple groups of persons and objects with the largest first relevance is selected for each type of video object according to the first relevance, then the persons associated with each type of video object in the association pairs of the multiple groups of persons and objects are determined as the target person, and the objects associated with the target person and mutually associated in the association pairs of the multiple groups of persons and objects are determined as the target object according to the second relevance. The determination may be based on actual demand, and is not limited thereto.

Optionally, the determining at least one person and at least one type of video object in the video segment may include: first a summary of the video segment is obtained, which can be understood as the main content of the video segment, and then at least one person and at least one type of video object in the video segment are determined based on the summary. Therefore, important objects in the video clip can be determined based on the outline of the video clip, so that correlation calculation of unimportant people and objects in the video clip is avoided, and key images are extracted efficiently.

In some embodiments, after the full video is obtained, the video may be split and a summary of each video segment may be obtained. For example, a scene, a person, an action, and an emotion included in a video clip can be detected by a scene, a person, an action, and an emotion detection algorithm, and an outline of the video clip, that is, main content and core content of the video clip can be generated based on a syntax by Natural Language Processing (NLP). If multiple sentences are generated according to the detection result of the video clips, the summaries of the video clips can be screened out from the multiple sentences through the following steps:

s1: the complete subtitles of the video or the audio are extracted into characters, and the extracted content is used as the basis for subsequent judgment, which is hereinafter referred to as a reference data set. For example, if the video is a movie, the lines of the entire movie are extracted; if the video is a television series, the lines of all episodes are extracted.

S2: since there are four main objects, namely, characters, scenes, actions and emotions, in a sentence as an outline, the probability of two adjacent words appearing in the sentence in the original video is determined according to experience. For example the following sentence: (1) playing football in playground in Xiaoming; (2) xiaoming laughs on the playground. If the probability of the word "xiaoming" appearing in the reference data set is P (xiaoming), "xiaoming" appearing in the reference data set is P (xiaoming in the playground), "xiaoming" followed by the word "in the playground" is P (xiaoming in the playground), "playing football in the playground" appears in the reference data set is P (playing football | in the playground), i.e., "playing football" followed by the word "in the playground" is P (playing football | in the playground), then the probability of the sentence (1) appearing in the reference data set is: p ((1)) = P (small light) × P (on playground | small light) × P (playing football | on playground). The same can be said of the probability that sentence (2) appears in the reference dataset. Therefore, the main content of the video clip is based on the rationality probability P (R) of the original video:

P(R)＝P(1)*P(2/1)*P(3/2)*P(4/3)

wherein, P (1) is the probability of the first word in the corresponding sentence, P (2/1) is the probability of the second word after the first word in the corresponding sentence, P (3/2) is the probability of the third word after the second word in the corresponding sentence, and so on, finally the rationality probability P (R) is obtained.

If the rationality probabilities of the multiple sentences are not very different, if the rationality probabilities are more than 90%, the rationality probabilities cannot be used as a screening standard; if the rationality probabilities of the multiple sentences are different greatly, the sentence with the maximum probability can be directly used as the summary of the video clip.

S3: if the rationality probabilities of multiple sentences are not very different, if the rationality probabilities of the multiple sentences are more than 90%, the action duration and the emotion duration in the sentences can be taken, if the football is played for 20 seconds and the football is happily smiled for 3 seconds, the content of the football playing sentences with long occurrence time is selected as the summary of the video clip. Therefore, the summaries of the video clips can be generated by combining the natural language processing technology, the video conversation text content is used as the judgment corpus, the generated summaries are more fit with the video content, and the user experience is improved.

Further after obtaining the summaries of all the video segments, the start and end times of the video segment corresponding to each summary can be determined.

For example, if the number of shots in a 90-minute movie reaches about 1000, summaries of about 1000 video segments can be generated, and a content number can be set for each summary, so as to obtain a data set [ content number, summary, start time, and end time ], where a corresponding summary can be determined according to the content number, and a corresponding content number can be determined according to the summary.

For example, the start-stop frame number of a certain video segment is SF _ ID and EF _ ID, and the person detection result set DS _ P = { F) of the video segment is obtained by traversing the video segment through person, scene, action and emotion recognition algorithms _p1 ，F _p2 ,……，F _pn That is, the set of the first video frames is DS _ P, and the scene detection result set DS _ S = { F = F _s1 ，F _s2 ,……，F _sn That is, the set of scene-related second video frames is DS _ S, and the motion detection result set DS _ a = { F = _a1 ，F _a2 ,……，F _an I.e. a number of second motion related onesThe set of video frames is DS _ A, and the emotion detection result set DS _ E = { F _e1 ，F _e2 ,……，F _en That is, the set of emotion-related second video frames is DS _ E, where the F frame number,

EF denotes the end frame number of the video segment and SF denotes the start frame number of the video segment, e.g., [100,1000 ] for the frame number of the video segment]Then EF =1000, sf =100; for example, if person 1 is detected in frames 600, 700, and 800 of the video clip, DS _ P1= {600, 700, 800}. And finally, summarizing all detection result sets:

if the video clip contains x persons, the person detection result set is as follows:

if the video clip contains z scenes, the scene detection result set is as follows:

if the video clip contains y actions, the action detection result set is as follows:

if the video clip contains h emotions, the emotion detection result set is as follows:

further, the relevance calculation may be performed separately for each detection result set of the person and the detection result set of each object under each type of video object. For example, with the detection result set DS _ P1 of the person 1: { F _p1 ，F _p2 ,……，F _pn And detection result set DS _ A1 of action 1: { F _a1 ，F _a2 ,……，F _an For example, the following formula (2) can be used to calculate the association value AV of the person 1 and the action 1:

further, for x persons in the video segment, for each person's detection result set, performing relevance calculation with the detection result sets of all actions respectively to obtain the following relevance value sets:

it is understood that the above-mentioned correlation calculation process is also applicable to the correlation calculation of the character and the scene or the emotion, and will not be described herein again.

Further, the relevance values of the characters and the motions, the relevance values of the characters and the scenes, and the relevance values of the characters and the emotions may be sorted from small to large, and the first three correlation pairs are obtained, that is, the three correlation pairs with the highest relevance are obtained, for example, the three correlation pairs of the obtained characters and the motions are AV (P1, A1), AV (P1, A2), and AV (P2, A3), and the three correlation pairs of the characters and the scenes are AV (P2, S1), AV (P3, S1), and AV (P3, S2), and the three correlation pairs of the characters and the emotions are AV (P1, E2), AV (P2, E2), and AV (P3, E3). Then, in the obtained association pair, P2 is associated with each type of object, that is, the association with A3, S1 and E2 is strong, so that the person represented by P2 can be determined as the target person, the action represented by A3, the scene represented by S1 and the emotion represented by E2 can be determined as the target object, the frame numbers simultaneously included in the four sets of DS _ P2, DS _ A3, DS _ S1 and DS _ E2 can be extracted, and the corresponding frame can be extracted as the key image. In addition, the relevance (AV) calculation can be carried out by combining DS _ A3, DS _ S1 and DS _ E2 in pairs, and the target object can be selected based on the relevance result.

Further, if the person, scene, action and emotion in the video segment are extracted based on the summary of the video segment, after the target person P2 and the target objects A3, S1 and E2 are determined, the summary of the video segment may be verified, that is, whether the summary includes P2, A3, S1 and E2 may be verified, and if so, the summary is determined to be valid. In this way, the summary of the video clip is verified by the relevance calculation result, and the reliability of the summary result can be improved.

In the embodiment of the application, after the key images are extracted from the video clips, the association relationship between the key images and the video clips can be established, so that the corresponding video clips are determined based on the key images.

Optionally, after the association relationship between the key image and the video clip is established, the key image may be displayed on a video playing interface, and when an operation of the key image by a user is received, the video clip associated with the key image is played. Therefore, when the user browses the video frames, the user can click the interested video frames to jump to the corresponding video segments for playing, and the user can adjust the playing progress to watch the front and back continuous segments in the original video, so that the user requirement is accurately positioned, and the video watching experience of the user is improved.

In some embodiments, when the key image is displayed on the video playing interface, information such as a summary, a content number, and a start time of a corresponding video clip may also be displayed for a user to view, so that a video content concerned by the user is quickly positioned, and the experience of viewing the video of the user is ensured.

In some embodiments, the front-end presentation interface may be as shown in FIG. 3. The summaries of the obtained video clips are arranged from top to bottom according to time, a line is displayed below the video, each line of the summaries is provided with a display diagram (namely a key image) of the corresponding video clip, and the display mode can be called as a 'quick view field' and represents an interface for quickly browsing the film and television works. When the user enters a playing page of the video, the user is attracted to continue watching step by step through the quick view interface. The display graph is a wonderful picture of each video clip screened out by a key image selection algorithm, can display the production levels of videos in an all-around manner, including clothes, makeup, props, special effects, actor formation and the like, can quickly establish a good impression of good production in the mind of a user, and improves the possibility of continuous watching of the user. When the user looks at the display diagram and wants to further understand the video content, the video content is gradually summarized from top to bottom, and the user can be led to simply understand the development of the plot. At present, the fast-paced and short-video channel is on the way, the character display enables users to enter the scenario faster, and the fast-paced requirements of the users are met better. If the user is interested in a certain video clip, the user can directly click on the corresponding display picture or summary to jump the video to the starting time of the clip to start playing.

In addition, since most users follow a series or a movie and know the series or the movie through propaganda in other ways, the propaganda has introduction of key characters or key contents, an input box can be arranged on the quick view field for the users to input search words so as to obtain corresponding videos. For example, when a user enters XXX, all data containing the keyword "XXX" may be searched and presented in order from top to bottom according to time. The user can click on the content or the display graph in the quick view field to directly jump to the video clip which the user wants to see. Further, in order to improve the user experience, a motion picture, a short video clip and the like can be displayed on the fast view field.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application, where the apparatus is applied to an electronic device, and as shown in fig. 4, a video processing apparatus 40 includes:

a first determining module 41, configured to determine at least one person and at least one type of video object in a video segment, where each type of video object has at least one type of object below it;

a second determining module 42 for determining a first association of each person with each object under each class of video objects;

a third determining module 43, configured to determine, according to the first relevance, a target person and a target object in the video segment, where the target object includes an object under each type of video object that is strongly associated with the target person;

and the extraction module 44 is configured to extract a key image from the video clip according to the target person and the target object.

Optionally, the second determining module 42 includes:

a first extraction unit configured to extract, for each person, a plurality of first video frames containing the person from the video clip;

a second extracting unit, configured to extract, for each object under each type of video object, a plurality of second video frames containing each object from the video clips;

a first determining unit configured to determine a first association of the each person with the each object according to the plurality of first video frames and the plurality of second video frames.

Optionally, the determining unit is specifically configured to: calculating an association value between frame numbers of the plurality of first video frames and frame numbers of the plurality of second video frames; and determining the first relevance of each person and each object according to the relevance value.

Optionally, the determining unit is specifically configured to:

calculating the relevance value AV by adopting the following formula:

wherein n represents the number of the first video frame and the second video frame, F _pi A frame number of an i-th first video frame representing a first person, the first person being any one of the at least one person, F _qi And the frame number of the ith second video frame representing a first object, wherein the first object is any object under the at least one type of video objects.

Optionally, the third determining module 43 includes:

the selecting unit is used for selecting a plurality of groups of person-object association pairs with the maximum first relevance aiming at each type of video objects according to the first relevance;

and the second determining unit is used for determining the target person and the target object according to the plurality of groups of association pairs of persons and objects.

Optionally, the second determining unit is specifically configured to: determining a second person as the target person and a second object as the target object; the second person is a person in the associated pair of the plurality of groups of persons and objects associated with each type of video object, and the second object is an object in the associated pair of the plurality of groups of persons and objects associated with the target person.

Optionally, in a case that the at least one type of video object is a plurality of types of video objects, the third determining module 43 is further configured to: determining a second relevance between each object under a first type of video object and each object under a second type of video object, wherein the first type of video object is one type of video objects in the multiple types of video objects, and the second type of video object is one type of video objects except the first type of video objects in the multiple types of video objects; and determining a target person and a target object in the video clip according to the first relevance and the second relevance.

Optionally, the extracting module 44 is specifically configured to: and determining the video frames which simultaneously contain the target person and the target object in the video clips as the key images, and extracting the key images.

Optionally, the first determining module 41 is specifically configured to: obtaining a summary of the video clip; determining the at least one person and the at least one class of video objects based on the summary.

Optionally, the video processing apparatus 40 further includes:

and the establishing module is used for establishing the incidence relation between the key image and the video clip.

Optionally, the video processing apparatus 40 further includes:

the display module is used for displaying the key image on a video playing interface;

and the playing module is used for playing the video clip when receiving the operation of the user on the key image.

It can be understood that the video processing apparatus 40 according to the embodiment of the present application can implement the processes of the video processing method embodiment shown in fig. 1, and can achieve the same technical effects, and for avoiding repetition, the details are not repeated here.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a processor readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In addition, referring to fig. 5, an electronic device 50 is further provided in the embodiment of the present application, which includes a processor 51, a memory 52, and a program or an instruction stored in the memory 52 and executable on the processor 51, and when the program or the instruction is executed by the processor 51, the process of the embodiment of the video processing method is implemented, and the same technical effect can be achieved, and details are not repeated here to avoid repetition.

The embodiments of the present application further provide a readable storage medium, on which a program or an instruction is stored, where the program or the instruction, when executed by a processor, can implement each process of the above-mentioned video processing method embodiments and achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a service classification device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A video processing method applied to an electronic device is characterized by comprising the following steps:

determining at least one person and at least one type of video object in the video clip, wherein at least one object is arranged under each type of video object;

determining a target person and a target object in the video clip according to the first relevance, wherein the target object comprises an object which is under each type of video object and is strongly related to the target person;

2. The method of claim 1, wherein determining a first association of each person with each type of object under each type of video object comprises:

extracting, for each of the persons, a plurality of first video frames containing the each of the persons from the video clips;

for each object under each type of video object, extracting a plurality of second video frames containing each object from the video clips;

determining a first association of the each person with the each object based on the plurality of first video frames and the plurality of second video frames.

3. The method of claim 2, wherein said determining a first association of said each person with said each object based on said plurality of first video frames and said plurality of second video frames comprises:

calculating correlation values between the frame numbers of the plurality of first video frames and the frame numbers of the plurality of second video frames;

and determining the first relevance of each person and each object according to the relevance value.

4. The method of claim 3, wherein the calculating the correlation value between the frame numbers of the first plurality of video frames and the frame numbers of the second plurality of video frames comprises:

calculating the relevance value AV by adopting the following formula:

wherein n represents the number of the first video frame and the second video frame, F _pi A frame number of an i-th first video frame representing a first person, the first person being any one of the at least one person, F _qi And representing the frame number of the ith second video frame of the first object, wherein the first object is any object under the at least one type of video object.

5. The method of any of claims 1 to 4, wherein determining the target person and the target object in the video segment according to the first relevance comprises:

according to the first relevance, aiming at each type of video object, selecting a plurality of groups of people and object relevance pairs with the highest first relevance;

and determining the target person and the target object according to the plurality of groups of association pairs of persons and objects.

6. The method of claim 5, wherein determining the target person and the target object based on the plurality of associated pairs of persons and objects comprises:

determining a second person as the target person and a second object as the target object; the second person is a person in the associated pair of the plurality of groups of persons and objects associated with each type of video object, and the second object is an object in the associated pair of the plurality of groups of persons and objects associated with the target person.

7. The method of claim 1, wherein in the case that the at least one type of video object is a plurality of types of video objects, before the determining the target person and the target object in the video segment, the method further comprises:

determining a second relevance between each object under a first type of video object and each object under a second type of video object, wherein the first type of video object is one type of video objects in the multiple types of video objects, and the second type of video object is one type of video objects except the first type of video objects in the multiple types of video objects;

wherein the determining the target person and the target object in the video segment according to the first relevance comprises:

and determining a target person and a target object in the video clip according to the first relevance and the second relevance.

8. The method of claim 1, wherein the extracting key images from the video clips according to the target person and the target object comprises:

and determining the video frames which simultaneously contain the target person and the target object in the video clip as the key images, and extracting the key images.

9. The method of claim 1, wherein determining at least one person and at least one type of video object in the video segment comprises:

obtaining a summary of the video clip;

determining the at least one person and the at least one class of video objects based on the summary.

10. The method of claim 1, further comprising:

establishing an incidence relation between the key image and the video clip

Displaying the key image on a video playing interface;

and when receiving the operation of the user on the key image, playing the video clip.

11. A video processing apparatus applied to an electronic device, comprising:

a third determining module, configured to determine a target person and a target object in the video segment according to the first relevance, where the target object includes an object under each type of video object that is strongly associated with the target person;

12. An electronic device, comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, which program or instructions, when executed by the processor, implement the steps of the video processing method according to any one of claims 1 to 10.

13. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the video processing method according to any one of claims 1 to 10.