Disclosure of Invention
In order to overcome the above technical problems, the present invention aims to provide a video annotation method, a device thereof, and a server, which solve the technical problem that the prior art cannot automatically annotate video segments in a specific type of video scene.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
in a first aspect, an embodiment of the present invention provides a video annotation method, which includes the following steps:
acquiring video transformation identifiers among different video scenes;
according to the video conversion identification, carrying out scene clustering on each video clip;
video segments belonging to the same video scene are labeled.
Optionally, the video transformation identifier is a video transformation time point, then: the acquiring of the video transformation time points between different video scenes specifically includes:
segmenting the video of each frame;
judging whether the current video frame and the video frame in the buffer area belong to a video clip of the same scene or not;
if the video clips belong to different scenes, replacing the video frames in the buffer area by the current video frame, and storing the time points of different video scene changes; alternatively, the first and second electrodes may be,
if the video clips belong to the same scene, the current state of the video frames in the buffer area is maintained.
Optionally, the determining whether the current video frame and the video frame in the buffer area belong to a video clip of the same scene specifically includes:
respectively extracting a first brightness histogram of a current video frame and a second brightness histogram of the video frame in a buffer area;
judging whether the threshold value of the correlation is larger than a first preset threshold value or not according to the correlation between the first brightness histogram and the second brightness histogram;
if the threshold value of the correlation is larger than a first preset threshold value, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
and if the threshold value of the correlation is smaller than a first preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes.
Optionally, the determining whether the current video frame and the video frame in the buffer area belong to a video clip of the same scene specifically includes:
respectively extracting a first discrete cosine transform component of a current video frame and a second discrete cosine transform component of the video frame in a buffer area;
judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not according to the relation between the first discrete cosine transform component and the second discrete cosine transform component;
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
and if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene.
Optionally, the determining whether the current video frame and the video frame in the buffer area belong to a video clip of the same scene specifically includes:
respectively extracting a first motion vector distribution of a current video frame and a second motion vector distribution of the video frame in a buffer area;
comparing the first motion vector distribution and the second motion vector distribution;
judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not according to the comparison result;
if the first motion vector distribution is different from the second motion vector distribution, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
and if the first motion vector distribution is the same as the second motion vector distribution, the current video frame and the video frame in the buffer area belong to the video segment of the same scene.
Optionally, the determining whether the current video frame and the video frame in the buffer area belong to a video clip of the same scene specifically includes:
calculating the pixel difference value between the current video frame and the video frame in the buffer area;
judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not according to the pixel difference value;
if the pixel difference value is larger than a second preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes; alternatively, the first and second electrodes may be,
and if the pixel difference value is smaller than a second preset threshold value, the current video frame and the video frame in the buffer area belong to the video clip of the same scene.
Optionally, the video transformation identifier is a video transformation time point, then: the scene clustering is performed on each video clip according to the video conversion time point, and specifically includes:
according to the video conversion time point, dividing an input video into a plurality of video segments;
performing time domain alignment processing on the plurality of video segments;
composing a four-dimensional tensor for the plurality of video segments; wherein each dimension respectively represents the height, width and length of a video clip and the number of the video clip;
carrying out high-order singular value decomposition processing on the four-dimensional tensor to obtain the characteristic vector of each video clip;
and performing sparse subspace clustering processing on the video clips according to the feature vectors to obtain the same video scene after scene clustering.
Optionally, the same video scene comprises a player close-up scene, then: the mark belongs to the video clip of the player close-up scene specifically includes:
when facial features of the player can be extracted, a cascade detector based on haar features is used for detecting the face of the player;
matching names of players according to the detected faces of the players by using a convolution network based on a deep learning architecture;
marking the basic information of the player in a video segment of the player close-up scene;
alternatively, the first and second electrodes may be,
when facial features of a player cannot be extracted, detecting the shirt number of the player by using an optical character recognition system;
matching the names of the players by using a convolution network based on a deep learning architecture according to the detected coat numbers of the players;
and marking the basic information of the player in the video segment of the player close-up scene.
Optionally, the same video scene comprises a court panorama or player tracking scene, then: the mark belongs to the court panorama or the video clip of sportsman tracking scene specifically includes:
tracking the player using a gradient tracker;
storing the movement track of the player;
and marking the motion trail of the player.
Optionally, the labeling a video segment belonging to the same video scene further includes:
detecting information in the statistical frame by a feature extractor based on a local binary pattern;
extracting information in the statistical box;
identifying information within the statistical box using an optical character recognition system;
and marking the information in the counting box on a fast forward prompt bar of a display screen for playing the video clip.
Optionally, after the labeling of the video segments belonging to the same video scene, the method further includes:
according to a known video segment, video segments similar to the known video segment are matched in the playing video.
In a second aspect, an embodiment of the present invention provides a video annotation device, which includes:
the acquisition module is used for acquiring video transformation identifiers among different video scenes;
the scene clustering module is used for carrying out scene clustering on each video clip according to the video conversion identifier;
and the marking module is used for marking the video clips belonging to the same video scene.
Optionally, the obtaining module includes:
a first dividing unit for dividing the video of each frame; wherein the video transformation identifier is a video transformation time point;
the judging unit is used for judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not;
if the video clips belong to different scenes, replacing the video frames in the buffer area by the current video frame, and storing the time points of different video scene changes; alternatively, the first and second electrodes may be,
if the video clips belong to the same scene, the current state of the video frames in the buffer area is maintained.
Optionally, the determining unit includes:
the first extraction subunit is used for respectively extracting a first brightness histogram of the current video frame and a second brightness histogram of the video frame in the cache region;
the first judgment subunit is configured to judge whether a threshold of the correlation is greater than a first preset threshold according to the correlation between the first luminance histogram and the second luminance histogram;
if the threshold value of the correlation is larger than a first preset threshold value, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
and if the threshold value of the correlation is smaller than a first preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes.
Optionally, the determining unit includes:
the second extraction subunit is used for respectively extracting the first discrete cosine transform component of the current video frame and the second discrete cosine transform component of the video frame in the buffer area;
the second judgment subunit is configured to judge whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene according to a relationship between the first discrete cosine transform component and the second discrete cosine transform component;
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
and if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene.
Optionally, the determining unit includes:
the third extraction subunit is used for respectively extracting the first motion vector distribution of the current video frame and the second motion vector distribution of the video frame in the cache region;
a comparison subunit configured to compare the first motion vector distribution and the second motion vector distribution;
the third judging subunit is used for judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not according to the comparison result;
if the first motion vector distribution is different from the second motion vector distribution, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
and if the first motion vector distribution is the same as the second motion vector distribution, the current video frame and the video frame in the buffer area belong to the video segment of the same scene.
Optionally, the determining unit includes:
the calculating subunit is used for calculating the pixel difference value between the current video frame and the video frame in the buffer area;
the fourth judging subunit is configured to judge whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene according to the pixel difference value;
if the pixel difference value is larger than a second preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes; alternatively, the first and second electrodes may be,
and if the pixel difference value is smaller than a second preset threshold value, the current video frame and the video frame in the buffer area belong to the video clip of the same scene.
Optionally, the scene clustering module includes:
the second segmentation unit is used for segmenting the input video into a plurality of video segments according to the video conversion time point; wherein the video transformation identifier is a video transformation time point;
the time domain alignment unit is used for performing time domain alignment processing on the plurality of video segments;
the composing unit is used for composing a four-dimensional tensor for the video clips; wherein each dimension respectively represents the height, width and length of a video clip and the number of the video clip;
the high-order singular value decomposition unit is used for carrying out high-order singular value decomposition processing on the four-dimensional tensor to obtain the eigenvector of each video segment;
and the sparse subspace clustering unit is used for carrying out sparse subspace clustering processing on each video clip according to the characteristic vector to obtain the same video scene after scene clustering.
Optionally, the same video scene comprises a player close-up scene, then: the labeling module comprises:
a first detection unit for detecting the face of a player by using a cascade detector based on haar features to perform face detection when the facial features of the player can be extracted;
a first matching unit for matching names of players using a convolutional network based on a deep learning architecture according to the detected faces of the players;
a first labeling unit for labeling the basic information of the player in a video segment of the player close-up scene;
alternatively, the first and second electrodes may be,
the second detection unit is used for detecting the shirt number of the player by using an optical character recognition system when the facial features of the player cannot be extracted;
the second matching unit is used for matching the names of the players by using a convolution network based on a deep learning architecture according to the detected shirt numbers of the players;
and the second labeling unit is used for labeling the basic information of the player in the video segment of the player close-up scene.
Optionally, the same video scene comprises a court panorama or player tracking scene, then: the labeling module comprises:
a tracking unit for tracking the player using a gradient tracker;
the storage unit is used for storing the motion trail of the player;
and the third marking unit is used for marking the motion trail of the player.
Optionally, the labeling module includes:
a third detection unit for detecting information in the statistical frame based on the feature extractor of the local binary pattern;
the fourth extraction unit is used for extracting the information in the statistical frame;
a recognition unit for recognizing the information in the statistical box using an optical character recognition system;
and the fourth labeling unit is used for labeling the information in the statistical box on a fast forward prompt bar of a display screen for playing the video clip.
Optionally, the apparatus further includes a third matching unit, configured to match a video segment similar to a known video segment in the playing video according to the known video segment.
In a third aspect, an embodiment of the present invention provides a server, including:
the communication unit is used for communicating with the intelligent terminal;
and the processor is used for acquiring video conversion identifiers among different video scenes, carrying out scene clustering on each video clip according to the video conversion identifiers, and labeling the video clips belonging to the same video scene.
In the embodiment of the invention, the video conversion identifiers among different video scenes are obtained, the scene clustering is carried out on each video clip according to the video conversion identifiers, and the video clips belonging to the same video scene are labeled, so that the technical problem that the video clips in the video scenes of specific types cannot be automatically labeled in the prior art is solved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The video annotation method of the embodiment can be applied to a plurality of fields. For example, the method can be applied to playing videos of game events in the sports field. The method can be applied to the entertainment art comprehensive program playing video by slightly modifying the method or equivalently replacing a certain step by a person with ordinary skill in the art. Or the method can be applied to the documentary playing video according to different implementation subjects. However, in this embodiment, the method is applied to the field of sports to describe the specific implementation of the method. Further, the method provided by the present embodiment is described in the field of sports, wherein the ball events include baseball, football, basketball, table tennis, volleyball, and other ball events.
Referring to fig. 1, fig. 1 is a schematic view of an implementation scenario of a video annotation method according to an embodiment of the present invention. As shown in fig. 1, the implementation scenario includes a server 11 and an intelligent terminal 12. The server 11 transmits the processed video stream to the intelligent terminal 12 through the network, and plays each video clip or video of the annotation at the intelligent terminal 12. Here, the smart terminal 12 may be a portable mobile electronic device such as a PDA, a desktop computer, a tablet computer, an MP4, a smart phone, and an electronic book. The server may be a local server or a cloud server. The number of the servers 11 may be single or multiple, and the operator or other user may set the number according to actual needs. Here, the servers may communicate with each other through a wireless network or a wired network. Between the intelligent terminal 12 and the server 11, the server 11 communicates with the intelligent terminal 12 through a wired network or a wireless network. The user requests the video at the intelligent terminal 12, and can watch each marked and clustered video segment, wherein the marked information is related to the content in the video segment. For example, the video segment is a defense scene of basketball player a, and at the intelligent terminal side, the user can see all the defense scenes labeled by player a, and the labeled information is indexes of attack, cap and defense of player a in the season, but of course, other event information about player a can be also included here. The method for marking the video clips of a certain scene is beneficial to improving the video watching interest of the user, so that the experience of the user is improved.
Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video annotation method according to an embodiment of the invention. As shown in fig. 2, the method comprises the steps of:
s21, acquiring video transformation identifiers among different video scenes;
in this step S21, the video scenes here include a court panorama, a player close-up, a auditorium sweep, a coach seat close-up, and other scenes. Further, in each of the video scenes, other sub-video scenes may be further divided, for example, the player features include a player defense scene, a player attack scene, a player hitting scene, and the like. One skilled in the art would recognize that: video scenes not described herein should fall within the scope of the present invention if they are similar or equivalent to the concepts of the various video scenes provided in the present embodiment.
In step S21, the server may access the resource server to obtain the playing video resource, or may obtain the playing video resource by actively transmitting the playing video resource by the resource server, or may achieve the purpose of obtaining the playing video resource by the user adding the video resource by himself. After the played video resource is obtained, the server divides the played video resource, and different video scenes are obtained according to a preset algorithm model.
In this step S21, the video conversion flag is a flag for determining video scene switching. For example, in basketball, the video transformation identifier is a video scene switching identifier between a shooting scene and a cap scene of a certain player. The video transform flag has various expressions, and in the present embodiment, the video transform flag is a video transform time point between different video scenes.
Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of obtaining video transformation time points between different video scenes. As shown in fig. 3, the process includes:
s31, dividing the video of each frame;
in this step S31, each frame of video is divided to be applied to each input video stream. Referring to fig. 3a to 3c together, fig. 3a is a first video segment of a first video scene obtained after segmentation according to an embodiment of the present invention, fig. 3b is a second video segment of a second video scene obtained after segmentation according to an embodiment of the present invention, and fig. 3c is a third video segment of a third video scene obtained after segmentation according to an embodiment of the present invention. Further, as can be seen from fig. 3a to fig. 3c, the video scenes of each frame of the divided video are different. The video scene shown in fig. 3a is a game scene in which both players hit and shoot, the video scene shown in fig. 3b is a game scene of one player, and the video scene shown in fig. 3c is an auditorium scene.
In this embodiment, an input video stream is analyzed, and each frame of video is segmented according to a preset algorithm model.
And S32, judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene.
In step S32, the current video frame is a video frame to be parsed and distinguished from the previous video frame, and the buffer area opens up a storage area for the system to store the video frame judged from the next video frame. Each frame of video is segmented.
In this embodiment, if video clips belonging to different scenes belong to, the current video frame replaces the video frame in the buffer area, and the time point of the change of the different video scenes is stored; or, if the video segments belong to the same scene, maintaining the current state of the video frames in the buffer.
Referring to fig. 3a and fig. 3b, assuming that the current video frame is fig. 3a and the video frame in the buffer is fig. 3b, since fig. 3a and fig. 3b belong to video frames of different video scenes, the video frame of fig. 3a will replace the video frame of fig. 3b in the buffer. If FIG. 3a is the player's playing position in the video frame of FIG. 3b for different time periods, then the current state of the video frame in the buffer is maintained.
In step S32, there are various ways to determine whether the current video frame and the video frame in the buffer area belong to the same scene video clip. The present embodiment provides four ways for determining whether a current video frame and a video frame in a buffer belong to a video segment of the same scene. Those skilled in the art will understand that: the manner of judging whether each video clip belongs to the same video scene according to each video frame is various, and the four purposes of judging the video clip of the same scene provided by the invention can be realized by free combination or slight modification or equivalent replacement on the basis of the following four manners of the embodiment. Therefore, any modification made by anyone who uses the inventive concept should fall within the scope of the present invention.
Referring to fig. 4a, fig. 4a is a flowchart illustrating a first method for determining whether a current video frame and a video frame in a buffer belong to a video clip of the same scene according to an embodiment of the present invention. As shown in fig. 4a, the process includes:
s4a1, extracting a first brightness histogram of the current video frame and a second brightness histogram of the video frame in the buffer area respectively;
in this step S4a1, a histogram image processing algorithm is used to extract a luminance histogram of each frame of video.
And S4a2, judging whether the threshold value of the correlation is larger than a first preset threshold value or not according to the correlation between the first brightness histogram and the second brightness histogram.
In this step S4a2, a threshold range (-1 to 1) is defined, where, -1 represents that the current video frame and the video frame in the buffer area are video clips completely belonging to different scenes, and 1 represents that the current video frame and the video frame in the buffer area are video clips completely belonging to the same scene. Whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not is divided in a similar concept between-1 and 1. Here, the first preset threshold is 0.7. If the threshold value of the correlation is larger than a first preset threshold value 0.7 and smaller than 1, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; if the threshold value of the correlation is smaller than a first preset threshold value 0.7 and larger than-1, the current video frame and the video frame in the buffer area belong to video clips of different scenes.
Referring to fig. 4b, fig. 4b is a flowchart illustrating a second method for determining whether a current video frame and a video frame in a buffer belong to a video clip of the same scene according to an embodiment of the present invention. As shown in fig. 4b, the process includes:
s4b1, extracting a first discrete cosine transform component of the current video frame and a second discrete cosine transform component of the video frame in the buffer area respectively;
in this step S4b1, a discrete cosine transform image processing algorithm (DCT) is used to extract discrete cosine transform components of each frame of video.
S4b2, judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene according to the relation between the first discrete cosine transform component and the second discrete cosine transform component.
In step S4b2, it is determined whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene according to the energy distribution between the first discrete cosine transform component and the second discrete cosine transform component. Specifically, if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; or if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; or if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; or, if the second discrete cosine transform component is a high frequency component and the first discrete cosine transform component is a high frequency component, the current video frame and the video frame in the buffer area belong to a video segment of the same scene.
Referring to fig. 4c, fig. 4c is a flowchart illustrating a third method for determining whether a current video frame and a video frame in a buffer belong to a video clip of the same scene according to an embodiment of the present invention. As shown in fig. 4c, the process includes:
s4c1, respectively extracting the first motion vector distribution of the current video frame and the second motion vector distribution of the video frame in the buffer area;
in this step S4c1, a motion vector image processing algorithm is used to extract a first motion vector distribution of the current video frame and a second motion vector distribution of the video frame in the buffer. The motion vector distribution is obtained by performing least variance matching on image blocks of the video frame.
S4c2, comparing the first motion vector distribution with the second motion vector distribution;
and S4c3, judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene according to the comparison result.
With reference to steps S4c2 and S4c3, the minimum variance of the first motion vector distribution and the minimum variance of the second motion vector distribution are compared, and whether the current video frame and the video frame in the buffer area belong to the video segment of the same scene is determined according to a preset minimum variance threshold. Here, it is assumed that the preset minimum variance threshold is 0.5. If the minimum variance of the first motion vector distribution deviates from 0.5 compared with the minimum variance of the second motion vector distribution, the first motion vector distribution and the second motion vector distribution are different, and further, the current video frame and the video frame in the buffer area belong to video segments of different scenes; and if the motion vector distribution is equal to 0.5, the first motion vector distribution and the second motion vector distribution are the same, and further, the current video frame and the video frame in the buffer area belong to the video segment of the same scene.
Referring to fig. 4d, fig. 4d is a flowchart illustrating a fourth method for determining whether a current video frame and a video frame in a buffer belong to a video clip of the same scene according to an embodiment of the present invention. As shown in fig. 4d, the process includes:
s4d1, calculating the pixel difference between the current video frame and the video frame in the buffer area;
in this step S4d1, a frame difference algorithm is used to calculate the pixel difference between the current video frame and the video frame in the buffer.
And S4d2, judging whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene or not according to the pixel difference value.
In this step S4d2, the second preset threshold is 0.5. If the pixel difference value is greater than a second preset threshold value of 0.5, the current video frame and the video frame in the buffer area belong to video clips of different scenes; and if the pixel difference value is less than a second preset threshold value of 0.5, the current video frame and the video frame in the buffer area belong to the video clip of the same scene.
In this embodiment, the above four ways may be executed in parallel or in combination in the process of determining whether the current video frame and the video frame in the buffer area belong to the video segment of the same scene. Further, the first mode and the second mode are only started to be executed when the third mode and the fourth mode cannot definitely judge whether the current video frame and the video frame in the buffer area belong to the video clip of the same scene. The first mode and the second mode are more suitable for detecting special-effect video scene changes such as fade-in and fade-out.
S22, carrying out scene clustering on each video clip according to the video transformation identifier;
in this step S22, the video transform flag here is a video transform time point between different video scenes. Here, the server calls the stored video transform time points from the memory, and performs scene clustering on the respective video segments according to the video transform time points. The scene clustering is to classify each video segment according to the scene type. The classification is to establish the associated characteristics of the video segments belonging to the same video scene.
Referring to fig. 5, fig. 5 is a schematic view illustrating a process of performing scene clustering on each video segment according to the video transformation time point according to an embodiment of the present invention. As shown in fig. 5, the process includes:
s51, dividing the input video into a plurality of video segments according to the video conversion time points;
s52, performing time domain alignment processing on the video clips;
in step S52, the server performs time domain registration for each of the divided video segments, and adjusts the video segments having different frame numbers to the same frame number. Specifically, according to the length of the frame number, the server freely selects a certain position in the video clip through a random number to perform interpolation or delete unnecessary frames.
S53, forming a four-dimensional tensor for the video clips;
in this step S53, each dimension represents the height, width, length, and number of the video clip, respectively.
S54, performing high-order singular value decomposition processing on the four-dimensional tensor to obtain the feature vector of each video clip;
in step S54, the four-dimensional tensor is processed by a Tucker Decomposition (Tucker decomplexion) to obtain a reduced feature vector for each video segment. During processing, the rank (rank) of each dimension of the four-dimensional tensor is set to a level that restores 90% of the original information. The transformation bases of the Tucker decomposition are recorded in order to process all subsequent video segments.
And S55, performing sparse subspace clustering processing on the video clips according to the feature vectors to obtain the same video scene after scene clustering.
In step S55, the reduced feature vector after the Tucker decomposition is processed by using a Sparse Subspace Clustering (Sparse Subspace Clustering) image processing algorithm to obtain a corresponding video scene of each video segment. Sparse subspace clustering relates feature vectors of other video segments in the same video scene with feature vectors of a specific video segment, and is represented in a linear mode, wherein represented parameters are used for generating a neighborhood matrix of spectral clustering.
Referring to fig. 5a and 5b together, fig. 5a is a panoramic view of a court provided by the embodiment of the invention, and fig. 5b is a close-up schematic view of players provided by the embodiment of the invention. In fig. 5a, each video image corresponds to a court-panorama video scene, which, as is apparent, includes a plurality of court video images of different perspectives. In fig. 5b, each video image corresponds to a player close-up video scene, which, as is apparent, includes a plurality of player video images at different perspectives.
And S23, marking video clips belonging to the same video scene.
In this step S23, the same video scene includes a player close-up scene, a auditorium swipe, a coach seat close-up, and other scenes. Further, in each of the video scenes, other sub-video scenes may be further divided, for example, the player features include a player defense scene, a player attack scene, a player hitting scene, and the like. The server marks video clips belonging to the same video scene, and the marked information is determined according to different video clip contents.
Please refer to fig. 3 b. On the right side of fig. 3b, the server annotates the video segment in the video scene, as enclosed by oval 3b1, with annotation information including the names of the first and second base players in the current video scene. On the left side of fig. 3b, the label information, as enclosed by oval 3b2, includes current game data statistics.
Please refer to fig. 3 a. On the right side of fig. 3a, the server annotates the video segment in the video scene, as enclosed by oval 3a1, with annotation information including the name of the current pitcher, the season data for the pitcher, the name of the current batter, and the season data for the batter. On the left side of fig. 3a, the label information, as enclosed by the oval 3a2, includes current game data statistics.
Therefore, by labeling the video segments belonging to the same video scene, the technical problem that the prior art cannot automatically label the video segments in the specific type of video scene is solved, on one hand, the method enables a user to further know the specific plot, content and role in the video, and on the other hand, the user experience is improved.
Referring to fig. 6a, fig. 6a is a schematic flow chart illustrating a video clip for marking a close-up scene belonging to a player when the same video scene is a close-up scene of the player according to an embodiment of the present invention. As shown in fig. 6a, the process includes:
s6a1, when facial features of a player can be extracted, detecting the face of the player by using a cascade detector based on haar features;
s6a2, matching the names of the players by using a convolution network based on a deep learning framework according to the faces of the players;
s6a3, marking the basic information of the player in the video segment of the player close-up scene.
By adopting the method, the marking information can be accurately matched with the corresponding player.
Referring to fig. 6b, fig. 6b is a schematic flow chart illustrating another video segment labeled to a close-up scene of a player when the same video scene is the close-up scene of the player according to the embodiment of the present invention. As shown in fig. 6b, the process includes:
s6b1, when facial features of a player cannot be extracted, detecting the shirt number of the player by using an optical character recognition system;
s6b2, matching the names of the players by using a convolution network based on a deep learning framework according to the detected coat numbers of the players;
s6b3, marking the basic information of the player in the video segment of the player close-up scene.
By adopting the method, corresponding marking is carried out when facial features of the player cannot be extracted, and auxiliary identification can be carried out by identifying other accessory features of the player, so that the reliability of the system is improved.
Referring to fig. 6c, fig. 6c is a schematic flow chart illustrating a process of marking a video clip belonging to a court panorama or a player tracking scene when the same video scene is the court panorama or the player tracking scene according to the embodiment of the present invention. As shown in fig. 6c, the process includes:
s6c1, tracking the player by using a gradient tracker;
s6c2, storing the motion trail of the player;
and S6c3, marking the motion trail of the player.
Corresponding to a court panorama or player tracking scene, the server detects players using player detectors based on haar features. For each detected player, the server will track the player using a gradient tracker (KLT tracker). The motion trail of each player will be recorded and output. These trajectories will be useful for tactical design and analysis by professionals providing ball games.
Referring to fig. 6d, fig. 6d is a schematic flow chart illustrating the process of labeling the statistical scores of the game according to the embodiment of the present invention. As shown in fig. 6d, the process includes:
s6d1, detecting information in the statistical frame by the feature extractor based on the local binary pattern;
in this step S6d1, a feature extractor of a local binary pattern (localbinypattern) is used to detect information in the statistical box.
S6d2, extracting information in the statistical box;
s6d3, identifying the information in the statistical box by using an optical character recognition system;
and S6d4, marking the information in the statistical box on a fast forward prompt bar of a display screen for playing the video clip.
Here, in the rebroadcast video with the statistical box, the method may also extract the information in the statistical box, pass through an optical character recognition system (OCR system) for recognition, and provide the information to a fast forward cue for ball games, and give the display. By adopting the method, the user can greatly improve the understanding of the background of the competition content and the current competition situation while watching the video, thereby improving the experience of the user.
The method provided by the embodiment further comprises the following steps: according to a known video segment, video segments similar to the known video segment are matched in the playing video. For example, a user stores a wonderful ball clip video of a famous player on an intelligent terminal, and a server processes the ball clip video through segmentation, identification and scene clustering to obtain a video segment of a known video scene. If the user is watching a game item related to the player at this time, the server automatically matches the video segment related to the player in the game item according to the video segment of the known video scene. Therefore, the method can greatly enhance the interest of users in watching videos.
In the present embodiment, the technical features described in the embodiments of the present invention may be combined with each other as long as they do not conflict with each other.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a video annotation apparatus according to an embodiment of the invention. As shown in fig. 7, the apparatus includes:
an obtaining module 71, configured to obtain video transformation identifiers between different video scenes;
a scene clustering module 72, configured to perform scene clustering on each video segment according to the video transformation identifier;
and the labeling module 73 is used for labeling the video segments belonging to the same video scene.
Referring to fig. 7a, fig. 7a is a schematic structural diagram of an acquisition module according to an embodiment of the present invention. As shown in fig. 7a, the obtaining module 71 includes:
a first division unit 711 for dividing the video of each frame; wherein the video transformation identifier is a video transformation time point;
a judging unit 712, configured to judge whether the current video frame and the video frame in the buffer belong to a video clip of the same scene;
if the video clips belong to different scenes, replacing the video frames in the buffer area by the current video frame, and storing the time points of different video scene changes; alternatively, the first and second electrodes may be,
if the video clips belong to the same scene, the current state of the video frames in the buffer area is maintained.
Referring to fig. 7b, fig. 7b is a schematic structural diagram of a determining unit according to an embodiment of the present invention. As shown in fig. 7b, the determining unit 712 includes:
a first extracting subunit 7121, configured to extract a first luminance histogram of the current video frame and a second luminance histogram of the video frame in the buffer, respectively;
a first judging subunit 7122, configured to judge, according to a correlation between the first luminance histogram and the second luminance histogram, whether a threshold of the correlation is greater than a first preset threshold;
if the threshold value of the correlation is larger than a first preset threshold value, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
and if the threshold value of the correlation is smaller than a first preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes.
As shown in fig. 7b, the determining unit 712 includes:
a second extracting subunit 7123, configured to extract a first discrete cosine transform component of the current video frame and a second discrete cosine transform component of the video frame in the buffer area, respectively;
a second determining subunit 7124, configured to determine, according to a relationship between the first discrete cosine transform component and the second discrete cosine transform component, whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene;
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a high-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a low-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to a video clip of the same scene; alternatively, the first and second electrodes may be,
if the second discrete cosine transform component is a high-frequency component and the first discrete cosine transform component is a low-frequency component, the current video frame and the video frame in the buffer area belong to video segments of different scenes; or, if the second discrete cosine transform component is a high frequency component and the first discrete cosine transform component is a high frequency component, the current video frame and the video frame in the buffer area belong to a video segment of the same scene.
As shown in fig. 7b, the determining unit 712 includes:
a third extracting subunit 7125, configured to extract the first motion vector distribution of the current video frame and the second motion vector distribution of the video frame in the buffer, respectively;
a comparison subunit 7126, configured to compare the first motion vector distribution and the second motion vector distribution;
a third judging subunit 7127, configured to judge, according to the comparison result, whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene;
if the first motion vector distribution is different from the second motion vector distribution, the current video frame and the video frame in the buffer area belong to video segments of different scenes; or if the first motion vector distribution and the second motion vector distribution are the same, the current video frame and the video frame in the buffer area belong to the video segment of the same scene.
As shown in fig. 7b, the determining unit 712 includes:
a computing subunit 7128, configured to compute a pixel difference between the current video frame and the video frame in the buffer area;
a fourth judging subunit 7129, configured to judge, according to the pixel difference, whether the current video frame and the video frame in the buffer area belong to a video segment of the same scene;
if the pixel difference value is larger than a second preset threshold value, the current video frame and the video frame in the buffer area belong to video clips of different scenes; or if the pixel difference value is smaller than a second preset threshold value, the current video frame and the video frame in the buffer area belong to the video clip of the same scene.
Referring to fig. 7c, fig. 7c is a schematic structural diagram of a scene clustering module according to an embodiment of the present invention. As shown in fig. 7c, the scene clustering module 72 includes:
a second dividing unit 721, configured to divide the input video into a plurality of video segments according to the video transformation time point; wherein the video transformation identifier is a video transformation time point;
a temporal alignment unit 722, configured to perform temporal alignment on the video segments;
a forming unit 723, configured to form a four-dimensional tensor for the plurality of video segments; wherein each dimension respectively represents the height, width and length of a video clip and the number of the video clip;
a high-order singular value decomposition unit 724, configured to perform high-order singular value decomposition processing on the four-dimensional tensor to obtain a feature vector of each video segment;
and the sparse subspace clustering unit 725 is configured to perform sparse subspace clustering on the video segments according to the feature vectors, so as to obtain the same video scene after scene clustering.
Referring to fig. 7d, fig. 7d is a schematic structural diagram of a labeling module according to an embodiment of the present invention. As shown in fig. 7d, the labeling module 73 includes:
a first detection unit 731 for detecting the face of a player by performing face detection using a cascade detector based on haar features when the facial features of the player can be extracted;
a first matching unit 732 for matching names of players using a convolutional network based on a deep learning architecture according to the detected faces of the players;
a first annotation unit 733 for annotating the player's basic information at a video segment of the player's close-up scene;
alternatively, the second detection unit 734 is configured to detect a jersey number of the player using an optical character recognition system when facial features of the player cannot be extracted;
a second matching unit 735 for matching the names of players using a convolutional network based on a deep learning architecture according to the detected jersey numbers of the players;
a second labeling unit 736 for labeling the player's basic information at a video segment of the player's close-up scene.
As shown in fig. 7d, the same video scene includes a court panorama or a player tracking scene, then: the labeling module 73 includes:
a tracking unit 737 for tracking the player using a gradient tracker;
a storage unit 738 for storing the movement locus of the player;
a third labeling unit 739 for labeling the motion trail of the player.
As shown in fig. 7d, the labeling module 73 includes:
a third detection unit 740 for detecting information within the statistical box based on the feature extractor of the local binary pattern;
a fourth extraction unit 741, configured to extract information in the statistics box;
a recognition unit 742 for recognizing information within the statistical box using an optical character recognition system;
a fourth labeling unit 743, configured to label the information in the statistics box on a fast forward prompt bar of a display screen on which the video clip is played.
As shown in fig. 7d, the apparatus further comprises a third matching unit 745 for matching a video segment similar to a known video segment in the playing video according to the known video segment.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. As shown in fig. 8, the server includes:
a communication unit 81 for communicating with the intelligent terminal;
and the processor 82 is used for acquiring video transformation identifiers among different video scenes, carrying out scene clustering on each video clip according to the video transformation identifiers, and labeling the video clips belonging to the same video scene.
The processor 82 is a control center of the server, connects various parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and/or processes data by operating or executing software programs and/or modules stored in the storage unit and calling data stored in the storage unit. The processor can be composed of an integrated circuit or a plurality of connected integrated chips with the same function or different functions. That is, the processor may be a combination of a GPU, a digital signal processor, and a control chip in the communication unit.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. The computer software may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.