CN112203115B

CN112203115B - Video identification method and related device

Info

Publication number: CN112203115B
Application number: CN202011078362.4A
Authority: CN
Inventors: 蔡聪怀; 刘振华; 饶峰云; 赵教生; 林炯; 刘叶青
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2023-03-10
Anticipated expiration: 2040-10-10
Also published as: CN112203115A

Abstract

The embodiment of the application discloses a video identification method and a related device, which are used for obtaining a video frame segment of a video to be identified, extracting the spatio-temporal characteristics of the video frame segment to be identified, matching the spatio-temporal characteristics of the video frame segment to be identified with the spatio-temporal characteristics in a search library to obtain the spatio-temporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatio-temporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified. When the name of the video to be recognized is recognized through technologies such as a computer vision technology and a machine learning technology, feature extraction is not performed on the basis of a video frame of the video to be recognized any more, but is performed on the basis of a video frame segment of the video to be recognized, so that the space-time feature of the video frame segment is obtained, the analysis capability of the video to be recognized can be improved by extracting the space-time feature of the video frame segment, the probability of correct naming of the name of the video to be recognized is improved, and the experience of a user is improved.

Description

Video identification method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video identification method and a related apparatus.

Background

With the development of internet technology, users can watch video resources through an internet platform. In order to be able to attract users, there are a large number of videos in the interconnected platform that clip out highlights from the complete video, or portions that may be of interest to the user, so that the user can watch the video with fragmented time,

when a user is interested in a certain video and wants to continuously watch the complete video corresponding to the video, if the name of the corresponding complete video is not published simultaneously when the video is published on an internet platform, the user is difficult to know the name of the corresponding complete video, and the user experience is poor.

Disclosure of Invention

In order to solve the technical problem, the application provides a video identification method and a related device, which are used for identifying the name of a video and improving the experience of a user.

The embodiment of the application discloses the following technical scheme:

in one aspect, the present application provides a video identification method, including:

acquiring a video frame segment of a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;

extracting the space-time characteristics of the video frame segments; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;

matching the spatio-temporal characteristics of the video frame segments with spatio-temporal characteristics in a search library, if the matching is successful, acquiring the spatio-temporal characteristics in the search library which is successfully matched, determining a target video where the video frame segments corresponding to the spatio-temporal characteristics in the search library which is successfully matched are located, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.

In another aspect, the present application provides a video recognition apparatus, comprising: the device comprises an acquisition unit, an extraction unit and a processing unit;

the acquisition unit is used for acquiring a video frame segment in a video to be identified, wherein the video frame segment comprises a plurality of continuous video frames;

the extraction unit is used for extracting the space-time characteristics of the video frame segment; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;

the processing unit is used for matching the spatio-temporal characteristics of the video frame segments with the spatio-temporal characteristics in a search library, if the matching is successful, obtaining the spatio-temporal characteristics in the search library which is successfully matched, determining a target video represented by the spatio-temporal characteristics in the search library which is successfully matched, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.

In another aspect, an embodiment of the present application provides an apparatus for video identification, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the above aspect according to instructions in the program code.

In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.

In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the methods provided in the various alternative implementations of the aspects described above.

According to the technical scheme, the video frame segment of the video to be identified is obtained, the spatio-temporal characteristics of the video frame segment to be identified are extracted, the spatio-temporal characteristics of the video frame segment to be identified are matched with the spatio-temporal characteristics in the search library, the spatio-temporal characteristics in the search library which are successfully matched are obtained, the target video where the video frame segment corresponding to the spatio-temporal characteristics in the search library which are successfully matched is located is determined, and the name of the target video is determined as the name of the video to be identified. When the name of the video to be identified is identified, feature extraction is not performed based on the video frame of the video to be identified, but is performed based on the video frame segment of the video to be identified, so that the space-time feature of the video frame segment is obtained, that is, not only the space feature of the video frame segment can be extracted, but also the time feature of the video frame segment can be extracted according to a plurality of continuous video frames.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;

fig. 2 is a flowchart of a video recognition method according to an embodiment of the present application;

fig. 3 is a schematic diagram of feature extraction provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a video identification method according to an embodiment of the present application;

fig. 5 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;

fig. 6 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;

fig. 7 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application;

fig. 8 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application

Fig. 9 is a schematic structural diagram of a video recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

When the video does not identify the corresponding affiliated drama name, the user hardly knows the name of the video, so that the user cannot view the complete video corresponding to the video when interested in the video, and the user experience is poor. In order to improve the experience of the user, in the related art, in order to identify the affiliated drama name of the video, generally, each frame of video frame picture of the video to be identified is extracted, the spatial feature of each frame of picture video frame picture is extracted through a pre-trained model, and is matched with a pre-established search library, so that the name of the video to be identified is obtained. However, since a video is data that changes according to time, the similarity between adjacent video frames is large, and there are strong temporal correlation and spatial correlation, and the spatial features of video frame pictures are extracted based on each frame of video picture only, and the correlation characteristics between consecutive frames cannot be effectively extracted, there are many matching results for one video, and thus the name result corresponding to the video cannot be obtained.

Based on the above, the application provides a video identification method and a related device, which are used for identifying the name of a video and improving the experience of a user.

The video identification method provided by the embodiment of the application can be applied to equipment with video identification capability, such as terminal equipment or a server with a video identification function. The method can be independently executed through the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed through the cooperation of the terminal equipment and the server. The terminal device may be a smart phone, a notebook computer, a desktop computer, a Personal Digital Assistant (PDA for short), a tablet computer, or the like. The server may be understood as an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.

The video recognition method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned computer vision technology and machine learning technology.

Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

For convenience of describing the scheme, in the embodiments of the present application, a server is mainly used as a video identification device, and the video identification method provided by the embodiments of the present application is independently performed for explanation.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a video identification method according to an embodiment of the present application. In this scenario, the aforementioned processing device is a server 100, and further includes a search library 200. As shown in fig. 1, the embodiment of the present application is described with the search library 200 located in the server, and the search library 200 may be located in the server 100 or may be independent of the server 100.

The video 101 to be identified is composed of a plurality of video frames, after the server 200 acquires the video 101 to be identified, a video frame segment is extracted from the plurality of video frames of the video 101 to be identified, the video frame segment may be a single segment or multiple segments, each segment of the video frame segment includes a plurality of consecutive video frames, as shown in fig. 1, a video frame segment including 5 consecutive video frames is acquired from the video 101 to be identified.

Since the video is data changing according to time and has strong time correlation and spatial correlation, in order to improve the accuracy of the identification result of the video 101 to be identified, the server 100 extracts the spatio-temporal features of the video frame segments. The spatio-temporal characteristics of the video frame segments are fusion characteristics of the spatial characteristics of the video frame segments and the temporal characteristics of the video frame segments, that is, the server 100 can extract not only the spatial characteristics of the video frame segments but also the temporal characteristics of the video frame segments, and by extracting the characteristics of multiple dimensions, the analysis capability of the video 101 to be recognized can be improved, so that the accuracy of name recognition of the video 101 to be recognized is improved.

After extracting the spatiotemporal features of the video frame segments, the server 100 searches in the search library 200 to obtain spatiotemporal features matching the spatiotemporal features of the video frame segments. The search library 200 includes a plurality of spatio-temporal features, each spatio-temporal feature being a spatio-temporal feature extracted from a video frame segment included in the complete video. As shown in fig. 1, the search library 200 includes N spatio-temporal features, where the N spatio-temporal features correspond to different video frame segments, and each video frame segment may be from the same complete video or from multiple complete videos. Wherein the spatio-temporal features II are extracted from the video frame segments comprised by the complete video 102. If the spatio-temporal characteristics of the video frame segments of the video 101 to be identified are successfully matched with the spatio-temporal characteristics II in the search library 200, the spatio-temporal characteristics II are obtained, the video frame segments corresponding to the spatio-temporal characteristics II are determined to be from the target video 102, and the name of the target video 102 is determined as the name of the video 101 to be identified.

According to the technical scheme, the video frame segment is extracted from the video to be recognized, the feature extraction is carried out on the video frame segment, the feature extraction is not carried out on each frame of video frame in the video to be recognized, the spatial feature of the video frame segment can be extracted, the time feature of the video frame segment can be extracted, the spatiotemporal feature of the video frame segment is obtained, the spatiotemporal feature is matched with the spatiotemporal feature in the search library, the name of the video to be recognized is obtained, the probability of obtaining a correct result of the video to be recognized is improved, and the experience of a user is improved.

The video identification method provided by the embodiment of the present application is described below with reference to the application scenario shown in fig. 1. Referring to fig. 2, the figure is a flowchart of a video identification method according to an embodiment of the present application. In the method shown in fig. 2, the following steps are included:

s201: and acquiring a video frame segment of the video to be identified.

The video to be recognized is composed of a plurality of video frames, one video frame is a frame, and human eyes have a short memory effect on the image, so that when the eyes see the quick switching of the plurality of video frames, the video to be recognized is regarded as a segment of played video. In order to be able to identify the name of the video to be identified, a video frame segment of the video to be identified may be obtained, the video frame segment comprising a plurality of consecutive video frames, such that the temporal correlation between the video frames may be subsequently obtained from the consecutive video frames.

The manner in which the video frame segments are acquired is not particularly limited. For example, each frame of video frame of the video to be identified may be extracted, and then the consecutive video frames may be treated as a set of video frame segments. For another example, the video to be recognized may be divided into a plurality of groups of video frame segments according to a preset number of video frame frames. A group of video frames may be, for example, a video frame segment composed of 5 video frames, and those skilled in the art can set the length of the group of video frame segments according to actual needs.

S202: and extracting the space-time characteristics of the video frame segments.

Since the video frame segment is data that changes according to time, the similarity between any pixel and its neighboring pixels is high, and there are strong temporal correlation and spatial correlation. In the related art, since only the spatial features of the video frames are extracted, but the temporal features between the video frames are ignored, the accuracy of obtaining a correct result of the video name to be recognized is not high.

Based on this, the present application extracts not only spatial features but also temporal features. The object for extracting the characteristics is not an isolated frame of video frame any more, but a video frame segment comprising a plurality of continuous video frames, so that the time characteristics of the video frame segment can be extracted, and compared with the method for extracting the space characteristics of the video frame, the method for extracting the space-time characteristics of the video frame segment can better represent the action information of the object in the video frame segment, thereby improving the accuracy of obtaining the correct result of the name of the video to be identified. The space-time characteristics of the video frame segments are fusion characteristics of the space characteristics of the video frame segments and the time characteristics of the video frame segments, the space characteristics of the video frame segments are used for identifying the appearance information of the objects involved in each frame of the video frame segments, and the time characteristics of the video frame segments are used for identifying the operation information of the objects involved in the video frame segments.

Meanwhile, because of the high similarity between adjacent frames, if the feature vector (such as spatial feature) of each frame of video frame is extracted, a large number of redundant feature vectors will be obtained, and the feature vectors (such as spatio-temporal feature) of video frame segments are extracted, the redundant feature vectors will be reduced. For example, if a group of video frame segments has 3 video frames, 1 feature vector is extracted by using the technical solution of the embodiment of the present application, and 3 feature vectors are obtained by extracting the feature vector of each video frame, and 1 feature vector can reduce redundant feature vectors compared to 3 feature vectors.

S203: and matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified.

And after the spatiotemporal characteristics of the video frame segments are obtained, matching the spatiotemporal characteristics of the video frame segments with the spatiotemporal characteristics in the search library so as to obtain the spatiotemporal characteristics matched with the spatiotemporal characteristics of the video frame segments in the search library.

The retrieval library is pre-established, when the retrieval library is pre-established, a complete video is divided into a plurality of video frame segments, the spatio-temporal characteristics of each video frame segment are extracted, and the extracted spatio-temporal characteristics are put into the retrieval library. For example, the search library may be a similarity search class library (Faiss), which is formed by densely clustering spatio-temporal features to construct an inverted index. The retrieval library has the space-time characteristics of a plurality of video frame fragments in a plurality of complete videos, and can continuously update the space-time characteristics in the retrieval library along with the update of complete video resources, so that the space-time characteristics matched with the space-time characteristics of the video frame fragments in the videos to be identified can be obtained in the retrieval library, the corresponding video frame fragments are determined according to the space-time characteristics in the retrieval library which are successfully matched, the name of a target video where the video frame fragments are located is determined as the name of the videos to be identified, and the target video is the complete video corresponding to the videos to be identified.

According to the technical scheme, when the name of the video to be recognized is recognized, feature extraction is not performed on the basis of the video frame of the video to be recognized, but the feature extraction is performed on the basis of the video frame segment of the video to be recognized, so that the space-time feature of the video frame segment is obtained, namely, the space feature of the video frame segment can be extracted, the time feature of the video frame segment can be extracted according to a plurality of continuous video frames, and compared with the method of extracting only the space feature of the video frame, the analysis capability of the video to be recognized can be improved by extracting the space-time feature of the video frame segment, so that the probability of the name of the video to be recognized is improved, and the experience of a user is improved.

When extracting the space-time characteristics of the video frame segments, a characteristic extraction model can be adopted for extraction. Referring to fig. 3, the figure is a schematic diagram of feature extraction provided in an embodiment of the present application. The feature extraction model includes at least a spatial convolution layer, a first fusion layer, and a second fusion layer. Each layer in the feature extraction model will be described below.

After the video frame segment of the video to be identified is obtained, each frame of video frame is respectively input into the space convolution layer of the feature extraction model, and therefore the space feature of each frame of video frame is obtained. As shown in fig. 3, the video frame segment includes 3 consecutive video frames, the feature extraction model has 3 spatial convolution layers, and 3 consecutive video frames are input into each spatial convolution layer to obtain 3 spatial features.

After the spatial features of each frame of video frame are obtained through the spatial convolution layer, the spatial features of each frame of video frame are input into a first fusion layer of the feature extraction model, and the spatial features of each frame of video frame are fused through the first fusion layer, so that the spatial features of the video frame fragments are obtained. As shown in fig. 3, 3 spatial features are fused into one spatial feature, i.e. the spatial feature of the video frame segment. Meanwhile, the video frame segments are input into the first fusion layer, and the time characteristics of the video frame segments can be obtained through the first fusion layer due to the time correlation among the video frame segments. As shown in fig. 3, temporal characteristics of a video frame segment including 3 video frames can be obtained through the first fusion layer.

And finally, inputting the spatial characteristics and the time characteristics of the video frame segments into a second fusion layer of the characteristic extraction model to obtain the space-time characteristics of the video frame segments.

The feature extraction model is not specifically limited in the present application, and for example, a model based on a long Time Series Networks (TSN) may be used. In the related art, if the time feature of a video frame is to be obtained, an optical flow picture of the video frame needs to be obtained first, so that the time feature of the video frame is obtained according to the optical flow picture, but since the video frame is converted into the optical flow picture, a large amount of calculation is required, compared with the method of converting each frame of the video to be recognized into the optical flow picture and then extracting the time feature, the amount of calculation can be reduced by inputting a video frame segment and extracting the time feature through the video frame segment in the embodiments of the present application, so that the resource consumption is reduced. Meanwhile, because the optical flow picture needs to extract the time features through the time convolution layer, the optical flow picture is not used as the extraction object of the time features any more in the embodiment of the application, the time features can be extracted through the original first fusion layer, the time convolution layer is reduced, the feature extraction model is simplified, and meanwhile, the calculation amount caused by the time convolution layer is reduced.

Due to different clipping settings of video producers, even if video frames of the same complete video have differences in resolution, video aspect ratio, and the like. In order to improve the editing resistance of the feature extraction model, multiple segments of the sample video frame segment may be constructed during training of the feature extraction model, so as to construct sample video frame segments with different respective rates and/or different aspect ratios, so as to increase the diversity of the sample video frame segments. And inputting a plurality of fragments which are constructed for the sample video frame fragment as sample data into the feature extraction model for model training, and improving the anti-editing capability of the feature extraction model, so that the probability of the correct naming of the video name to be recognized is improved, and the experience of a user is improved.

Due to different clipping settings of video producers, for a video frame of the same complete video, differences of resolution, video aspect ratio and the like exist, and differences of whether shades such as laces exist in a video frame picture also exist. Due to the fact that the shelters such as laces are static in the video, a dynamic area and a static area in the video to be recognized can be recognized, the static area is the area where the shelters are located in the video to be recognized, the static area can be removed from the video to be recognized, the editing resistance of the feature extraction model is improved, the probability of the name of the video to be recognized being named correctly is improved, and the experience of a user is improved.

The name of the video to be identified can be obtained through the spatio-temporal characteristics in the search library, and the positioning interval of the video to be identified in the target video can be obtained through the spatio-temporal characteristics in the search library. It should be noted that, if a positioning interval of a video to be identified in a target video needs to be obtained, when a search library is established, not only a corresponding relationship needs to be established between the complete video and a corresponding spatio-temporal feature, but also a positioning interval of a video frame segment corresponding to the spatio-temporal feature in the complete video needs to be marked, so that after the spatio-temporal feature in the search library which is successfully matched is obtained, a video frame segment corresponding to the spatio-temporal feature in the search library which is successfully matched is determined, and according to the positioning interval of the corresponding video frame segment in the target video, the positioning interval of the video to be identified in the target video can be obtained.

If a plurality of videos to be identified all correspond to one target video, the plurality of videos to be identified can be aggregated. The method comprises the steps of obtaining a plurality of positioning intervals of videos to be identified, sequencing the videos to be identified according to the positioning intervals of the videos to be identified, and displaying the videos to be identified according to a time sequence. Therefore, the continuous watching experience of the user is provided, the experience of the user is improved, and the service life of the user on the Internet platform (such as APP, webpage and the like) can be prolonged.

In order to further improve the accuracy of the video positioning interval to be identified, the matching of the positioning interval can be performed for multiple times. Next, the matching of the positioning sections is described as an example. After the first positioning area of the video to be identified in the target video is obtained, the first positioning area can be expanded. For example, the second positioning section after the extension is obtained by extending a preset time period after the first positioning section, or extending a preset time period before and after the first positioning section. For example, the first positioning interval is [50 seconds-100 seconds ], 30 seconds are added before and after the first positioning interval, and the second positioning interval is obtained [20 seconds-130 seconds ]. And then matching the spatiotemporal characteristics of the video frame segments with the spatiotemporal characteristics of the target video in the second positioning interval in the search library again based on the second positioning interval, if the matching is successful, obtaining the video frame segments corresponding to the spatiotemporal characteristics of the successfully matched target video, and obtaining the positioning interval of the video to be recognized in the target video according to the positioning interval of the video frame segments in the target video, thereby further improving the accuracy of the positioning interval of the video to be recognized in the target recognition.

In order to better understand the video identification method provided by the above embodiment, an application scenario of the video identification method provided by the embodiment of the present application is described below with reference to fig. 4 to 8.

Referring to fig. 4, the figure is a schematic view of a video identification method according to an embodiment of the present application. The building of the search library and the video identification method obtain the space-time characteristics in the same way, the search library can be built off-line, the video can be identified on line, and the method is not limited in detail. The following describes a process of creating a search base, and then describes a process of video recognition.

The process of establishing a search base comprises the following steps: a video frame segment is extracted from the complete video, for example 5 video frames from video 402. If the picture of the video frame segment has the shelter, the shelter can be firstly identified for the video frame segment, a dynamic area and a static area in the video frame segment are identified, and the static area of the video frame segment is removed. Then, the spatio-temporal features of the video frame segments are extracted and stored in the Faiss library 404 for subsequent matching of the spatio-temporal features.

The video identification process comprises the following steps: a video frame segment is extracted from the video to be identified, for example, a video frame segment including 5 video frames is extracted from the video 401 to be identified. If the picture of the video frame segment has the shelter, the shelter can be firstly identified for the video frame segment, the dynamic area and the static area in the video frame segment are identified, and the static area of the video frame segment is removed. Then, extracting the spatio-temporal features of the video frame segments, comparing the extracted spatio-temporal features of the video frame segments with the spatio-temporal features in the Faiss library 404, obtaining the first N features with higher similarity to the spatio-temporal features of the video frame segments, such as the first three features, and respectively obtaining the video frame segments corresponding to the three similar spatio-temporal features, and obtaining the names of the target videos where the corresponding video frame segments are located, thereby obtaining the names of the videos to be identified. And constructing a score histogram for each similar spatio-temporal feature, wherein the abscissa of the score histogram is the positioning interval of the video frame segment corresponding to the similar spatio-temporal feature in the target video, and the ordinate of the score histogram is the score of the similar feature. As can be seen from fig. 4, if the score of the first similar spatio-temporal feature is higher, the matching degree between the video 401 to be recognized and the first similar spatio-temporal feature is higher, the name of the target video 403 where the corresponding video frame segment is located can be determined by the first similar spatio-temporal feature, the name of the target video 403 is determined as the name of the video 401 to be recognized, and the positioning interval of the video 401 to be recognized in the target video 403 is [ time a-time B ].

As shown in fig. 5, if the user watches the video 501 through the mobile phone client, and the name of the video 501 is not included in the related information 1 of the video 501, the name of the video 501 may be identified by the video identification method in the embodiment of the present application, and is displayed on the page. Not only can the name of the video 501 be displayed on the page, but also the corpus of the video 501 can be displayed on the page in a connected manner, as shown in fig. 6, when the user clicks to view the corpus, the user can directly view the complete video corresponding to the video 501. Further, if a plurality of videos to be identified all correspond to the same target video, the plurality of videos to be identified may be sorted based on the positioning interval of each video to be identified. As shown in fig. 7, if the target video corresponding to the video 502 is the same target video as the target video corresponding to the video 501, the video 502 may be displayed below the video 501, and the introduction about the video 502 is displayed through the related information 2, and if the user wants to watch more other segments about the video 501 or the complete video corresponding to the video 502, the user may click on the selection set to watch other related videos. As shown in fig. 8, after the user clicks the selection, a related video corresponding to the same target video may be displayed below the video 501, and both the video 502 and the video 503 correspond to the same target video as the video 501, where the positioning interval of the video 502 is arranged before the positioning interval of the video 503, for example, the positioning interval of the video 502 is from 03:44, the localization interval of video 503 starts from 06. Therefore, continuous watching experience is displayed for the user through sequencing of the related videos, the experience of the user can be improved, and the service life of the user on the internet platform can be prolonged.

Aiming at the video identification method provided by the embodiment, the embodiment of the application also provides a video identification device.

Referring to fig. 9, the figure is a schematic view of a video identification apparatus according to an embodiment of the present application. The device comprises: an acquisition unit 901, an extraction unit 902, and a processing unit 903;

the acquiring unit 901 is configured to acquire a video frame segment in a video to be identified, where the video frame segment includes a plurality of consecutive video frames;

the extracting unit 902 is configured to extract spatio-temporal features of the video frame segment; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;

the processing unit 903 is configured to match the spatio-temporal features of the video frame segment with spatio-temporal features in a search library, if the matching is successful, obtain the spatio-temporal features in the search library which is successfully matched, determine a target video represented by the spatio-temporal features in the search library which is successfully matched, and determine the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.

As a possible implementation manner, the extracting unit 902 is further configured to input the video frame segment into a spatial convolution layer of a feature extraction model, and obtain a spatial feature of each frame of video frame in the video frame segment;

inputting the spatial features of each video frame in the video frame segments into a first fusion layer of the feature extraction model to obtain the spatial features of the video frame segments, and inputting the video frame segments into the first fusion layer of the feature extraction model to obtain the temporal features of the video frame segments;

and inputting the spatial characteristics of the video frame segments and the temporal characteristics of the video frame segments into a second fusion layer of the characteristic extraction model to obtain the spatio-temporal characteristics of the video frame segments.

As a possible implementation manner, the apparatus further includes a training unit, configured to, when training the feature extraction model, construct a plurality of segments of sample video frame segments, where the plurality of segments include video frame segments with different resolutions and/or different aspect ratios for the sample video frame segments;

and inputting a plurality of fragments of the sample video frame fragments as sample data into a feature extraction model for training.

As a possible implementation manner, the apparatus further includes a removing unit, configured to identify a dynamic area and a static area in the video to be identified when the video to be identified has a blocking object, where the static area is an area where the blocking object is located in the video to be identified;

and removing the static area in the video to be identified.

As a possible implementation manner, the processing unit 903 is further configured to obtain a spatio-temporal feature in the search library which is successfully matched, and determine a video frame segment corresponding to the spatio-temporal feature in the search library which is successfully matched;

and obtaining the positioning interval of the video to be identified in the target video according to the positioning interval of the corresponding video frame segment in the target video.

As a possible implementation manner, the apparatus further includes a sorting unit, configured to obtain, if multiple videos to be identified correspond to a same target video, a positioning interval of the multiple videos to be identified;

and sequencing the videos to be identified according to the positioning intervals of the videos to be identified.

As a possible implementation manner, the processing unit 903 is further configured to obtain a first positioning interval of the video to be identified in the target video;

adding a preset time period for the first positioning interval to obtain a second positioning interval;

matching the spatiotemporal features of the video frame segments with the spatiotemporal features of the target video based on the second positioning interval;

if the matching is successful, acquiring a video frame segment corresponding to the space-time characteristics of the successfully matched target video;

The video identification device provided in the above embodiment obtains a video frame segment of a video to be identified, extracts a spatiotemporal feature of the video frame segment to be identified, matches the spatiotemporal feature of the video frame segment to be identified with a spatiotemporal feature in a search library, obtains a spatiotemporal feature in the search library which is successfully matched, determines a target video where the video frame segment corresponding to the spatiotemporal feature in the search library which is successfully matched is located, and determines the name of the target video as the name of the video to be identified. When the name of the video to be identified is identified, feature extraction is not performed based on the video frame of the video to be identified, but is performed based on the video frame segment of the video to be identified, so that the space-time feature of the video frame segment is obtained, that is, not only the space feature of the video frame segment can be extracted, but also the time feature of the video frame segment can be extracted according to a plurality of continuous video frames.

The embodiment of the present application further provides a device for video identification, and the device for video identification provided in the embodiment of the present application will be described below from the perspective of hardware implementation.

Referring to fig. 10, fig. 10 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, and one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a sequence of instruction operations in storage medium 1430 on server 1400.

The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.

The CPU 1422 is configured to perform the following steps:

matching the spatiotemporal characteristics of the video frame segments with spatiotemporal characteristics in a search library, if the matching is successful, acquiring the spatiotemporal characteristics in the search library which is successfully matched, determining a target video where the video frame segment corresponding to the spatiotemporal characteristics in the search library which is successfully matched is located, and determining the name of the target video as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.

Optionally, the CPU 1422 may further perform method steps of any specific implementation manner of the video identification method in the embodiment of the present application.

For the video identification method described above, the embodiment of the present application further provides a terminal device for video identification, so that the video identification method described above is implemented and applied in practice.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), and the like, taking the terminal device as the mobile phone as an example:

fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the mobile phone includes: a Radio Frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, a sensor 1550, an audio circuit 1560, a wireless fidelity (WiFi) module 1570, a processor 1580, and a power supply 1590. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 11:

the RF circuit 1510 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information of a base station and then process the received downlink information to the processor 1580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1510 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 1531 using any suitable object or accessory such as a finger or a stylus) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1580, and can receive and execute commands sent by the processor 1580. In addition, the touch panel 1531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 1540 may include a Display panel 1541, and optionally, the Display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1531 may cover the display panel 1541, and when the touch panel 1531 detects a touch operation on or near the touch panel 1531, the touch operation is transmitted to the processor 1580 to determine the type of the touch event, and then the processor 1580 provides a corresponding visual output on the display panel 1541 according to the type of the touch event. Although in fig. 11, the touch panel 1531 and the display panel 1541 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated to implement the input and output functions of the mobile phone.

The handset can also include at least one sensor 1550, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1541 according to the brightness of ambient light and a proximity sensor that turns off the display panel 1541 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1560, speaker 1561, and microphone 1562 may provide an audio interface between a user and a cell phone. The audio circuit 1560 may transmit the electrical signal converted from the received audio data to the speaker 1561, and convert the electrical signal into an audio signal by the speaker 1561 and output the audio signal; on the other hand, the microphone 1562 converts collected sound signals into electrical signals, which are received by the audio circuit 1560 and converted into audio data, which are processed by the audio data output processor 1580 and then passed through the RF circuit 1510 for transmission to, for example, another cellular phone, or for output to the memory 1520 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1570, and provides wireless broadband internet access for the user. Although fig. 11 shows WiFi module 1570, it is understood that it does not belong to the essential components of the handset and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1580 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.

The handset also includes a power supply 1590 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1580 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In an embodiment of the present application, the handset includes a memory 1520 that can store program code and transmit the program code to the processor.

The processor 1580 included in the mobile phone can execute the video identification method provided by the above embodiments according to the instructions in the program code.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is configured to execute the video identification method provided in the foregoing embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video recognition method provided in the various alternative implementations of the above aspect.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as a read-only memory (ROM), a RAM, a magnetic disk, or an optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for video recognition, the method comprising:

2. The method of claim 1, wherein the extracting spatiotemporal features of the video frame segments comprises:

inputting the video frame segment into a spatial convolution layer of a feature extraction model, and acquiring the spatial feature of each video frame in the video frame segment;

3. The method of claim 2, further comprising:

while training the feature extraction model, constructing a plurality of segments of sample video frame segments, the plurality of segments including video frame segments of different resolutions and/or different aspect ratios for the sample video frame segments;

4. The method according to any one of claims 1-3, wherein when the video to be identified has an obstruction, the method further comprises:

identifying a dynamic area and a static area in the video to be identified, wherein the static area is an area where the blocking object in the video to be identified is located;

and removing the static area in the video to be identified.

5. The method according to any one of claims 1-3, wherein the obtaining of spatiotemporal features in the successfully matched search library comprises:

acquiring the spatio-temporal characteristics in the successfully matched search library, and determining the video frame segment corresponding to the spatio-temporal characteristics in the successfully matched search library;

6. The method of claim 5, further comprising:

if a plurality of videos to be identified correspond to the same target video, acquiring positioning intervals of the plurality of videos to be identified;

7. The method according to claim 5, wherein the obtaining a location interval of the video to be identified in the target video comprises:

obtaining a first positioning interval of the video to be identified in the target video;

8. A video recognition apparatus, the apparatus comprising: the device comprises an acquisition unit, an extraction unit and a processing unit;

the extraction unit is used for extracting the space-time characteristics of the video frame segments; the spatio-temporal characteristics are fusion characteristics of spatial characteristics of the video frame segments and temporal characteristics of the video frame segments, and represent motion information of objects involved in the video frame segments, the spatial characteristics are used for identifying appearance information of the objects involved in each video frame of the video frame segments, and the temporal characteristics are used for identifying motion information of the objects involved in the video frame segments;

the processing unit is used for matching the spatio-temporal characteristics of the video frame fragments with the spatio-temporal characteristics in a search library, if the matching is successful, the spatio-temporal characteristics in the search library which is successfully matched are obtained, the target video which is represented by the spatio-temporal characteristics in the search library which is successfully matched is determined, and the name of the target video is determined as the name of the video to be identified; the search library comprises space-time characteristics of a plurality of video frame fragments of the target video, and the target video is a complete video corresponding to the video to be identified.

9. An apparatus for video recognition, the apparatus comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-7 according to instructions in the program code.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any one of claims 1-7.