CN112487242A

CN112487242A - Method and device for identifying video, electronic equipment and readable storage medium

Info

Publication number: CN112487242A
Application number: CN202011359490.6A
Authority: CN
Inventors: 代江; 付程晗; 范学峰; 李国洪; 高菲
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-12

Abstract

The embodiment of the application discloses a method and a device for identifying videos, electronic equipment and a computer readable storage medium, relates to the technical field of computer vision, cloud service and deep learning, and can be used for video search scenes. One embodiment of the method comprises: acquiring an image to be identified; screening out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified; determining a content category to which image content in an image to be identified belongs, and adjusting the current arrangement sequence of a plurality of video key frames in a video key frame set according to the degree of proximity to the content category to obtain an adjusted video key frame sequence; and respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence. The embodiment improves the matching degree of the determined video and the image to be recognized.

Description

Method and device for identifying video, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to the technical field of computer vision, cloud services, and deep learning, and more particularly, to a method and an apparatus for identifying a video, an electronic device, and a computer-readable storage medium.

Background

The visual search is a technique of using contents such as images and videos as input sources of the search, identifying and searching the input visual contents by using a visual identification related technique, and returning various form results such as related images and characters. With the continuous development of the visual search technology, the search results returned to the user by the visual search are developed from characters to images and from images to videos, and the search requirements of different users are met by continuously updating and iterating.

Conventionally, word guessing processing is performed on an image to be identified given by a user, then a corresponding video is searched according to guessed search keywords, and the searched video is fed back to the user as an identification result.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying a video, electronic equipment and a computer-readable storage medium.

In a first aspect, an embodiment of the present application provides a method for identifying a video, including: acquiring an image to be identified; screening out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified; determining the image content category according to the image content of the image to be identified, and adjusting the current arrangement sequence of a plurality of video key frames in the video key frame set according to the degree of proximity to the image content category to obtain the adjusted video key frame sequence; and respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence.

In a second aspect, an embodiment of the present application provides an apparatus for identifying a video, including:

an image to be recognized acquisition unit configured to acquire an image to be recognized; the video key frame set determining unit is configured to screen out a video key frame set similar to the image to be identified according to the image similarity, and a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified; the sequencing adjustment unit is configured to determine the image content category according to the image content of the image to be identified, and adjust the current arrangement sequence of a plurality of video key frames in the video key frame set according to the degree of proximity to the image content category to obtain the adjusted video key frame sequence; and the matching video determining unit is configured to respectively determine the matching video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matching video sequence.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for identifying video as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement the method for identifying a video as described in any one of the implementation manners of the first aspect when executed.

According to the method, the device, the electronic equipment and the computer-readable storage medium for identifying the video, firstly, an image to be identified is obtained; then, screening out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified; secondly, determining the image content category according to the image content of the image to be identified, and adjusting the current arrangement sequence of a plurality of video key frames in the video key frame set according to the degree of proximity to the image content category to obtain the adjusted video key frame sequence; and finally, respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence.

The accuracy of the screened video key frames and the similarity sorting of the screened video key frames is gradually improved according to the image similarity in vision and the category of the image content, namely the video key frames are jointly used for judging which video key frames are more matched with the image to be recognized through the characteristics of two different angles, so that the matching degree of the matched video determined based on the adjusted video key frame sorting and the image to be recognized is higher.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

fig. 2 is a flowchart of a method for identifying a video according to an embodiment of the present application;

fig. 3 is a flowchart of another method for identifying a video according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for identifying a video in an application scenario according to an embodiment of the present application;

fig. 5 is a block diagram illustrating a structure of an apparatus for identifying a video according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device adapted to execute a method for identifying a video according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the method, apparatus, electronic device, and computer-readable storage medium for identifying videos of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 and the server 105 may be installed with various applications for implementing information communication between the two devices, such as a video search application, a picture search application, an instant messaging application, and the like.

The

terminal apparatuses

101, 102, 103 and the server 105 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the

terminal devices

101, 102, and 103 are software, they may be installed in the electronic devices listed above, and they may be implemented as multiple software or software modules, or may be implemented as a single software or software module, and are not limited in this respect. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited herein.

The server 105 may provide various services through various built-in applications, for example, a video search application based on a video service for searching a given image of a user according to the graph search principle may be provided, and the server 105 may implement the following effects when running the video search application: firstly, receiving images to be identified uploaded by users from

terminal equipment

101, 102 and 103 through a network 104; then, screening out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified; secondly, determining the image content category according to the image content of the image to be identified, and adjusting the current arrangement sequence of a plurality of video key frames in the video key frame set according to the degree of proximity to the image content category to obtain the adjusted video key frame sequence; and finally, respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence.

It should be noted that the image to be recognized may be acquired from the

terminal apparatuses

101, 102, and 103 through the network 104, or may be stored locally in the server 105 in advance in various ways. Thus, when the server 105 detects that such data is already stored locally (e.g., a pending video search task remaining before starting processing), it may choose to retrieve such data directly from locally, in which case the exemplary system architecture 100 may also not include the

terminal devices

101, 102, 103 and the network 104.

Since the determination of the matching video needs to be based on a large number of image sets extracted from the video as data support, and therefore needs to occupy more computing resources and stronger computing power, the method for identifying the video provided in the following embodiments of the present application is generally performed by the server 105 having stronger computing power and more computing resources, and accordingly, the means for identifying the video is generally also disposed in the server 105. However, it should be noted that, when the

terminal devices

101, 102, and 103 also have computing capabilities and computing resources meeting the requirements, the

terminal devices

101, 102, and 103 may also complete the above-mentioned operations that are originally delivered to the server 105 through the electronic album application installed thereon, and then output the same result as the server 105. Particularly, when there are a plurality of types of terminal devices having different computing capabilities at the same time, but the electronic album application determines that the terminal device has a strong computing capability and a large amount of computing resources are left, the terminal device may execute the above-described computation, thereby appropriately reducing the computing load of the server 105, and accordingly, the device for recognizing the video may be provided in the

terminal devices

101, 102, and 103. In such a case, the exemplary system architecture 100 may also not include the server 105 and the network 104.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of a method for identifying a video according to an embodiment of the present application, wherein the process 200 includes the following steps:

step 201: acquiring an image to be identified;

this step is intended to acquire, by an execution subject (for example, the server 105 shown in fig. 1) of the method for identifying a video, an image to be identified, which may be acquired in a receiving manner from a terminal device (for example, the

terminal devices

101, 102, 103 shown in fig. 1) controlled by a user in real time, may be acquired in a downloading manner from a preset network storage space, or may be intercepted from data transmitted between two users (authorization of the corresponding user has been acquired in advance), so as to further determine information interaction content performed between the two users, for example, a preference of a certain user, by using an image included in the intercepted data.

Specifically, the image to be recognized may be a single frame image, or may be a moving image (e.g., Gif file) formed by multiple frame images, and when the user can only acquire a moving image formed by multiple frame images, the user may be further required to provide additional information indicating which image should be the image to be recognized, for example, indicating that the image displayed by the Gif file in the 2 nd second is the image to be recognized.

Step 202: screening out a video key frame set similar to the image to be identified according to the image similarity;

on the basis of step 201, this step is intended to find, by the executing entity mentioned above, a partial video key frame that is visually similar to the image to be recognized, among an alternative set of video key frames (i.e., a set consisting of key frames extracted from different videos). When the number of the similar video key frames is large, a video key frame set can be formed, and a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified, for example, the video key frames in the video key frame set are arranged from left to right or from top to bottom according to the image similarity.

The video key frames may include video cover frames and some video frames extracted from the video at fixed time points, or may refer to representative video frames, which may be flexibly selected according to actual situations, and are not specifically limited herein.

The image similarity of the two images refers to the similarity of the two images on the color characteristics of the pixel level, and in a simple aspect, the two images obtained by continuously shooting the same scene should have very high image similarity. Specifically, the image similarity may be calculated in various manners, for example, two images may be placed in the same color space to obtain the euclidean distance, the cosine distance, the hamming distance, and the like, a deep learning model capable of outputting the image similarity to the input image pair may also be obtained by training using the sample image and the result of labeling the image similarity between the sample images, and a preset number of similar images with higher similarity may also be further required to be output, for example, a convolutional neural network or a residual error network may be used to implement the calculation, which is not described herein again one by one.

Step 203: determining a content category to which image content in an image to be identified belongs, and adjusting the current arrangement sequence of a plurality of video key frames in a video key frame set according to the degree of proximity to the content category to obtain an adjusted video key frame sequence;

on the basis of step 202, this step is intended to reorder, by the execution main body, each similar video key frame in the video key frame set ordered only according to the similarity of the image visual features according to the proximity of each similar video key frame to the content category to which the image content in the image to be recognized belongs, so as to obtain the adjusted video key frame ordering. The content category to which the image content in the image to be recognized belongs is a category to which the image content of the image to be recognized belongs, for example, if the image to be recognized is a landscape image, the image content of the image to be recognized belongs to a preset landscape image category, if the image content of the image to be recognized is related to a movie ticket, the image content of the image to be recognized belongs to the movie ticket category, the number and the upper and lower levels of the category can be set and divided according to the requirements of actual situations, and details are not described here.

That is, the images to be recognized are sequentially subjected to the visual feature similarity ranking and the content category feature similarity ranking in step 202, and finally the video key frame ranking similar to the images to be recognized in both visual features and content categories is obtained, so that the similarity between the video key frames and the images to be recognized is improved through various features capable of influencing the similarity.

The content category to which the image content in the image to be recognized belongs can be recognized in various ways, for example, by extracting the image features of the image to be recognized, then matching the image features with the template image features of each preset type, and finally determining whether the image content belongs to the category according to the matching degree; semantic recognition can also be carried out on the image to be recognized so as to determine the types of objects contained in the image content of the image to be recognized, and then the types and the like of the objects are finally determined according to the quantity and the occupied area of each type of objects; the sample image and the labeling result of the corresponding category of the sample image may also be used to train a deep learning model, such as a convolutional neural network, a Senet network (a deep learning network for image processing), and the like, which can output and input the category of the image.

Step 204: and respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence.

On the basis of step 203, in this step, the execution subject determines the matching video to which each video key frame in the adjusted video key frame sequence belongs, respectively, to obtain a matching video sequence, that is, the matching video corresponding to each key video frame will inherit its the sequence in the video key frame sequence.

Furthermore, after the matching video sequence corresponding to the image to be recognized is determined, the matching video sequence can be returned to the providing terminal of the image to be recognized, so that the providing terminal can further screen out the video consistent with the expectation according to the received matching video sequence.

According to the method for identifying the video, the accuracy of the screened video key frames and the similarity sorting of the screened video key frames is gradually improved according to the image similarity in vision and the category of the image content, namely the video key frames are jointly used for judging which video key frames are more matched with the image to be identified through the characteristics of two different angles, so that the matching degree of the matched video determined based on the adjusted video key frame sorting and the image to be identified is higher.

On the basis of the previous embodiment, when the image to be recognized is not directly transmitted by the user, for example, a video to be recognized transmitted by the user is received, at this time, a key frame to be recognized can be extracted from the video to be recognized, and the key frame to be recognized is used as the image to be recognized to continue the subsequent processing; if the time indication information transmitted by the user can be received while the video to be recognized transmitted by the user is received, the target video frame corresponding to the time indication information in the video to be recognized can be used as the image to be recognized to continue the subsequent processing.

Referring to fig. 3, fig. 3 is a flowchart of another method for identifying a video according to an embodiment of the present application, wherein the process 300 includes the following steps:

step 301: acquiring an image to be identified;

this step is consistent with step 201 of the process 200, and please refer to step 201 for corresponding explanation, which is not described herein again.

Step 302: inputting an image to be recognized into a preset image similarity calculation model;

step 303: receiving image similarity between each video key frame in a preset video key frame set output by an image similarity calculation model and an image to be identified;

step 304: the method comprises the steps of obtaining video key frames with the preset number of image similarity to generate a video key frame set;

step 302-305 aims at outputting a plurality of video key frames with higher similarity to the image to be recognized in visual features through a pre-trained image similarity calculation model by the executing body, and finally obtaining a video key frame set ordered according to the similarity.

The image similarity calculation model can be used for embodying a residual error network, and training the residual error network is completed based on the similarity labeling result between the sample image and any two sample images. The number of outputs can be further set to be 30, 50, 70 and the like, namely, the finally obtained video key frame set comprises video key frames with the similarity sizes of the top 30, 50 and 70.

Step 305: performing semantic recognition operation aiming at image content on an image to be recognized by using a preset image classification model, and determining a content type according to an obtained semantic recognition result;

the execution main body performs semantic recognition operation aiming at image content on the image to be recognized by using a preset image classification model, and determines the content category according to the obtained semantic recognition result.

The image classification model can be obtained by specifically selecting a Senet network and training based on the sample image and the labeling result of the image content category contained in the sample image, and the accuracy of the processing result is improved by means of the characteristic that the network structure of the Senet network is designed and used for processing the image.

Step 306: adjusting the current arrangement sequence of a plurality of video key frames in the video key frame set according to the degree of proximity to the content category to obtain the adjusted video key frame sequence;

step 307: and respectively determining the matched video to which each video key frame belongs in the adjusted video key frame sequence to obtain the matched video sequence.

The above steps 306-307 are the same as the second half and 204 of the step 203 shown in fig. 2, and the contents of the same parts refer to the corresponding parts of the previous embodiment, which are not described herein again.

On the basis of the previous embodiment, this step provides a specific implementation manner for step 202 in the process 200 through steps 302 to 305, and a set number of video key frames having higher similarity with the image to be recognized in visual features are output by means of an image similarity calculation model trained based on the definition of the image similarity in an actual scene, so that the output result is more accurate; step 306 provides a specific implementation manner for the technical scheme for identifying the category to which the image to be identified belongs in step 203 in the process 200, and is implemented by means of an image classification model trained based on the definition of which category the image content should belong to in an actual scene, so that the accuracy of the category identification result is improved as much as possible.

For further understanding, the present application also provides a specific implementation scheme in combination with a specific application scenario, please refer to a process 400 shown in fig. 4, which includes the following steps:

step 401: acquiring an image to be identified uploaded by a user;

in practice, a user may construct a query (query) sentence containing or enabling a server to obtain an image to be identified, for example, the query sentence contains a network address or a network link where the image to be identified is located.

Step 402: obtaining the 128-dimensional visual characteristics of the image to be recognized by using the trained residual error network;

the residual error network used in the embodiment is improved in a standard residual error network structure, so that the improved residual error network outputs visual features with fewer dimensionalities, and comparison of subsequent similar images is performed based on the features with fewer dimensionalities and more visual difference. Specifically, the standard residual error network may output a 1024-dimensional feature, and since some features irrelevant to the similarity of the determination image may be included in the 1024-dimensional feature, and since the difficulty in comparing the 1024-dimensional feature is high, two convolution layers and a Linear rectification Unit (reduce) layer are added after the full connection layer of the standard residual error network in this embodiment, so as to perform dimension reduction processing on the original 1024-dimensional feature.

Step 403: obtaining 256-dimensional category characteristics of the image to be recognized by using the trained Senet network;

similar to the residual network in step 402, the Senet network used in this embodiment is also modified in the same way, so that only one 256-dimensional feature for characterizing the category parameter is finally output.

As an example, the chinese welfare lottery ticket and the cat-eye movie ticket are similar in visual characteristics, are substantially the same in size, are all red with a large title, are followed by similar black small text contents, and are white in background, so that the two may be taken as similar images on a purely visual-based 128-dimensional characteristic, but if the types of the two can be identified, the two can be found to belong to the lottery ticket class and the movie ticket class respectively, thereby helping to improve the degree of similarity with the similar images of the images to be identified.

Step 404: splicing the 128-dimensional visual features and the 256-dimensional category features to obtain 384-dimensional comprehensive features;

on the basis of the

steps

402 and 403, the present step aims to perform the body stitching of the 128-dimensional visual features and the 256-dimensional class features to obtain a 384-dimensional comprehensive feature containing both the visual features and the class features.

Step 405: searching a similar video cover ranked 40 with the similarity of the 384-dimensional comprehensive features of the image to be identified in the video cover set by utilizing the GNOIMI retrieval model;

on the basis of step 404, this step is intended to find a similar video cover ranked 40 with the degree of similarity of the 384-dimensional integrated features of the image to be recognized in the video cover set by the executing subject described above, in particular using the GNOIMI search model.

Compared with a conventional retrieval model, the GNOIMI retrieval model provides a retrieval method of Approximate Nearest Neighbor (Approximate Nearest Neighbor), the whole space is divided into a plurality of small subspaces, during searching, the subspace(s) is (are) locked quickly in a certain mode, and then traversal is performed in the subspace(s), so that the sub-linear computation complexity is achieved, and the retrieval efficiency is remarkably improved.

The video cover set is composed of covers of all videos stored in video data, including horizontal videos and vertical videos, meanwhile, in order to enhance the relevance between the video covers and the videos, the video covers can be required to be images of a certain frame in the videos, and when the video covers are not images of any frame in the videos, other images extracted from a certain time position of the videos can be added to serve as alternative reference images of the video covers.

Step 406: and returning the video corresponding to the similar video cover with the top rank of 40 to the user as a recognition result.

The scheme given by the process 400 is a scheme of splicing visual features and category features and then matching similar images based on the spliced comprehensive features, and in practice, a realization mode of firstly performing primary matching by using the visual features to obtain a video key frame set, then correcting a similarity degree sequencing result on the basis of the video key frame set by combining content category features can also be adopted in a middle-high manner, for example, under the condition that image similarity is obtained by comparing preset first dimension degree features, the content category of the image to be identified is expressed as preset second dimension degree features, then the first dimension degree features of each video key frame in the video key frame set are spliced with the second dimension degree features thereof to obtain spliced features, and finally, the comprehensive feature similarity is calculated according to the spliced features of the video key frames and the spliced features of the image to be identified, and then obtaining all video key frames which are arranged from large to small according to the comprehensive feature similarity.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for identifying a video, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the apparatus 500 for identifying a video of the present embodiment may include: an image to be recognized acquisition unit 501, a video key frame set determination unit 502, a ranking adjustment unit 503, and a matching video determination unit 504. The image to be recognized acquiring unit 501 is configured to acquire an image to be recognized; a video key frame set determining unit 502 configured to screen out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in order of the image similarity with the image to be identified; the sorting adjustment unit 503 is configured to determine a content category to which image content in the image to be identified belongs, and adjust the current arrangement order of the plurality of video key frames in the video key frame set according to the proximity degree with the content category to obtain an adjusted video key frame sorting; the matching video determining unit 504 is configured to determine a matching video to which each video key frame in the adjusted video key frame sequence belongs, respectively, to obtain a matching video sequence.

In the present embodiment, in the apparatus 500 for identifying video: the specific processing and the technical effects of the to-be-recognized image obtaining unit 501, the video key frame set determining unit 502, the sorting adjusting unit 503 and the matching video determining unit 504 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the video key frame set determination unit 502 may be further configured to:

inputting an image to be recognized into a preset image similarity calculation model;

receiving image similarity between each video key frame in a preset video key frame set output by an image similarity calculation model and an image to be identified;

and taking the video key frames with the image similarity of which the number is set in the front to generate a video key frame set.

In some optional implementation manners of this embodiment, the image similarity calculation model is additionally provided with 2 convolution layers and 1 linear rectifying layer which are sequentially connected after the fully-connected layer.

In some optional implementations of the present embodiment, the sorting adjustment unit 503 may include a category identification subunit configured to determine a content category to which the image content in the image to be identified belongs, and the category identification subunit may be further configured to:

and performing semantic recognition operation aiming at image content on the image to be recognized by using a preset image classification model, and determining the content type according to the obtained semantic recognition result.

In some optional implementations of this embodiment, the image classification model is additionally provided with 2 convolutional layers and 1 linear rectifying layer which are connected in sequence after the fully-connected layer.

In some optional implementations of the present embodiment, the image acquiring unit to be recognized 501 may be further configured to:

and in response to receiving the incoming video to be identified, extracting a key frame to be identified from the video to be identified, and taking the key frame to be identified as an image to be identified.

and in response to the received video to be recognized and the time indication information, taking a target video frame corresponding to the time indication information in the video to be recognized as an image to be recognized.

In some optional implementations of this embodiment, the apparatus 500 for identifying a video may further include:

the feature expression unit is configured to respond to the image similarity and compare the image similarity with a preset first-dimension degree feature, and express the content category as a preset second-dimension degree feature; and

the sorting adjustment unit comprises a sorting adjustment subunit configured to adjust the current arrangement order of the plurality of video key frames in the video key frame set according to the degree of proximity to the content category to obtain the adjusted video key frame sorting, and the sorting adjustment subunit comprises:

the characteristic splicing module is configured to splice the first dimensional degree characteristic and the second dimensional degree characteristic of each video key frame in the video key frame set to obtain a spliced characteristic;

and the comprehensive characteristic similarity calculation and sequencing module is configured to calculate to obtain comprehensive characteristic similarity according to the spliced characteristics of the video key frames and the spliced characteristics of the images to be recognized, and obtain adjusted video key frame sequencing which is arranged from large to small according to the comprehensive similarity.

In some optional implementations of this embodiment, the integrated feature similarity calculation and sorting module includes an integrated feature similarity calculation operator module configured to calculate an integrated feature similarity according to the post-stitching feature of the video key frame and the post-stitching feature of the image to be recognized, and the integrated feature similarity calculation operator module is further configured to:

and calculating to obtain the feature similarity between the spliced features of the video key frame and the image to be recognized by using a preset retrieval model, so as to obtain the comprehensive feature similarity.

The embodiment exists as an embodiment of an apparatus corresponding to the above method embodiment, and the apparatus for identifying a video provided in the embodiment of the present application gradually improves the accuracy of the sorted video key frames and the similarity ranking thereof according to the visual image similarity and the category to which the image content belongs in sequence, that is, the features of two different angles are used together to determine which video key frames are more matched with the image to be identified, so that the matching degree between the matched video determined based on the adjusted video key frame ranking and the image to be identified is higher.

According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.

Fig. 6 shows a block diagram of an electronic device suitable for implementing the method for identifying videos of the embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for identifying videos provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for identifying videos provided by the present application.

The memory 602, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for identifying a video in the embodiment of the present application (for example, the image to be identified acquiring unit 501, the video key frame set determining unit 502, the ranking adjusting unit 503, and the matching video determining unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the method for identifying a video in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store various types of data created by the electronic device in performing the method for recognizing the video, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to an electronic device adapted to perform the method for identifying videos. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device adapted to perform the method for recognizing a video may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus suitable for performing the method for recognizing a video, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in the conventional physical host and Virtual Private Server (VPS) service.

According to the method and the device, the accuracy of the screened video key frames and the similarity sorting of the screened video key frames is gradually improved according to the image similarity in vision and the category of the image content, namely the video key frames are jointly used for judging which video key frames are more matched with the image to be recognized through the characteristics of two different angles, so that the matching degree of the matched video determined based on the adjusted video key frame sorting and the image to be recognized is higher.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for identifying videos, comprising:

acquiring an image to be identified;

screening out a video key frame set similar to the image to be identified according to the image similarity, wherein a plurality of video key frames in the video key frame set are arranged in sequence according to the image similarity of the image to be identified;

determining the content category to which the image content in the image to be identified belongs, and adjusting the current arrangement sequence of a plurality of video key frames in the video key frame set according to the proximity degree of the content category to obtain the adjusted video key frame sequence;

and respectively determining the matched video to which each video key frame in the adjusted video key frame sequence belongs to obtain the matched video sequence.

2. The method according to claim 1, wherein the screening out the video key frame set similar to the image to be identified according to the image similarity comprises:

inputting the image to be recognized into a preset image similarity calculation model;

receiving image similarity between each video key frame in a preset video key frame set output by the image similarity calculation model and the image to be identified;

and taking the video key frames with the preset number of image similarity degrees to generate the video key frame set.

3. The method according to claim 2, wherein the image similarity calculation model is added with 2 convolutional layers and 1 linear rectifying layer which are connected in sequence after the fully connected layer.

4. The method of claim 1, wherein the determining a content category to which image content in the image to be recognized belongs comprises:

and performing semantic recognition operation aiming at image content on the image to be recognized by utilizing a preset image classification model, and determining the content category according to the obtained semantic recognition result.

5. The method of claim 4, wherein the image classification model is added with 2 convolutional layers and 1 linear rectifying layer connected in sequence after the fully connected layer.

6. The method of claim 1, wherein the acquiring an image to be identified comprises:

in response to receiving an incoming video to be identified, extracting a key frame to be identified from the video to be identified, and taking the key frame to be identified as the image to be identified.

7. The method of claim 1, wherein the acquiring an image to be identified comprises:

in response to receiving an incoming video to be identified and time indication information, taking a target video frame corresponding to the time indication information in the video to be identified as the image to be identified.

8. The method of any of claims 1 to 7, wherein comparing by a preset first degree feature in response to the image similarity further comprises:

expressing the content category as a preset second dimension degree characteristic; and

the adjusting the current arrangement order of the plurality of video key frames in the video key frame set according to the degree of closeness to the content category to obtain the adjusted video key frame order comprises:

splicing the first dimensional degree feature and the second dimensional degree feature of each video key frame in the video key frame set to obtain spliced features;

and calculating to obtain comprehensive feature similarity according to the spliced features of the video key frames and the spliced features of the images to be recognized, and obtaining the adjusted video key frame sequence which is arranged from large to small according to the comprehensive similarity.

9. The method of claim 8, wherein the calculating a comprehensive feature similarity according to the stitched features of the video keyframes and the stitched features of the images to be identified comprises:

10. An apparatus for identifying video, comprising:

an image to be recognized acquisition unit configured to acquire an image to be recognized;

the video key frame set determining unit is configured to screen out a video key frame set similar to the image to be identified according to image similarity, and a plurality of video key frames in the video key frame set are arranged in the order of the image similarity with the image to be identified;

the ordering adjusting unit is configured to determine a content category to which image content in the image to be identified belongs, and adjust the current arrangement order of a plurality of video key frames in the video key frame set according to the proximity degree of the content category to obtain an adjusted video key frame ordering;

and the matching video determining unit is configured to respectively determine the matching video to which each video key frame in the adjusted video key frame sequence belongs, so as to obtain the matching video sequence.

11. The apparatus of claim 10, wherein the video keyframe set determination unit is further configured to:

12. The apparatus according to claim 11, wherein the image similarity calculation model adds 2 convolutional layers and 1 linear rectifying layer connected in sequence after the fully connected layer.

13. The apparatus of claim 10, wherein the ordering adjustment unit comprises a category identification subunit configured to determine a category of content to which image content in the image to be identified belongs, the category identification subunit being further configured to:

14. The apparatus of claim 13, wherein the image classification model is added with 2 convolutional layers and 1 linear rectifying layer connected in sequence after the fully connected layer.

15. The apparatus of claim 8, the image acquisition unit to be identified further configured to:

16. The apparatus of claim 8, the image acquisition unit to be identified further configured to:

17. The apparatus of any of claims 10 to 16, further comprising:

the sorting adjustment unit includes a sorting adjustment subunit configured to adjust a current arrangement order of a plurality of video key frames in the video key frame set according to a degree of proximity to the content category to obtain an adjusted video key frame sorting, and the sorting adjustment subunit includes:

the feature splicing module is configured to splice the first dimensional degree feature and the second dimensional degree feature of each video key frame in the video key frame set to obtain a spliced feature;

and the comprehensive characteristic similarity calculation and sequencing module is configured to calculate to obtain comprehensive characteristic similarity according to the spliced characteristics of the video key frames and the spliced characteristics of the images to be identified, and obtain the adjusted video key frame sequencing which is arranged from large to small according to the comprehensive similarity.

18. The apparatus of claim 17, wherein the integrated feature similarity calculation and ranking module comprises an integrated feature similarity operator module configured to calculate an integrated feature similarity from the post-stitching features of the video keyframes and the post-stitching features of the images to be identified, the integrated feature similarity operator module further configured to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for identifying videos of any one of claims 1-9.

20. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for identifying videos of any one of claims 1-9.