CN113642392A

CN113642392A - Target searching method and device

Info

Publication number: CN113642392A
Application number: CN202110767455.6A
Authority: CN
Inventors: 杨华; 刘创
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-11-12
Anticipated expiration: 2041-07-07
Also published as: CN113642392B

Abstract

The invention discloses a target searching method and device. The method comprises the following steps: the method comprises the steps of obtaining feature expressions of targets in a target video frame, and constructing a source graph which takes the target to be searched as a central node, other targets as context nodes and points to the central node by the context nodes based on the feature expressions of the targets. The method comprises the steps of obtaining feature expressions of candidate targets in a candidate video frame, determining the candidate targets corresponding to the targets based on the feature expressions of the candidate targets and the feature expressions of the targets, and constructing a target graph which takes the candidate targets corresponding to the targets to be searched as central nodes, takes the candidate targets corresponding to other targets as context nodes and points to the central nodes through the context nodes. And obtaining the graph embedding vectors of the source graph and the target graph based on the twin residual graph convolution neural network. Determining a target to be searched in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph. The scheme of the invention improves the accuracy of target search and has stronger robustness.

Description

Target searching method and device

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to a target search method and apparatus, a computer device, and a computer-readable storage medium.

Background

Generally, object searching may be defined as a process of searching for a specific object in a vast number of surveillance video frames. The object search technology is an automatic object detection and identification technology, which can quickly locate an object of interest in a monitoring network. The target search technology combines target detection and target re-identification, which is a key technology in intelligent video monitoring and has gained wide attention in the field of computer vision in recent years. Taking the target as a pedestrian as an example, the pedestrian search has important application value in security protection people searching, human body behavior analysis and other applications.

Currently, the individual similarity between the target to be searched and the candidate target is usually calculated according to the characteristics of the target to be searched and the candidate target, and then the candidate target is ranked according to the individual similarity between the target to be searched and the candidate target, so as to realize the search of the target to be searched. However, in surveillance video, the target usually changes greatly due to the change of the view angle, illumination, and the like, and due to the problem of occlusion. For example, for a pedestrian in a surveillance video, the appearance of the pedestrian may greatly change in different video frames due to the posture, the viewing angle, the illumination, whether the pedestrian is blocked or not, and the like. However, the accuracy of the search is low by adopting the current target search method.

Therefore, how to search out a target in a video and improve the accuracy of target search becomes one of the problems to be solved at present.

Disclosure of Invention

The invention provides a target searching method, a target searching device, computer equipment and a computer readable storage medium, which are used for accurately searching targets in different video frames.

The invention provides a target searching method, which comprises the following steps:

acquiring feature expression of each target in a target video frame, and constructing a source graph which takes the target to be searched as a central node and other targets as context nodes and points to the central node by the context nodes based on the feature expression of each target;

acquiring feature expression of each candidate target in a candidate video frame, determining the candidate target corresponding to each target based on the feature expression of each candidate target and the feature expression of each target, and constructing a target graph which takes the candidate target corresponding to the target to be searched as a central node, takes the candidate target corresponding to other targets as context nodes and points to the central node by the context nodes, wherein the candidate target corresponding to each target is a candidate target most similar to each target;

obtaining a graph embedding vector of a source graph and a graph embedding vector of a target graph based on the twin residual graph convolution neural network;

determining a target to be searched in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph.

Optionally, the determining a target to be searched in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph includes:

calculating graph similarity of a target to be searched and a candidate target corresponding to the target to be searched based on the graph embedding vector of the source graph and the graph embedding vector of the target graph;

and determining the target to be searched in the candidate video frame based on the image similarity of the target to be searched and the candidate target corresponding to the target to be searched.

correcting the similarity between the target to be searched and the candidate target corresponding to the target to be searched based on the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched so as to obtain the similarity between the target to be searched and the candidate target corresponding to the target to be searched after correction;

and determining the target to be searched in the candidate video frame based on the similarity between the corrected target to be searched and the candidate target corresponding to the target to be searched.

Optionally, the adjacency matrix of the source graph and the target graph is defined by the following formula:

wherein A is_ijBeing elements of the ith row and jth column of the contiguous matrix, q₁Feature vectors, g, for the center node in the source graph₁Is a feature vector of a central node in the target graph, q_jIs the feature vector, g, of node j in the source graph_jIs the feature vector of the node j in the target graph.

Optionally, the twin residual map convolutional neural network at least includes two map convolutional layers, each map convolutional layer is defined by the following formula:

Z^l+1＝σ(AZ^lW^l)

where σ (-) is the nonlinear activation operation, A is the adjacency matrix, Z^lIs an input feature matrix of the l-th layer, W^lLearning parameters of the l-th layer.

Optionally, the obtaining of the feature expression of each target in the target video frame includes:

detecting each target in the target video frame based on the detection network;

acquiring feature expression of each target based on a neural network;

the acquiring the feature expression of each candidate target in the candidate video frame comprises the following steps:

detecting each candidate target in the candidate video frame based on the detection network;

and acquiring the feature expression of each candidate target based on the neural network.

The present invention also provides a target search apparatus, comprising:

the first processing unit is used for acquiring the feature expression of each target in the target video frame, and constructing a source graph which takes the target to be searched as a central node and other targets as context nodes and points to the central node by the context nodes based on the feature expression of each target;

the second processing unit is used for acquiring feature expressions of all candidate targets in a candidate video frame, determining the candidate targets corresponding to all the targets based on the feature expressions of all the candidate targets and the feature expressions of all the targets, and constructing a target graph which takes the candidate target corresponding to the target to be searched as a central node, takes the candidate target corresponding to other targets as context nodes and points to the central node from the context nodes, wherein the candidate target corresponding to all the targets is a candidate target most similar to each target;

the acquisition unit is used for acquiring a graph embedding vector of the source graph and a graph embedding vector of the target graph based on the twin residual graph convolution neural network;

a determining unit, configured to determine a target to be searched in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph.

The invention also provides a computer device comprising at least one processor and at least one memory, wherein the memory stores a computer program which, when executed by the processor, enables the processor to perform the above object search method.

The present invention also provides a computer-readable storage medium in which instructions, when executed by a processor in a device, enable the device to perform the above-described object search method.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the method comprises the steps of firstly obtaining feature expressions of targets in a target video frame, and constructing a source graph which takes the target to be searched in the target video frame as a central node and other targets as context nodes and points to the central node by the context nodes based on the feature expressions of the targets. And then acquiring feature expressions of all candidate targets in the candidate video frame, determining the candidate targets corresponding to all targets based on the feature expressions of all the candidate targets and the feature expressions of all the targets, and constructing a target graph which takes the candidate target corresponding to the target to be searched as a central node, takes the candidate target corresponding to other targets as a context node and points to the central node from the context node, wherein the candidate target corresponding to all the targets is the candidate target most similar to each target. Then, a graph embedding vector of the source graph and a graph embedding vector of the target graph are respectively obtained based on the twin residual graph convolution neural network. Finally, a target to be searched in the candidate video frame is determined based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph. In the process of searching the target, a source graph reflecting the relation between the target to be searched and the contextual target thereof is constructed, and a target graph reflecting the relation between the candidate target corresponding to the target to be searched and the contextual target thereof is constructed. By constructing the source graph and the target graph, the context target in the video frame is taken into consideration, so that the target searching result has higher robustness to target change, and the accuracy of target searching is improved. In addition, the graph embedding vectors of the source graph and the target graph are obtained by adopting the twin residual graph convolution neural network, so that the information of the upper and lower targets in the source graph and the target graph can be effectively integrated, the attenuation of the characteristics can be effectively reduced by adopting the twin residual graph convolution neural network, and the accuracy of target searching is further improved. In addition, the target searching method of the technical scheme of the invention is easy to realize and has strong universality.

Further, after the graph embedding vector based on the source graph and the graph embedding vector of the target graph are calculated to obtain the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched, the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched can be utilized to correct the similarity between the target to be searched and the candidate target corresponding to the target to be searched, and then the target to be searched is determined in the candidate video based on the corrected similarity between the target to be searched and the candidate target corresponding to the target to be searched. When searching for the target, the individual feature similarity between the target to be searched and the candidate target corresponding to the target to be searched is considered, the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched based on the context target is also considered, the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched is utilized to correct the similarity between the target to be searched and the candidate target corresponding to the target to be searched, and the target is searched through the corrected similarity between the target to be searched and the candidate target corresponding to the target to be searched, so that the accuracy of target searching is improved to a great extent.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic flow chart of a target searching method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the graph similarity obtained by using a twin residual image convolution neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of searching for a target by the target searching method according to the embodiment of the present invention;

FIG. 4 is a diagram illustrating a comparison of search results of a target search method and other target search methods according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a comparison between a target search method and a search result of a target search method according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating the ranking of the search results of the target search method and other target search methods according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is now common to search for an object to be searched in a video frame based on individual feature similarities of the object to be searched and candidate objects as described in the prior art. And the target to be searched is searched only based on the individual feature similarity of the target to be searched and the candidate target, and the searching accuracy is low. Therefore, the embodiment of the invention provides a target searching method. Referring to fig. 1, fig. 1 is a schematic flow chart of a target search method according to an embodiment of the present invention, and as shown in fig. 1, the target search method includes:

s101: the method comprises the steps of obtaining feature expressions of targets in a target video frame, and constructing a source graph which takes the target to be searched as a central node, other targets as context nodes and points to the central node by the context nodes based on the feature expressions of the targets.

S102: the method comprises the steps of obtaining feature expressions of candidate targets in a candidate video frame, determining the candidate targets corresponding to the targets based on the feature expressions of the candidate targets and the feature expressions of the targets, and constructing a target graph which takes the candidate targets corresponding to the targets to be searched as central nodes, takes the candidate targets corresponding to other targets as context nodes and points to the central nodes through the context nodes, wherein the candidate targets corresponding to the targets are the candidate targets most similar to each target.

S103: and respectively obtaining a graph embedding vector of the source graph and a graph embedding vector of the target graph based on the twin residual graph convolutional neural network.

S104: determining a target to be searched in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph.

S101 is executed, in this embodiment, the target video frame may be a frame of video containing the target to be searched. The target to be searched is determined, and the target can be a pedestrian or a vehicle. Taking a pedestrian as an example, it may be determined that the pedestrian to be searched is somebody in the target video frame. In order to obtain the feature expression of each target in the target video frame, in this embodiment, each target in the target video frame may be detected based on a detection network, where the detection network may be a neural network, for example: R-CNN (Region-CNN), Faster R-CNN, and the like. Then, feature expressions of the targets are obtained based on the Neural network, the feature expressions of the targets in this embodiment may be feature vectors of the targets, and specifically, a Convolutional Neural Network (CNN) may be used, such as: Se-ResNet-50 performs feature extraction on each target obtained through fast R-CNN detection to obtain a feature vector of each target, wherein the feature vector of the target in the embodiment can be a 512-dimensional vector. After the feature vectors of the targets are obtained, simplifying each target in the target video frame into a node, taking a target to be searched (such as some pedestrian) in the target video frame as a central node, taking other targets as context nodes, connecting the central node and the context nodes, and establishing a source graph by pointing the context nodes to the central node.

And S102 is executed, and the feature expression of each candidate target in the candidate video frame is obtained. Similarly, in this embodiment, each candidate target in the candidate video frame may be detected based on a detection network, where the detection network may be a neural network, such as: R-CNN (Region-CNN), Faster R-CNN, and the like. Then, feature expressions of the candidate targets are obtained based on the Neural network, in this embodiment, the feature expressions of the candidate targets may be feature vectors of the candidate targets, and specifically, feature extraction may be performed on the candidate targets obtained through fast R-CNN detection by using a Convolutional Neural Network (CNN), such as Se-ResNet-50, to obtain the feature vectors of the candidate targets. After the feature vectors of the candidate targets are determined, the candidate targets corresponding to the targets are determined. In this embodiment, the candidate object corresponding to each object refers to a candidate object most similar to each object. By simplifying the target and the candidate target into nodes, the candidate target corresponding to each target can be determined according to the distance between the nodes, or the similarity between the nodes. In this embodiment, the candidate object most similar to each object may be determined by calculating cosine similarity between each object and the candidate object. In other embodiments, the candidate target most similar to each target may also be determined by calculating a manhattan distance or a euclidean distance between each target and the candidate target, or the like. After the candidate targets corresponding to each target are determined, the candidate target corresponding to the target to be searched is taken as a central node (for example, the candidate target corresponding to some golden target is taken as the central node) and the candidate targets corresponding to other targets are taken as context nodes, the central node and the context nodes are connected, and the direction is pointed to the central node by the context nodes, so that the target graph is constructed.

Executing S103: a graph embedding vector of the source graph and a graph embedding vector of the target graph are obtained based on the twin residual graph convolution neural network. After the source map and the target map are constructed through S101 and S102, they are input to a twin residual map convolutional neural network, which includes at least two map convolutional layers, each defined by the following formula:

Z^l+1＝σ(AZ^lW^l)

where σ (-) is the nonlinear activation operation, A is the adjacency matrix, and A ∈ R^m×m，Z^lIs the input feature matrix of the l-th layer,

W^las the learning parameters of the l-th layer,

m is the number of nodes in the source graph, d_inAs a dimension of the input feature vector, d_outIs the dimension of the output feature vector.

In this example, d_in＝d_outM varies with the source map 512.

In this embodiment, the twin residual image convolution neural network may include three layers, see fig. 2, and fig. 2 is a schematic diagram of obtaining image similarity by using the twin residual image convolution neural network in the embodiment of the present invention. The twin residual map convolution neural network in fig. 2 includes: GCN-Layer-1, GCN-Layer-2 and GCN-Layer-3. The adjacency matrix a of the source graph and the target graph is defined by the following formula:

wherein A is_ijIs an element adjacent to the ith row and jth column of matrix A, q₁Feature vectors, g, for the center node in the source graph₁Is a feature vector of a central node in the target graph, q_jIs the feature vector, g, of node j in the source graph_jIs the feature vector of the node j in the target graph.

When a graph embedding vector of a source graph is obtained through a twin residual graph convolution neural network, an input feature matrix Z of a layer 1 of the twin residual graph convolution neural network¹The feature matrix of the source graph is a matrix formed by feature vectors of each node in the source graph. Likewise, for the graph embedding vector of the target graph obtained by the twin residual graph convolution neural network, the input feature matrix Z of the 1 st layer of the twin residual graph convolution neural network¹The feature matrix is a feature matrix of the target graph, namely a matrix formed by feature vectors of each node in the target graph.

It should be noted that, in this embodiment, for convenience of description, the target, the feature vector of the target, the graph embedding vector of the target, and the central node when the target is used as the central node, which are included in the target video frame, are all represented by the same symbol, for example, the target q is represented by the same symbol₁Target q₁Characteristic vector q of₁Target q₁Is embedded in a map vector q₁With a target q₁When the central node is a central node, the central node q₁. Similarly, the candidate target, the feature vector of the candidate target, the graph embedding vector of the candidate target, and the central node when the candidate target is taken as the central node, which are included in the candidate video frame, are all represented by the same symbol, such as the candidate target g₁Candidate target g₁Feature vector g of₁Candidate target g₁Graph embedding vector g₁To candidate target g₁When the central node is a central node, the central node g₁。

With continued reference to FIG. 2, the center node of the source graph in FIG. 2 is q₁I.e. q₁For the object to be searched, the other object is q₂～q₅Which is a context node, context node q₂～q₅And a central node q₁Form a context node q therebetween₂～q₅Pointing to a central node q₁The edge of (2). G in the target graph₁To and from the object q to be searched₁Corresponding candidate target, i.e. g₁For the candidate target and the target q to be searched₁The most similar candidate object. g₂～g₅Are respectively q and₂～q₅corresponding candidate target, i.e. g₂For the candidate target and the target q₂Most similar candidate object, g₃For the candidate target and the target q₃Most similar candidate object, g₄For the candidate target and the target q₄Most similar candidate object, g₅For the candidate target and the target q₅The most similar candidate object. In the target graph, the center node is g₁The other candidate target is g₂～g₅Which is a context node, context node g₂～g₅And a central node g₁Form a context node g therebetween₂～g₅Point to center node g₁The edge of (2).

In FIG. 2, the source map is input to Layer 1 GCN-Layer-1 of the GCN, and is also part of the GCN-Layer-3 input of the GCN. Inputting a source map into a GCN, after the source map passes through GCN-Layer-1 and GCN-Layer-2, taking the output of the GCN-Layer-2 and the source map as the input of the GCN-Layer-3, finally outputting a map embedding vector of the source map, and determining a target q to be searched from the map embedding vector of the source map₁Graph embedding vector g₁。

Similarly, the target map is input to Layer 1 GCN-Layer-1 of the GCN, and is also part of the GCN-Layer-3 input of the GCN. Inputting the target graph to the GCN, after the GCN-Layer-1 and the GCN-Layer-2, taking the output of the GCN-Layer-2 and the target graph as the input of the GCN-Layer-3, finally outputting a graph embedding vector of the target graph, and determining a graph embedding vector of the target graph and a target q to be searched₁Corresponding candidate target g₁Graph embedding vector g₁。

In this embodiment, the twin residual map convolution neural network is trained by the following cosine embedding loss function:

wherein q is₁Embedding vectors for a graph normalized for the object to be searched, g₁Graph embedding vector normalized for candidate target, alpha is threshold value, y is 1 to represent target q to be searched₁And candidate target g₁Is the same object. In this embodiment α may be 0.5.

And S104 is executed, after the graph embedding vectors of the source graph and the target graph are obtained through the twin residual graph convolution neural network, firstly, the graph similarity of the target to be searched and the candidate target corresponding to the target to be searched is calculated based on the graph embedding vector of the source graph and the graph embedding vector of the target graph. Specifically, the method comprises the following steps:

and determining a graph embedding vector of the target to be searched according to the target to be searched in the source graph, namely the position of the central node in the source graph in the graph embedding vector of the source graph. Similarly, the graph embedding vector of the candidate target corresponding to the target to be searched is determined according to the candidate target corresponding to the target to be searched in the target graph, namely the position of the central node in the graph embedding vector of the target graph. And calculating the graph similarity between the graph embedding vector of the target to be searched and the graph embedding vector of the candidate target corresponding to the graph embedding vector. Specifically, when the graph similarity between the graph embedding vector of the target to be searched and the graph embedding vector of the candidate target corresponding to the target to be searched is calculated, the graph similarity between the graph embedding vector of the target to be searched and the graph embedding vector of the candidate target corresponding to the target to be searched can be normalized, and then the graph similarity between the graph embedding vector of the target to be searched and the graph embedding vector of the candidate target corresponding to the target to be searched can be obtained by calculating the cosine similarity between the normalized graph embedding vector of the target to be searched and the normalized graph embedding vector of the candidate target corresponding to the target to be searched. With continued reference to FIG. 2, the target q to be searched is obtained by convolving the neural network with the twin residual map₁Is embedded in a map vector q₁With the object q to be searched₁Corresponding candidate target g₁Graph embedding vector g₁Then, respectively normalizing the two images, and calculating the cosine similarity between the two images to obtain the graph similarity S between the two images_g(q₁,g₁)。

In this embodiment, in order to better improve the accuracy of target search, after the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched is determined, the graph similarity is used to correct the similarity between the target to be searched and the candidate target corresponding to the target to be searched so as to obtain the similarity between the target to be searched and the candidate target corresponding to the target to be searched after correction. The similarity between the target to be searched and the candidate target corresponding to the target to be searched can be obtained by normalizing the feature vector of the target to be searched and the feature vector of the candidate target corresponding to the target to be searched, and then calculating the cosine similarity between the normalized feature vector of the target to be searched and the normalized feature vector of the candidate target corresponding to the target to be searched.

In this embodiment, the similarity between the target to be searched and the candidate target corresponding to the target to be searched may be corrected by the following formula:

S(q₁,g₁)＝(1-λ)S₀(q₁,g₁)+λS_g(q₁,g₁),λ∈(0,1)

wherein q is₁As an object to be searched, g₁Is a reaction with q₁Corresponding candidate object, S (q)₁,g₁) For the corrected target q to be searched₁And candidate target g corresponding thereto₁Similarity of (D), S₀(q₁,g₁) For an object q to be searched₁And candidate target g corresponding thereto₁Similarity of (D), S_g(q₁,g₁) For an object q to be searched₁And candidate target g corresponding thereto₁λ is a correction factor, and λ may be 0.5 in this embodiment.

After the similarity between the corrected target to be searched and the candidate target corresponding to the target to be searched is obtained, whether the candidate target is the target to be searched can be judged according to whether the similarity is larger than a preset threshold value. In practical application, the candidate target corresponding to the target to be searched in the multiple candidate video frames needs to be judged, the similarity between the corrected target to be searched and the candidate target corresponding to the target to be searched can be calculated according to the formula, and then the similarity between the target to be searched corresponding to each candidate video frame obtained through calculation and the candidate target corresponding to the target to be searched is ranked from high to low so as to search the target to be searched in the multiple candidate video frames.

In the embodiment, the similarity between the target to be searched and the candidate target corresponding to the target to be searched is corrected according to the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched, so that on one hand, the individual feature similarity between the target to be searched and the candidate target corresponding to the target to be searched is considered, and on the other hand, the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched based on the context target is also considered, and therefore the accuracy of target search is improved to a great extent.

In other embodiments, the target search may also be performed directly according to the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched, that is, whether the candidate target corresponding to the target to be searched is determined directly according to the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched. Specifically, whether the candidate target is the target to be searched may be determined according to whether the graph similarity between the target to be searched and the candidate target corresponding to the target to be searched is greater than a preset threshold.

Thus, the search for the target in the video frame is realized through the above-mentioned S101 to S104.

Fig. 3 is a schematic diagram of a process of searching for a target by the target searching method according to the embodiment of the present invention. The following briefly describes the technical solution of the embodiment of the present invention with reference to fig. 3. In fig. 3, a search pedestrian example is illustrated, and a target pedestrian in a target pedestrian video frame and a candidate pedestrian in a candidate pedestrian video frame are detected by fast R-CNN to obtain the target pedestrian: q. q.s₁，q₂，q₃And candidate pedestrians: g₁，g₂，g₃Using CNN, e.g. Se-ResNet-50, for target pedestrians q₁，q₂，q₃Candidate pedestrian g₁，g₂，g₃Performing Feature Extraction (CNN Feature Extraction) to respectively obtain target pedestrians q₁，q₂，q₃Characteristic vector q of₁，q₂，q₃Candidate pedestrian g₁，g₂，g₃Feature vector g of₁，g₂，g₃. Calculating the similarity between each target pedestrian and the candidate pedestrian to find the candidate pedestrian most similar to each target pedestrian, in FIG. 3, the candidate pedestrian q₁，q₂，q₃The most similar pedestrian candidates are g in order₁，g₂，g₃. If the target line to be searched is artificial q₁Then with q₁Is a central node, q₂，q₃Constructing a Source Graph (Source Graph) for context nodes in g₁Is a central node, g₂，g₃A Target Graph (Target Graph) is constructed for the context nodes. Inputting a Graph Pair (constraint of Graph Pair) composed of a source Graph and a target Graph into a twin Residual Graph convolution neural Network (SR-GCN) to obtain a target pedestrian q₁And a pedestrian candidate g corresponding thereto₁Degree of similarity of graph S_g(q₁,g₁) To the target pedestrian q₁And a pedestrian candidate g corresponding thereto₁Degree of similarity of graph S_g(q₁,g₁) For target pedestrian q₁And a pedestrian candidate g corresponding thereto₁Similarity of (2)₀(q₁,g₁) Correction (Graph Similarity retrieval) is performed to obtain a corrected target pedestrian q₁And a pedestrian candidate g corresponding thereto₁Similarity of (a) S (q)₁,g₁) The target pedestrian q can be identified based on the corrected similarity₁Search is performed to determine a pedestrian candidate g₁Whether it is a target pedestrian q₁。

In practical application, the performance of the target search method of this embodiment can be evaluated by top-1 accuracy and mean Average accuracy (mAP). Hereinafter, when the target is a pedestrian, the performance of the target search method is evaluated.

Fig. 4 is a schematic diagram illustrating comparison between search results of a target search method and search results of other target search methods according to an embodiment of the present invention. In FIG. 4, the two databases PRW and CUHK-SYSU are provided by (L.ZHING, H.ZHANG, S.Sun, M.Chandraker, Y.Yang, Q.Tian, Person re-identification in the world, in: Proceedings of the IEEE Conference on 555Computer Vision and Pattern Recognition,2017, pp.1367-1376.) and (T.Xiao, S.Li, B.Wang, L.Lin, X.Wang, Joint detection and identification discovery search for Person search, in: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition,2017, pp.3415-3424), respectively, for performance evaluation of the video search method.

In fig. 4, Ours is the top-1 and mAP values calculated when searching for the target pedestrian by using the target searching method of the present embodiment. As can be seen from fig. 4, for different data sets, the accuracy of searching for the target pedestrian by the target searching method of the embodiment of the invention is significantly higher than that of searching for the target pedestrian by other target searching methods.

Fig. 5 is a schematic diagram illustrating comparison between a target search method and a search result of the target search method according to the embodiment of the present invention. In fig. 5, Baseline is values of top-1 and mAP when a target pedestrian and a candidate pedestrian are searched in different candidate video frames after similarity calculation is directly performed on the target pedestrian and the candidate pedestrian according to individual features extracted by CNN, and Baseline + SR-GCN is values of top-1 and mAP when the target pedestrian is searched by using the target searching method of the embodiment of the invention, and as can be seen from fig. 5, the target searching method of the embodiment of the invention improves the accuracy of target pedestrian search to a greater extent by using context pedestrians in the video frames as auxiliary information.

Fig. 6 is a schematic diagram illustrating the ranking of the search results of the target search method and other target search methods according to the embodiment of the present invention. In fig. 6, for each target pedestrian (query), the first action is a result of ranking candidate pedestrians after searching for the target pedestrian by using another target searching method, and the second action is a result of ranking candidate pedestrians after searching for the target pedestrian by using the target searching method according to the embodiment of the present invention. In fig. 6, the pedestrian candidates in the green box are the target pedestrians to be searched, and the pedestrian candidates in the blue box are the wrong search results. As can be seen from fig. 6, by using the target search method according to the embodiment of the present invention, the context pedestrian is used as the auxiliary information to search for the target pedestrian, and when the appearance of the target pedestrian is greatly changed, as many correct search results as possible can be arranged at the front position, so that the accuracy of searching for the target pedestrian is improved. The target searching method provided by the embodiment of the invention has higher robustness to the change of the target.

The present invention also provides an object search apparatus, the apparatus comprising:

the first processing unit is used for acquiring the feature expression of each target in the target video frame, and constructing a source graph which takes the target to be searched as a central node and other targets as context nodes and points to the central node by the context nodes based on the feature expression of each target.

And the second processing unit is used for acquiring the feature expression of each candidate target in the candidate video frame, determining the candidate target corresponding to each target based on the feature expression of each candidate target and the feature expression of each target, and constructing a target graph which takes the candidate target corresponding to the target to be searched as a central node, takes the candidate target corresponding to the other target as a context node, and points to the central node from the context node, wherein the candidate target corresponding to each target is a candidate target most similar to each target.

And the acquisition unit is used for acquiring the graph embedding vector of the source graph and the graph embedding vector of the target graph based on the twin residual graph convolution neural network.

The implementation of the target search apparatus of this embodiment may refer to the implementation of the target search method described above, and details are not described here.

Based on the same technical concept, embodiments of the present invention provide a computer device, which includes at least one processor and at least one memory, wherein the memory stores a computer program, and when the program is executed by the processor, the processor is enabled to execute the above object search method.

Based on the same technical concept, embodiments of the present invention provide a computer-readable storage medium in which instructions, when executed by a processor within a device, enable the device to perform the above-described object search method.

It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of searching for an object, comprising:

2. The method of claim 1, wherein the determining a target to search for in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph comprises:

3. The method of claim 1, wherein the determining a target to search for in the candidate video frame based on at least the graph embedding vector of the source graph and the graph embedding vector of the target graph comprises:

4. The method of claim 1, wherein the adjacency matrix of the source graph and the target graph is defined by the following equation:

wherein A is_ijBeing elements of the ith row and jth column of the contiguous matrix, q₁Feature vectors, g, for the center node in the source graph₁Being central nodes in the target graphFeature vector, q_jIs the feature vector, g, of node j in the source graph_jIs the feature vector of the node j in the target graph.

5. The method of claim 1, wherein the twin residual map convolutional neural network comprises at least two map convolutional layers, each map convolutional layer defined by the formula:

Z^l+1＝σ(AZ^lW^l)

6. The method of claim 1, wherein obtaining the feature representation for each object in the target video frame comprises:

detecting each target in the target video frame based on the detection network;

acquiring feature expression of each target based on a neural network;

7. An object search apparatus, comprising:

8. A computer device comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, enables the processor to perform the object search method of any of claims 1-6.

9. A computer readable storage medium having instructions which, when executed by a processor within a device, enable the device to perform the object search method of any of claims 1 to 6.