CN107480178A

CN107480178A - A kind of pedestrian's recognition methods again compared based on image and video cross-module state

Info

Publication number: CN107480178A
Application number: CN201710536118.XA
Authority: CN
Inventors: 林倞; 张冬雨; 吴文熙
Original assignee: Guangzhou Deep Domain Mdt Infotech Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2017-07-01
Filing date: 2017-07-01
Publication date: 2017-12-15
Anticipated expiration: 2037-07-01
Also published as: CN107480178B

Abstract

The invention provides a kind of pedestrian's recognition methods again compared based on image and video cross-module state, corresponds to the video of personage in the query image containing input for being retrieved from multiple videos, comprises the following steps：The configurable depth model of S1, structure；S2, training sample is obtained, and training sample is inputted in depth model, to be trained to depth model；Learn each several part parameter of the depth model of structure using forwards algorithms and backward algorithm；S3, the parameter obtained using S2 learnings are initialized to depth model；Query image to be measured and multiple videos are inputted into depth model, similarity measurement of each video respectively with the query image is calculated by depth model；It S4, will list with video of the similarity measurement of the query image higher than a threshold value, and be ranked up according to the size of similarity measurement.The present invention is realized the pedestrian compared based on image with video cross-module state and identified again on the premise of degree of precision is ensured.

Description

A kind of pedestrian's recognition methods again compared based on image and video cross-module state

Technical field

The present invention relates to computer vision and area of pattern recognition, and in particular to one kind is based on image and video cross-module state ratio To pedestrian's recognition methods again.

Background technology

Pedestrian's weight identification technology is the important basic research problem of computer vision field one.Pedestrian identifies origin again In personage's tracking technique of video field.When tracked personage's video camera shooting area away from keyboard, when its reenter by , it is necessary to be identified again during shooting area, and ID as before is distributed for it.With extensively should for video monitoring With being received more and more attention about the research that pedestrian identifies again.At present, pedestrian identifies and is not limited under single visual angle again Same person identification, more common situation refers to that the personage under different time, different visual angles identifies again.

The most of method for being confined to compare based on similitude between image of research identified again on pedestrian at present.Input Inquiry be image, and search for database be similarly by image construction.Although carry out on the research that pedestrian identifies again For a long time, while also made significant headway, but the problem is still a very challenging research topic. Its chief reason is due to the difference between existing illumination, angle, background between different cameras, along with personage The change of posture, same person have very big difference in appearance in the photo that different cameras obtains.

With the quickening of smart city pace of construction, we can easily obtain the monitor video comprising people information. In criminal investigation and safety-security area, the monitor video of the personage is included as a kind of new demand quilt according to the human image retrieval of input It is proposed.Compared with the image of static state, video bag contains more abundant people information, can preferably portray the feature of personage.Together When, the image of multiframe is included in video, for the single image of static state, it can be preferably tackled in the presence of the feelings blocked Condition.Further, since video is the set of successive image frame, its continuous dynamic space time information included, can be used for aiding in The identification of personage.Therefore, identify it is a kind of more natural mode again using video information to carry out pedestrian.

However, how compare pedestrian based on image and video will effectively extract in identification problem and rationally utilize video again Information is one of difficult point.Because substantial amounts of redundancy for image, in video be present, if processing is not When the precision of identification can be reduced on the contrary.Further, since the comparison between image and video belongs to two kinds of different mode, therefore How the comparison of cross-module state and an other difficult point are reasonably carried out.

The content of the invention

In view of this, it is necessary to for problems of the prior art, there is provided one kind is based on image and video cross-module state The pedestrian of comparison recognition methods again, this method can in video database by calculating the similitude between image and video, So as to retrieve all videos containing personage corresponding with input picture.

To achieve the above object, the present invention uses following technical scheme：

A kind of pedestrian's recognition methods again compared based on image and video cross-module state, is contained for being retrieved from multiple videos There is the video that personage is corresponded in the query image of input, comprise the following steps：

The configurable depth model of S1, structure；

The depth model includes convolutional neural networks, long memory network and similarity-based learning network in short-term；Convolutional Neural Network is used to extract the characteristics of image of query image and the video features of video respectively；Long memory network in short-term is used for extracting Video features in embedded video spatial information, and export the space video feature for including the spatial information；Similarity-based learning Network is used for characteristics of image and space video Feature Mapping to same dimension, and learns similarity measurement between the two；

S2, training sample is obtained, and training sample is inputted in depth model, to be trained to depth model；Utilize Forwards algorithms and backward algorithm learn each several part parameter of the depth model of structure；

S3, the parameter obtained using S2 learnings are initialized to depth model；Inputted into depth model to be measured Query image and multiple videos, similarity measurement of each video respectively with the query image is calculated by depth model；

S4, it will be listed with video of the similarity measurement of the query image higher than a threshold value, and according to similarity measurement Size is ranked up.

Further, in S2, input before training sample, the parameter of depth model is carried out using random fashion initial Change.

Further, in S2, every group of training sample includes a query image and a video, and precalculated two Mutual similarity measurement between person.

Further, in convolutional neural networks, x is made to represent the query image of input, Y represents video, Y={ y^t| t= 1 ..., N }, wherein y^tFor video Y t two field pictures, N is the totalframes of video.Cnn is made to represent the letter of convolutional neural networks Number, then query image x is as follows by the image feature representation obtained after convolutional neural networks：

f_x=Cnn (x)；

For video Y, its each two field picture y is obtained using convolutional neural networks^tFeature, i.e. video features are expressed as：

{f(y^t)=Cnn (y^t) | t=1 ..., N }.

Further, for the t two field pictures of video, its feature obtained by convolutional neural networks is f (y^t), f (y^t) Input as long memory network in short-term；In long memory network in short-term, output corresponding to it is state h_t, h_tFor the two field picture By the feature of long memory network in short-term；Above-mentioned processing is all carried out for each frame of video, finally all state h_tGroup It is combined together as the space video feature that embedded in spatial information：

f_y={ h_t| t=1 ..., N }.

Further, in similarity-based learning network, S (x, y) is made to represent the similitude between query image x and video Y Measurement, then have：

Wherein, A, B, C, d, e, f are similarity measurement S (x, y) parameter, and wherein A, B are positive definite matrix, and C is positive semidefinite Matrix；

OrderThen have

S (x, y)=| | L_Af_x||²+||L_Bf_y||²+2d^Tf_x-2(L_c ^xf_x)^T(L_c ^yf_y)+2e^tf_y；

Wherein, L_A, L_B, L_CFor the parameter of similarity-based learning network, learn to obtain by the training to depth model.

Further, in S2, using class hinge loss function, and the stochastic gradient descent method of standard is used to depth mould Type is trained；

W is made to represent the parameter of network, then loss function is defined as follows：

Wherein,

Wherein, l_iFor indicator function, it is defined as follows：

The present invention can will automatically input the query image comprising personage and is compared with the video in database, and return The video of personage identical with query image is returned, and according to the similarity auto-sequencing of personage in video and input picture personage. The present invention is by the way that similarity-based learning is embedded into deep neural network, so as to which depth characteristic is expressed into study and similarity-based learning Consolidated network framework is fused to, therefore can be by the two progress combination learning with optimizing, so as to solve spy in conventional method Sign study isolates study of coming with similarity-based learning, the defects of can not be optimized end to end with study.The present invention is using deeply The adaptive feature for removing study image and video of degree neutral net, it is achieved thereby that compared based on image and video cross-module state Pedestrian identifies solve a great problem in the pedestrian's weight identification technology compared based on image and video again.

Brief description of the drawings

Fig. 1 is that a kind of flow of pedestrian compared based on image with video cross-module state provided by the invention recognition methods again is shown It is intended to.

Fig. 2 is the structural representation of the depth model used in the present invention.

Embodiment

Technical scheme is described in detail below in conjunction with accompanying drawing and specific embodiment.

The invention provides a kind of pedestrian's recognition methods again compared based on image and video cross-module state, for being regarded from multiple The video that personage is corresponded in the query image containing input is retrieved in frequency.It should be noted that heretofore described is multiple Video can be the multiple videos for the multiple videos or scattered storage being uniformly stored in a video database.

A kind of as shown in figure 1, pedestrian compared based on image and video cross-module state provided by the invention recognition methods bag again Include following steps：

The configurable depth model of S1, structure；

As shown in Fig. 2 the depth model framework used in the present invention is as follows：The depth model includes convolutional Neural net Network, long three parts of memory network and similarity-based learning network in short-term.The input of model has two, respectively query image and regards Frequently, correspondingly, the data transfer of model is divided into Liang Ge branches.One of branch extracts inquiry with convolutional neural networks The feature of image, it exports a branch for being connected to similarity-based learning network inputs；Another branch of model is refreshing with convolution The feature of video is extracted through network and long memory network in short-term, it exports second branch for being connected to similarity-based learning network.

The concrete function of each network is described in detail below：

Convolutional neural networks：In invention, convolutional neural networks are responsible for extracting query image and the feature of video.Convolution god Structure through network is using classical GoogLeNet structures.For the query image of input, by being obtained after convolutional neural networks Its feature, it is output to the study that similarity-based learning e-learning is used for similarity measurement.

X is made to represent the query image of input, Y represents video, Y={ y^t| t=1 ..., N }, wherein y^tFor video Y t Two field picture, N are the totalframes of video.Cnn is made to represent the function of convolutional neural networks, then query image x passes through convolutional Neural net The image feature representation obtained after network is as follows：

f_x=Cnn (x)；

{f(y^t)=Cnn (y^t) | t=1 ..., N }.

After each frame of video obtains its feature by convolutional neural networks, then by long memory network in short-term in video Embedded space time information.

Long memory network in short-term：Long memory network in short-term as a kind of Recognition with Recurrent Neural Network, can handle random length Video data, and export video features.In the present invention, we using long memory network in short-term come to from convolutional neural networks The information of each frame of video of acquisition further encodes.The characteristics of long memory network in short-term be its current state not only with currently The data of input are relevant, and further relate to the previous state of current state be.

For the t two field pictures of video, its feature obtained by convolutional neural networks is f (y^t), f (y^t) it is used as length When memory network input；In long memory network in short-term, output corresponding to it is state h_t, h_tPass through length for the two field picture When memory network feature；Above-mentioned processing is all carried out for each frame of video, finally all state h_tIt is grouped together As the space video feature that embedded in spatial information：

f_y={ h_t| t=1 ..., N }.

Similarity-based learning network：After the feature and the feature of video for being extracted query image respectively, we use phase Learn the similitude between query image and video like inquiry learning network.

Specifically, make S (x, y) represent the similarity measurement between query image x and video Y, then have：

Wherein, A, B, C, d, e, f are similarity measurement S (x, y) parameter, and wherein A, B are positive definite matrix, and C is positive semidefinite Matrix.OrderThen have：

During the model training described in S2, every group of training sample includes a query image and a video, and in advance The similarity measurement mutual between the two calculated.Before training sample is inputted, using random fashion to depth model Parameter is initialized.

In the present invention, using class hinge loss function, and depth model is entered using the stochastic gradient descent method of standard Row training；W is made to represent the parameter of network, then loss function is defined as follows：

Wherein,

Wherein, l_iFor indicator function, it is defined as follows：

Learnt by S2 after each several part parameter of the depth model of structure, depth model is entered again using these parameters Row initialization, can formally carries out pedestrian and identifies work again afterwards.

In S3 and S4, the present invention calculates the query image and video database of input with the depth model succeeded in school respectively In multiple videos between similitude.Then the mode according to similitude from big to small is arranged the video in database Sequence, and return to ranking results.

Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of pedestrian's recognition methods again compared based on image and video cross-module state, for retrieved from multiple videos containing The video of personage is corresponded in the query image of input, it is characterised in that comprise the following steps：

The configurable depth model of S1, structure；

The depth model includes convolutional neural networks, long memory network and similarity-based learning network in short-term；Convolutional neural networks For extracting the characteristics of image of query image and the video features of video respectively；Long memory network in short-term is used to regard to what is extracted The spatial information of embedded video in frequency feature, and export the space video feature for including the spatial information；Similarity-based learning network For by characteristics of image and space video Feature Mapping to same dimension, and learn similarity measurement between the two；

S2, training sample is obtained, and training sample is inputted in depth model, to be trained to depth model；Using it is preceding to Algorithm and backward algorithm learn each several part parameter of the depth model of structure；

S3, the parameter obtained using S2 learnings are initialized to depth model；Inquiry to be measured is inputted into depth model Image and multiple videos, similarity measurement of each video respectively with the query image is calculated by depth model；

S4, it will be listed with video of the similarity measurement of the query image higher than a threshold value, and according to the size of similarity measurement It is ranked up.

2. according to the method for claim 1, it is characterised in that in S2, input before training sample, using random fashion The parameter of depth model is initialized.

3. according to the method for claim 1, it is characterised in that in S2, every group of training sample include a query image and One video, and precalculated similarity measurement mutual between the two.

4. according to the method for claim 1, it is characterised in that in convolutional neural networks, make x represent the query graph of input Picture, Y represent video, Y={ y^t| t=1 ..., N }, wherein y^tFor video Y t two field pictures, N is the totalframes of video.Make Cnn The function of convolutional neural networks is represented, then query image x is as follows by the image feature representation obtained after convolutional neural networks：

f_x=Cnn (x)；

{f(y^t)=Cnn (y^t) | t=1 ..., N }.

5. according to the method for claim 4, it is characterised in that for the t two field pictures of video, it passes through convolutional Neural net The feature that network obtains is f (y^t), f (y^t) as the input for growing memory network in short-term；In long memory network in short-term, corresponding to it Export as state h_t, h_tFeature for the two field picture by long memory network in short-term；Above-mentioned place is all carried out for each frame of video Reason, finally all state h_tIt is grouped together as the space video feature that embedded in spatial information：

f_y={ h_t| t=1 ..., N }.

6. according to the method for claim 5, it is characterised in that in similarity-based learning network, make S (x, y) represent inquiry Similarity measurement between image x and video Y, then have：

OrderThen have：

7. according to the method for claim 1, it is characterised in that in S2, using class hinge loss function, and use standard Stochastic gradient descent method depth model is trained；

<mrow> <mi>W</mi> <mo>=</mo> <msub> <mi>argmin</mi> <mi>w</mi> </msub> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>l</mi> <mi>i</mi> </msub> <mi>S</mi> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> </msub> <mo>+</mo> <mi>&lambda;</mi> <mo>|</mo> <mo>|</mo> <mi>W</mi> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>;</mo> </mrow>

Wherein,

<mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo><</mo> <mo>-</mo> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mtable> <mtr> <mtd> <mrow> <mi>i</mi> <mi>f</mi> </mrow> </mtd> <mtd> <mrow> <msub> <mi>l</mi> <mi>i</mi> </msub> <mo>=</mo> <mo>-</mo> <mn>1</mn> </mrow> </mtd> </mtr> </mtable> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>&GreaterEqual;</mo> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>o</mi> <mi>t</mi> <mi>h</mi> <mi>e</mi> <mi>r</mi> <mi>w</mi> <mi>i</mi> <mi>s</mi> <mi>e</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Wherein, l_iFor indicator function, it is defined as follows：