CN108765394B

CN108765394B - Target identification method based on quality evaluation

Info

Publication number: CN108765394B
Application number: CN201810487252.XA
Authority: CN
Inventors: 徐奕; 倪冰冰; 刘桂荣
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2021-02-05
Anticipated expiration: 2038-05-21
Also published as: CN108765394A

Abstract

The invention provides a target identification method based on quality evaluation, which comprises the following steps: constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target identification model is used for extracting target features from a video so as to represent the whole structure information and the local information of a target; training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements; and carrying out target recognition on the video through the trained target recognition model. Therefore, the problem of target identification caused by variable appearance and uneven image quality in a video sequence is solved, inter-frame correlation information is added in quality evaluation, more effective target information is obtained, the representation of a target is more accurate, and the identification precision is improved.

Description

Target identification method based on quality evaluation

Technical Field

The invention relates to the technical field of image processing, in particular to a target identification method based on quality evaluation.

Background

The rise of a series of applications such as face recognition and behavior analysis shows that target recognition plays an increasingly important role in real life. In the task of target identification, the same target is often required to be identified from cameras in different angles and different scenes. In the case of a cross-camera, the appearance gap of the target is often large, which poses a great challenge to the robustness of the recognition algorithm. In recent years, although existing recognition algorithms have achieved good results in experimental environments, these recognition algorithms are still unsatisfactory in real uncontrollable scenes. This is because the data collected in the experimental environment is often of good quality, and in the intentional shooting, the variation factors affecting the image quality are often few, and for example, there may be changes such as motion expressions in the experimental data, but there are no uncontrollable factors such as illumination and shading. In real life, these uncontrollable factors can have a complex effect on image quality. This makes image quality an important factor affecting the performance of object recognition, and also makes object recognition based on quality evaluation an important subject to be studied intensively.

Currently, video object Recognition methods focus on how to integrate more information, such as that of "Face Recognition by Multi-Frame Fusion of Rotating Heads In Videos", published by Canavan et al In 2007 "In IEEE International Conference on biometry: the Theory, Applications, and Systems" ("where seven frames are selected from a video sequence with different poses, and fused into one image to utilize more information. Wheeler et al, in the article "Face recognition in unconventional videos with matched background similarity" published in IEEE Computer Vision and Pattern recognition International Conference 2011, propose to combine multiple Face images into super-resolution Face images, thereby improving Face recognition performance.

However, these methods use the advantage of multiple frames of video to integrate information of multiple frames to extract features, but neglect the effectiveness of information, so there is a limitation. Researchers began to focus on the effect of Quality on target recognition, and Antiharajah et al, published in "Signal Processing and Communication Systems" 2012 (Signal Processing and Communication Systems conference), treated a sequence of images as a set of independent images by Quality based frame selection for video face recognition, and screened out "good Quality" images for target recognition. However, due to the change of the motion, expression, environment and the like of the target in the video, each frame of the video often contains different information, and therefore, other frames of the video are discarded in the method, so that the information is wasted. "Quality Aware Network for Set to Set registration", published by Liu et al in 2017 "IEEE International Conference on Computer Vision and Pattern registration. IEEE" (IEEE Computer Vision and Pattern Recognition International Conference), considers the validity of each frame of information, proposes a Quality-Aware Network to measure the validity of each frame of information by using the Quality of each frame, and finally aggregates all the frame of information to form a final feature representation. But it treats the video frames as separate individuals, ignoring the connections between video frames, limiting the performance of object recognition.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a target identification method based on quality evaluation.

The invention provides a target identification method based on quality evaluation, which comprises the following steps:

constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target identification model is used for extracting effective target features from a video so as to represent the whole structure information and the local information of a target;

training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements;

and carrying out target recognition on the video through the trained target recognition model.

Optionally, the constructing the target recognition model includes:

acquiring image data with known quality standards, and training a quality evaluation network through the image data to obtain a trained quality evaluation network;

extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global characteristics according to the extracted characteristics of the context information of the target;

performing quality evaluation on the local features and the global features of the target through a trained quality evaluation network to obtain corresponding quality scores;

according to the quality scores of the local features and the global features, respectively aggregating the local features and the global features of each frame of the target through a feature aggregation network, and aggregating the local features and the global features of the target;

and constructing the construction target identification model through a trained quality evaluation network, a trained feature extraction network and a trained feature aggregation network.

Optionally, the acquiring image data with known quality standard includes:

acquiring a first video and a second video from two cameras at different angles and different positions from a database with known quality standards, wherein the first video and the second video both comprise targets;

selecting N first video samples with the frame number larger than 21 frames from the first video, and selecting N second video samples with the frame number larger than 21 frames from the second video; wherein N is a natural number greater than or equal to 2;

and selecting a training set and a testing set from the first video sample and the second video sample, wherein the training set is used for training the quality evaluation network, and the testing set is used for testing the quality evaluation network.

Optionally, the acquiring image data with known quality standard includes:

taking a video containing a target as an input of a face recognition system, and taking an output result of the face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as an identity i is used as a quality label;

assume that the training set consists of m labeled samples: { (x)₁,y₁)，…，(x_m,y_m)}，

y_iE {1, 2, …, N }, the probability of sample i being in the j category is:

by passing

Normalizing probability distributions of various classesSetting the sum of all the probabilities to be 1, and taking the probability when i equals to j as the quality standard of the image, wherein the quality standard is that

In the formula: (x)₁,y₁) Denotes the sample numbered 1, (x)_m,y_m) Denotes the sample with the index m, x_iThe characteristic expression of the ith sample is shown, the value range of i is 1-m,

representing a real space, n being the output dimension of a fully connected layer preceding the softmax layer, y_iA label representing the ith sample,

representing the probability that sample i is in the j category,

representing the raw output of the jth neuron after the ith sample passes through the softmax layer,

the original output of the kth neuron after the ith sample passes through the softmax layer is shown, N represents the class number, and k represents a counting variable.

Optionally, the quality evaluation network comprises: the system comprises an AlexNet feature extractor and a bidirectional long and short term memory network LSTM, wherein the AlexNet feature extractor is used for performing quality evaluation on the single-frame image features of a target and generating quality evaluation on local features by the single-frame image features, and the bidirectional long and short term memory network LSTM is used for performing quality evaluation on global features.

Optionally, the single-frame image features are extracted through a feature extraction network to obtain local features of the target; and forming global features according to the extracted features of the context information of the target, including:

using a GoogleNet network as a feature extractor to extract the features of a single-frame image to obtain the local features of a target, and using a bidirectional long-short term memory network (LSTM) to extract the features of the context information of the target to form global features;

the extracting of the single-frame image features and the local features of the target comprises the following steps:

selecting the characteristics of an acceptance layer-5 b through a GoogleNet network;

the image is input into an image input layer in a size of 224 × 224, and the output of an initiation _5b layer is selected as a single-frame image feature after a 5-stage Iceptation network structure, wherein the initiation structure is that a convolution layer of 1x1, 3x3, 5x5 and a firing layer of 3x3 are executed in parallel, and finally the parallel output is used as a result of one initiation.

Optionally, after using the GoogleNet network as a feature extractor to extract a single-frame image feature and generate a local feature of the target, the method further includes:

and inputting the extracted single-frame image features into the time sequence features to obtain the time sequence features corresponding to the single-frame image features.

Optionally, performing quality evaluation on the local features and the global features of each frame of the target through a trained quality evaluation network to obtain corresponding quality scores, including:

predicting the quality fraction of the target single-frame image through an AlexNet feature extractor, wherein the prediction formula is as follows:

in the formula:

represents the image at the T-th moment of the ith video sample, → represents the neural network operation, a represents AlexNet, T represents the length of the video sample,

represents the mass fraction at the time t of the ith sample, P (X)_i) Set of quality scores, X, representing each frame image of the ith sample_iRepresents the ith video sample, then

The calculation formula of (a) is as follows:

in the formula: h' denotes the LSTM network structure, Q (G)_i) Set of quality scores representing frames of an ith video sample based on context information

Representing the context-based quality score, G, of the ith sample at time T_iThe GoogleNet feature representation representing the ith video sample,

representing a context-based quality score at time t of the ith video sample.

Optionally, according to the quality scores of the local features and the global features, respectively aggregating the local features and the global features of each frame of the target through a feature aggregation network, and aggregating the local features and the global features of the target, including:

from one image set S ═ { I }₁,I₂,…,I_NExtracting features of fixed dimension to represent features of the whole video sample; let R_a(S) and

respectively representing the image set S and the ith frame image I_iFeature of (1) (local/global feature)Is characterized by) R_a(S) is dependent on all frames in S, wherein:

in the formula:

representing the characteristics of the i frame image extracted by GoogleNet,

representing an aggregation function that maps variable-length video features to fixed-dimension features, N representing a number of frames in an image set; wherein:

μ_i＝Q(I_i)

in the formula: q (I)_i) Representing the ith frame image I_iMass fraction of (D) < u >_iThe prediction function of (2);

order to

Represents a video sequence in which

Representing the ith frame of image in the video sequence, then:

in the formula: t denotes the number of frames contained in the video sequence,

represents the quality score of the ith frame image,

represents the quality score of the aggregate feature of the ith frame image, {, } represents the cascade,

features representing the image of the ith frame, a multiplication operation,

representing the temporal characteristics of the image of the ith frame, S (X)_i) Representing a video sequence X_iThe characteristics of (1).

Compared with the prior art, the invention has the following beneficial effects:

the target identification method based on quality evaluation solves the problem of target identification caused by variable appearance and uneven image quality in a video sequence, increases the correlation information between frames in the quality evaluation, synthesizes video characteristics by aggregating the extracted characteristics and quality scores and utilizing the information of all the frames, and enables the extracted video characteristics to describe corresponding video samples more effectively. And more complete target representation can be given by combining the global features and the local features, so that more effective target information is obtained, the representation of the target is more accurate, and the identification precision is improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram illustrating a method for identifying a target based on quality evaluation according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a network of a target identification method based on quality evaluation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an internal structure of a long short term memory network LSTM according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a bidirectional LSTM network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a quality evaluation network based on context information according to an embodiment of the present invention;

FIG. 6 is a block diagram of a combination of global features and local features provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of a pedestrian re-identification result provided in an embodiment of the present invention, where (a) is a target sample, (b) is a matching result obtained by a text method, and (c) is a matching result of a text comparison method.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Fig. 1 is a schematic diagram illustrating a principle of a target identification method based on quality evaluation according to an embodiment of the present invention, and as shown in fig. 1, the target identification method based on quality evaluation according to the present invention includes:

s1: constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target recognition model is used for extracting target features from a video so as to represent the whole structure information and the local information of a target.

In this embodiment, the quality evaluation network includes: the system comprises an AlexNet feature extractor and a bidirectional Long-Short Term Memory (LSTM), wherein the AlexNet feature extractor is used for carrying out quality evaluation on single-frame image features and local features of a target, and the LSTM is used for carrying out quality evaluation on global features.

S2: and training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements.

In this embodiment, the target recognition model is trained, and only the identity information is used for supervision. In the training process, the quality evaluation network and the feature extraction network are mutually promoted, the quality evaluation network and the feature extraction network which consider the time sequence features are mutually promoted, and the global features and the local features are mutually promoted. And simultaneously, reasonably initializing the network: initialization was performed using the public model of GoogleNet and a pre-trained quality assessment model.

S3: and carrying out target recognition on the video through the trained target recognition model.

In the embodiment, the change of the motion, the expression and the like of the target and the influence of environmental factors such as illumination change, shielding and the like are considered, so that the feature and the effectiveness of each frame of the video are changed greatly. And extracting context information by using a bidirectional long-short term memory network, and evaluating the effectiveness of the characteristics of each frame on the basis of considering the context information. And finally, combining the two characteristics to enable the frame characteristics to be aggregated on the basis of reasonable effectiveness to obtain effective video characteristics, thereby effectively improving the identification precision of the target.

Optionally, step S1 includes:

s11: acquiring image data with known quality standards, and training a quality evaluation network through the image data to obtain a trained quality evaluation network;

in this embodiment, optionally, the training set may be obtained through a database with known quality standards, specifically:

acquiring a first video and a second video from two cameras at different angles and different positions from a database with known quality standards, wherein the first video and the second video both comprise targets; selecting N first video samples with the frame number larger than 21 frames from the first video, and selecting N second video samples with the frame number larger than 21 frames from the second video; wherein N is a natural number greater than or equal to 2; and selecting a training set and a testing set from the first video sample and the second video sample, wherein the training set is used for training the quality evaluation network, and the testing set is used for testing the quality evaluation network.

Alternatively, the video containing the target may also be made without a database of known quality criteriaInputting a human face recognition system, and taking an output result of the human face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as the identity i is used as a quality label; assume that the training set consists of m labeled samples: { (x)₁,y₁)，…，(x_m,y_m)}，

y_iE {1, 2, …, N }, the probability of sample i being in the j category is:

by passing

Normalizing the probability distribution of each class to enable the sum of all the probabilities to be 1, and taking the probability when i is equal to j as the quality standard of the image, wherein the quality standard is that

representing the probability that sample i is in the j category,

represents the ith sample after passing through the softmax layerThe raw output of the j neurons,

S12: extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global features according to the extracted features of the context information of the target;

in the embodiment, a GoogleNet network is used as a feature extractor to extract the single-frame image features and the local features of the target, and a bidirectional long-short term memory network LSTM is used to extract the features of the context information of the target to form global features;

the image is input into an image input layer in a size of 224 × 224, and the output of an initiation _5b layer is selected as a single-frame image feature after a 5-stage Iceptation network structure, wherein the initiation structure is that a convolution layer of 1x1, 3x3, 5x5 and a firing layer of 3x3 are executed in parallel, and finally the parallel output is used as the output of one initiation structure.

S13: performing quality evaluation on the single-frame image characteristics of the target through the trained quality evaluation network, and performing quality evaluation on the local characteristics and the global characteristics extracted by the characteristic extraction network to obtain corresponding quality scores;

in this embodiment, the quality evaluation network can generate a reasonable evaluation of the effectiveness of the extracted features. The quality score of the known context is used for measuring the effectiveness of global features, and the quality score of the single-frame image is used for measuring the effectiveness of local features of all parts of the body.

Appearance characteristics of image at t moment extracted by GoogleNet representing ith sampleThe extraction process can be expressed as:

the context information-based features representing the image at time t extracted by the ith sample through the LSTM network may be expressed as:

in the formula:

represents the image of the ith sample at time t, G represents GoogleNet, G (X)_i) Representing the feature set of each frame extracted after the ith sample passes through GoogleNet, H represents an LSTM network, and H (G)_i) And the feature set based on the context information at each moment of the ith sample is shown.

in the formula: → represents neural network operation, a represents AlexNet, T represents sequence length of video samples,

represents the image independent evaluation quality score at the t-th time of the ith sample, P (X)_i) Set of independent evaluation quality scores for each frame representing the ith sample, X_iWhich represents the ith video sample, is,

the independent image evaluation quality score at the Tth time of the ith sample is shown, then

The calculation formula of (a) is as follows:

in the formula: h denotes the LSTM network, Q (G)_i) Set of context-based quality scores for frames representing the ith sample, G_iRepresents the GoogleNet feature of the ith sample,

the image representing the ith sample at time t is based on the quality score of the context information,

and the quality score of the image at the Tth time of the ith sample based on the context information is represented.

Features extracted for GoogleNet

Features representing an ith frame image; and further inputting the time sequence feature extraction network to extract the time sequence feature of the time sequence feature. We use a round-robin network, where each node of the layer is connected to the previous node, so that within the layer information can flow from the first node to the last node. Order to

Information representing the ith frame image from t to t +1 passes through the module, and the extracted time sequence characteristics can be represented as:

H(X_i) A set of temporal features representing a video sample,

representing the time sequence characteristics of the ith sample and the tth frame image.

Representing the information extracted from the current frame in the temporal characteristics of the ith sample th frame image. Hypothesis information r₀Representing the gait of the target person, similar information will flow between all frames after training. In this case, the extracted features may include timing features. The cross-frame time sequence characteristics are extracted, and the final video characteristics have robustness.

Specifically, the quality evaluation of the target single-frame image by using AlexNet includes: each input image is scaled to 227 × 227 size and input into an input picture layer. Then, the image is input into five convolution modules in sequence for feature extraction, and each convolution module comprises a group of structures: convolutional layers, ReLUs layers and max-firing layers. Then, the number of the neurons in the first two layers is 4096 after passing through the three full-connection layers. Since the goal is to generate a quality score for each image, the number of neurons in the last layer is set to 1.

Specifically, the method for acquiring the quality score based on the context information comprises the following steps: a timing feature learning module is accessed after the penultimate fully connected layer to generate a quality score evaluation of the known context information. The timing feature learning module is constructed using a modified LSTM unit. In this network, the number of units per LSTM layer is equal to the number of frames making up each video. Each LSTM unit is connected to another LSTM unit so that information can flow from the first LSTM unit to the last LSTM node. Each LSTM node consists of an input node, a hidden node, and an output node. Through the LSTM node, useful information is retained while useless information is forgotten. The output of the LSTM layer is a feature vector for each frame, except that the feature vectors contain both features from the current frame and features from previous frames. The characteristics of each frame of image are obtained by an AlexNet characteristic extractor

The LSTM cell has two inputs, one being characteristic of each frame of image

The other is a hidden state from the previous unit

And

first through the forget gate. The forgetting gate determines the degree of forgetting information, and the process of forgetting information and retaining information can be expressed as:

in the formula:

information indicating that the ith sample passes through the forgetting gate at the t-th moment,

indicates the GoogleNet characteristic of the ith sample at the t-th time, and σ indicates that σ (x) — (1+ e)^-x)^-1sigmoid function which compresses the input nonlinearly between 0 and 1, W_fConvolution parameters representing a forgetting gate, b_fAn offset parameter representing a forgetting gate,

and the output characteristic of the memory unit at the t-1 th moment of the ith sample is shown.

At the same time, the input gate will process the current input and decide which information will be used to update the current state, and this update process can be expressed as:

in the formula:

representing the information of the ith sample processed by the input gate at time t, W_jRepresenting convolution parameters of the input gate, b_jThe offset parameter of the input gate is represented,

represents a candidate for updating information, and tanh () represents a tanh function layer

Nonlinear compression of input to between-1 and 1, W_CConvolution parameter representing update information, b_CAn offset parameter indicating the update information,

representing the state of the neuron at the t-th time of the ith sample,

representing the state of the neuron at time t-1 of the ith sample.

Finally, the hidden state will be updated to produce an output.

Is an output gate that decides which part of the information will be output, and this process can be expressed as:

wherein: w_oRepresenting convolution parameters of the output gate, b_oAn offset parameter indicative of the output gate,

and the output characteristic of the memory unit at the t-th time of the ith sample is shown.

Through the LSTM unit, the information of each frame can be affected by the previous frame, so that the current frame can obtain the above information. Since the nature of the object recognition task is not causal, context information is equally important for non-causal tasks. Thus the present invention uses a bidirectional long-short term memory network. The network structure concatenates the information of the two LSTM layers, and can analyze the characteristics and relationships of the input sequence in the forward and reverse directions simultaneously, while learning the information from the context.

The joining network of the bidirectional long-short term network predicts the quality of each frame under the premise of considering the influence of the previous and the next frames. By adding the bidirectional long-and-short-period LSTM network, the frames with the time sequence characteristics can obtain larger quality scores, so that the characteristics of the corresponding frames play a reasonable role in the final video sample characteristic formation.

S14: according to the quality scores of the single-frame image features, the local features and the global features, aggregating the single-frame image features of the target through a feature aggregation network, and aggregating the local features and the global features of the target;

in the present embodiment, S ═ { I ] is selected from one image set₁,I₂,…,I_NExtracting fixed dimension features to represent the features of the whole video sample; let R_a(S) and

respectively representing the image set S and the ith frame image I_iIs characterized in that R_a(S) depends on all frames in S, where:

in the formula:

representing the characteristics of the i frame image extracted by GoogleNet,

representing an aggregation function that maps variable-length video features to fixed-dimension features, N representing a total number of frames in an image set; wherein:

μ_i＝Q(I_i)

in the formula: q (I)_i) Representing the ith frame image I_iMass ofFraction mu_iThe prediction function of (2); the characteristics of each part of the body are characterized by dividing the output characteristics of GoogleNet into three parts. Respectively predicting scores of the sound field for the characteristics of each part, and using the characteristics and the quality scores of each part

After aggregation, the physical features of the three parts are connected together as the final feature representation of the video sample.

Order to

Represents a video sequence in which

Representing the ith frame of image in the video sequence, then:

in the formula: t denotes the number of frames contained in the video sequence,

represents the quality score of the ith frame image,

features representing the image of the ith frame, a multiplication operation,

S15: and constructing the constructed target recognition model through the trained quality evaluation network, the trained feature extraction network and the trained feature aggregation network.

The embodiment solves the problems of identification of targets with variable appearances and uneven image quality in video sequences. A circulating network module is added to the quality evaluation module to mine the correlation information between frames, and more effective target information is obtained, so that the quality evaluation becomes more reasonable due to the consideration of the time sequence information. Meanwhile, a circulation network is added to the feature extraction module, so that the feature extraction module contains context information, and the problem of variable target appearance is solved. And through a feature aggregation scheme, the extracted features and quality information are aggregated, and the features of the video sample are synthesized by using the information of all frames, so that the extracted features can describe the video sample more effectively. In addition, the combination of global features and local features can give a more complete target characterization (including both overall structural information and body part features). For the pedestrian re-identification task, experiments on the iLID _ VIDS and PRID2011 datasets showed that the top1 Match rate improved by approximately 3% over the previous algorithm average (the evaluation criterion is from the Cumulative Match curve temporal matching probabilistic (CMC) curve).

Specifically, pedestrian re-identification (Person re-identification) is also called pedestrian re-identification, each camera can only track one section of track of a pedestrian under the condition of fixed-position image pickup monitoring, a Person appears in a plurality of cameras during a long-distance walking process, and if further analysis such as tracking and motion analysis is carried out on a Person, the target under the condition of crossing the cameras can be firstly identified. Given a monitored pedestrian image, the goal of pedestrian re-identification is to find the pedestrian image in another device or vision. In real life, images obtained through various shooting devices often have serious problems of posture, appearance change and the like, and also have the problems of obvious light change, shading and the like. The image quality is uneven, and the image with poor quality can cause great influence on target identification. Pedestrian re-identification is little affected by facial features, mainly affected by wearing, size, shading, posture, viewing angle and the like. Another important feature of pedestrian re-identification is that the person is in motion. And the movement of the pedestrian includes both rigid movement and flexible movement, so that the difference between the appearances is further increased.

The present invention proposes a target identification method based on quality evaluation using quality differences due to these factors. First, a quality assessment network of known context is pre-trained using a database of monitoring images with quality metrics, which may be generated using existing well-behaved target recognition systems if not available. And then embedding the quality evaluation network into a feature extraction network to construct an integral network structure. And finally, training the whole network, so that the feature extraction network and the quality evaluation network are mutually promoted in the training, and a target recognition model based on quality evaluation is obtained.

Specifically, the construction of a target identification database used by the construction introduction is carried out, video data with a certain length is screened, and each sample is guaranteed to contain two videos containing two cameras from different angles and different positions. In the data set construction process, samples of videos with two frames larger than 21 frames are selected, and samples which do not meet requirements are discarded. And dividing the screened data sets into a training set and a testing set according to the proportion of 1:1 of the number of samples. And the data set is randomly divided for a plurality of times, and the average value is obtained through a plurality of experiments to obtain an accurate result. And all image data is scaled to the same size.

In the specific example, two databases of iLIDS _ VIDS and PRID2011 are selected for experiment, and the matching effect of the method is observed. And compared with the existing best method, and the experimental result is analyzed.

The PRID2011 database is a database that is built specifically for pedestrian re-identification. The data set consists of video recorded from two different static surveillance cameras. There are significant variations in viewing angle, and significant differences in illumination, background, and camera characteristics between different samples of the same identity. Since the image is extracted from the video, a motion process of the pedestrian is included. The camera 1 records 385 personal motion videos, the camera 2 records 749 personal motion videos, samples which are seriously shielded are deleted in advance in an open database, and in order to meet the research requirements of people, the samples are further screened. In the experiment, only the video samples with the effective frame number larger than 21 are selected, and the samples which do not meet the conditions are deleted.

The data in the iLIDS-VID database is collected by the multi-shot surveillance network arriving at the lobby at an airport. The data set selects two non-overlapping lens views, consisting of 300 pedestrians of different identities. Each pedestrian sample includes a pair of video samples from different cameras. The number of frames per video sample varies from 23 to 192, with an average length of 73 frames. The iLIDS-VID dataset is very challenging due to clothing similarities between people, illumination and viewpoint variations in the camera view, cluttered background and random occlusion. The biggest characteristic of this data set is that in the airport where there are many people, the sheltering situation is common. Also, because the samples may not be collected during the same day, the clothing wear and the like of the task may all be different.

Effects of the implementation

Table 1 shows experimental comparisons on the lids _ VIDS dataset; table 2 shows experimental comparisons on the PRID2011 data set. Wherein BLSTM + PQAN (Bidirectional Long Short Term Memory Network + Partial Quality Aware Network) represents a Bidirectional Long and Short Term Memory Network + target identification Network based on the Quality of each part of the target, LSTM + PQAN (Long Short Term Memory Network + Partial Quality Aware Network) represents a Long and Short Term Memory Network + target identification Network based on the Quality of each part of the target, PQAN Partial Quality Aware Network represents a target identification Network based on the Quality of each part of the target, QAN (Quality Aware Network) represents a target identification Network based on the Quality of the target, CNN + RNN (relational Neural Network + Currrer Network) represents a method of Neural Network + cyclic Neural Network, STFV3D (spread-Temporal Vector represents a spatial Learning Distance model using TDR-T3 convolution model, and a Fisher model (Distance-Learning Distance) represents a Distance model of using TDR-T3 convolution model, the method Of average temporal alignment Possing representation, the method Of GOG-KISSME-SRID (Gaussian Of Gaussian descriptor-Keep It and straight for straight method-spread ReiD) representing Gaussian descriptor, the method Of Simple direct measurement and straight recognition combining, the LADF (Local-Adaptive Decision Functions) representing a Local adaptable Decision-making method, the leaf Net (Person Re-identification with Human Body guide Guided Decision and Fusion) representing a Network Of Simple straight recognition combining Local features and global features, the PAM-LOMO + KISSME (partial-Simple average knowledge feedback compensation and Fusion) representing a Network Of Simple direct recognition and straight recognition combining Of Local features and straight features, the method Of PAM-LOMO + Ki spatial average alignment Possing representation, the method Of Simple direct recognition and straight recognition Of straight forward measure-spread Of straight forward N + straight recognition and straight recognition method, CNN + XQDA (Convolutional Neural Network + Cross-view predictive Analysis) represents a method Of a Convolutional Neural Network and Cross-view four-dimensional specific Analysis, and GOG + XQDA (Gaussian Of Gaussian descriptor + Cross-view predictive Analysis) represents a method Of a Gaussian descriptor combined with Cross-view four-dimensional specific Analysis. In tables 1 and 2, R1(Top 1 Matching Rate) indicates the 1 st Matching Rate, R5(Top 5 Matching Rate) indicates the Top 5 Matching Rate, R10(Top 10 Matching Rate) indicates the Top 10 Matching Rate, and R20(Top 20 Matching Rate) indicates the Top 20 Matching Rate.

TABLE 1

TABLE 2

Firstly, the effectiveness of the combination of the global features and the local features is verified, and experiments show that when the global features and the local features subjected to quality evaluation are used, the performance ratio is improved by 3.3%, so that the important effect of the quality evaluation on the effectiveness evaluation of the global features and the local features is verified. On the basis, an LSTM structure is added to extract time sequence characteristics. The addition of the time sequence characteristic module further improves the accuracy rate of pedestrian re-identification. When only one LSTM layer is added in the PQAN framework, compared with the baseline method, the top1 matching rate is improved by 0.3%, and when the bidirectional cascade LSTM module is embedded in the framework, the matching rate is improved by 2.3%, and compared with the baseline algorithm PQAN, the matching rate is improved by 4.2%. Moreover, the present invention outperforms most existing methods in all of the criteria listed in the table. Although the PAM-LOMO + KISSME method has a slightly higher top1 match rate than the present invention, the present invention performs better than this method on top 5, top 10 and top 20. In addition, this method uses multiple appearance models and uses a complex framework structure to extract local features. These designs increase the complexity of the network. In sum, the method is superior to the PAM-LOMO + KISSME method.

Firstly, the improvement of the invention is improved to the matching rate as can be seen from the comparative experiment. Secondly, compared with other methods, the performance of the QAN and PQAN of the reference method is superior to other results, the performance of the method is further improved, and the performance of the baseline method is improved by 0.7% and 2.1% respectively due to the improvement of the single-layer LSTM and the bidirectional cascade LSTM. It is worth noting that both the top 5 and top 10 match rates of PAM-LOMO + KISSME are higher than the present invention.

In summary, the effectiveness of the present invention is demonstrated by comparison with baseline methods QAN and PQAN. Compared with the prior method with better performance, the method has the advantages that at least one index is superior to other methods, and the performance of other indexes is not inferior to that of a contrast method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A target identification method based on quality evaluation is characterized by comprising the following steps:

carrying out target recognition on the video through the trained target recognition model;

the constructing of the target recognition model comprises the following steps:

extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global characteristics according to the extracted characteristics of the context information of the target; specifically, a GoogleNet network is used as a feature extractor to extract the features of a single-frame image so as to obtain the local features of a target, and a bidirectional long-short term memory network (LSTM) is used to extract the features of the context information of the target so as to form global features;

constructing the constructed target recognition model through a trained quality evaluation network, a feature extraction network and a feature aggregation network;

the quality evaluation network includes: the system comprises an AlexNet feature extractor and a bidirectional long and short term memory network LSTM, wherein the AlexNet feature extractor is used for performing quality evaluation on the single-frame image features of a target and generating quality evaluation on local features by the single-frame image features, and the bidirectional long and short term memory network LSTM is used for performing quality evaluation on global features.

2. The method for identifying an object based on quality evaluation according to claim 1, wherein the acquiring image data with known quality standard comprises:

3. The method for identifying an object based on quality evaluation according to claim 1, wherein the acquiring image data with known quality standard comprises:

the method comprises the following steps of taking a video containing a target as the input of a face recognition system, and taking the output result of the face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as the identity i is used as a quality label;

Then the probability that sample i is in category j is:

by passing

representing the probability that sample i is in the j category,

4. The method for identifying the target based on the quality evaluation as claimed in claim 1, wherein the extracting the single-frame image features and the local features of the target comprises:

the image is input into an image input layer after being scaled to 224 × 224, and output of an acceptance _5b layer is selected as a single-frame image feature after passing through a 5-stage Iceptation network structure, wherein the acceptance structure is that a convolution layer of 1x1, 3x3, 5x5 and a posing layer of 3x3 are executed in parallel, and finally the parallel output is taken as a result of the acceptance.

5. The method for identifying an object based on quality evaluation according to claim 1, after extracting features of a single frame image and generating local features of the object using a GoogleNet network as a feature extractor, further comprising:

6. The method for identifying the target based on the quality evaluation as claimed in claim 1, wherein the quality evaluation is performed on the local feature and the global feature of each frame of the target through a trained quality evaluation network to obtain the corresponding quality score, and the method comprises the following steps:

in the formula:

The calculation formula of (a) is as follows:

representing a context-based quality score at time t of the ith video sample.

7. The method for identifying the target based on the quality evaluation according to claim 1, wherein the steps of aggregating the local features and the global features of each frame of the target and aggregating the local features and the global features of the target through a feature aggregation network according to the quality scores of the local features and the global features comprise:

from one image set S ═ { I }₁,I₂,…,I_NExtracting fixed dimension features to represent the features of the whole video sample; let R_a(S) and