CN108765394B - Target identification method based on quality evaluation - Google Patents

Target identification method based on quality evaluation Download PDF

Info

Publication number
CN108765394B
CN108765394B CN201810487252.XA CN201810487252A CN108765394B CN 108765394 B CN108765394 B CN 108765394B CN 201810487252 A CN201810487252 A CN 201810487252A CN 108765394 B CN108765394 B CN 108765394B
Authority
CN
China
Prior art keywords
target
features
network
video
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810487252.XA
Other languages
Chinese (zh)
Other versions
CN108765394A (en
Inventor
徐奕
倪冰冰
刘桂荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201810487252.XA priority Critical patent/CN108765394B/en
Publication of CN108765394A publication Critical patent/CN108765394A/en
Application granted granted Critical
Publication of CN108765394B publication Critical patent/CN108765394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target identification method based on quality evaluation, which comprises the following steps: constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target identification model is used for extracting target features from a video so as to represent the whole structure information and the local information of a target; training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements; and carrying out target recognition on the video through the trained target recognition model. Therefore, the problem of target identification caused by variable appearance and uneven image quality in a video sequence is solved, inter-frame correlation information is added in quality evaluation, more effective target information is obtained, the representation of a target is more accurate, and the identification precision is improved.

Description

Target identification method based on quality evaluation
Technical Field
The invention relates to the technical field of image processing, in particular to a target identification method based on quality evaluation.
Background
The rise of a series of applications such as face recognition and behavior analysis shows that target recognition plays an increasingly important role in real life. In the task of target identification, the same target is often required to be identified from cameras in different angles and different scenes. In the case of a cross-camera, the appearance gap of the target is often large, which poses a great challenge to the robustness of the recognition algorithm. In recent years, although existing recognition algorithms have achieved good results in experimental environments, these recognition algorithms are still unsatisfactory in real uncontrollable scenes. This is because the data collected in the experimental environment is often of good quality, and in the intentional shooting, the variation factors affecting the image quality are often few, and for example, there may be changes such as motion expressions in the experimental data, but there are no uncontrollable factors such as illumination and shading. In real life, these uncontrollable factors can have a complex effect on image quality. This makes image quality an important factor affecting the performance of object recognition, and also makes object recognition based on quality evaluation an important subject to be studied intensively.
Currently, video object Recognition methods focus on how to integrate more information, such as that of "Face Recognition by Multi-Frame Fusion of Rotating Heads In Videos", published by Canavan et al In 2007 "In IEEE International Conference on biometry: the Theory, Applications, and Systems" ("where seven frames are selected from a video sequence with different poses, and fused into one image to utilize more information. Wheeler et al, in the article "Face recognition in unconventional videos with matched background similarity" published in IEEE Computer Vision and Pattern recognition International Conference 2011, propose to combine multiple Face images into super-resolution Face images, thereby improving Face recognition performance.
However, these methods use the advantage of multiple frames of video to integrate information of multiple frames to extract features, but neglect the effectiveness of information, so there is a limitation. Researchers began to focus on the effect of Quality on target recognition, and Antiharajah et al, published in "Signal Processing and Communication Systems" 2012 (Signal Processing and Communication Systems conference), treated a sequence of images as a set of independent images by Quality based frame selection for video face recognition, and screened out "good Quality" images for target recognition. However, due to the change of the motion, expression, environment and the like of the target in the video, each frame of the video often contains different information, and therefore, other frames of the video are discarded in the method, so that the information is wasted. "Quality Aware Network for Set to Set registration", published by Liu et al in 2017 "IEEE International Conference on Computer Vision and Pattern registration. IEEE" (IEEE Computer Vision and Pattern Recognition International Conference), considers the validity of each frame of information, proposes a Quality-Aware Network to measure the validity of each frame of information by using the Quality of each frame, and finally aggregates all the frame of information to form a final feature representation. But it treats the video frames as separate individuals, ignoring the connections between video frames, limiting the performance of object recognition.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a target identification method based on quality evaluation.
The invention provides a target identification method based on quality evaluation, which comprises the following steps:
constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target identification model is used for extracting effective target features from a video so as to represent the whole structure information and the local information of a target;
training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements;
and carrying out target recognition on the video through the trained target recognition model.
Optionally, the constructing the target recognition model includes:
acquiring image data with known quality standards, and training a quality evaluation network through the image data to obtain a trained quality evaluation network;
extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global characteristics according to the extracted characteristics of the context information of the target;
performing quality evaluation on the local features and the global features of the target through a trained quality evaluation network to obtain corresponding quality scores;
according to the quality scores of the local features and the global features, respectively aggregating the local features and the global features of each frame of the target through a feature aggregation network, and aggregating the local features and the global features of the target;
and constructing the construction target identification model through a trained quality evaluation network, a trained feature extraction network and a trained feature aggregation network.
Optionally, the acquiring image data with known quality standard includes:
acquiring a first video and a second video from two cameras at different angles and different positions from a database with known quality standards, wherein the first video and the second video both comprise targets;
selecting N first video samples with the frame number larger than 21 frames from the first video, and selecting N second video samples with the frame number larger than 21 frames from the second video; wherein N is a natural number greater than or equal to 2;
and selecting a training set and a testing set from the first video sample and the second video sample, wherein the training set is used for training the quality evaluation network, and the testing set is used for testing the quality evaluation network.
Optionally, the acquiring image data with known quality standard includes:
taking a video containing a target as an input of a face recognition system, and taking an output result of the face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as an identity i is used as a quality label;
assume that the training set consists of m labeled samples: { (x)1,y1),…,(xm,ym)},
Figure BDA0001666877190000031
yiE {1, 2, …, N }, the probability of sample i being in the j category is:
Figure BDA0001666877190000032
by passing
Figure BDA0001666877190000033
Normalizing probability distributions of various classesSetting the sum of all the probabilities to be 1, and taking the probability when i equals to j as the quality standard of the image, wherein the quality standard is that
Figure BDA0001666877190000034
In the formula: (x)1,y1) Denotes the sample numbered 1, (x)m,ym) Denotes the sample with the index m, xiThe characteristic expression of the ith sample is shown, the value range of i is 1-m,
Figure BDA0001666877190000038
representing a real space, n being the output dimension of a fully connected layer preceding the softmax layer, yiA label representing the ith sample,
Figure BDA0001666877190000035
representing the probability that sample i is in the j category,
Figure BDA0001666877190000036
representing the raw output of the jth neuron after the ith sample passes through the softmax layer,
Figure BDA0001666877190000037
the original output of the kth neuron after the ith sample passes through the softmax layer is shown, N represents the class number, and k represents a counting variable.
Optionally, the quality evaluation network comprises: the system comprises an AlexNet feature extractor and a bidirectional long and short term memory network LSTM, wherein the AlexNet feature extractor is used for performing quality evaluation on the single-frame image features of a target and generating quality evaluation on local features by the single-frame image features, and the bidirectional long and short term memory network LSTM is used for performing quality evaluation on global features.
Optionally, the single-frame image features are extracted through a feature extraction network to obtain local features of the target; and forming global features according to the extracted features of the context information of the target, including:
using a GoogleNet network as a feature extractor to extract the features of a single-frame image to obtain the local features of a target, and using a bidirectional long-short term memory network (LSTM) to extract the features of the context information of the target to form global features;
the extracting of the single-frame image features and the local features of the target comprises the following steps:
selecting the characteristics of an acceptance layer-5 b through a GoogleNet network;
the image is input into an image input layer in a size of 224 × 224, and the output of an initiation _5b layer is selected as a single-frame image feature after a 5-stage Iceptation network structure, wherein the initiation structure is that a convolution layer of 1x1, 3x3, 5x5 and a firing layer of 3x3 are executed in parallel, and finally the parallel output is used as a result of one initiation.
Optionally, after using the GoogleNet network as a feature extractor to extract a single-frame image feature and generate a local feature of the target, the method further includes:
and inputting the extracted single-frame image features into the time sequence features to obtain the time sequence features corresponding to the single-frame image features.
Optionally, performing quality evaluation on the local features and the global features of each frame of the target through a trained quality evaluation network to obtain corresponding quality scores, including:
predicting the quality fraction of the target single-frame image through an AlexNet feature extractor, wherein the prediction formula is as follows:
Figure BDA0001666877190000041
Figure BDA0001666877190000042
in the formula:
Figure BDA0001666877190000043
represents the image at the T-th moment of the ith video sample, → represents the neural network operation, a represents AlexNet, T represents the length of the video sample,
Figure BDA0001666877190000044
represents the mass fraction at the time t of the ith sample, P (X)i) Set of quality scores, X, representing each frame image of the ith sampleiRepresents the ith video sample, then
Figure BDA0001666877190000045
The calculation formula of (a) is as follows:
Figure BDA0001666877190000046
Figure BDA0001666877190000047
in the formula: h' denotes the LSTM network structure, Q (G)i) Set of quality scores representing frames of an ith video sample based on context information
Figure BDA0001666877190000048
Representing the context-based quality score, G, of the ith sample at time TiThe GoogleNet feature representation representing the ith video sample,
Figure BDA0001666877190000049
representing a context-based quality score at time t of the ith video sample.
Optionally, according to the quality scores of the local features and the global features, respectively aggregating the local features and the global features of each frame of the target through a feature aggregation network, and aggregating the local features and the global features of the target, including:
from one image set S ═ { I }1,I2,…,INExtracting features of fixed dimension to represent features of the whole video sample; let Ra(S) and
Figure BDA00016668771900000410
respectively representing the image set S and the ith frame image IiFeature of (1) (local/global feature)Is characterized by) Ra(S) is dependent on all frames in S, wherein:
Figure BDA00016668771900000411
in the formula:
Figure BDA00016668771900000412
representing the characteristics of the i frame image extracted by GoogleNet,
Figure BDA00016668771900000413
representing an aggregation function that maps variable-length video features to fixed-dimension features, N representing a number of frames in an image set; wherein:
Figure BDA00016668771900000414
μi=Q(Ii)
in the formula: q (I)i) Representing the ith frame image IiMass fraction of (D) < u >iThe prediction function of (2);
order to
Figure BDA0001666877190000051
Represents a video sequence in which
Figure BDA0001666877190000052
Representing the ith frame of image in the video sequence, then:
Figure BDA0001666877190000053
in the formula: t denotes the number of frames contained in the video sequence,
Figure BDA0001666877190000054
represents the quality score of the ith frame image,
Figure BDA0001666877190000055
represents the quality score of the aggregate feature of the ith frame image, {, } represents the cascade,
Figure BDA0001666877190000056
features representing the image of the ith frame, a multiplication operation,
Figure BDA0001666877190000057
representing the temporal characteristics of the image of the ith frame, S (X)i) Representing a video sequence XiThe characteristics of (1).
Compared with the prior art, the invention has the following beneficial effects:
the target identification method based on quality evaluation solves the problem of target identification caused by variable appearance and uneven image quality in a video sequence, increases the correlation information between frames in the quality evaluation, synthesizes video characteristics by aggregating the extracted characteristics and quality scores and utilizing the information of all the frames, and enables the extracted video characteristics to describe corresponding video samples more effectively. And more complete target representation can be given by combining the global features and the local features, so that more effective target information is obtained, the representation of the target is more accurate, and the identification precision is improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram illustrating a method for identifying a target based on quality evaluation according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a network of a target identification method based on quality evaluation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an internal structure of a long short term memory network LSTM according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a bidirectional LSTM network according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a quality evaluation network based on context information according to an embodiment of the present invention;
FIG. 6 is a block diagram of a combination of global features and local features provided by an embodiment of the present invention;
fig. 7 is a schematic diagram of a pedestrian re-identification result provided in an embodiment of the present invention, where (a) is a target sample, (b) is a matching result obtained by a text method, and (c) is a matching result of a text comparison method.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Fig. 1 is a schematic diagram illustrating a principle of a target identification method based on quality evaluation according to an embodiment of the present invention, and as shown in fig. 1, the target identification method based on quality evaluation according to the present invention includes:
s1: constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target recognition model is used for extracting target features from a video so as to represent the whole structure information and the local information of a target.
In this embodiment, the quality evaluation network includes: the system comprises an AlexNet feature extractor and a bidirectional Long-Short Term Memory (LSTM), wherein the AlexNet feature extractor is used for carrying out quality evaluation on single-frame image features and local features of a target, and the LSTM is used for carrying out quality evaluation on global features.
S2: and training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements.
In this embodiment, the target recognition model is trained, and only the identity information is used for supervision. In the training process, the quality evaluation network and the feature extraction network are mutually promoted, the quality evaluation network and the feature extraction network which consider the time sequence features are mutually promoted, and the global features and the local features are mutually promoted. And simultaneously, reasonably initializing the network: initialization was performed using the public model of GoogleNet and a pre-trained quality assessment model.
S3: and carrying out target recognition on the video through the trained target recognition model.
In the embodiment, the change of the motion, the expression and the like of the target and the influence of environmental factors such as illumination change, shielding and the like are considered, so that the feature and the effectiveness of each frame of the video are changed greatly. And extracting context information by using a bidirectional long-short term memory network, and evaluating the effectiveness of the characteristics of each frame on the basis of considering the context information. And finally, combining the two characteristics to enable the frame characteristics to be aggregated on the basis of reasonable effectiveness to obtain effective video characteristics, thereby effectively improving the identification precision of the target.
Optionally, step S1 includes:
s11: acquiring image data with known quality standards, and training a quality evaluation network through the image data to obtain a trained quality evaluation network;
in this embodiment, optionally, the training set may be obtained through a database with known quality standards, specifically:
acquiring a first video and a second video from two cameras at different angles and different positions from a database with known quality standards, wherein the first video and the second video both comprise targets; selecting N first video samples with the frame number larger than 21 frames from the first video, and selecting N second video samples with the frame number larger than 21 frames from the second video; wherein N is a natural number greater than or equal to 2; and selecting a training set and a testing set from the first video sample and the second video sample, wherein the training set is used for training the quality evaluation network, and the testing set is used for testing the quality evaluation network.
Alternatively, the video containing the target may also be made without a database of known quality criteriaInputting a human face recognition system, and taking an output result of the human face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as the identity i is used as a quality label; assume that the training set consists of m labeled samples: { (x)1,y1),…,(xm,ym)},
Figure BDA0001666877190000071
yiE {1, 2, …, N }, the probability of sample i being in the j category is:
Figure BDA0001666877190000072
by passing
Figure BDA0001666877190000073
Normalizing the probability distribution of each class to enable the sum of all the probabilities to be 1, and taking the probability when i is equal to j as the quality standard of the image, wherein the quality standard is that
Figure BDA0001666877190000074
In the formula: (x)1,y1) Denotes the sample numbered 1, (x)m,ym) Denotes the sample with the index m, xiThe characteristic expression of the ith sample is shown, the value range of i is 1-m,
Figure BDA0001666877190000078
representing a real space, n being the output dimension of a fully connected layer preceding the softmax layer, yiA label representing the ith sample,
Figure BDA0001666877190000075
representing the probability that sample i is in the j category,
Figure BDA0001666877190000076
represents the ith sample after passing through the softmax layerThe raw output of the j neurons,
Figure BDA0001666877190000077
the original output of the kth neuron after the ith sample passes through the softmax layer is shown, N represents the class number, and k represents a counting variable.
S12: extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global features according to the extracted features of the context information of the target;
in the embodiment, a GoogleNet network is used as a feature extractor to extract the single-frame image features and the local features of the target, and a bidirectional long-short term memory network LSTM is used to extract the features of the context information of the target to form global features;
the extracting of the single-frame image features and the local features of the target comprises the following steps:
selecting the characteristics of an acceptance layer-5 b through a GoogleNet network;
the image is input into an image input layer in a size of 224 × 224, and the output of an initiation _5b layer is selected as a single-frame image feature after a 5-stage Iceptation network structure, wherein the initiation structure is that a convolution layer of 1x1, 3x3, 5x5 and a firing layer of 3x3 are executed in parallel, and finally the parallel output is used as the output of one initiation structure.
S13: performing quality evaluation on the single-frame image characteristics of the target through the trained quality evaluation network, and performing quality evaluation on the local characteristics and the global characteristics extracted by the characteristic extraction network to obtain corresponding quality scores;
in this embodiment, the quality evaluation network can generate a reasonable evaluation of the effectiveness of the extracted features. The quality score of the known context is used for measuring the effectiveness of global features, and the quality score of the single-frame image is used for measuring the effectiveness of local features of all parts of the body.
Figure BDA0001666877190000081
Appearance characteristics of image at t moment extracted by GoogleNet representing ith sampleThe extraction process can be expressed as:
Figure BDA0001666877190000082
Figure BDA0001666877190000083
Figure BDA0001666877190000084
the context information-based features representing the image at time t extracted by the ith sample through the LSTM network may be expressed as:
Figure BDA0001666877190000085
Figure BDA0001666877190000086
in the formula:
Figure BDA0001666877190000087
represents the image of the ith sample at time t, G represents GoogleNet, G (X)i) Representing the feature set of each frame extracted after the ith sample passes through GoogleNet, H represents an LSTM network, and H (G)i) And the feature set based on the context information at each moment of the ith sample is shown.
Predicting the quality fraction of the target single-frame image through an AlexNet feature extractor, wherein the prediction formula is as follows:
Figure BDA0001666877190000088
Figure BDA0001666877190000089
in the formula: → represents neural network operation, a represents AlexNet, T represents sequence length of video samples,
Figure BDA00016668771900000810
represents the image independent evaluation quality score at the t-th time of the ith sample, P (X)i) Set of independent evaluation quality scores for each frame representing the ith sample, XiWhich represents the ith video sample, is,
Figure BDA00016668771900000811
the independent image evaluation quality score at the Tth time of the ith sample is shown, then
Figure BDA00016668771900000812
The calculation formula of (a) is as follows:
Figure BDA00016668771900000813
Figure BDA00016668771900000814
in the formula: h denotes the LSTM network, Q (G)i) Set of context-based quality scores for frames representing the ith sample, GiRepresents the GoogleNet feature of the ith sample,
Figure BDA00016668771900000815
the image representing the ith sample at time t is based on the quality score of the context information,
Figure BDA00016668771900000816
and the quality score of the image at the Tth time of the ith sample based on the context information is represented.
Features extracted for GoogleNet
Figure BDA00016668771900000817
Figure BDA00016668771900000818
Features representing an ith frame image; and further inputting the time sequence feature extraction network to extract the time sequence feature of the time sequence feature. We use a round-robin network, where each node of the layer is connected to the previous node, so that within the layer information can flow from the first node to the last node. Order to
Figure BDA0001666877190000091
Information representing the ith frame image from t to t +1 passes through the module, and the extracted time sequence characteristics can be represented as:
Figure BDA0001666877190000092
H(Xi) A set of temporal features representing a video sample,
Figure BDA0001666877190000093
representing the time sequence characteristics of the ith sample and the tth frame image.
Figure BDA0001666877190000094
Representing the information extracted from the current frame in the temporal characteristics of the ith sample th frame image. Hypothesis information r0Representing the gait of the target person, similar information will flow between all frames after training. In this case, the extracted features may include timing features. The cross-frame time sequence characteristics are extracted, and the final video characteristics have robustness.
Specifically, the quality evaluation of the target single-frame image by using AlexNet includes: each input image is scaled to 227 × 227 size and input into an input picture layer. Then, the image is input into five convolution modules in sequence for feature extraction, and each convolution module comprises a group of structures: convolutional layers, ReLUs layers and max-firing layers. Then, the number of the neurons in the first two layers is 4096 after passing through the three full-connection layers. Since the goal is to generate a quality score for each image, the number of neurons in the last layer is set to 1.
Specifically, the method for acquiring the quality score based on the context information comprises the following steps: a timing feature learning module is accessed after the penultimate fully connected layer to generate a quality score evaluation of the known context information. The timing feature learning module is constructed using a modified LSTM unit. In this network, the number of units per LSTM layer is equal to the number of frames making up each video. Each LSTM unit is connected to another LSTM unit so that information can flow from the first LSTM unit to the last LSTM node. Each LSTM node consists of an input node, a hidden node, and an output node. Through the LSTM node, useful information is retained while useless information is forgotten. The output of the LSTM layer is a feature vector for each frame, except that the feature vectors contain both features from the current frame and features from previous frames. The characteristics of each frame of image are obtained by an AlexNet characteristic extractor
Figure BDA0001666877190000095
The LSTM cell has two inputs, one being characteristic of each frame of image
Figure BDA0001666877190000096
The other is a hidden state from the previous unit
Figure BDA0001666877190000097
Figure BDA0001666877190000098
And
Figure BDA0001666877190000099
first through the forget gate. The forgetting gate determines the degree of forgetting information, and the process of forgetting information and retaining information can be expressed as:
Figure BDA00016668771900000910
in the formula:
Figure BDA00016668771900000911
information indicating that the ith sample passes through the forgetting gate at the t-th moment,
Figure BDA00016668771900000912
indicates the GoogleNet characteristic of the ith sample at the t-th time, and σ indicates that σ (x) — (1+ e)-x)-1sigmoid function which compresses the input nonlinearly between 0 and 1, WfConvolution parameters representing a forgetting gate, bfAn offset parameter representing a forgetting gate,
Figure BDA00016668771900000913
and the output characteristic of the memory unit at the t-1 th moment of the ith sample is shown.
At the same time, the input gate will process the current input and decide which information will be used to update the current state, and this update process can be expressed as:
Figure BDA00016668771900000914
Figure BDA0001666877190000101
Figure BDA0001666877190000102
in the formula:
Figure BDA0001666877190000103
representing the information of the ith sample processed by the input gate at time t, WjRepresenting convolution parameters of the input gate, bjThe offset parameter of the input gate is represented,
Figure BDA0001666877190000104
represents a candidate for updating information, and tanh () represents a tanh function layer
Figure BDA0001666877190000105
Nonlinear compression of input to between-1 and 1, WCConvolution parameter representing update information, bCAn offset parameter indicating the update information,
Figure BDA0001666877190000106
representing the state of the neuron at the t-th time of the ith sample,
Figure BDA0001666877190000107
representing the state of the neuron at time t-1 of the ith sample.
Finally, the hidden state will be updated to produce an output.
Figure BDA0001666877190000108
Is an output gate that decides which part of the information will be output, and this process can be expressed as:
Figure BDA0001666877190000109
Figure BDA00016668771900001010
wherein: woRepresenting convolution parameters of the output gate, boAn offset parameter indicative of the output gate,
Figure BDA00016668771900001011
and the output characteristic of the memory unit at the t-th time of the ith sample is shown.
Through the LSTM unit, the information of each frame can be affected by the previous frame, so that the current frame can obtain the above information. Since the nature of the object recognition task is not causal, context information is equally important for non-causal tasks. Thus the present invention uses a bidirectional long-short term memory network. The network structure concatenates the information of the two LSTM layers, and can analyze the characteristics and relationships of the input sequence in the forward and reverse directions simultaneously, while learning the information from the context.
The joining network of the bidirectional long-short term network predicts the quality of each frame under the premise of considering the influence of the previous and the next frames. By adding the bidirectional long-and-short-period LSTM network, the frames with the time sequence characteristics can obtain larger quality scores, so that the characteristics of the corresponding frames play a reasonable role in the final video sample characteristic formation.
S14: according to the quality scores of the single-frame image features, the local features and the global features, aggregating the single-frame image features of the target through a feature aggregation network, and aggregating the local features and the global features of the target;
in the present embodiment, S ═ { I ] is selected from one image set1,I2,…,INExtracting fixed dimension features to represent the features of the whole video sample; let Ra(S) and
Figure BDA00016668771900001012
respectively representing the image set S and the ith frame image IiIs characterized in that Ra(S) depends on all frames in S, where:
Figure BDA00016668771900001013
in the formula:
Figure BDA00016668771900001014
representing the characteristics of the i frame image extracted by GoogleNet,
Figure BDA00016668771900001015
representing an aggregation function that maps variable-length video features to fixed-dimension features, N representing a total number of frames in an image set; wherein:
Figure BDA0001666877190000111
μi=Q(Ii)
in the formula: q (I)i) Representing the ith frame image IiMass ofFraction muiThe prediction function of (2); the characteristics of each part of the body are characterized by dividing the output characteristics of GoogleNet into three parts. Respectively predicting scores of the sound field for the characteristics of each part, and using the characteristics and the quality scores of each part
Figure BDA0001666877190000112
After aggregation, the physical features of the three parts are connected together as the final feature representation of the video sample.
Order to
Figure BDA0001666877190000113
Represents a video sequence in which
Figure BDA0001666877190000114
Representing the ith frame of image in the video sequence, then:
Figure BDA0001666877190000115
in the formula: t denotes the number of frames contained in the video sequence,
Figure BDA0001666877190000116
represents the quality score of the ith frame image,
Figure BDA0001666877190000117
represents the quality score of the aggregate feature of the ith frame image, {, } represents the cascade,
Figure BDA0001666877190000118
features representing the image of the ith frame, a multiplication operation,
Figure BDA0001666877190000119
representing the temporal characteristics of the image of the ith frame, S (X)i) Representing a video sequence XiThe characteristics of (1).
S15: and constructing the constructed target recognition model through the trained quality evaluation network, the trained feature extraction network and the trained feature aggregation network.
The embodiment solves the problems of identification of targets with variable appearances and uneven image quality in video sequences. A circulating network module is added to the quality evaluation module to mine the correlation information between frames, and more effective target information is obtained, so that the quality evaluation becomes more reasonable due to the consideration of the time sequence information. Meanwhile, a circulation network is added to the feature extraction module, so that the feature extraction module contains context information, and the problem of variable target appearance is solved. And through a feature aggregation scheme, the extracted features and quality information are aggregated, and the features of the video sample are synthesized by using the information of all frames, so that the extracted features can describe the video sample more effectively. In addition, the combination of global features and local features can give a more complete target characterization (including both overall structural information and body part features). For the pedestrian re-identification task, experiments on the iLID _ VIDS and PRID2011 datasets showed that the top1 Match rate improved by approximately 3% over the previous algorithm average (the evaluation criterion is from the Cumulative Match curve temporal matching probabilistic (CMC) curve).
Specifically, pedestrian re-identification (Person re-identification) is also called pedestrian re-identification, each camera can only track one section of track of a pedestrian under the condition of fixed-position image pickup monitoring, a Person appears in a plurality of cameras during a long-distance walking process, and if further analysis such as tracking and motion analysis is carried out on a Person, the target under the condition of crossing the cameras can be firstly identified. Given a monitored pedestrian image, the goal of pedestrian re-identification is to find the pedestrian image in another device or vision. In real life, images obtained through various shooting devices often have serious problems of posture, appearance change and the like, and also have the problems of obvious light change, shading and the like. The image quality is uneven, and the image with poor quality can cause great influence on target identification. Pedestrian re-identification is little affected by facial features, mainly affected by wearing, size, shading, posture, viewing angle and the like. Another important feature of pedestrian re-identification is that the person is in motion. And the movement of the pedestrian includes both rigid movement and flexible movement, so that the difference between the appearances is further increased.
The present invention proposes a target identification method based on quality evaluation using quality differences due to these factors. First, a quality assessment network of known context is pre-trained using a database of monitoring images with quality metrics, which may be generated using existing well-behaved target recognition systems if not available. And then embedding the quality evaluation network into a feature extraction network to construct an integral network structure. And finally, training the whole network, so that the feature extraction network and the quality evaluation network are mutually promoted in the training, and a target recognition model based on quality evaluation is obtained.
Specifically, the construction of a target identification database used by the construction introduction is carried out, video data with a certain length is screened, and each sample is guaranteed to contain two videos containing two cameras from different angles and different positions. In the data set construction process, samples of videos with two frames larger than 21 frames are selected, and samples which do not meet requirements are discarded. And dividing the screened data sets into a training set and a testing set according to the proportion of 1:1 of the number of samples. And the data set is randomly divided for a plurality of times, and the average value is obtained through a plurality of experiments to obtain an accurate result. And all image data is scaled to the same size.
In the specific example, two databases of iLIDS _ VIDS and PRID2011 are selected for experiment, and the matching effect of the method is observed. And compared with the existing best method, and the experimental result is analyzed.
The PRID2011 database is a database that is built specifically for pedestrian re-identification. The data set consists of video recorded from two different static surveillance cameras. There are significant variations in viewing angle, and significant differences in illumination, background, and camera characteristics between different samples of the same identity. Since the image is extracted from the video, a motion process of the pedestrian is included. The camera 1 records 385 personal motion videos, the camera 2 records 749 personal motion videos, samples which are seriously shielded are deleted in advance in an open database, and in order to meet the research requirements of people, the samples are further screened. In the experiment, only the video samples with the effective frame number larger than 21 are selected, and the samples which do not meet the conditions are deleted.
The data in the iLIDS-VID database is collected by the multi-shot surveillance network arriving at the lobby at an airport. The data set selects two non-overlapping lens views, consisting of 300 pedestrians of different identities. Each pedestrian sample includes a pair of video samples from different cameras. The number of frames per video sample varies from 23 to 192, with an average length of 73 frames. The iLIDS-VID dataset is very challenging due to clothing similarities between people, illumination and viewpoint variations in the camera view, cluttered background and random occlusion. The biggest characteristic of this data set is that in the airport where there are many people, the sheltering situation is common. Also, because the samples may not be collected during the same day, the clothing wear and the like of the task may all be different.
Effects of the implementation
Table 1 shows experimental comparisons on the lids _ VIDS dataset; table 2 shows experimental comparisons on the PRID2011 data set. Wherein BLSTM + PQAN (Bidirectional Long Short Term Memory Network + Partial Quality Aware Network) represents a Bidirectional Long and Short Term Memory Network + target identification Network based on the Quality of each part of the target, LSTM + PQAN (Long Short Term Memory Network + Partial Quality Aware Network) represents a Long and Short Term Memory Network + target identification Network based on the Quality of each part of the target, PQAN Partial Quality Aware Network represents a target identification Network based on the Quality of each part of the target, QAN (Quality Aware Network) represents a target identification Network based on the Quality of the target, CNN + RNN (relational Neural Network + Currrer Network) represents a method of Neural Network + cyclic Neural Network, STFV3D (spread-Temporal Vector represents a spatial Learning Distance model using TDR-T3 convolution model, and a Fisher model (Distance-Learning Distance) represents a Distance model of using TDR-T3 convolution model, the method Of average temporal alignment Possing representation, the method Of GOG-KISSME-SRID (Gaussian Of Gaussian descriptor-Keep It and straight for straight method-spread ReiD) representing Gaussian descriptor, the method Of Simple direct measurement and straight recognition combining, the LADF (Local-Adaptive Decision Functions) representing a Local adaptable Decision-making method, the leaf Net (Person Re-identification with Human Body guide Guided Decision and Fusion) representing a Network Of Simple straight recognition combining Local features and global features, the PAM-LOMO + KISSME (partial-Simple average knowledge feedback compensation and Fusion) representing a Network Of Simple direct recognition and straight recognition combining Of Local features and straight features, the method Of PAM-LOMO + Ki spatial average alignment Possing representation, the method Of Simple direct recognition and straight recognition Of straight forward measure-spread Of straight forward N + straight recognition and straight recognition method, CNN + XQDA (Convolutional Neural Network + Cross-view predictive Analysis) represents a method Of a Convolutional Neural Network and Cross-view four-dimensional specific Analysis, and GOG + XQDA (Gaussian Of Gaussian descriptor + Cross-view predictive Analysis) represents a method Of a Gaussian descriptor combined with Cross-view four-dimensional specific Analysis. In tables 1 and 2, R1(Top 1 Matching Rate) indicates the 1 st Matching Rate, R5(Top 5 Matching Rate) indicates the Top 5 Matching Rate, R10(Top 10 Matching Rate) indicates the Top 10 Matching Rate, and R20(Top 20 Matching Rate) indicates the Top 20 Matching Rate.
TABLE 1
Figure RE-GDA0001746548160000131
Figure RE-GDA0001746548160000141
TABLE 2
Figure RE-GDA0001746548160000142
Firstly, the effectiveness of the combination of the global features and the local features is verified, and experiments show that when the global features and the local features subjected to quality evaluation are used, the performance ratio is improved by 3.3%, so that the important effect of the quality evaluation on the effectiveness evaluation of the global features and the local features is verified. On the basis, an LSTM structure is added to extract time sequence characteristics. The addition of the time sequence characteristic module further improves the accuracy rate of pedestrian re-identification. When only one LSTM layer is added in the PQAN framework, compared with the baseline method, the top1 matching rate is improved by 0.3%, and when the bidirectional cascade LSTM module is embedded in the framework, the matching rate is improved by 2.3%, and compared with the baseline algorithm PQAN, the matching rate is improved by 4.2%. Moreover, the present invention outperforms most existing methods in all of the criteria listed in the table. Although the PAM-LOMO + KISSME method has a slightly higher top1 match rate than the present invention, the present invention performs better than this method on top 5, top 10 and top 20. In addition, this method uses multiple appearance models and uses a complex framework structure to extract local features. These designs increase the complexity of the network. In sum, the method is superior to the PAM-LOMO + KISSME method.
Firstly, the improvement of the invention is improved to the matching rate as can be seen from the comparative experiment. Secondly, compared with other methods, the performance of the QAN and PQAN of the reference method is superior to other results, the performance of the method is further improved, and the performance of the baseline method is improved by 0.7% and 2.1% respectively due to the improvement of the single-layer LSTM and the bidirectional cascade LSTM. It is worth noting that both the top 5 and top 10 match rates of PAM-LOMO + KISSME are higher than the present invention.
In summary, the effectiveness of the present invention is demonstrated by comparison with baseline methods QAN and PQAN. Compared with the prior method with better performance, the method has the advantages that at least one index is superior to other methods, and the performance of other indexes is not inferior to that of a contrast method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (7)

1. A target identification method based on quality evaluation is characterized by comprising the following steps:
constructing a target recognition model, wherein the target recognition model comprises: the system comprises a quality evaluation network, a feature extraction network and a feature aggregation network, wherein the target identification model is used for extracting effective target features from a video so as to represent the whole structure information and the local information of a target;
training the target recognition model, and adjusting parameters of a quality evaluation network and a feature extraction network in the training process so as to enable the target recognition model to output target features meeting preset requirements;
carrying out target recognition on the video through the trained target recognition model;
the constructing of the target recognition model comprises the following steps:
acquiring image data with known quality standards, and training a quality evaluation network through the image data to obtain a trained quality evaluation network;
extracting the characteristics of the single-frame image through a characteristic extraction network to obtain the local characteristics of the target; forming global characteristics according to the extracted characteristics of the context information of the target; specifically, a GoogleNet network is used as a feature extractor to extract the features of a single-frame image so as to obtain the local features of a target, and a bidirectional long-short term memory network (LSTM) is used to extract the features of the context information of the target so as to form global features;
performing quality evaluation on the local features and the global features of the target through a trained quality evaluation network to obtain corresponding quality scores;
according to the quality scores of the local features and the global features, respectively aggregating the local features and the global features of each frame of the target through a feature aggregation network, and aggregating the local features and the global features of the target;
constructing the constructed target recognition model through a trained quality evaluation network, a feature extraction network and a feature aggregation network;
the quality evaluation network includes: the system comprises an AlexNet feature extractor and a bidirectional long and short term memory network LSTM, wherein the AlexNet feature extractor is used for performing quality evaluation on the single-frame image features of a target and generating quality evaluation on local features by the single-frame image features, and the bidirectional long and short term memory network LSTM is used for performing quality evaluation on global features.
2. The method for identifying an object based on quality evaluation according to claim 1, wherein the acquiring image data with known quality standard comprises:
acquiring a first video and a second video from two cameras at different angles and different positions from a database with known quality standards, wherein the first video and the second video both comprise targets;
selecting N first video samples with the frame number larger than 21 frames from the first video, and selecting N second video samples with the frame number larger than 21 frames from the second video; wherein N is a natural number greater than or equal to 2;
and selecting a training set and a testing set from the first video sample and the second video sample, wherein the training set is used for training the quality evaluation network, and the testing set is used for testing the quality evaluation network.
3. The method for identifying an object based on quality evaluation according to claim 1, wherein the acquiring image data with known quality standard comprises:
the method comprises the following steps of taking a video containing a target as the input of a face recognition system, and taking the output result of the face recognition system as a data image with a known quality standard; the last layer of the face recognition system is a softmax layer, and the probability that a person with an identity i is recognized as the identity i is used as a quality label;
assume that the training set consists of m labeled samples: { (x)1,y1),…,(xm,ym)},
Figure FDA0002716892800000021
Figure FDA0002716892800000022
Then the probability that sample i is in category j is:
Figure FDA0002716892800000023
by passing
Figure FDA0002716892800000024
Normalizing the probability distribution of each class to enable the sum of all the probabilities to be 1, and taking the probability when i is equal to j as the quality standard of the image, wherein the quality standard is that
Figure FDA0002716892800000025
In the formula: (x)1,y1) Denotes the sample numbered 1, (x)m,ym) Denotes the sample with the index m, xiThe characteristic expression of the ith sample is shown, the value range of i is 1-m,
Figure FDA0002716892800000029
representing a real space, n being the output dimension of a fully connected layer preceding the softmax layer, yiA label representing the ith sample,
Figure FDA0002716892800000026
representing the probability that sample i is in the j category,
Figure FDA0002716892800000027
representing the raw output of the jth neuron after the ith sample passes through the softmax layer,
Figure FDA0002716892800000028
the original output of the kth neuron after the ith sample passes through the softmax layer is shown, N represents the class number, and k represents a counting variable.
4. The method for identifying the target based on the quality evaluation as claimed in claim 1, wherein the extracting the single-frame image features and the local features of the target comprises:
selecting the characteristics of an acceptance layer-5 b through a GoogleNet network;
the image is input into an image input layer after being scaled to 224 × 224, and output of an acceptance _5b layer is selected as a single-frame image feature after passing through a 5-stage Iceptation network structure, wherein the acceptance structure is that a convolution layer of 1x1, 3x3, 5x5 and a posing layer of 3x3 are executed in parallel, and finally the parallel output is taken as a result of the acceptance.
5. The method for identifying an object based on quality evaluation according to claim 1, after extracting features of a single frame image and generating local features of the object using a GoogleNet network as a feature extractor, further comprising:
and inputting the extracted single-frame image features into the time sequence features to obtain the time sequence features corresponding to the single-frame image features.
6. The method for identifying the target based on the quality evaluation as claimed in claim 1, wherein the quality evaluation is performed on the local feature and the global feature of each frame of the target through a trained quality evaluation network to obtain the corresponding quality score, and the method comprises the following steps:
predicting the quality fraction of the target single-frame image through an AlexNet feature extractor, wherein the prediction formula is as follows:
Figure FDA0002716892800000031
Figure FDA0002716892800000032
in the formula:
Figure FDA0002716892800000033
represents the image at the T-th moment of the ith video sample, → represents the neural network operation, a represents AlexNet, T represents the length of the video sample,
Figure FDA0002716892800000034
represents the mass fraction at the time t of the ith sample, P (X)i) Set of quality scores, X, representing each frame image of the ith sampleiRepresents the ith video sample, then
Figure FDA0002716892800000035
The calculation formula of (a) is as follows:
Figure FDA0002716892800000036
Figure FDA0002716892800000037
in the formula: h' denotes the LSTM network structure, Q (G)i) Set of quality scores representing frames of an ith video sample based on context information
Figure FDA0002716892800000038
Representing the context-based quality score, G, of the ith sample at time TiThe GoogleNet feature representation representing the ith video sample,
Figure FDA0002716892800000039
representing a context-based quality score at time t of the ith video sample.
7. The method for identifying the target based on the quality evaluation according to claim 1, wherein the steps of aggregating the local features and the global features of each frame of the target and aggregating the local features and the global features of the target through a feature aggregation network according to the quality scores of the local features and the global features comprise:
from one image set S ═ { I }1,I2,…,INExtracting fixed dimension features to represent the features of the whole video sample; let Ra(S) and
Figure FDA00027168928000000310
respectively representing the image set S and the ith frame image IiFeature of (2) (local/global feature), Ra(S) depends on all frames in S, where:
Figure FDA00027168928000000311
in the formula:
Figure FDA00027168928000000312
representing the characteristics of the i frame image extracted by GoogleNet,
Figure FDA00027168928000000313
representing an aggregation function that maps variable-length video features to fixed-dimension features, N representing a number of frames in an image set; wherein:
Figure FDA00027168928000000314
μi=Q(Ii)
in the formula: q (I)i) Representing the ith frame image IiMass fraction of (D) < u >iThe prediction function of (2);
order to
Figure FDA00027168928000000315
Represents a video sequence in which
Figure FDA00027168928000000316
Representing the ith frame of image in the video sequence, then:
Figure FDA0002716892800000041
in the formula: t denotes the number of frames contained in the video sequence,
Figure FDA0002716892800000042
represents the quality score of the ith frame image,
Figure FDA0002716892800000043
represents the quality score of the aggregate feature of the ith frame image, {, } represents the cascade,
Figure FDA0002716892800000044
features representing the image of the ith frame, a multiplication operation,
Figure FDA0002716892800000045
representing the temporal characteristics of the image of the ith frame, S (X)i) Representing a video sequence XiThe characteristics of (1).
CN201810487252.XA 2018-05-21 2018-05-21 Target identification method based on quality evaluation Active CN108765394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810487252.XA CN108765394B (en) 2018-05-21 2018-05-21 Target identification method based on quality evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810487252.XA CN108765394B (en) 2018-05-21 2018-05-21 Target identification method based on quality evaluation

Publications (2)

Publication Number Publication Date
CN108765394A CN108765394A (en) 2018-11-06
CN108765394B true CN108765394B (en) 2021-02-05

Family

ID=64008435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810487252.XA Active CN108765394B (en) 2018-05-21 2018-05-21 Target identification method based on quality evaluation

Country Status (1)

Country Link
CN (1) CN108765394B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020581B (en) 2018-12-03 2020-06-09 阿里巴巴集团控股有限公司 Comparison method and device based on multi-frame face images and electronic equipment
CN111435431A (en) * 2019-01-15 2020-07-21 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN109871780B (en) * 2019-01-28 2023-02-10 中国科学院重庆绿色智能技术研究院 Face quality judgment method and system and face identification method and system
US11176654B2 (en) * 2019-03-27 2021-11-16 Sharif University Of Technology Quality assessment of a video
CN110121110B (en) * 2019-05-07 2021-05-25 北京奇艺世纪科技有限公司 Video quality evaluation method, video quality evaluation apparatus, video processing apparatus, and medium
CN110210126B (en) * 2019-05-31 2023-03-24 重庆大学 LSTMPP-based gear residual life prediction method
CN110502665B (en) * 2019-08-27 2022-04-01 北京百度网讯科技有限公司 Video processing method and device
CN111222487B (en) * 2020-01-15 2021-09-28 浙江大学 Video target behavior identification method and electronic equipment
CN111666823B (en) * 2020-05-14 2022-06-14 武汉大学 Pedestrian re-identification method based on individual walking motion space-time law collaborative identification
CN111914613B (en) * 2020-05-21 2024-03-01 淮阴工学院 Multi-target tracking and facial feature information recognition method
CN111814567A (en) * 2020-06-11 2020-10-23 上海果通通信科技股份有限公司 Method, device and equipment for detecting living human face and storage medium
CN112330613B (en) * 2020-10-27 2024-04-12 深思考人工智能科技(上海)有限公司 Evaluation method and system for cytopathology digital image quality
CN113160050B (en) * 2021-03-25 2023-08-25 哈尔滨工业大学 Small target identification method and system based on space-time neural network
CN113837107A (en) * 2021-09-26 2021-12-24 腾讯音乐娱乐科技(深圳)有限公司 Model training method, video processing method, electronic device and readable storage medium
CN115908280B (en) * 2022-11-03 2023-07-18 广东科力新材料有限公司 Method and system for determining performance of PVC (polyvinyl chloride) calcium zinc stabilizer based on data processing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
CN105046277A (en) * 2015-07-15 2015-11-11 华南农业大学 Robust mechanism research method of characteristic significance in image quality evaluation
CN107341463A (en) * 2017-06-28 2017-11-10 北京飞搜科技有限公司 A kind of face characteristic recognition methods of combination image quality analysis and metric learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10681391B2 (en) * 2016-07-13 2020-06-09 Oath Inc. Computerized system and method for automatic highlight detection from live streaming media and rendering within a specialized media player

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
CN105046277A (en) * 2015-07-15 2015-11-11 华南农业大学 Robust mechanism research method of characteristic significance in image quality evaluation
CN107341463A (en) * 2017-06-28 2017-11-10 北京飞搜科技有限公司 A kind of face characteristic recognition methods of combination image quality analysis and metric learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
No-Reference quality assessment for multiply distorted images based on deep learning;Qingbing Sang,and etc;《2017 International Smart Cities Conference (ISC2)》;20171231;第1-2页 *
Quality Aware Network for Set to Set Recognition;Yu Liu,and etc;《2017 IEEE Conference on Computer Vision and Pattern Recognition》;20171231;第4694-4703页 *
基于卷积神经网络的本征图像分解的实现;孙星等;《北京电子科技学院学报》;20171231;第25卷(第4期);第74-80页 *

Also Published As

Publication number Publication date
CN108765394A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108765394B (en) Target identification method based on quality evaluation
Liu et al. Video-based person re-identification with accumulative motion context
Zhang et al. Facial expression recognition based on deep evolutional spatial-temporal networks
Misra et al. Shuffle and learn: unsupervised learning using temporal order verification
Wang et al. Unsupervised learning of visual representations using videos
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
Song et al. Multimodal multi-stream deep learning for egocentric activity recognition
Xu et al. Deepmot: A differentiable framework for training multiple object trackers
Deep et al. Leveraging CNN and transfer learning for vision-based human activity recognition
Jalalvand et al. Real-time reservoir computing network-based systems for detection tasks on visual contents
Kollias et al. Training deep neural networks with different datasets in-the-wild: The emotion recognition paradigm
CN110188637A (en) A kind of Activity recognition technical method based on deep learning
Zhang et al. Image-to-video person re-identification with temporally memorized similarity learning
Ma et al. Video saliency forecasting transformer
CN110503053A (en) Human motion recognition method based on cyclic convolution neural network
Zhang et al. A multi-scale spatial-temporal attention model for person re-identification in videos
CN111126223B (en) Video pedestrian re-identification method based on optical flow guide features
CN111339908B (en) Group behavior identification method based on multi-mode information fusion and decision optimization
CN113239801B (en) Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment
Snoun et al. Towards a deep human activity recognition approach based on video to image transformation with skeleton data
CN113642482B (en) Video character relation analysis method based on video space-time context
CN111967433A (en) Action identification method based on self-supervision learning network
Jin et al. Real-time action detection in video surveillance using a sub-action descriptor with multi-convolutional neural networks
Behera et al. Person re-identification: A taxonomic survey and the path ahead
Wang et al. Pose-based two-stream relational networks for action recognition in videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant