CN108416632B

CN108416632B - Dynamic video identification method

Info

Publication number: CN108416632B
Application number: CN201810253312.1A
Authority: CN
Inventors: 李仁超
Original assignee: Shi Yongbing
Current assignee: Jiangsu Wenwen Network Technology Co ltd
Priority date: 2018-03-26
Filing date: 2018-03-26
Publication date: 2022-09-13
Anticipated expiration: 2038-03-26
Also published as: CN108416632A

Abstract

The invention provides a dynamic video identification method, which comprises the following steps: the passenger registers personal information in the identity authentication cloud in advance, submits the facial image information of the user and associates a personal account; after the registration is successful, the user obtains a unique ID, and the unique ID and the corresponding user information are stored in a database of the identity authentication cloud; capturing the facial frame of the user according to the user identification information of the card swiping site, and identifying a user identifier according to the facial frame of the user; transmitting the user ID passing through the card swiping site and settlement data of the payment time and place of the user to a transaction cloud; and the transaction cloud calculates the charging required by the passengers entering and leaving the station according to the user ID, and sends the charging value and the ticket buying mode to the passenger terminal. The invention provides a dynamic video identification method, which does not need to increase IC equipment for users, saves a large amount of equipment cost, and improves the calculation efficiency and the passenger flow passing efficiency.

Description

Dynamic video identification method

Technical Field

The invention relates to video identification, in particular to a dynamic video identification method.

Background

In modern cities, subways are increasingly widely used as a convenient, fast, stable and large-traffic vehicle. A large number of passengers get in or out of a subway station, how to ensure the running efficiency of the station and how to prevent crowding events are very important. For example, card swiping into and out of a station requires the operation of aligning the hand-held card with the sensing area, and a long queue is often required for card swiping into the station during rush hours. The user experience is poor. Ticket purchasing systems based on face recognition have been developed in the prior art, and pre-installed photographing devices are used to collect and recognize passengers. However, when the method is applied to an indoor multi-target scene, due to the characteristics of complex background, low quality, variable form and the like, the user and the crowd background are difficult to distinguish by using simple manually selected features, and the accuracy rate of segmentation and identification is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for identifying a dynamic video, which comprises the following steps:

the passenger registers personal information in the identity authentication cloud in advance, submits the facial image information of the user and associates a personal account; after the registration is successful, the user obtains a unique ID, and the unique ID and the corresponding user information are stored in a database of the identity authentication cloud;

capturing the facial frame of the user according to the user identification information of the card swiping site, and identifying a user identifier according to the facial frame of the user; transmitting the user ID passing through the card swiping site and settlement data of the payment time and place of the user to a transaction cloud;

and the transaction cloud calculates the charging required by the passengers entering and leaving the station according to the user ID, and sends the charging value and the ticket buying mode to the passenger terminal.

Preferably, when the user registers the personal information, the user logs in the identity authentication cloud through the passenger terminal to register the personal information.

Preferably, the transaction cloud acquires the user ID and the user information through the identity authentication cloud, calculates the charge to be paid by the passenger according to a preset ticket purchasing mode, and sends the charge value and the ticket purchasing mode to the passenger terminal.

Preferably, the passenger terminal and the settlement client are respectively in communication connection with a transaction cloud; the settlement client comprises a triggering unit, a face recognition unit, an access control system for controlling the passing of a user and a control unit; the triggering unit is used for identifying that a user arrives at the card swiping station and sending user identification information to the control unit; the face recognition unit is used for acquiring a control instruction according to a face video frame sent by the control unit, capturing the face frame of the user and transmitting the face video frame to the control unit; the control unit is used for receiving the user identification information sent by the trigger unit and sending a face video frame acquisition control instruction to the face identification unit so as to control the face identification unit to capture the face frame of the user and identify the user identification; transmitting the user ID of the card swiping site, the payment time and place of the user and settlement data related to ticket buying of the user to a transaction cloud; the identity authentication cloud is used for registering personal ID by the user; the settlement data may include: and calculating the mileage charge according to the user inbound time and the user outbound time.

Compared with the prior art, the invention has the following advantages:

the invention provides a dynamic video identification method, which does not need to increase IC equipment for users, saves a large amount of equipment cost, and improves the calculation efficiency and the passenger flow passing efficiency.

Drawings

Fig. 1 is a flowchart of a method for identifying a motion video according to an embodiment of the present invention.

Detailed Description

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.

One aspect of the present invention provides a method for identifying a motion video. Fig. 1 is a flowchart of a method for identifying a motion video according to an embodiment of the present invention.

The subway ticket card settlement system comprises settlement clients arranged at all subway gates, an identity authentication cloud, a transaction cloud and passenger terminals. The passenger terminal and the settlement client are respectively in communication connection with the transaction cloud; the settlement client includes: the system comprises a triggering unit, a face recognition unit, an access control system for controlling the passing of a user and a control unit; the triggering unit is used for identifying that a user arrives at the card swiping station and sending user identification information to the control unit; the face recognition unit is used for acquiring a control instruction according to the face video frame sent by the control unit, capturing the face frame of the user and transmitting the face video frame to the control unit; the control unit is used for receiving the user identification information sent by the trigger unit and sending a face video frame acquisition control instruction to the face identification unit so as to control the face identification unit to capture the face frame of the user and identify the user identification; transmitting the user ID of the card swiping site, the payment time and place of the user and settlement data related to ticket buying of the user to a transaction cloud; the identity authentication cloud is used for registering personal ID by the user; the settlement data may include: and calculating the mileage charge according to the user inbound time and the user outbound time.

Before a passenger settles through a gate based on the system, the passenger registers personal information in the identity authentication cloud in advance, submits the facial image information of the user and associates the personal account. After the registration is successful, the user obtains the unique ID. The unique ID and the corresponding user information are stored in a database of the identity authentication cloud. When the user registers the personal information, the user can log in the identity authentication cloud through the passenger terminal to register the personal information. The transaction cloud is used for acquiring the user ID and the user information through the identity authentication cloud, the transaction cloud is provided with a ticket purchasing module, the ticket purchasing module calculates the charging required by the passengers entering and leaving the station according to a preset ticket purchasing mode, and sends the charging value and the ticket purchasing mode to the passenger terminal.

The passenger terminal enables a user to obtain the charging information and the ticket buying mode information, the passenger terminal automatically pays, and the payment completion information is uploaded to the transaction cloud. The control unit is also used for receiving confirmation information and authorization information sent by the transaction cloud, controlling the gate to be opened according to the confirmation information and the authorization information, and releasing passengers.

In this embodiment, data transmission is completed between the transaction cloud and the identity authentication cloud in the following manner, the transaction cloud accesses and acquires the user ID and the user information stored in the identity authentication cloud, and only the user ID and the user information stored in the identity authentication cloud can be accessed and acquired, and the identity authentication cloud cannot access and acquire the data information stored in the transaction cloud. The transaction cloud periodically pulls the user ID and the user face frame information from the identity authentication cloud, and updates the transaction cloud database.

The transaction cloud can be used as an independent server of a service provider, is connected with the identity authentication cloud through the Internet, and synchronously updates the user ID and the user information in the database. The transaction cloud includes a ticketing module. And the ticket buying module calculates the amount according to the requirements of the ticket buying mode and the access site.

Correspondingly, the invention also provides a subway ticket card settlement method based on the passenger terminal, which comprises the following steps:

step 1: the triggering unit identifies that the user arrives at the card swiping station and sends the user identification information to the control unit;

and 2, step: the control unit receives the user identification information sent by the trigger unit and sends a face video frame acquisition control instruction to the face identification unit;

and step 3: the face recognition unit acquires a control instruction according to a face video frame sent by the control unit, captures the face frame of the user and transmits the face video frame to the control unit;

and 4, step 4: the control unit acquires a facial frame of the user captured by the face recognition unit and recognizes a user identifier; transmitting the user ID passing through the card swiping site and user payment time and place settlement data to a transaction cloud;

and 5: the transaction cloud retrieves the user ID according to the user ID, calculates the charging required by the passengers entering and leaving the terminal according to a preset ticket purchasing mode and settlement data, and sends the charging value and the ticket purchasing mode to the passenger terminal;

step 6: the passenger terminal automatically pays and charges, and the payment completion information is uploaded to the transaction cloud;

and 7: the transaction cloud sends payment completion information to the control unit, and the control unit controls the barrier gate mechanism to be opened to release passengers who complete payment.

Prior to step 1, the method further comprises: the user accesses the identity authentication cloud, registers personal information and user information, and associates a personal account; after the registration is successful, the user obtains a unique identity ID; the identity ID, the personal information submitted by the user and the corresponding user information are stored in a database of the identity authentication cloud.

After the photographing device is started to scan the video frame to be identified, the face frame of the video frame to be identified, which is located in the scanning area, is acquired. And extracting the characteristic pixel points of the frame to generate a characteristic set to be identified. Specifically, a corresponding scale space is generated according to the face frame, then local extreme points in the scale space are detected, and then the local extreme points are accurately positioned by removing points with contrast lower than a threshold and edge response points, so that feature pixel points capable of reflecting features of the face frame are finally obtained.

When describing the feature pixel points, the main direction of each extreme point is calculated, histogram gradient direction statistics is carried out on the area with the extreme points as the center, and a feature descriptor is generated. And generating a feature set to be identified by the feature pixel points.

And acquiring a sample feature set from the identity authentication cloud, and performing feature matching on the feature set to be identified and the sample feature set. Specifically, the feature set to be identified can be feature matched with the sample feature set by the following method. Counting the number of characteristic pixel points successfully matched with the characteristic set to be identified and the sample characteristic set, acquiring the number of target characteristic pixel points of the sample characteristic set as a first number, and calculating the ratio of the first matching pair number to the first number as the similarity. And finally, comparing the similarity with a second threshold, and if the similarity is greater than the second threshold, judging that the sample feature set is successfully matched.

And then, if the matching is successful, performing feature matching on the feature set to be identified and a verification feature set corresponding to the successfully matched sample feature set to calculate the identification similarity. Then, the number of the feature pixel points successfully matched is counted to serve as a second matching pair number, the number of the feature pixel points in the feature set to be identified is obtained to serve as a second number, and the number of the verification feature pixel points in the verification feature set is obtained to serve as a third number. And finally, calculating the ratio of the second matching pair quantity to the smaller value of the second quantity and the third quantity to serve as the identification similarity.

And finally, determining that the video frame to be identified comprises the target identification user corresponding to the sample feature set if the identification similarity exceeds a first threshold value. Specifically, the target recognition users including the feature set corresponding to the sample in the video frame to be recognized can be determined through the following method. Firstly, whether the recognition similarity exceeds a first threshold value or not is judged, and if the recognition similarity exceeds the first threshold value, the number of verification feature sets with the recognition similarity exceeding the first threshold value is counted. Then, whether the number of the verification feature sets of which the verification feature sets exceed the first threshold is greater than 1 is judged, and if so, a sample feature set related to the verification feature set with the highest identification similarity is obtained. Further, if the verification feature set with the recognition similarity exceeding the first threshold does not exist, it is determined that the target recognition user corresponding to the sample feature set does not exist in the video sequence.

In the above video identification process, a sample feature set for feature matching needs to be generated in advance. Firstly, a to-be-processed facial frame is obtained, the to-be-processed facial frame comprises a target identification user, the target identification user comprises a target feature object and at least one verification feature object, a sample feature set is formed by target feature pixel points, feature pixel points of the verification feature object in a to-be-processed picture are extracted to serve as verification feature pixel points, the verification feature pixel points form a verification feature set, and the sample feature set of the verification feature object is obtained. And finally, associating the sample feature set with the verification feature set to form a sample feature set, wherein the sample feature set corresponds to the target recognition user. After all the facial frames to be processed are preprocessed to generate corresponding sample feature sets, all the sample feature sets are stored in the identity authentication cloud.

In the process of capturing the facial frame of the user by the face recognition unit, in order to reconstruct the background in a motion scene and effectively avoid the mixing phenomenon of a target and the background, the following method is adopted in the target positioning process:

(1) and establishing a video gray two-dimensional vector.

(2) And determining current frame and background pixel points by using the symmetric adjacent frame difference.

(3) And counting and updating the two-dimensional vector according to the determined background pixel points.

(4) The entire initial background was constructed.

Where the size of the input video frame is M x N, a two-dimensional vector LM is created, the value of each element LM (p, l) representing the total number of occurrences of the pixel value l (0 < l < 255) for the pixel at p in the video frame. Let the video sequence be (I) ₀ ，I ₁ ，I ₂ ,…,I _T+1 ) I (p, t-1), I (p, t +1) represent the pixel value at the point p in the t-1, t, t +1 frame in the N +2 frame, then the forward and backward mask maps of the ith frame are:

wherein t is 1,2, …, N. Th ^-1 (t)，Th ⁺¹ (t) are threshold values for whether or not the pixel value at the determination point p changes, respectively.

To D ⁺¹ (p, t) and D ^-1 (p, t) performing logic AND operation to obtain a mask map of the moving pixel point:

if for any point p, if OB (p, t) ═ 1, at D ⁺¹ (p, t) and D ^-1 The median values of (p, t) are all 1, and the current point p is a pixel point of the identified foreground. Otherwise, the current point p is a background pixel point.

Then, the two-dimensional vector LM is counted and updated: if OB (p, t) at point p is 0, adding 1 to the number of occurrences of pixel value l at p; otherwise, no processing is performed.

The selected T +2 frame is repeated through steps 2 and 3. Counting a two-dimensional vector LM according to pixel values, and taking the pixel value with the most occurrence times as an initial background pixel value of each pixel point p, thus finishing the whole initial background B (p), namely the two-dimensional vector LM

B(p)＝max(LM(p,l))

After the initialization of the current background is completed, the background is automatically replaced in a self-adaptive mode along with the arrival of the next frame of image. And updating the background according to the information of target detection and tracking, and utilizing the following three-level algorithm.

(a) Background pixel label (gs), which indicates the number of times a certain pixel is a background pixel in the previous N frames:

(b) identification target label (ms), representing the number of times a certain pixel is divided into moving pixels:

(c) a change history label (hs) representing the number of frames that pixel x has elapsed since the previous marking as a foreground pixel:

let I ^M _t (p) all pixels representing recognition targets, I ^B _t (p) all pixels of the background, I ^c _BK (p) is the background pixel currently in use, I _BK (p) new background pixels. The judgment criterion is as follows:

if gs (p) > k > N, then I _BK (p)＝I ^B _t (p)

If (gs (p) < k.times.N) # N (ms (p)<r × N), then I _BK (p)＝I ^M _t (p)

I _BK (p)＝I ^c _BK (p)

The extraction of the identification target area is carried out on a real target mask image B, and a searching two-dimensional vector DB, a connected domain two-dimensional vector DF and a marking two-dimensional vector flag which have the same size as the real image B are created _W×H And initializing DB, DF is 0, initializing a connected component flag value L to be 1, scanning each row and each column of B, marking the scanned pixel point DB to be 1, and setting a flag when searching a seed point p1 of which the first B is 1 and DB is 0 _W×H (p) ═ L (L ═ 1,2, …, connected domain flag value). And carrying out eight-neighborhood search on the point, marking the point with the B equal to 1 and the DB equal to 0 until the marking of the whole area is completed. To pairAnd marking the points meeting the requirements by using a connected domain two-dimensional vector DF, and setting DF to be 1. The value of L is reset to L for a point in the communication area, and finally L is set to L + 1.

In the previous step, marking of the first area is completed, scanning is continued for points in the image, and the next point with B equal to 1 and DB equal to 0 is searched. And detecting whether the point is the last point, and if not, continuing to scan each row and each column of the B.

And completing the marking of the connected domain, and simultaneously acquiring position and area information so as to facilitate subsequent feature extraction and motion area calculation processing.

For the object recognition of a complex scene, preferably, a preprocessing step of a video frame is further included before the recognition, which mainly includes the detection of a target edge, specifically as follows:

inputting a video frame subjected to gray processing, presetting an integral attenuation parameter and an attenuation coefficient, presetting a short-time FFT filter group of a plurality of direction parameters uniformly distributed along the circumference, and performing short-time FFT filtering on each pixel point in the video frame according to each direction parameter to obtain a short-time FFT energy value of each pixel point in each direction; selecting the maximum value in the short-time FFT energy values of all directions of each pixel point;

for each pixel point, carrying out segmentation processing on the maximum value in the short-time FFT energy values of each direction of each pixel point;

constructing a group of temporary windows by using a Gaussian difference template, wherein each temporary window has different deviation angles relative to a video picture window; for each pixel point, integrating and regularizing the temporary window response and a Gaussian difference template to obtain a group of regularized weight functions;

for each pixel point, under different deflection angles, multiplying the regularized weight function by the maximum value in the segmented short-time FFT energy values in each direction in the Gaussian difference template, and then summing to obtain the short-time FFT energy maximum value approximation result of each pixel point under each deflection angle; solving a standard deviation of a short-time FFT energy maximum value approximation result of each pixel point at each deflection angle;

for each pixel point, calculating by combining the standard deviation of the short-time FFT energy maximum value approximation result under each deflection angle and the integral attenuation parameter to obtain a standard deviation weight; multiplying the standard deviation weight value with the minimum value of the short-time FFT energy maximum value approximation result under each deflection angle to obtain the final result of the short-time FFT energy maximum value of the pixel point;

and for each pixel point, calculating the final result of the maximum value of the short-time FFT energy values in each direction and the maximum value of the short-time FFT energy in combination with the attenuation coefficient to obtain the edge identification value of the pixel point, and carrying out non-maximum attenuation and binarization on the edge identification values of all the pixel points of the video frame to obtain the edge identification image of the video frame.

The calculation of the maximum value in the short-time FFT energy values of each direction specifically includes:

defining a two-dimensional short-time FFT function expression:

wherein

Gamma is a constant representing the ratio of the long axis to the short axis of the elliptical field, lambda is the wavelength, sigma is the standard deviation of the short-time FFT function and the bandwidth of the Gaussian difference template window, 1/lambda is the spatial frequency of the cosine function, sigma/lambda is the bandwidth of the spatial frequency,

is a phase angle parameter, theta is an angle parameter of the short-time FFT filtering;

calculating e (x, y) ═ I (x, y) ^* f(x,y)

I (x, y) is the video frame, the convolution operator;

E(x,y；σ)＝max{e(x,y)|i∈[1,N _θ ]}

e (x, y; sigma) is the maximum value of the short-time FFT filtering energy value of each angle of the pixel point (x, y), N _θ The number of angles θ.

The maximum value in the segmented short-time FFT energy values in each direction is calculated as follows:

e (x, y, sigma) is segmented by utilizing the upper limit proportion and the lower limit proportion, the E (x, y, sigma) of each pixel point is selected from small to large, the E (x, y, sigma) with percentage number corresponding to the upper limit proportion is selected, and the maximum value is set as Q _H (ii) a E (x, y; sigma) of each pixel point is selected from small to large, E (x, y; sigma) with percentage corresponding to lower limit proportion is selected, and the maximum value is set as Q _L (ii) a Maximum value in the short-time FFT energy values of each direction after segmentation:

the expression of the Gaussian difference template is as follows:

wherein k is a parameter for controlling the size of the Gaussian difference template;

the expression of the temporary window response is as follows:

wherein d represents the distance from the center of the video picture to the temporary window;

the integration and regularization process of each pixel point comprises the following steps of performing regularized weight function expression, namely:

the calculation process of the short-time FFT energy maximum value approximation result under each deflection angle of each pixel point is as follows:

wherein-3 k σ < x' <3k σ; -3k σ < y' <3k σ, representing the range of the gaussian difference template;

the calculation process of the average Ave (x, y) and the standard deviation STD (x, y) of the short-time FFT energy maximum value approximation result under each deflection angle of each pixel point is as follows:

when the collected video frame information is analyzed based on the content, the method adopts the deep neural network to extract the crowd characteristics in the scene in real time, associates the crowd characteristics with the corresponding time information labels, and calculates the projection vector according to the position and the angle of the shooting equipment calibrated in advance so as to realize the conversion from a plurality of pixel coordinates to a uniform three-dimensional coordinate and associate the pixel coordinates with the three-dimensional coordinate labels. The method comprises two training steps: firstly, training a human body detector, then carrying out network compression to reduce the number of layers and channels and weight aggregation, and retraining according to the previous detection result to obtain a detector suitable for the current visual angle; specific feature detection is added on the basis of a crowd detection algorithm, and local features are described to serve as supplementary features of the overall features. Then, for each photographing device, a lightweight DNN based on the perspective is trained. And calibrating corresponding time information according to each target detection result, and calculating a projection vector by means of the position and the angle of the shooting equipment calibrated in advance, so that mapping from pixel coordinates to a three-dimensional position is realized, and the mapping is related to a three-dimensional coordinate label. Then, mapping of the target from a pixel space to a three-dimensional space is realized through the three-dimensional position and the projection vector of the photographing device, and conversion from a plurality of pixel coordinates to unified three-dimensional coordinates is realized.

And according to the crowd characteristics, carrying out single-lens tracking on the corresponding human body target to generate a human body target tracking path, and converting the human body target tracking path into a coordinate path of a three-dimensional space through coordinate mapping.

The identity authentication cloud receives a human body target tracking path returned by the settlement client, and aggregates the human body target tracking path to obtain an aggregated path, wherein the aggregated path specifically comprises the following steps:

(1) processing target path discontinuity caused by shielding and illumination problems, and realizing continuous path depiction through feature comparison;

(2) according to the motion direction information of the target projection, surrounding photographing equipment coverage is searched in the three-dimensional space, weight values are given to the photographing equipment according to the maximum possibility, and target aggregation is carried out based on the weight values.

And the identity authentication cloud respectively samples the human body target tracking path under each single lens according to the aggregation path obtained in the last step to serve as a characteristic basic library of the human body target, and corresponds the multi-lens aggregated target to the same library ID.

Wherein, sample the human target tracking path under every single-lens, include: the sequence is sampled through the target path. And sets a multi-shot object unified library ID management method.

The identity authentication cloud receives the crowd image to be retrieved, the features of the crowd image are extracted through DNN to serve as retrieval features, the retrieval features are compared with the stored feature base libraries, the successfully compared human body target paths are searched, the human body target paths are ranked according to the matching degree, and the retrieval result is returned.

Preferably, searching the successfully compared human body target paths, and sorting according to the matching degree comprises: according to the input crowd image to be retrieved, a two-stage retrieval mechanism is adopted, the target position with the highest matching degree is obtained firstly, and then retrieval is performed on the basis of the periphery of the target preferentially.

In the process of constructing DNN, the whole DNN network is divided into a convolutional layer, a positioning layer and a matching layer, and the concrete analysis is as follows:

the convolution layer adopts a structure of 5 layers of convolution layers, Relu activating functions are used between the layers, and a maximum value cache layer is added after the first two layers of convolution layers. A series of image feature maps can be extracted through the convolution layer, and the cache layer next to the last layer of the image is changed into the following oneIn such a way that the resulting feature maps are of uniform size: if the final feature size requirement is W ₀ ,H ₀ And when the size of the current feature map is { W, h }, defining the size of the current feature map as { W ₀ /w,H ₀ The sliding window of/h performs maximum value buffering processing.

The positioning layer adopts a sliding window for each dimension feature map obtained in the above way, and a low-dimension feature can be extracted for each sliding window. The invention carries out multi-scale sampling on the characteristic diagram to extract the characteristics of objects with different scales: and extracting K possible candidate sliding windows from the center point of each sliding window, and extracting at most W H K candidate sliding windows from the feature map with the size of W H. The K possibilities include a area scales and b aspect ratios, i.e.: k ═ a × b. And then inputting the extracted low-dimensional features into a sliding window regression layer and a sliding window meter layer respectively, and obtaining the position correction of K candidate sliding windows extracted from the central point of the sliding window and the score of whether the candidate sliding windows belong to the foreground target respectively, wherein the method can be realized by two parallel 1-by-1 fully-connected convolution layers. And the sliding window regression layer further corrects the position of each candidate sliding window, outputs the corrected upper left corner and the corrected length and width correction value of the candidate sliding window, and constructs different regressors for K different candidate sliding windows, namely K regressors do not share a weight value, so that candidate areas with different sizes can be predicted for each 3 x 3 sliding window. And judging whether each candidate sliding window belongs to the target detection area or not by the sliding window counting layer, and outputting scores of the candidate sliding windows respectively belonging to the foreground and the background. And finally, carrying out non-maximum attenuation processing on all the candidate sliding windows extracted by the sliding window, removing the regions with higher repetition degree, and finally extracting N candidate sliding windows with the highest scores as candidate regions to suggest to enter final target classification.

The matching layer carries out classification judgment on the candidate regions obtained by the positioning layer and further obtains positioning position correction, and firstly, the characteristics of the candidate regions need to be extracted. The feature map of the candidate region can be extracted by calculating the position of the candidate region in the feature map, so that the network only needs to calculate the feature map of the whole face frame once, and the positioning layer and the matching layer can share the feature map extracted by the convolutional layer. And respectively inputting the feature graph to a clustering layer and a position adjusting layer after passing through two full-connection layers, and respectively obtaining the category score and the position correction of the candidate region.

After the whole DNN network framework is constructed, a regression attenuation function of a positioning layer and a classification attenuation function of a matching layer are defined, so that an overall target function of the whole network is obtained, and global end-to-end training of the whole network is realized; when supervised training is carried out, the training set needs to be labeled, and the labeling content comprises the category and the position of the object. And for K candidate sliding windows extracted from each 3-by-3 sliding window, defining that the intersection degree of the candidate sliding windows with the actually marked sliding window is more than 0.8 as a positive sample, defining that the intersection degree is less than 0.3 as a negative sample, and discarding the rest.

The definition of the degree of intersection is:

Cm＝ML∩CD/ML∪CD

wherein, ML is label, CD is candidate sliding window. Cm is the ratio of the area of the overlapped part of the two to the total occupied area of the two, IoU is 1 when the candidate sliding window and the label are completely overlapped, and IoU is 0 when the candidate sliding window and the label are not overlapped.

Defining its classification decay function as:

L _p (p _i ,p _i ^* )＝-log[p _i ^* p _i +(1-p _i ^* )(1-p _i )]

wherein p is _i Representing the score of the ith candidate sliding window predicted as the target, i.e. the probability that it belongs to the target, p _i ^* Represents a training label, which is 1 when the candidate sliding window is a positive sample and 0 when the candidate sliding window is a negative sample.

The regression decay function defining the sliding window regression network is:

L _r (t _i ,t _i ^* )＝p _i ^* R(t _i -t _i ^* )

wherein, t _i ＝{t _x ,t _y ,t _w ,t _h Denotes the position coordinate information of the i-th candidate sliding window regression, respectively, t _i ^* ＝{t _x ^* ,t _y ^* ,t _w ^* ,t _h ^* Denotes the position coordinate information of the positive sample window.

Wherein, in training, a term p is introduced in the attenuation function _i ^* To ensure that the calculation of the regression decay function is only performed when the sliding window is a positive sample.

The function R takes the following function:

knowing the classification decay function and the regression decay function, the decay function of the localization layer can be defined as:

where p ∈ { p } _i }，t∈{t _i And the parameter lambda is a weighting parameter of the two sub-attenuation functions.

The matching layer also comprises a candidate region counting part and a region regression part. If the network needs to construct a classifier for distinguishing M classes, after each candidate region passes through a matching layer, the score of whether the candidate region belongs to each class of the M classes and the score of whether the candidate region belongs to the background can be obtained, so that the classifier obtains M +1 score values in total, the sum of the score values is 1, and each score value also represents the probability score of whether the candidate region belongs to the class, wherein c is equal to { c ═ c ₀ ,c ₁ ,...c _M+1 }。

And training the network by adopting a training set of the calibrated facial feature categories and the position information, thereby obtaining a network model for positioning and identifying the facial features. In training, if the candidate sliding windows are from the same face frame, the computation of the previous convolutional layer may be shared. Because the network mainly comprises three parts of networks, a layer-by-layer progressive training mode is adopted, and the method specifically comprises the following steps:

1) the convolutional layer is trained first. Migration initialization is performed for convolutional layers. 2) And adding a positioning layer on the basis of the trained convolutional layer for training, fixing parameters of the convolutional layer, initializing the parameters of the positioning layer by adopting a random initialization mode, and adjusting the parameters of the positioning layer according to the defined attenuation function of the positioning layer. 3) Then, a matching layer is added, the convolution layer and the positioning layer parameters are fixed, the parameters of the matching layer are initialized in a random initialization mode, and the parameters of the matching layer are learned and adjusted according to the defined attenuation function of the matching layer. 4) And finally, carrying out end-to-end fine adjustment on the whole network according to the defined global network attenuation function to obtain a final training result.

After learning and training the network by the calibrated training set of facial feature categories and position information, a result of a network model can be obtained, and the model comprises numerical values of weights of each layer in the DNN. When the method is applied to actual application, the collected facial feature images are input to a network for forward transmission, and the output of the network is the N candidate regions with corrected positions and the category scores thereof.

Carrying out subsequent processing on the N candidate regions to obtain a final accurate recognition result, wherein the method comprises the following steps: 1) scoring each candidate region by M +1 categories, and selecting the highest scoring person as the category of the candidate region; 2) de-overlapping candidate regions of the same category: and calculating the repetition Cm value pairwise, and keeping the candidate region with high score when the repetition Cm value is greater than 0.7. 3) And in the facial feature recognition, all the facial features are not overlapped, and the residual candidate areas are subjected to full-class de-duplication processing to obtain the final positioning and recognition result of the network.

In summary, the present invention provides a method for identifying a dynamic video, which does not require an additional IC device for a user, saves a lot of device costs, and improves the computation efficiency and the passenger traffic efficiency.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing system, centralized on a single computing system, or distributed across a network of computing systems, and optionally implemented in program code that is executable by the computing system, such that the program code is stored in a storage system and executed by the computing system. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A method for identifying a motion video, comprising:

the transaction cloud calculates the charging required by the passengers entering and leaving the station according to the user ID, and sends the charging value and the ticket buying mode to the passenger terminal;

the captured facial frames of the user, further comprising the following target location process:

(1) establishing a video gray scale two-dimensional vector; (2) determining current frame and background pixel points by using symmetrical adjacent frame difference; (3) counting and updating the two-dimensional vector according to the determined background pixel points; (4) constructing the whole initial background;

the size of an input video frame is M x N, a two-dimensional vector LM is established, the value of each element LM (p, l) represents the total occurrence times of a pixel value l of a pixel at p in the video frame, and the value range of the pixel value l is as follows: l is more than 0 and less than 255; let video sequence be (I) ₀ ，I ₁ ，I ₂ ,…,I _T+1 ) I (p, t-1), I (p, t +1) represent the pixel value at the point p in the t-1, t, t +1 frame in the N +2 frame, then the forward and backward mask maps of the ith frame are:

wherein t is 1,2, …, N; th ^-1 (t)，Th ⁺¹ (t) are threshold values for determining whether or not the pixel value at the point p has changed, respectively;

if OB (p, t) ═ 1 for arbitrary point p, at D ⁺¹ (p, t) and D ^-1 (p, t) the median values are all 1, and the current point p is a pixel point of the identified foreground; otherwise, the current point p is a background pixel point;

then, the two-dimensional vector LM is counted and updated: if OB (p, t) at point p is 0, adding 1 to the number of occurrences of pixel value l at p; otherwise, no processing is carried out;

repeating the steps 2 and 3 on the selected T +2 frame; counting a two-dimensional vector LM according to pixel values, and taking the pixel value with the most occurrence times as an initial background pixel value of each pixel point p, thus finishing the whole initial background B (p), namely the two-dimensional vector LM

B(p)＝max(LM(p,l))

After the initialization of the current background is finished, automatically replacing the background in a self-adaptive mode along with the arrival of the next frame of image; updating the background according to the information of target detection and tracking, and utilizing the following three labels;

(a) the background pixel label gs indicates the number of times that a certain pixel is taken as a background pixel in the previous N frames:

(b) an identification target label ms indicating the number of times a certain pixel is divided into moving pixels:

(c) change history label hs, representing the number of frames that pixel x has elapsed since the previous marking as a foreground pixel:

let I ^M _t (p) all pixels representing recognition targets, I ^B _t (p) represents all pixels of the background, I ^c _BK (p) is the background pixel currently in use, I _BK (p) new background pixels; the judgment criterion is as follows:

if gs (p) > k > N, then I _BK (p)＝I ^B _t (p)

If (gs (p) < k × N) # ms (p)<r × N), then I _BK (p)＝I ^M _t (p)

I _BK (p)＝I ^c _BK (p)

The extraction of the identification target area is carried out on a real target mask image B, and a searching two-dimensional vector DB, a connected domain two-dimensional vector DF and a marking two-dimensional vector flag which have the same size as the real image B are created _W×H And initializing DB, DF is 0, initializing a connected component flag value L to be 1, scanning each row and each column of B, marking the scanned pixel point DB to be 1, and setting a flag when searching a seed point p1 of which the first B is 1 and DB is 0 _W×H (p)＝L，L＝1，2, …, connected domain tag value; carrying out eight-neighborhood search on the point p, and marking the point which accords with B equal to 1 and DB equal to 0 until the marking of the whole area is completed; marking the points meeting the requirements by using a connected domain two-dimensional vector DF, and setting DF as 1; resetting the value of the point L in the communication area to L, and finally making L equal to L + 1;

completing marking of the first area in the previous step, continuing scanning points in the image, and searching the next point with B equal to 1 and DB equal to 0; simultaneously detecting whether the point is the last point, if not, continuing to scan each row and each column of the B;

completing the marking of the connected domain, and simultaneously acquiring position and area information so as to facilitate subsequent feature extraction and motion area calculation processing;

after video frame information is collected, extracting crowd characteristics in a scene in real time by adopting a deep neural network, associating the crowd characteristics with corresponding time information labels, and calculating projection vectors according to the positions and angles of shooting equipment calibrated in advance so as to realize the conversion of a plurality of pixel coordinates to uniform three-dimensional coordinates and associate the pixel coordinates with the three-dimensional coordinate labels;

the method comprises two training steps: firstly, training a human body detector, then carrying out network compression to reduce the number of layers and channels and weight aggregation, and retraining according to a detection result to obtain a detector suitable for a current visual angle; adding specific feature detection into a crowd detection algorithm, and depicting local features as complementary features of the overall features; then, for each photographing device, training a lightweight DNN based on the perspective; calibrating corresponding time information for each target detection result, calculating a projection vector by means of the position and the angle of the shooting equipment calibrated in advance, realizing the mapping from the pixel coordinate to the three-dimensional position, and associating with a three-dimensional coordinate label; mapping of a target from a pixel space to a three-dimensional space is realized through the three-dimensional position and the projection vector of the photographing device, and conversion from a plurality of pixel coordinates to a unified three-dimensional coordinate is realized;

according to the crowd characteristics, carrying out single-lens tracking on a corresponding human body target to generate a human body target tracking path, and converting the human body target tracking path into a coordinate path of a three-dimensional space through coordinate mapping;

(1) processing target path discontinuity caused by shielding and illumination, and realizing continuous path depiction through feature comparison;

(2) according to the motion direction information of the target projection, searching surrounding photographing equipment coverage in a three-dimensional space, endowing weight values to the photographing equipment according to the maximum possibility, and performing target aggregation based on the weight values;

the identity authentication cloud respectively samples a human body target tracking path under each single lens according to the aggregation path obtained in the last step to serve as a characteristic basic library of the human body target, and the multiple lens aggregated targets correspond to the same library ID;

wherein, sample the human target tracking path under every single-lens, include:

performing sequence sampling through a target path, and setting a multi-shot target unified library ID management method;

the identity authentication cloud receives a crowd image to be retrieved, the characteristics of the crowd image are extracted through DNN to serve as retrieval characteristics, the retrieval characteristics are compared with a plurality of stored characteristic base libraries, human body target paths which are successfully compared are searched, the human body target paths are ranked according to the matching degree, and a retrieval result is returned;

for the object identification of a complex scene, the method further comprises a preprocessing step of the video frame before the identification, including the detection of the target edge, specifically as follows:

inputting a video frame subjected to gray processing, presetting an integral attenuation parameter and an attenuation coefficient, presetting a short-time FFT filter group of a plurality of direction parameters uniformly distributed along the circumference, and performing short-time FFT filtering on each pixel point in the video frame according to each direction parameter to obtain a short-time FFT energy value of each pixel point in each direction; selecting a maximum value in short-time FFT energy values of each pixel point in each direction;

for each pixel point, carrying out segmentation processing on the maximum value in the short-time FFT energy values of each direction;

2. The method according to claim 1, wherein the user registers the personal information by logging in an authentication cloud through a passenger terminal when registering the personal information.

3. The method according to claim 2, wherein the transaction cloud obtains the user ID and the user information through the identity authentication cloud, calculates the charge to be paid by the passenger according to a preset ticket buying mode, and sends the charge value and the ticket buying mode to the passenger terminal.

4. The method of claim 3, wherein the passenger terminal and the settlement client are each communicatively connected to a transaction cloud; the settlement client comprises a triggering unit, a face recognition unit, an access control system for controlling the passing of a user and a control unit; the triggering unit is used for identifying that a user arrives at the card swiping station and sending user identification information to the control unit; the face recognition unit is used for acquiring a control instruction according to the face video frame sent by the control unit, capturing the face frame of the user and transmitting the face video frame to the control unit; the control unit is used for receiving the user identification information sent by the trigger unit and sending a face video frame acquisition control instruction to the face identification unit so as to control the face identification unit to capture the face frame of the user and identify the user identification; transmitting the user ID of the card swiping site, the payment time and place of the user and settlement data related to ticket buying of the user to a transaction cloud; the identity authentication cloud is used for registering personal ID by the user; the settlement data may include: and calculating the mileage charging according to the user inbound time and the user outbound time.