CN101419670A

CN101419670A - Video monitoring method and system based on advanced audio/video encoding standard

Info

Publication number: CN101419670A
Application number: CNA2008102032020A
Authority: CN
Inventors: 王新; 路红; 宋元征; 陈桂财
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2008-11-21
Filing date: 2008-11-21
Publication date: 2009-04-29
Anticipated expiration: 2028-11-21
Also published as: CN101419670B

Abstract

The invention belongs to the video monitoring technical field and in particular relates to a video monitoring method based on AVS (advanced video-audio coding standard) and an implementation system thereof. Complying with the development trend of the video monitoring, the invention introduces automatic processing and AVS standard into video monitoring and is combined with techniques of background/non-background classification, face detection and recognition, etc. The monitoring video is automatically processed in advance by the computer system. Under the premise of ensuring the effectiveness of returned contents, information quantity feedback to operators is far smaller than that of a traditional monitoring system. Therefore, human resources are saved greatly, and the reliability of the video monitoring system is improved simultaneously. The invention is the first to use the advantages of AVS in video monitoring technology and patent application. With the great support from national and local governments for the application and the expansion of AVS, The video monitoring method based on AVS and the implementation system thereof have certain application value in the application fields of digital monitoring, access control, personal identification, etc.

Description

Video frequency monitoring method and system based on advanced audio/video encoding standard

Technical field

The invention belongs to technical field of video monitoring, be specially a kind of video frequency monitoring method and realization system thereof based on AVS (advanced audio/video encoding standard).

Background technology

Nowadays safety problem has been subjected to extensive concern, has emerged in large numbers increasing video monitoring system, as gate control system, attendance checking system and identification system or the like.Video monitoring system can allow managerial personnel observe front end in the pulpit and take precautions against in the zone all personnel's active situation and keep a record, for security system provides real-time image, acoustic information.But, traditional video monitoring system needs the great amount of manpower resource overhead, detection, identification and understanding to the monitor video content rely on manually fully, reduced the work efficiency of video monitoring system, security and accuracy also lack assurance, and also do not have special-purpose digital video monitor system video compression standard at present as the video compression standard of video monitoring system core technology, on Network Transmission and system's versatility, caused bigger problem.

Summary of the invention

The objective of the invention is to propose a kind of high efficiency, video frequency monitoring method and system that security is good.

The present invention complies with the video monitoring trend, robotization is handled and AVS standard introducing video monitoring, in conjunction with technology such as background/non-background class, the detection of people's face and identifications, in advance to the automatic processing of monitor video by computer system, under the prerequisite of the validity that guarantees returned content, the quantity of information that feeds back to operating personnel will be much smaller than traditional supervisory system, thereby has saved human resources greatly, has also improved the reliability of video monitoring system simultaneously.Initiative utilizes AVS at video monitoring technical elements and patent application advantage, and along with country and local government support application to AVS energetically, the present invention controls and application such as identification has certain application value digital supervision, gate inhibition.

The present invention at first gathers according to the AVS code stream by the AVS web camera, uses the compressed domain in the AVS code stream decoding process to carry out the classification of background and non-background.When classification results shows that current frame is not background, carry out people's face and detect.When detecting people's face, carry out recognition of face, be about to people's face data and carry out comparing with training data after the conversion.Before recognition result is fed to the user, calculate degree of confidence t earlier, t shows the credibility of current recognition result.(t_min is obtained by the empirical data statistics during less than threshold value t_min as degree of confidence t, t_min is high more, and then accuracy rate is high more, t_min is low more, and then recall ratio is high more, set a suitable t_min by balance according to system's actual conditions), we think that this people's face does not belong to the data in the Current Library, regard as the stranger, and this result is fed back to the user, new people's face adds in the storehouse with this after the user confirms.When degree of confidence during more than or equal to threshold value t_min, show that recognition result has higher confidence level, write down recognition result then and video is marked.Fig. 1 is the process flow diagram of this video monitoring system, has wherein embodied two characteristics of the present invention, and AVS uses and robotization is handled.

The system of specific implementation mainly forms training module, labeling module and retrieval module by three parts.

Training module comprises the training module of monitoring environment background and the training module of face database, implements respectively to import to people's face sample storehouse and background sample storehouse to the environmental background training with to the training of people's face, is output as each face characteristic and background characteristics.

Labeling module comprises that background detection module, people's face detection module, face recognition module and index structure set up part, and the monitor video of input is marked automatically.Be input as background characteristics, face characteristic and monitor video to be marked that training module obtains, be output as the search index of monitor video to be marked.

Retrieval module is to specifying monitor video to retrieve, comprising picture query, text query and query video.Be input as the index of specifying monitor video, picture, text or segment video that the user submits to obtain content that the user submits to corresponding picture material in monitor video.Figure 2 shows that the logical relation between main composition module, workflow and each module of system.As shown in the figure, the initial input of system is face database and background sample, through obtaining background model and face characteristic transformation matrix and face characteristic storehouse after the training.Then monitor video is marked, the process of mark at first is that background detects, and to not being that the image of background carries out people's face and detects, the people's face that wherein occurs is carried out eigentransformation and creates index under the index structure.The final user submits text to by user interface, picture or video, and system submits to the difference of content to handle respectively according to the user, and what finally feed back to the user is the position that relevant information occurs in monitor data.

Be the design of system's main modular below:

1) background training module: the background video sample to input calculates, and obtains background model.Adopt algorithm to be based on the hsv color space, calculate the span that each pixel belongs to background.

Input: background video sample.

Output: background model is used for the comparison of background.

2) people's face training module: the people's face in the face database is handled.Adopting algorithm is fisher-face.

Input: face database.

Output: by the transformation matrix that people's face data computation in the face database obtains, the purpose of this matrix is that the conversion of input people face is obtained one-dimensional vector, in order to identification.When obtaining transformation matrix, export the center of each one face, in order to identification.

3) background detection module: incoming frame image and background model are compared, and purpose is to know whether this incoming frame is background, if not background, those zones belong to the prospect scope.

Input: background model, two field picture.

Output: know whether this incoming frame is background, if not background, those zones belong to the prospect scope.

4) people's face detection module:, detect people's face therein for the two field picture of non-background.

Input: two field picture.

Output: detected facial image.

5) face recognition module: for detected facial image, the transformation matrix that uses training to obtain obtains a bit vector, adopts the similarity at Euclidean distance calculating and each center, to realize the purpose of identification.

Input: facial image, transformation matrix.

Output: recognition result.

6) index structure module: input video is marked, and the result according to recognition of face obtains video index, and index structure set up in index.

Input: monitor video.

Output: video index.

7) retrieval module: the user is by user interface input inquiry content, and retrieval module submits to the difference of content format to retrieve according to the user, and by the user interface feedback information.

Input: the inquiry that the user submits to.

Output: the information such as video clips that feed back to the user.

The present invention has special pre-service at the AVS video flowing, no matter be at the gate inhibition's monitoring in real time or the video of processed offline storage, the AVS code stream is not decoded completely, and the compressed domain that is to use AVS is carried out background/non-background class, judge whether present image is background, if just do not carry out follow-up work, improve the treatment effeciency of system with this for background.In using in real time, can also add and use hardware handles to quicken this process.

In the middle of the compression domain of AVS, the motion vector of macro block can reflecting video in the middle of the motion of object.In the background segment, image is static relatively, can make when the people occurs and introduce more movable information in the video.Propose in the document [1] to use motion estimation technique H.264 to carry out the classification of background/non-background.The present invention is used for the AVS code stream with similar algorithms.If Be the motion vector of a macro block in the present image,

0≤i≤N-1.N is a macro block sum in the present image.Calculate exercise intensity in the present image with following formula:

Formula (1)

Wherein, size _iThe area of representing i macro block.

The simple motion state of using exercise intensity can not characterize object in the present image fully, therefore introduce the scope of moving in another parameter MS presentation video:

MS = Σ_{i = 0}^{N - 1} b_s_{i},

b_s_{i} = \{\begin{matrix} {size}_{i}, \overset{&RightArrow;}{m_{i}} &NotEqual; 0 \\ 0, else \end{matrix}

Formula (2)

In the background image sequence, there is not violent motion in the image, exercise intensity and range of movement all are limited in less numerical value.If the threshold value of MV is mv_min, the threshold value of ms is ms_min, mv_min and ms_min are obtained by the empirical data statistics, it is high more that the more little then background of mv_min and ms_min is differentiated accuracy rate, mv_min and ms_min are big more, and then recall ratio is high more, sets a suitable mv_min and ms_min by balance according to system's actual conditions.When satisfying following condition, judge that present image belongs to background:

MV＜mv_min and MS＜ms_min.

The meaning of carrying out background and non-background class not only is to have improved the efficient of system, also collects the statistical information of each control point on the other hand, thereby infers the environmental information of control point.For example by the distribution of the non-background frames of statistics in the middle of supervisory sequence, just can learn when section is in crowded state in this control point, thereby further suitable deployment is made in this control point, for example the intensive relatively time period improve the frame per second of recording, and reduce frame per second of recording or the like in the time period of stream of people's rareness the stream of people.

Detect through background, do not detect for the image of background carries out people's face judging.People's face detects and adopts AdaBoost algorithm [2].But in order to improve the treatment effeciency of system, we do not carry out global detection, but carry out local detection.

From people's face detected, detected facial image carried out according to from left to right, being scanned into sample vector from top to bottom after size unifies convergent-divergent, then sample vector is carried out dimensionality reduction.The Fisher-Face algorithm that we adopt classical PCA to combine with LDA carries out the extraction of people's face projection properties ^[3](PCA:Principal Components Analysis is in conjunction with pivot analysis; LDA:Linear Discriminant Analysis, linear discriminant analysis).Use LDA on the space after using the PCA dimensionality reduction, obtain the proper vector of the people's face that detects.Adopt the people's face in minimum distance classifier and the storehouse to compare and identification after the feature extraction.

If the sample vector after people's face f process Fi sher-Face feature extraction is f ', f '=(u0, u1 ... uk), calculate the distance of itself and training sample then:

d (f', f_{i}^{'}) = \sqrt{Σ_{i = 0}^{k} [{(u_{i} - v_{i})}^{2}]}

Formula (3)

Fi '=(v0, v1 wherein ... vk) i training sample in the library representation, k is the sample dimension.The distance of i training sample in d (f ', fi ') current sample to be identified of expression and the storehouse.

Calculated in f ' and the storehouse behind all samples, found out minimum preceding 5 samples of distance, fi1 ', fi2 ' ... fi5 '.Wherein most samples belong to class c, and class c appoints the sample class that refers to belong to same individual, the more the sort of c class that is of quantity.If 5 samples respectively belong to a class, then with the minimum sample fi1 ' of f ' distance under class as c.We calculate the degree of confidence t of identification with following formula:

t = \frac{Σd (f', f_{ij}^{'} | f_{ij}^{'} &Element; c)}{Σ_{j = 1}^{5} d (f', f_{ij}^{'})}

Formula (4)

As degree of confidence t during less than threshold value t_min, illustrate that people's face is the stranger, f is as a result fed back to the user, new people's face adds in the storehouse with this after the user confirms, otherwise the expression recognition result is reliable and write down the result.T_min is obtained by the empirical data statistics, and t_min is high more, and then accuracy rate is high more, and t_min is low more, and then recall ratio is high more, sets a suitable t_min by balance according to system's actual conditions.

According to foregoing, what summarize the present invention's proposition based on the video monitoring system of AVS and the step of its implementation is: 1, utilize the AVS video camera to obtain the AVS code stream; 2, the AVS code stream is carried out background class, the detection of people's face, background training, the training of people's face; 3, to the identification of comparing of people's face; 4, obtain Query Result.

Description of drawings

Fig. 1 is the core process flow diagram of this video monitoring system.

Fig. 2 is system's main modular and workflow.

Number in the figure: 1 training module; 2 labeling module; 3 retrieval modules; 4 face databases; 5 background sample storehouses; 6 background training modules; 7 people's face training modules; 8 background models; 9 face characteristic transformation matrixs; 10 background detection modules; 11 people's face detection modules; 12 face recognition module; 13 index structure modules; 14 monitor videos; 15 search indexs; 16 retrieval modules.

Embodiment

For example, the present invention is in the application of gate control system, and system can be divided into five parts: front-end camera, AVS video database, Video processing and comparison identification, face database, enter information inquiry.In gate control system, camera position is more fixing, and the angle and the figure viewed from behind of shooting are all fixed, and the variation of light neither be very violent in this indoor environment of office building.Because segmentation and remote storage are not supported in the driving that video camera carries, so will carry at camera on the basis of driving according to application requirements and write driver, in the shooting process, automatically realize the segmentation of video, and the AVS segmentation video storage that will take gained is in the data designated storehouse.Simultaneously, real-time order is handled the AVS code stream of segmentation.At first carry out background class,, then be not for further processing if the segment video is the figure viewed from behind.Detect through background, do not detect for the image of background carries out people's face judging.But in order to improve the treatment effeciency of system, we do not carry out global detection, but carry out local detection, and detection method has detailed elaboration in preamble, just do not repeat at this.Detect as degree of confidence t (the computing method preamble is stated) during by people's face less than threshold value t_min (preamble is stated), t_min can be made as 0.85 in the actual realization of system, give the user less than this value feedback sort like the information in " this person's face is not in the storehouse; be the stranger ", remind the user, can also be after the user confirms new people's face adds in the storehouse with this, the result can be existed in the face database.If greater than t_min, the expression recognition result reliably and in protoplast's face database has this person, inquires about and report this person's name automatically, writes down the time that it enters.This is the present invention's a kind of application in practice.

List of references:

[1] Hui H., Liu H., Wu Y., Liang Y.Video surveillance method based on is standard[J H.264] .Computer Applications, 2005,25 (11), 131-133.[Hui Marsha, Liu Han, Wu Yali, Liang Yanming. a kind of based on video encoding standard intelligent video monitoring technology [J] H.264. " computer utility ", 2005,25 (11), 131-133]

[2]Freund?Y.，Schapire?R.E.A?Decision-Theoretic?Generalization?of?Online?Learning?and?anApplication?to?Boosting.Journal?of?Computer?and?System?Sciences，1997，55(1)：119-139

[3]Belhumeur?P.，Hespanha?J.Eigenfaces?vs?Fisherfaces：recognition?using?class?specific?linearprojection[C]，1997，IEEE?Transactions?on?Pattern?Analysis?and?Machine?Intelligence，20(7)，711-720

Claims

1, a kind of video frequency monitoring method based on AVS is characterized in that concrete steps are as follows: at first gather according to the AVS code stream by the AVS web camera, use the compressed domain in the AVS code stream decoding process to carry out the classification of background and non-background.When classification results shows that current frame is not background, carry out people's face and detect.When detecting people's face, carry out recognition of face, be about to people's face data and carry out comparing with training data after the conversion.Before recognition result is fed to the user, calculate degree of confidence t earlier, t shows the credibility of current recognition result.As degree of confidence t during less than threshold value t_min, think that this people's face does not belong to the data in the Current Library, regard as the stranger, and this result is fed back to the user, new people's face adds in the storehouse with this after the user confirms.When degree of confidence during more than or equal to threshold value t_min, show that recognition result has higher confidence level, write down recognition result then and video is marked; Here AVS is meant advanced audio/video encoding standard.2, method according to claim 1, the method that it is characterized in that described background class is for establishing

Be the motion vector of a macro block in the present image,

0≤i≤N-1; N is a macro block sum in the present image.Calculate exercise intensity in the present image with following formula:

Formula (1)

Wherein, size _iThe area of representing i macro block.

The scope of moving in the parameter MS presentation video:

MS = Σ_{i = 0}^{N - 1} b_s_{i},

b_s_{i} = \{\begin{matrix} {size}_{i}, \overset{&RightArrow;}{m_{i}} &NotEqual; 0 \\ 0, else \end{matrix}

Formula (2)

When satisfying following condition, judge that present image belongs to background:

MV＜mv_min and MS＜ms_min.

3, method according to claim 1, the method that it is characterized in that described recognition of face is as follows: from people's face detected, detected facial image carried out after size unifies convergent-divergent, according to from left to right, be scanned into sample vector from top to bottom, then sample vector carried out dimensionality reduction; The Fisher-Face algorithm that we adopt classical PCA to combine with LDA carries out the extraction of people's face projection properties;

If the sample vector after people's face f process Fisher-Face feature extraction is f ', f '=(u0, u1 ... uk), calculate the distance of itself and training sample then:

d (f', f_{i}^{'}) = \sqrt{Σ_{i = 0}^{k} [{(u_{i} - v_{i})}^{2}]}

Formula (3)

Fi '=(v0, v1 wherein ... vk) i training sample in the library representation, k is the sample dimension; The distance of i training sample in d (f ', fi ') current sample to be identified of expression and the storehouse;

Calculated in f ' and the storehouse behind all samples, found out minimum preceding 5 samples of distance, fi1 ', fi2 ' ... fi5 '; Wherein most samples belong to class c, and class c appoints the sample class that refers to belong to same individual, the more the sort of c class that is of quantity; If 5 samples respectively belong to a class, then with the minimum sample fi1 ' of f ' distance under class as c; We calculate the degree of confidence t of identification with following formula:

t = \frac{Σd (f', f_{ij}^{'} | f_{ij}^{'} &Element; c)}{Σ_{j = 1}^{5} d (f', f_{ij}^{'})}

Formula (4)

As degree of confidence t during less than threshold value t_min, illustrate that people's face is the stranger, f is as a result fed back to the user, new people's face adds in the storehouse with this after the user confirms, otherwise the expression recognition result is reliable and write down the result.

4, a kind of video monitoring system based on AVS is characterized in that system is mainly by training module, labeling module and retrieval module:

Training module comprises the training module of monitoring environment background and the training module of face database, implements respectively to import to people's face sample storehouse and background sample storehouse to the environmental background training with to the training of people's face, is output as each face characteristic and background characteristics;

Labeling module comprises that background detection module, people's face detection module, face recognition module and index structure set up part, and the monitor video of input is marked automatically; Be input as background characteristics, face characteristic and monitor video to be marked that training module obtains, be output as the search index of monitor video to be marked;

Retrieval module is to specifying monitor video to retrieve, comprising picture query, text query and query video; Be input as the index of specifying monitor video, picture, text or segment video that the user submits to obtain content that the user submits to corresponding picture material in monitor video;

The design of system's main modular is as follows:

1) background training module: the background video sample to input calculates, and obtains background model; Adopt algorithm to be based on the hsv color space, calculate the span that each pixel belongs to background;

Input: background video sample;

Output: background model is used for the comparison of background;

2) people's face training module: the people's face in the face database is handled; Adopting algorithm is fisher-face;

Input: face database;

Output: by the transformation matrix that people's face data computation in the face database obtains, the purpose of this matrix is that the conversion of input people face is obtained one-dimensional vector, in order to identification; When obtaining transformation matrix, export the center of each one face, in order to identification;

3) background detection module: incoming frame image and background model are compared, and purpose is to know whether this incoming frame is background, if not background, those zones belong to the prospect scope;

Input: background model, two field picture;

Output: know whether this incoming frame is background, if not background, those zones belong to the prospect scope;

4) people's face detection module:, detect people's face therein for the two field picture of non-background;

Input: two field picture;

Output: detected facial image;

5) face recognition module: for detected facial image, the transformation matrix that uses training to obtain obtains a bit vector, adopts the similarity at Euclidean distance calculating and each center, to realize the purpose of identification;

Input: facial image, transformation matrix;

Output: recognition result;

6) index structure module: input video is marked, and the result according to recognition of face obtains video index, and index structure set up in index;

Input: monitor video;

Output: video index;

7) retrieval module: the user is by user interface input inquiry content, and retrieval module submits to the difference of content format to retrieve according to the user, and by the user interface feedback information;

Input: the inquiry that the user submits to;

Output: the video clips information that feeds back to the user.