CN107624189A

CN107624189A - Method and apparatus for generating forecast model

Info

Publication number: CN107624189A
Application number: CN201580080145.XA
Authority: CN
Inventors: 王晓刚; 张聪; 李鸿升
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2018-01-23
Anticipated expiration: 2035-05-18
Also published as: CN107624189B; WO2016183766A1

Abstract

A kind of method for being used to generate that forecast model is distributed with the crowd density in prognostic chart picture frame and personnel count is disclosed, it includes：CNN is trained by inputting one or more crowd's image blocks of the frame in training set, in the crowd's image block inputted, there are each crowd's image block predetermined actual conditions Density Distribution and personnel to count；Frame is sampled from target scene image set and the training image with identified true value Density Distribution and counting/number is received from training set；Similar view data is retrieved from the training frames received for each target image frame sampled, to overcome the scene gap between target scene image set and training image；And by the way that similar view data is input into CNN to be finely adjusted to CNN, to determine the forecast model for the crowd density figure in prognostic chart picture frame and personnel's counting.

Description

Method and apparatus for generating forecast model

Technical field

The application is related to be set for generating forecast model with what the crowd density distribution in prognostic chart picture frame and personnel counted Standby and method.

Background technology

Pedestrian crowd in video is counted strong demand in terms of video monitoring, thus has attracted many Notice.Due to serious shielding, scene perspective distortion and Crowds Distribute variation, it is one challenging that crowd, which counts, Task.Due to pedestrian detection and tracking when in for crowd's scene it is difficult, so the most methods of state of the art It is to be based on returning, and target is to learn the mapping between low-level feature and crowd's counting.However, these work are for specific Scene, i.e., it may only be applied to Same Scene for crowd's counter model of special scenes study.In view of the field that can't see Scape or the scene layout changed, it is necessary to new mark come re -training model.

There are many work to pass through detection or trajectory clustering (trajectory-clustering) to be counted to pedestrian Number.But for crowd's enumeration problem, these methods are limited by serious block between people.Many kinds of methods attempt to lead to The recurrence first (regressor) using being trained by low-level feature is crossed to predict global counting.These methods are more suitable for crowded Environment, and it is computationally more effective.

Carry out counting the spatial information that have ignored pedestrian by global recurrence.Lempitsky et al.One kind is described to pass through Pixel layer object densities figure returns the object count method carried out.After the work,Fiaschi et al.Use random forest (random forest) returns object densities and improves training effectiveness.In addition to considering spatial information, returned based on density Another advantage of method be：They can be estimated the counting of the object in any region of image.Utilize this Advantage, interactive object number system is introduced, it visualizes area count to help user effectively to determine that correlation is anti- Feedback.RodriguezeNumber of people testing result is improved using density map estimation.But these methods are for special scenes, and It is not suitable for intersecting scene counting.

Many work employ studies in depth it is various monitoring application, such as personnel identify again, pedestrian detection, tracking, crowd Behavioural analysis and crowd's segmentation.The differentiation power of depth model is benefited from their success.Sermanet et al.Show, for many It is more effective from the hand-made feature of aspect ratio of depth model extraction for.However, not yet develop the depth counted for crowd Spend model.

As many large-scale and good indication data sets are disclosed, a kind of non-parametric, data-driven side Method is proposed out.Such method can be extended up easily, because they need not be trained.They are most like by obtaining Label is transferred to test image by training image from training image, and these training images is matched with test image.Liu etc. PeoplePropose a kind of nonparametric method for analyzing image for the intensive deformation field sought between image.

The content of the invention

The disclosure solves the problems, such as crowd density and counts estimation, and its target is automatically to estimate given monitor video Number/counting of density map and/or personnel on frame.

Intersect scene density present applicant proposes one kind and count estimating system.Even if any target is not present in training set Scene, the system still are able to estimate that the density map of the scene counts with personnel.

In an aspect, a kind of equipment for being used to generate forecast model to predict crowd density figure and count is disclosed, It includes density map creating unit, CNN generation units, set of metadata of similar data acquiring unit and model fine-adjusting unit.Density map creates single Member, which is configured to the approximate training set that comes from, (to be had pedestrian's number of people mark, is indicated in everyone number of people position in area-of-interest (ROI) Put) each Training scene perspective view (perspective map), it is true to be created based on label and perspective view on training set It is worth (ground-truth) density map and counting.Density map represents the Crowds Distribute of each frame, and the integration of density map is equal to Pedestrian's sum.CNN generation units are configured to construction and initialization crowd's convolutional neural networks, to be sampled by input from training set To CNN crowd's image block and corresponding true value density map and count to train CNN.Set of metadata of similar data acquiring unit is configured to： Sample frame is received from target scene and is received from training set with the true value density map created by CNN generation units and the sample of counting This；And set of metadata of similar data is obtained from training set for each target scene, to overcome scene gap.Model fine-adjusting unit configures Into the set of metadata of similar data being acquired and the CNN of construction the 2nd is received, wherein initializing second by using the first CNN trained CNN, and model fine-adjusting unit is further configured to be finely adjusted the 2nd initialized CNN with set of metadata of similar data, so that the Two CNN can predict density map and pedestrian counting and the region of interest in the area-of-interest of frame of video to be detected Domain.

In present aspects, disclose a kind of for generating forecast model with the crowd density in prognostic chart picture frame point The method that cloth and personnel count, methods described include：

CNN is trained by inputting one or more crowd's image blocks of the frame in training set, in the crowd inputted There are each crowd's image block predetermined true value Density Distribution and personnel to count in image block；

Frame is sampled from target scene image set and received from training set have identified true value Density Distribution and The training image of counting/number, and

Similar view data is obtained from the training frames received for each target image frame sampled, to overcome Scene gap between target scene image set and training image；And

By the way that similar view data is input into CNN to be finely adjusted to CNN, to determine to be used in prognostic chart picture frame The forecast model that crowd density figure and personnel count.

The application it is other in terms of in, disclose it is a kind of be used for generate forecast model with the crowd in prognostic chart picture frame The equipment that Density Distribution and personnel count, the equipment include：

CNN training units, CNN is trained by inputting one or more crowd's image blocks of the frame in training set, In the crowd's image block inputted, there are each crowd's image block predetermined true value Density Distribution and personnel to count；

Set of metadata of similar data acquiring unit, frame is sampled from target scene image set and from training set receive have determined True value Density Distribution and counting/number training image；And each target image frame for being sampled is from being received Similar view data is obtained in training frames, to overcome the scene gap between target scene image set and training image；And

Model fine-adjusting unit, by the way that similar view data is input into CNN to be finely adjusted to CNN, to determine to use The forecast model that crowd density figure and personnel in prognostic chart picture frame count.

The application it is other in terms of in, disclose it is a kind of be used for generate forecast model with the crowd in prognostic chart picture frame The system that Density Distribution and personnel count, the system include：

Memory, it stores executable part；And

Processor, it is electrically coupled to memory to perform operation of the executable part with execution system, wherein, it is described to hold Row part includes：

CNN training components, it is used to train by inputting one or more crowd's image blocks of the frame in training set CNN, in the crowd's image block inputted, each crowd's image block has predetermined true value Density Distribution and personnel Count；

Set of metadata of similar data obtaining widget, frame is sampled from target scene image set and from training set receive have determined True value Density Distribution and counting/number training image；And each target image frame for being sampled is from being received Similar view data is obtained in training frames, to overcome the scene gap between target scene image set and training image；And

Model trimming part, by the way that similar view data is input into CNN to be finely adjusted to CNN, to determine to use The forecast model that crowd density figure and personnel in prognostic chart picture frame count.

According to required solution.There will be at least one of advantages below：

Multitask system-its can estimate together crowd density figure with count.It can be counted by the integration to density map Calculate and keep count of.Two inter-related tasks can also help each other, so as to obtain more preferable solution for our training pattern.

Intersect scene ability-target scene and do not need extra pedestrian's label in the framework for intersecting scene counting.

Do not need crowd segmentation-its independent of crowd's foreground segmentation pre-process., all will be logical no matter whether crowd moves Our model is crossed to capture crowd's texture, and system can obtain rational estimated result.

The following description and drawings elaborate some illustrative aspects of the disclosure.However, these aspect instructions can use this Only a small number of modes in the various modes of principle disclosed.When considered in conjunction with the accompanying drawings, other aspects of the disclosure will be from this public affairs That opens described in detail below becomes apparent.

Brief description of the drawings

The Exemplary, non-limiting embodiment of the present invention is described below with reference to accompanying drawing.Accompanying drawing is illustrative, and Do not drawn in definite ratio typically.The same or like element on different figures is quoted with identical drawing reference numeral.

Fig. 1 is the schematic diagram of the block diagram for the equipment 1000 for illustrating an embodiment according to the application, and the equipment is used In generation forecast model to predict crowd density figure and counting.

Fig. 2 is to illustrate to generate prediction module according to the equipment 1000 of the embodiment of the application with prognostic chart picture frame Crowd density distribution and personnel count flow schematic diagram.

Fig. 3 is the signal of the flowchart process for the density map creating unit 10 for illustrating an embodiment according to the application Figure.

Fig. 4 is the schematic diagram of the flowchart process for the CNN training units for illustrating an embodiment according to the application.

Fig. 5 is the schematic diagram of the general introduction for the crowd's CNN models for illustrating an embodiment according to the application, shown people Group's CNN models have changeable target (switchable objective).

Fig. 6 is the schematic diagram for illustrating the flow according to the acquisition of the set of metadata of similar data of another embodiment herein.

Fig. 7 is the schematic diagram for being used to generate the system of forecast model for illustrating an embodiment according to the application, its In by software come implement the present invention function.

Embodiment

It is contemplated for carrying out the present invention's with detailed reference to some particulars of the present invention, including by inventor Optimal mode.The example of these particulars is illustrated in accompanying drawing.Although describe this with reference to these particulars Invention, it is to be understood that the present invention is not restricted to described embodiment.On the contrary, it is intended to as may include such as All alternatives, modification and equivalent in the spirit and scope of the present invention defined by the appended claims.It is described below In, numerous specific details are elaborated to provide thorough understanding of the present invention.Can be in some in these no specific details Or implement the present invention in the case of whole.In other examples, well-known process operation is not described in detail so as to unnecessary Ground obscures the present invention.

Term used herein is only used for describing the purpose of particular and being not intended to be limiting the present invention.Such as this Used in text, unless the context clearly indicates otherwise, otherwise singulative " one " and " described/to be somebody's turn to do " are also intended to comprising plural shape Formula.It will be further understood that, when used in this specification, term includes providing stated feature, integer, step, operation, member The presence of part and/or part, but be not precluded from other one or more features, integer, step, operation, element, part and/or The presence or addition of its group.

Fig. 1 is the schematic diagram of the block diagram for the equipment 1000 for illustrating an embodiment according to the application, and the equipment is used In generation forecast model to predict crowd density figure and counting.As indicated, equipment 1000 may include density map creating unit 10, CNN generation units 20, set of metadata of similar data acquiring unit 30 and model fine-adjusting unit 40.

Fig. 2 is the general signal of the flowchart process 2000 for the equipment 1000 for illustrating an embodiment according to the application Figure.In step s201, true value density map creating unit 10 is operated to be selected from one or more of training set training image frame Image block is selected, and the true value pedestrian in the true value Crowds Distribute and selected image block in image block selected by determination amounts to Number.In step s202, CNN training units 20 are operated with one or more crowd's image blocks by inputting the frame in training set To train CNN, wherein in the crowd's image block inputted, each crowd's image block has predetermined true value Density Distribution Counted with personnel.In step s203, set of metadata of similar data acquiring unit 30 is operated to be sampled from target scene image set to frame, simultaneously The training image with identified true value Density Distribution and counting/number is received from training set, and it is each for being sampled Target image frame obtains similar view data from the training frames received, to overcome target scene image set and training image Between scene gap.In step s204, model fine-adjusting unit 40 operate with by by similar view data be input to CNN come CNN is finely adjusted, to determine the forecast model for the crowd density figure in prognostic chart picture frame and personnel's counting.Hereafter will The conjunction of density map creating unit 10, CNN generation units 20, set of metadata of similar data acquiring unit 30 and model fine-adjusting unit 40 is discussed in detail Make.

1) density map creating unit 10

Initial input is training set to equipment 100 (that is, inputting to density map creating unit 10), and it is included from various prisons Control a certain amount of frame of video of the camera capture with pedestrian's number of people label.Density map creating unit 10 is operated with based on being inputted Training set export the density map of each frame of video and counting.

Fig. 3 is the signal of the flowchart process for the density map creating unit 10 for illustrating an embodiment according to the application Figure.In step s301, density map creating unit 10 operate with the perspective view of each Training scene/frame of the approximation from training set/ Distribution.The mark row people number of people is to indicate everyone number of people position in the area-of-interest of each training frames.In the sign number of people In the case of position, the locus of pedestrian and human bodily form will be located in each frame.At step s302, the sky based on pedestrian Between the perspective distortion of position, human bodily form and image create true value density map/distribution, to determine each frame middle row people/people The true value density of group simultaneously estimates that the personnel in each frame of training set in crowd count.Specifically, true value density map/distribution table Show the Crowds Distribute in each frame, and pedestrian's sum is equal to the integration of density map/distribution.

Specifically, the main target of crowd's CNN models to be discussed later is study mapping F:X-D, wherein X be from One group of low-level feature of training image extraction, and D is crowd density figure/distribution of image.Assuming that denote the position of each pedestrian Put, then the locus based on pedestrian, human bodily form and, the perspective distortion of image create density map/distribution.From training image In the image block that is randomly chosen be considered as training sample, and density map/distribution of correspondence image piecemeal is considered as crowd CNN The true value of model, crowd's CNN models will be further discussed later.As auxiliary mark, pass through the integration to density map/distribution To calculate total crowd's number in selected training image piecemeal.It should be noted that sum will be decimal and non-integer.

In the prior art, density map recurrence true value is defined as to the sum of the Gaussian kernel centered on the position of object. This density map/distribution is suitable for characterizing the Density Distribution of round shape object (such as, cell and bacterium).However, this hypothesis It may fail when speaking of the general pedestrian crowd not in birds-eye view of camera wherein.Row in common monitoring camera The example of people has three obvious characteristics：1) pedestrian image in monitor video has different chis due to perspective distortion Degree；2) bodily form of pedestrian is more closely similar to ellipse compared with circle；3) due to serious shielding, people's head and shoulders is judged in each position Put important implications of the place with the presence or absence of pedestrian.The body part of pedestrian is insecure for mark people.In view of these characteristics, Crowd density is combined to create by the several distribution with perspective normalization (perspective normalization) Figure/distribution.

Perspective normalization is necessary for estimation pedestrian's yardstick.For each scene, some adults will be randomly chosen Pedestrian, then from the beginning they are indicated into pin.Assuming that the average height of adult be 175cm (such as), then can be by linear Return and carry out approximate perspective view M.Pixel value in perspective view M (p) represents：Number of pixels in image is represented in actual scene Some distance (for example, 1 meter) of the opening position.If a pedestrian is indicated with H pixels, on the center of the pedestrian Perspective view is M (p)=H/1.75, and then, linearly interpolation perspective view is all saturating to obtain along the vertical and horizontal directions respectively for it View.After perspective view/distribution of pedestrian's number of people Ph in obtaining area-of-interest (ROI) and center, according to Lower rule creates crowd density figure/distribution：

Crowd density distribution core includes two：Normalization 2 as number of people part ties up Gaussian kernel Nh and as body part Bivariate normal distribution Nb.Herein, Pb is the position of pedestrian body, and it is drawn by number of people position and perspective value estimation.For most Pedestrian contour is represented goodly, and variance is set as (being directed to item Nh, and Nx=0.2M (p)),(being directed to item Nb).To ensure that the integration of all density values in density map/distribution is equal in original image Total crowd's number, overall distribution is normalized by Z.

In short, for the number of people position with sign everyone, body build Density Distribution or core will be determined (hereafter In be referred to as " core "), as described in formula (1).All body build cores of the personnel of (overlapping) all signs are combined to be formed True value density map/distribution of each frame.The value that true value density map/distributed median is put is bigger, and crowd density is got in these positions It is high.Further, since the normalized value of each body build core is equal to 1, so personnel's counting in crowd will be close equal in true value Spend the sum of all values of body build core in figure/distribution.

2) CNN generation units 20

CNN generation units 20 are configured to first crowd's convolutional neural networks (CNN) of construction and initialization.Generation unit 20 is grasped Make with acquisition/sampling crowd's image block from the frame in training set, and obtain the correspondence in sampled crowd's image block True value density map and number (as determined by unit 10).Then, crowd's image that generation unit 20 will sample from training set Piecemeal and its corresponding true value density map/distribution and number are input to as target objectivity (target objectiveness) In CNN, to train CNN.

Fig. 4 is the flow for being used to generate and train CNN process 4000 for illustrating an embodiment according to the application The schematic diagram of figure.

As indicated, in step s401, process 300 from frame sample one or more crowd's image block in training set, And obtain the corresponding true value density map and number/crowd's number in sampled crowd's image block.Input is from training image The image block of cutting.In order to obtain the pedestrian under similar scale, size of each image block at diverse location is basis The perspective value of its center pixel carrys out selection.In this example, the 3X3 that each image block can be set in covering actual scene Square metre.Then, as follows by image block distort (warp) to 72 (such as) pixel X72 (such as) pixel using as in step The crowd's CNN models generated in 302.

In step s402, process 4000 is based on gaussian random and is distributed randomly to initialize crowd's convolutional neural networks.Fig. 5 In show the general introductions of crowd's CNN models with changeable target.

As indicated, crowd CNN models 500 include 3 convolutional layers (con1 to conv3) and three full articulamentum (fc4, fc5 With fc6 or fc7).Conv1 has 32 7X7X3 wave filters, and conv2 has 32 7X7X32 wave filters, and last convolution Layer has 64 5X5X32 wave filters.After conv1 and conv2, the maximum pond layer with 2X2 core sizes is used.In Fig. 5 Unshowned amendment linear unit (ReLU) is the activation primitive applied after each convolutional layer and full articulamentum.It will be appreciated that For purposes of illustration herein only by the number of wave filter and the number of layer description as an example, and the application be not limited to this Some specific numbers and other numbers will be acceptable.

In step s403, process 400 learns from crowd image block to density map/mapping of distribution, such as by using Small lot gradient declines and backpropagation is until that density map/convergence in distribution is created in such as by true value density map creating unit 10 is true It is worth density/distribution.In step s404, process 400 switches target, learns the mapping from crowd image block to counting, Zhi Daosuo The counting of study converges on the counting estimated by true value density map creating unit 10.In step 405, it is determined that estimated density Whether figure/distribution and counting converge on true value, if it not, then repeat step s403 to s405.Hereinafter, it will be discussed in detail step S403 to s405.

In the embodiment of the application, it introduces iteration handoff procedure in crowd CNN models 500, and it is used to replace Ground optimizes density map/distribution estimation task and counts estimation task.The main task of crowd CNN models 400 is estimation input Image block crowd density figure/distribution.In embodiment as shown in Figure 5, due to having two in CNN models 500 Individual pond layer, so density map/distribution of output is downsampled 18X18.Therefore, true value density map/distribution is also downward It is sampled to 18X18.Because density map/distribution includes abundant partial detailed information, so CNN models 500 can benefit from study Predicted density figure/distribution and the more preferable expression that crowd's image block can be obtained.The tale of the image block of input is returned Return and be considered as the second task, it to density map image block by quadraturing to calculate.Two tasks are alternately mutually helped Help and obtain more preferable solution.Two loss functions are defined according to following rule：

Wherein Θ is one group of parameter of CNN models, and N is the number of training sample.L_DIt is estimated density map Fd (X_i； Θ) the loss between (fc6 output) and true value density map Di.Similarly, L_YIt is estimated crowd number Fy (Xi；Θ)(fc7 Output) with true value number Y_iBetween loss.Euclidean distance is used in both objective loss.Use a small amount of gradients Decline and backpropagation is lost to minimize.

Changeable training program is outlined in algorithm 1.L_DFirst object loss to be minimized is set to, because close More spatial informations can be incorporated into CNN models by degree figure/distribution, so density map/distribution estimation needs model 500 to learn people The general expression of group.After first object convergence, model 500 is switched over to minimize the global target for counting recurrence. It is an easier task to count recurrence, and learns the task of its specific density figure/distribution recurrence faster.It should be noted that should be by Photograph like or identical yardstick normalize the two target loss；Otherwise, the target with more large scale will be in training process Middle dominance.Can be 10 by the yardstick weight setting of density loss, and can incite somebody to action in the embodiment of the application The yardstick weight setting of counting loss is 1.Training is true in about 6 post-concentrations for switching iteration.Proposed switching study Method can reach than widely used multi-task learning method better performance.

3) set of metadata of similar data acquiring unit 30

Set of metadata of similar data acquiring unit 30 is configured to：Sample frame is received from target scene, and is received from training set with by list True value density map/the distribution and the sample of counting that member 10 creates；Then similarity number is obtained from training set for each target scene According to overcome scene gap.

Crowd CNN models 500 are entered by proposed changeable learning process based on all Training scene data Row pre-training.However, the crowd's scene each inquired about has its unique scene property, such as different visual angles, yardstick and not Same Density Distribution.These properties significantly change the outward appearance of crowd's image block, and influence the performance of crowd CNN models 500. In order to bridge the distribution gap between Training scene and test scene, nonparametric trimming scheme is designed to the CNN for crossing pre-training Model 500 is suitable for the target scene that can't see.

The given target video from the scene that can't see, the sample with similar quality is obtained from training frames, and will They are added to training data to be finely adjusted to crowd CNN models 500.Acquisition task is made up of two steps：Alternate scenes Obtain and local image block obtains.

Alternate scenes obtain (step 601)The visual angle of scene and yardstick are the principal elements of influence crowd's outward appearance.Perspective view/ Distribution can indicate both visual angle and yardstick.To overcome the scale gap between different scenes, by the image block of each input Be normalized to same yardstick, the yardstick in perspective view/distribution covering actual scene 3X3 square metres (such as).Therefore, it is non- The first step of small parameter perturbations method concentrate on from all Training scenes obtain have the perspective view similar to target scene/point The Training scene of cloth.The scene that those are acquired is referred to as candidate and finely tunes scene.Perspective descriptor is designed to represent each field The visual angle of scape.Because perspective view/distribution is linearly fitted along y-axis, so its vertical gradient Δ My=M (y)-M (y-1) can be used as Have an X-rayed descriptor.Based on the descriptor, for the scene that can't see, concentrated from whole training data and obtain top (for example, 20 It is individual) perspective view similar scene.Image in the scene being acquired is considered as to the alternate scenes obtained for topography's piecemeal.

Topography's piecemeal obtains (step 602)Second step is the selection similar image piecemeal from alternate scenes, these Image block has the Density Distribution similar to the Density Distribution in test scene.In addition to visual angle and yardstick, crowd density point Cloth has an effect on the skin mode of crowd.The higher crowd of density have it is even more serious block, and may only observe the number of people and Shoulder.On the contrary, in sparse crowd, complete body build is presented in pedestrian.Therefore, set of metadata of similar data acquiring unit 30 is configured to Attempt the Density Distribution of prediction target scene and obtain from alternate scenes to be matched with the similar of predicted target density distribution Image block.For example, for the higher crowd's scene of density, the model that more dense image block should be obtained to be crossed to pre-training It is finely adjusted and carrys out fit object scene.

The CNN models 500 crossed using the pre-training such as trained in the cell 20, we can calculate roughly target image Each image block density and tale.Assuming that the image block with similar density map/distribution passes through pre-training mistake Model 500 there is similar output.Based on prediction result, we calculate the histogram of the Density Distribution of target scene.Press Each section (bin) is calculated according to following rule：

Wherein y_iIt is the integral counting of sample i estimated density map/distribution.

Due to scene of the pedestrian station of wherein more than 20 in 3X3 square metres seldom be present, so working as y_i>, should when 20 Image block is assigned to the 6th section (that is, ci=6).The Density Distribution of target scene can be obtained from equation (4).Then, from It is acquired in Training scene and is randomly chosen image block, and controls the number of the different image block of density to be matched with The Density Distribution of target scene.By this way, it is able to obtain with similar visual angle, chi using proposed method for trimming The image block of degree and Density Distribution.

Model fine-adjusting unit 40

Model fine-adjusting unit 40 is configured to receive the set of metadata of similar data being acquired and utilizes the set of metadata of similar data to pre-training The CNN 500 crossed is finely adjusted, so that CNN 500 can predict the density in the area-of-interest of frame of video to be detected Figure/distribution and pedestrian counting and the area-of-interest.Trimmed crowd CNN models have reached more preferable for target scene Performance.

In the embodiment of the application, fine-adjusting unit 40 samples the similar image piecemeal obtained from unit 30, and The similar image piecemeal obtained is input to CNN that pre-training crosses to be finely adjusted to it (for example, by using small lot ladder Degree decline and backpropagation until density map/convergence in distribution in the true value density such as created by true value density map creating unit 10/ Distribution).Then, fine-adjusting unit 40 switches objectivity and learns the mapping from crowd image block to counting, until what is learnt Count the counting for converging on and being estimated by true value density map creating unit 10.Finally, it is determined that estimated density map/distribution and counting Whether true value is converged on, if it not, then repeating above step.

The trimmed forecast model generated by model fine-adjusting unit 40 can receive frame of video to be detected and region of interest Domain, then predict the estimated density map and pedestrian counting in area-of-interest.

Such as by it will be apparent to those skilled in the art that system, method or computer program product can the invention is embodied as.Cause This, the present invention can use complete hardware embodiment and hardware aspect (it can be typically referred to as to " unit ", " electricity herein Road ", " module " or " system ").The major part of invention sexual function and many invention principles are when realizing most preferably by integrated circuit (IC) support, such as digital signal processor and therefore software or application-specific integrated circuit.While it may be possible to pay great efforts and many Design alternative by (such as) available time, current technology and economic consideration drive, but still expect ordinary skill people It will readily be able under the guiding of member concept and principle disclosed herein and produce IC with minimum experiment.Therefore, in order to succinct Any risk of fuzzy principles and concepts according to the present invention is simultaneously preferably minimized by property, to such software and IC (if any) Be discussed further be limited to regard to the principle as used in preferred embodiment and the key element for concept.

In addition, the present invention can use complete software embodiment (including firmware, resident software, microcode etc.) or with reference to software Embodiment.In addition, the present invention can use the form for the computer program product being embodied in any tangible performance media, The performance media have the computer usable program code being embodied in the media.Fig. 7 illustrates to be used to generate forecast model The system 7000 counted with the crowd density distribution in prognostic chart picture frame and personnel.System 7000 includes：Memory 3001, it is deposited The executable part of storage；And processor 3002, it is electrically coupled to memory 3001 to perform executable part, with execution system 3000 operation.These executable parts may include：True value density map creates part 701, CNN training components 702, set of metadata of similar data Obtaining widget 703 and model trimming part 704.

True value density map creates part 701 and is disposed for：Selected from one or more of training set training image frame Select image block；And the true value pedestrian in the true value Crowds Distribute and selected image block in image block selected by determining amounts to Number.CNN training components 702 be disposed for by input one or more crowd's image blocks of frame in training set come CNN is trained, in the crowd's image block inputted, each crowd's image block has predetermined true value Density Distribution and personnel Count.

Set of metadata of similar data obtaining widget 703 is configured to sample frame from target scene image set and receive have institute really from training set Fixed true value Density Distribution and the training image of counting/number；And each target image frame for being sampled is from being received Training frames in obtain similar view data, to overcome the scene gap between target scene image set and training image.

Model trimming part 703 is disposed for by the way that similar view data is input into CNN 500 to be carried out to CNN Fine setting, to determine the forecast model for the crowd density figure and personnel's counting being used in prognostic chart picture frame.

The function of part 701 to 704 is analogous respectively to the function of unit 10 to 40, and therefore omits it herein and retouch in detail State.

Although having been described for the preferred exemplary of the present invention, those skilled in the art can know that basic invention is general At once these examples are made with change or modification after thought.Appended claims are intended to be considered as including preferred exemplary and all changes Change or modification is all fallen within the scope of the present invention.

Claims

1. a kind of be used to generate the method that forecast model is counted with the crowd density distribution in prognostic chart picture frame and personnel, it is wrapped Include：

CNN is trained by inputting one or more crowd's image blocks of the frame in training set, is schemed in the crowd of the input As in piecemeal, there are each crowd's image block predetermined true value Density Distribution and personnel to count；

Frame, which is sampled, from target scene image set and received from the training set has the predetermined true value Density Distribution With the training image of counting/number；

Similar view data is obtained from the training frames received for each target image frame sampled, it is described to overcome Scene gap between target scene image set and the training image；And

By the way that the similar view data is input into the CNN to be finely adjusted to the CNN, to determine to be used for predict The forecast model that crowd density figure and personnel in picture frame count.

2. according to the method for claim 1, further comprise：

Image block is selected from one or more of described training set training image frame；And in image block selected by determining True value Crowds Distribute and selected image block in true value pedestrian's tale.

3. according to the method for claim 2, wherein the determination further comprises：

Each personnel of number of people position of the identification with sign in each picture frame；

Determine the body build core of each personnel identified；And

All identified body build cores are combined to form the true value of each frame density map/distribution, wherein in the crowd The counting of the personnel is equal to the sum of all values of body build core described in the true value density map/distribution.

4. according to the method for claim 1, wherein the training further comprises：

The CNN is randomly initialized based on gaussian random distribution；

From the training image sampled picture piecemeal；

Estimate that the pedestrian in the Crowds Distribute in sampled image block and the image block sampled is total by the CNN Number；

The parameter of the CNN is updated, until estimated convergence in distribution is distributed in the true value；And

Further update the parameter of the CNN, converged on until estimated number determined by true value number, so as to obtain warp Cross the CNN of pre-training.

5. according to the method for claim 4, wherein the acquisition set of metadata of similar data further comprises：

The candidate with the perspective distribution similar to the target image frame is obtained from the training image frame and finely tunes frame data； And

The selection similar image with the Density Distribution similar to the Density Distribution of the target image frame point from alternate scenes Block.

6. according to the method for claim 5, wherein, the fine setting further comprises：

From the similar image piecemeal sampled picture piecemeal；

Divided by the CNN by pre-training come the image estimated the Crowds Distribute in sampled image block and sampled Pedestrian's sum in block；

The parameter of the CNN is updated, until the estimated convergence in distribution is distributed in the true value；And

The parameter for the CNN that the pre-training is crossed further is updated, until estimated number converges on the identified true value Number, to obtain trimmed CNN.

7. the method according to any one of claim 1 to 6, wherein, pass through the product to identified true value Density Distribution Divide and counted to determine the tale of the personnel in described image frame.

8. the method according to any one of claim 1 to 6, wherein, the space bit based on the pedestrian in each picture frame Put, the perspective distortion of the human bodily form in each picture frame and image creates the Crowds Distribute.

9. a kind of be used to generate the equipment that forecast model is counted with the crowd density distribution in prognostic chart picture frame and personnel, it is wrapped Include：

CNN training units (20), trained by inputting one or more crowd's image blocks of the frame in training set CNN, in the crowd's image block inputted, each crowd's image block has predetermined true value Density Distribution and personnel Count；

Set of metadata of similar data acquiring unit (30), is sampled from target scene image set to frame, and has from training set reception The training image of identified true value Density Distribution and counting/number, and each target image frame for being sampled is from institute Similar view data is obtained in the training frames of reception, to overcome between the target scene image set and the training image Scene gap；And

Model fine-adjusting unit (40), it is micro- to be carried out to the CNN by the way that the similar view data is input into the CNN Adjust, to determine the forecast model for the crowd density figure in prognostic chart picture frame and personnel's counting.

10. equipment according to claim 9, it further comprises：

True value density map creating unit (10), image point is selected from one or more of described training set training image frame Block；And true value pedestrian's tale in the true value Crowds Distribute and selected image block in image block selected by determining.

11. equipment according to claim 10, wherein the true value density map creating unit (10) is configured to by following Step is come true value pedestrian's tale in the true value Crowds Distribute and the selected image block in image block selected by determining：

Determine the body build core of each personnel identified；And

12. equipment according to claim 9, wherein the CNN training units (20) trained by following steps it is described CNN：

The CNN is randomly initialized based on gaussian random distribution；

From the training image sampled picture piecemeal；

Further update the parameter of the CNN, converged on until estimated number determined by true value number, to obtain process The CNN of pre-training.

13. equipment according to claim 12, wherein the set of metadata of similar data acquiring unit (30) is configured to：

14. equipment according to claim 13, wherein the fine-adjusting unit is further configured to be used for：

From the similar image piecemeal sampled picture piecemeal；

The CNN crossed by the pre-training is come the image block estimating the Crowds Distribute in sampled image block He sampled In pedestrian sum；

Further update the parameter for the CNN that the pre-training is crossed, converged on until estimated number determined by true value number, To obtain trimmed CNN.

15. the equipment according to any one of claim 9 to 14, wherein, by identified true value Density Distribution Integration is counted to determine the tale of the personnel in described image frame.

16. the equipment according to any one of claim 9 to 14, wherein, the space based on the pedestrian in each picture frame The perspective distortion of position, the human bodily form in each picture frame and image creates the Crowds Distribute.

17. a kind of be used to generate the system that forecast model is counted with the crowd density distribution in prognostic chart picture frame and personnel, it is wrapped Include：

Memory, it stores executable part；And

Processor, it is electrically coupled to the memory to perform the executable part to perform the operation of the system, wherein, The executable part includes：

CNN training components, CNN is trained by inputting one or more crowd's image blocks of the frame in training set, it is defeated in institute In the crowd's image block entered, there are each crowd's image block predetermined true value Density Distribution and personnel to count；

Set of metadata of similar data obtaining widget, frame is sampled from target scene image set, has from training set reception and determined True value Density Distribution and counting/number training image, and each target image frame for being sampled is from being received Similar view data is obtained in training frames, to overcome between the scene between the target scene image set and the training image Gap；And

Model trimming part, by the way that the similar view data is input into the CNN to be finely adjusted to the CNN, with Determine the forecast model for the crowd density figure in prognostic chart picture frame and personnel's counting.

18. system according to claim 17, it further comprises：

True value density map creates part, and image block is selected from one or more of described training set training image frame；With And true value pedestrian's tale in the true value Crowds Distribute and selected image block in image block selected by determining.

19. system according to claim 17, it is configured to pass through following steps wherein the true value density map creates part Come the true value pedestrian tale in the true value Crowds Distribute and selected image block in image block selected by determining：

Determine the body build core of each personnel identified；And

20. system according to claim 17, wherein the CNN training components train the CNN by following steps：

The CNN is randomly initialized based on gaussian random distribution；

From the training image sampled picture piecemeal；

Further update the parameter of the CNN, converged on until the number of the estimation determined by true value number, it is pre- to obtain The CNN trained.

21. system according to claim 20, wherein the set of metadata of similar data obtaining widget is configured to：

22. system according to claim 21, wherein the trimming part is further configured to be used for：

From the similar image piecemeal sampled picture piecemeal；

The parameter for the CNN that the pre-training is crossed is updated, until estimated convergence in distribution is distributed in the true value；And

Further update the parameter of the CNN, until the estimated number converge on it is described determined by true value number, with Obtain trimmed CNN.