WO2018167851A1

WO2018167851A1 - Image processing device, image processing method, and image processing program

Info

Publication number: WO2018167851A1
Application number: PCT/JP2017/010247
Authority: WO
Inventors: 亮史服部; 奥村　誠司; 守屋　芳美; 崇西辻
Original assignee: 三菱電機株式会社
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2018-09-20
Also published as: JP6559378B2; TW201833822A; CN110383295A; SG11201906822YA; JPWO2018167851A1; CN110383295B; MY184063A

Abstract

A parameter acquisition unit (204): extracts a subject image, which is an image of a subject, from a first photographic image, which is a photographic image in which a photographic space in which the subject is present is photographed and the subject is projected; and, on the basis of the extracted subject images, generates a plurality of reference images in which the plurality of subjects are present in the photographic space by varying the number of subjects present in the photographic space for each reference image. A headcount analysis unit (206) compares a second photographic image, which is a photographic image of a photographic space in which the photographing time differs from the first photographic image, with the plurality of reference images, and estimates the number of subjects projected in the second photographic image.

Description

Image processing apparatus, image processing method, and image processing program

The present invention relates to an image processing apparatus, an image processing method, and an image processing program.

Conventionally, a technique for estimating the number of people, density or flow rate from a camera image is known. Techniques for estimating the number of people from a camera image include a method for counting the number of people based on person detection, a method for estimating the number of people from a foreground area, and the like. The former has an advantage in accuracy when the crowd density is low, but when the crowd density is high, there is a problem that the detection accuracy is lowered due to the influence of occlusion between persons. The latter is inferior to the former in the accuracy of analysis at low density, but can be processed with a small amount of computation even at high density.

For example, Patent Document 1 discloses the latter technique. Specifically, in Patent Document 1, a foreground extracted from an image obtained by photographing a crowd by background difference is designated as a person region. And in patent document 1, the number of persons in an image is estimated from the area of a person area. In Patent Document 1, a CG (Computer Graphics) model that imitates a crowd in advance is generated with a plurality of congestion levels. Then, a relational expression between the foreground area and the number of people in consideration of the occlusion between the crowds is derived, and the number of people can be estimated while suppressing the influence of the occlusion.

JP 2005-25328 A

前 The foreground area per person acquired by background difference changes according to the size of the actually observed person and the accuracy of foreground extraction according to lighting conditions. For this reason, the technique of Patent Document 1 has a problem that the CG model generated in advance does not match the foreground extracted from the actual camera image, and an error occurs in the number of people estimation result.

The main object of the present invention is to solve the above-described problems. Specifically, a main object of the present invention is to improve the accuracy of estimating the number of subjects shown in a captured image.

An image processing apparatus according to the present invention includes:
A subject image that is an image of the subject is extracted from a first photographed image that is a photographed image in which the subject is photographed in a photographing space in which the subject is present, and the photographing space is based on the extracted subject image. A reference image generation unit that generates a plurality of reference images in which a plurality of subjects are present by changing the number of subjects existing in the shooting space for each reference image;
The second photographed image that is a photographed image in the photographing space having a photographing time different from that of the first photographed image is compared with the plurality of reference images, and the subject imaged in the second photographed image is compared. A number-of-subjects estimation unit for estimating the number.

According to the present invention, it is possible to improve the accuracy of estimating the number of subjects shown in a captured image.

FIG. 3 is a diagram illustrating a functional configuration example of a crowd monitoring apparatus according to the first embodiment. FIG. 3 is a diagram illustrating an internal configuration example of a parameter acquisition unit according to the first embodiment. FIG. 4 is a diagram illustrating an internal configuration example of a number analysis unit according to the first embodiment. 5 is a flowchart illustrating an operation example of the crowd monitoring apparatus according to the first embodiment. 5 is a flowchart illustrating an operation example of a parameter acquisition unit according to the first embodiment. 5 is a flowchart showing an operation example of a number analysis unit according to the first embodiment. FIG. 6 shows an example of a person detection result according to the first embodiment. FIG. 6 shows an example of foreground extraction results according to the first embodiment. FIG. 6 is a diagram illustrating an example of information stored for a detected person according to the first embodiment. FIG. 6 is a diagram for explaining a method of estimating the depression angle of the camera according to Embodiment 1; FIG. 6 is a diagram for explaining a method for deriving a horizontal line height according to the first embodiment. FIG. 3 is a diagram for explaining a method for deriving vertical point coordinates according to the first embodiment. The figure which shows the example of the crowd foreground image of the congestion level 1 which concerns on Embodiment 1. FIG. The figure which shows the example of a crowd foreground image of the congestion level 2 which concerns on Embodiment 1. FIG. The figure which shows the example of a crowd foreground image of the congestion level 3 which concerns on Embodiment 1. FIG. The figure which shows the example of the foreground area amount of the crowd foreground picture of the congestion level 1 which concerns on Embodiment 1. FIG. The figure which shows the example of the foreground area amount of the crowd foreground picture of the congestion level 2 which concerns on Embodiment 1. FIG. The figure which shows the example of the foreground area amount of the crowd foreground picture of the congestion level 3 which concerns on Embodiment 1. FIG. FIG. 6 is a diagram for explaining the number of people estimation process according to the first embodiment. FIG. 6 is a diagram for explaining the number of people estimation process according to the first embodiment. FIG. 6 is a diagram for explaining the number of people estimation process according to the first embodiment. FIG. 3 is a diagram illustrating a hardware configuration example of the crowd monitoring apparatus according to the first embodiment. The figure which shows the relationship between the function structure of the crowd monitoring apparatus which concerns on Embodiment 1, and a hardware configuration.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description of the embodiments and drawings, the same reference numerals denote the same or corresponding parts.

Embodiment 1 FIG.
*** Explanation of configuration ***
FIG. 1 shows a functional configuration example of the crowd monitoring apparatus 20 according to the first embodiment.
FIG. 22 shows a hardware configuration example of the crowd monitoring apparatus 20 according to the first embodiment.
The crowd monitoring device 20 corresponds to an image processing device. The operations performed by the crowd monitoring apparatus 20 correspond to an image processing method and an image processing program.
As shown in FIG. 1, the crowd monitoring device 20 is connected to the camera 10.
The camera 10 is installed at a position overlooking the shooting space where the subject exists. The shooting space is a space to be monitored by the camera 10. In the present embodiment, the subject is a person. In the present embodiment, it is assumed that there are a plurality of persons in the shooting space. Hereinafter, a plurality of persons existing in the shooting space are also referred to as a crowd.

Before explaining the functional configuration example of the crowd monitoring device 20 in FIG. 1, an example hardware configuration of the crowd monitoring device 20 will be described with reference to FIG.

The crowd monitoring device 20 is a computer.
The crowd monitoring device 20 includes a processor 1101, a memory 1102, a network interface 1103, and a storage device 1104 as hardware.
The storage device 1104 stores programs that realize the functions of the image reception decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207 shown in FIG. ing. The program is loaded from the storage device 1104 into the memory 1102. The processor 1101 reads the program from the memory 1102. The processor 1101 executes the program and performs operations of an image reception decoding unit 201, a changeover switch 202, a mode switching control unit 203, a parameter acquisition unit 204, a number analysis unit 206, and an analysis result output unit 207, which will be described later. .
The parameter storage 205 shown in FIG. 1 is realized by the storage device 1104.
FIG. 23 shows the relationship between the functional configuration shown in FIG. 1 and the hardware configuration shown in FIG. In FIG. 23, it is shown that the image reception decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207 are realized by the processor 1101. FIG. 23 shows that the parameter storage 205 is realized by the storage device 1104.
The network interface 1103 receives the compressed image stream from the camera 10.

Next, a functional configuration example of the crowd monitoring apparatus 20 shown in FIG. 1 will be described.

The image reception decoding unit 201 decodes the compressed image stream distributed from the camera 10 and converts the compressed image stream into an image frame. The image frame is a photographed image in the photographing space.

The mode switching control unit 203 controls the operation mode of the crowd monitoring device 20. The operation mode of the crowd monitoring device 20 includes a parameter acquisition mode and a people count mode. The mode change control unit 203 outputs a mode control signal to the changeover switch 202.

The changeover switch 202 switches the output destination of the image frame in accordance with the mode control signal from the mode switching control unit 203. More specifically, the changeover switch 202 outputs an image frame to the parameter acquisition unit 204 when the operation mode of the crowd monitoring apparatus 20 is the parameter acquisition mode. On the other hand, if the operation mode of the crowd monitoring device 20 is the people count mode, the changeover switch 202 outputs an image frame to the people analysis unit 206. The image frame output from the changeover switch 202 to the parameter acquisition unit 204 corresponds to the first captured image. Further, the image frame output from the changeover switch 202 to the number-of-people analysis unit 206 corresponds to the second captured image. The image frame output from the changeover switch 202 to the number-of-people analysis unit 206 is a captured image in a shooting space having a shooting time different from that of the image frame output from the changeover switch 202 to the parameter acquisition unit 204.

When the operation mode of the crowd monitoring device 20 is the parameter acquisition mode, the parameter acquisition unit 204 acquires an analysis parameter for analyzing the number of people using an image frame.
More specifically, the parameter acquisition unit 204 estimates the depression angle of the camera 10 as an external parameter of the camera 10. In addition, the parameter acquisition unit 204 extracts a subject image that is a subject image from the image frame that is the first captured image, and generates a plurality of foreground maps based on the extracted subject image and the estimated depression angle of the camera 10. The parameter acquisition unit 204 generates a plurality of foreground maps by changing the number of subjects existing in the shooting space for each foreground map. The foreground map is divided into a plurality of partial areas. In the foreground map, for each partial area, the number of subjects (number of area subjects) shown in the partial area and the foreground area amount in the partial area are shown. The foreground map is an image that is compared with the image frame when the number analysis unit 206 performs the number analysis. The foreground map corresponds to a reference image.
The parameter acquisition unit 204 stores the generated foreground maps in the parameter storage 205 as analysis parameters.
The parameter acquisition unit 204 corresponds to a reference image generation unit. The process performed by the parameter acquisition unit 204 corresponds to a reference image generation process.

The parameter storage 205 stores the analysis parameters generated by the parameter acquisition unit 204.

The number analysis unit 206 performs the number analysis when the operation mode of the crowd monitoring device 20 is the number counting mode.
More specifically, the number analysis unit 206 compares the image frame as the second captured image with a plurality of foreground maps generated by the parameter acquisition unit 204, and determines the number of subjects displayed in the image frame. presume.
The number analysis unit 206 corresponds to a subject number estimation unit. The processing performed by the number-of-people analysis unit 206 corresponds to subject number estimation processing.

The analysis result output unit 207 outputs the result of the number analysis by the number analysis unit 206 to the outside.

FIG. 2 shows an internal configuration example of the parameter acquisition unit 204.
The parameter acquisition unit 204 includes a person detection unit 2041, a foreground extraction unit 2042, a depression angle estimation unit 2043, a foreground map generation unit 2044, and a person information storage 2045.

The person detection unit 2041 detects an image of a person as a subject from an image frame that is a first photographed image.

The foreground extraction unit 2042 extracts the foreground image of the person specified by the person detection unit 2041.

The depression angle estimation unit 2043 estimates the depression angle of the camera 10 from the person detection result information and the foreground image of many persons.

The foreground map generation unit 2044 generates a plurality of foreground maps from the depression angle of the camera 10, the person detection result information of a large number of persons, and the foreground image.
Then, the foreground map generation unit 2044 stores the foreground map in the parameter storage 205 as an analysis parameter.

Person information storage 2045 stores person detection result information and foreground images.

FIG. 3 shows an internal configuration example of the number of persons analysis unit 206.
The number analysis unit 206 includes a foreground extraction unit 2061 and a number estimation unit 2062.

The foreground extraction unit 2061 extracts the foreground image from the image frame that is the second captured image.

The number of persons estimation unit 2062 estimates the number of persons from the foreground image using the analysis parameters stored in the parameter storage storage 205.

*** Explanation of operation ***
Next, an operation example of the crowd monitoring apparatus 20 according to the first embodiment will be described.
FIG. 4 is a flowchart showing an operation example of the crowd monitoring device 20.

First, immediately after the crowd monitoring device 20 is activated, the mode switching control unit 203 sets the operation mode to the parameter acquisition mode (step ST01). That is, the mode switching control unit 203 outputs a mode control signal for notifying the changeover switch 202 of the parameter acquisition mode.

Next, the mode switching control unit 203 refers to the parameter storage 205 and confirms whether or not the analysis parameter has already been acquired (step ST02). That is, the mode switching control unit 203 confirms whether or not the analysis parameters are stored in the parameter storage storage 205.

If the analysis parameter has not yet been acquired (NO in ST02), the analysis parameter is stored in the parameter storage 205 via step ST04, step ST05, and ST06.
In step ST02 after the analysis parameters are stored in the parameter storage 205, it is determined that the analysis parameters have been acquired, and the process proceeds to step ST03.

On the other hand, when the analysis parameter has already been acquired (YES in step ST02), mode switching control unit 203 changes the operation mode to the person count mode (step ST03).
If the analysis parameter has already been saved in the parameter saving storage 205 before the crowd monitoring device 20 is activated, the determination in step ST02 is YES. Further, as described above, step ST02 after the analysis parameter is stored in the parameter storage 205 through step ST04, step ST05, and ST06 is also determined as YES.

Next, the image reception decoding unit 201 receives a compressed image stream from the camera 10 and decodes at least one image frame of the received compressed image stream (step ST04). The image receiving / decoding unit 201 is a compressed image stream, for example, H.264. 262 / MPEG-2 video, H.264. H.264 / AVC, H.H. Image encoded data compressed by an image compression encoding method such as H.265 / HEVC or JPEG is received. The image receiving / decoding unit 201 also includes, as compressed image streams, for example, MPEG-2 TS (Moving Picture Experts Group 2 Transport Stream), RTP / RTSP (Real-time Transport Protocol / Real Time Stream Protocol MT, MTP). ) Or DASH (Dynamic Adaptive Streaming over HTTP) or other image distribution protocol.
Further, the image receiving / decoding unit 201 may receive, as a compressed image stream, image data encoded by an encoding method other than the above or image data distributed in a distribution format other than the above. The image receiving / decoding unit 201 may receive image data distributed according to an uncompressed transmission standard such as SDI (Serial Digital Interface) or HD (High Definition) -SDI as a compressed image stream.

Next, the changeover switch 202 outputs the image frame output from the image reception decoding unit 201 to the parameter acquisition unit 204 or the number of people analysis unit 206 according to the operation mode set by the mode switching control unit 203 (step ST05).
That is, the changeover switch 202 outputs an image frame to the parameter acquisition unit 204 if a mode control signal for notifying the parameter acquisition mode is output from the mode change control unit 203. On the other hand, if a mode control signal notifying the number of people count mode is output from the mode switching control unit 203, the changeover switch 202 outputs an image frame to the number of people analysis unit 206.
In the following description, it is assumed that the parameter acquisition unit 204 and the number of people analysis unit 206 are operated exclusively. However, a configuration may be adopted in which the parameter acquisition unit 204 is simultaneously operated while the number of people analysis unit 206 is operated, and the analysis parameters are updated as needed.

If the operation mode is the parameter acquisition mode, the parameter acquisition unit 204 performs parameter acquisition processing (step ST06). That is, the parameter acquisition unit 204 sequentially processes the image frames from the changeover switch 202, generates analysis parameters, and stores the generated analysis parameters in the parameter storage 205. The parameter acquisition unit 204 generates an analysis parameter by processing an image frame for a certain period. When the generation of the analysis parameter is completed, the parameter acquisition unit 204 stores the analysis parameter in the parameter storage storage 205. When the analysis parameter is stored in the parameter storage 205, the process returns to step ST02. Details of the operation of the parameter acquisition unit 204 will be described later.

If the operation mode is the people count mode, the people analysis unit 206 performs a people count process (step ST07). That is, the number analysis unit 206 analyzes the image frame from the changeover switch 202 using the analysis parameter stored in the parameter storage 205, thereby analyzing the number of persons shown in the image frame. Details of the operation of the number of persons analysis unit 206 will be described later.

After the number of people is analyzed by the number of people analysis unit 206, the analysis result output unit 207 outputs the number of people analysis result indicating the analysis result of the number of people analysis unit 206 to the outside (step ST08).
The analysis result output unit 207 outputs the number analysis result by, for example, display on a monitor, output to a log file, output to an externally connected device, or transmission to a network. Further, the analysis result output unit 207 may output the result of the number of people in another format. In addition, every time the number analysis result is output from the number analysis unit 206, the result may be output to the outside, or output after a specific period or a specific number of number analysis results are aggregated or statistically processed. It may be. After step ST08, the process returns to step ST04 to process the next image frame.

Hereinafter, the detailed operation of the parameter acquisition unit 204 will be described.
FIG. 5 is a flowchart showing the operation of the parameter acquisition unit 204.

First, the person detection unit 2041 performs person detection processing on the input image frame, and stores person detection result information in the person information storage 2045 (step ST11).
The person detection unit 2041 is in a standing state, the contact position between the toes and the ground and the top of the head is visible, and the ratio of occlusion with another person in which no occlusion has occurred with another person. A person who satisfies that is small is detected.

The person detection unit 2041 outputs person detection result information to the person information storage 2045 for each detected person. The person detection result information includes, for example, information indicating a rectangle, ellipse, or other shape frame that surrounds the detected person without excess or deficiency, coordinate information indicating the ground contact position of the detected person's ground and feet, and the top of the person's head. The coordinate information shown and the moving speed of the contact point are included.
FIG. 7 shows an image of a person detection result by the person detection unit 2041.
As a person detection technique, a technique using a still image-based feature quantity such as HOG (Histogram of Oriented Gradients), ICF (Integral Channel Features), or ACF (Aggregate Channel Features) may be used. Further, as a person detection technique, a technique using a moving image-based feature amount using a plurality of temporally adjacent image frames may be used.
In the present embodiment, the person detection unit 2041 uses a person detection method based on the premise that the size of the person shown in the image frame is unknown. Instead, the person detection unit 2041 performs person detection a plurality of times, and when a tendency to the person scale for each image region is obtained, the person detection unit 2041 switches to a person detection method based on the scale and performs more processing. You may speed up.
The person detection unit 2041 acquires information on the moving speed of the grounding point by tracking feature points across a plurality of image frames, tracking a person, and the like. In addition, when a motion vector for compression encoding is included in the compressed image stream input from the camera 10, the person detection unit 2041 uses the motion vector information as it is to determine the movement speed of the ground point. Information may be acquired.

The parameter acquisition unit 204 may be configured to save all the detection results as person detection result information in the person information storage 2045. In order to save the capacity of the person information storage 2045, the parameter acquisition unit 204 may store the person information storage 2045. You may comprise so that the data amount accumulate | stored may be suppressed below to fixed level. For example, when the person detection unit 2041 detects another person again in an image area in which a person has already been detected, the person detection unit 2041 discards one of the detection results and stores it in the person information storage storage 2045. The amount of data being stored is suppressed below a certain level. In addition, the person detection unit 2041 may take an average value of the detection results, integrate the detection results, and suppress the data amount stored in the person information storage 2045 to a certain level or less.

Next, based on the person detection result information, the foreground extraction unit 2042 performs foreground extraction within the area of the frame surrounding the detected person, and saves the person information by associating the foreground extraction result image with the person detection result information. The data is stored in the storage 2045 (steps ST12 and ST13).
The foreground image extracted by the foreground extraction unit 2042 corresponds to a subject image.

FIG. 8 shows an image of the foreground extraction result.
FIG. 9 shows a pair of person detection result information and a person foreground image corresponding to each person detection result information stored in the person information storage 2045. The person detection result information and the foreground image are collectively referred to as person information.

The foreground extraction unit 2042 extracts a foreground image using, for example, a background difference method, an adaptive background difference method, and a dense optical flow derivation algorithm.
The background difference method is a method of registering a background image in advance and calculating the difference between the input image and the background image. The adaptive background subtraction method is a method of automatically updating a background image from a continuously input image frame using a model such as MOG (Mixture of Gaussian) Distribution. The dense optical flow derivation algorithm is a method for acquiring motion information in an image in units of pixels.

Next, the foreground extraction unit 2042 determines whether sufficient person information has been acquired by the processes in steps ST11 to ST13 (step ST14).
If sufficient person information has been acquired, the process proceeds to step ST15. On the other hand, if sufficient personal information has not been acquired, the process ends.
Whether or not sufficient person information has been obtained is determined based on measures such as the number of pieces of person information acquired so far and the elapsed time since the start of parameter acquisition processing. In addition, parameter estimation processing after step ST15 may be performed, and it may be determined whether sufficient person information has been obtained with the reliability of the parameters estimated by the parameter estimation processing.

When sufficient person information is acquired, the depression angle estimation unit 2043 estimates the depression angle of the camera 10 using the person information stored in the person information storage 2045 (step ST15).
10, 11 and 12 show an outline of the depression angle estimation process.
As shown in FIG. 10, in this embodiment, it is assumed that the camera is installed downward from the horizontal. Also, it is assumed that the lens distortion of the camera is corrected and can be regarded as a pinhole camera. Further, it is assumed that the internal parameters of the camera are known and it is known how much angle the direction of each coordinate in the image has from the optical axis.
In this case, if the image coordinates corresponding to the horizontal line height are known in the image, the angle α [rad] formed by the optical axis direction and the horizontal direction can be uniquely obtained. The angle formed by the optical axis direction and the ground, that is, the depression angle θ [rad] is obtained by θ = α. Alternatively, if the image coordinates corresponding to the vertical point are known in the image, the angle β [rad] formed by the optical axis direction and the vertical direction can be uniquely obtained. The depression angle θ [rad] is obtained by θ = π / 2−β.

FIG. 11 shows an image of deriving the horizontal line height.
The depression angle estimation unit 2043 extends the direction of movement of the grounding point of the moving person based on the person information, and regards the point where the extension lines of the movement directions of the plurality of persons intersect best as a point on the horizontal line. This is based on the premise that a plurality of pedestrians are walking in parallel and straight ahead. The lines do not necessarily intersect at one point due to the accuracy error of the moving direction detection and the influence of pedestrians who are not walking in parallel. For this reason, the depression angle estimation unit 2043 uses a lot of person information to estimate a more likely result as the horizontal height.

FIG. 12 shows an image of deriving the vertical point coordinates.
Based on the person information, the depression angle estimation unit 2043 extends a line connecting the coordinates of the top of the head and the ground point, that is, a normal line, and sets a point where the normal lines of a plurality of persons intersect best as a vertical point. It is considered. This is based on the premise that the pedestrian is standing upright with respect to the ground. The lines do not necessarily intersect at a single point due to the detection error of the coordinates of the top and the grounding point or the influence of a pedestrian who is not standing upright. For this reason, the depression angle estimation unit 2043 estimates a more likely result as the vertical point coordinates using a lot of person information.

Note that the method of using the horizontal line height is suitable when the camera depression angle is shallow, that is, when the camera depression angle is close to the horizontal direction. The direction in which the vertical point is used is suitable when the depression angle of the camera is deep, that is, when the depression angle of the camera is closer to the vertical direction. The depression angle estimator 2043 may select the depression angle obtained by a more suitable method from the depression angles obtained by using both methods, or may select the average value of both. Alternatively, the depression angle estimation unit 2043 may use another depression angle estimation method.

Next, the foreground map generation unit 2044 generates a foreground map to be used by the number analysis unit 206 using the depression angle information obtained by the depression angle estimation unit 2043 and the person information stored in the person information storage 2045 ( Step ST16).

More specifically, the foreground map generation unit 2044 first calculates a two-dimensional coordinate system on the road surface (hereinafter referred to as a road surface coordinate system) from the depression angle information and the person contact point movement information. For example, the foreground map generation unit 2044 defines the main movement direction of the person as the X direction, and defines the direction orthogonal to the X direction as the Y direction.

Next, the foreground map generation unit 2044 obtains an absolute scale of road surface coordinates that are coordinates in the road surface coordinate system. The foreground map generation unit 2044 regards, for example, the movement width per unit time on the road surface coordinates as the movement width based on the average movement speed of pedestrians given in advance, and obtains an absolute scale of the road surface coordinates.

Next, after acquiring the road surface coordinate system, the foreground map generation unit 2044 combines the foreground images for each person acquired in step ST13, and the foreground of the crowd that is assumed to be observed at a specific congestion level. An image (hereinafter referred to as a crowd foreground image) is generated.
13, 14 and 15 show examples of the crowd foreground image. FIG. 13 shows an example of the crowd foreground image at the congestion level 1. FIG. 14 shows an example of the crowd foreground image at the congestion level 2. FIG. 15 shows an example of the crowd foreground image at the congestion level 3.
As shown in FIGS. 13 to 15, the foreground map generation unit 2044 generates a plurality of crowd foreground images in which a plurality of persons exist in the shooting space by changing the number of persons in the shooting space for each crowd foreground image. To do.
As shown in FIGS. 13 to 15, in the crowd foreground image, occlusion occurs between persons depending on the density and arrangement of the persons.
The foreground map generation unit 2044 converts the image coordinates of the contact point of the person information acquired so far into road surface coordinates, and then converts the foreground image to a plurality of positions where the density of the persons becomes a predetermined value in the road surface coordinate system. The crowd foreground image is generated by pasting. For example, the road surface coordinate grid shown in FIG. 13 is set at an interval of 50 cm in the road surface coordinate system. That is, the whole grid is 4 [m ² ]. Assuming that the crowd density at the congestion level 1 is 1 [person / m ² ], the foreground map generation unit 2044 randomly arranges four persons in the grid. If the crowd density at crowd level 2 is 2 [person / m ² ], the foreground map generation unit 2044 randomly arranges 8 persons in the grid. Assuming that the crowd density at the congestion level 3 is 4 [person / m ² ], the foreground map generation unit 2044 randomly arranges 16 persons in the grid. In either case, the foreground map generation unit 2044 randomly keeps the person at a distance equal to or greater than the average shoulder width of the person and the average personal space radius for each density so that the persons are not too close to each other. Deploy.

Next, after generating the crowd foreground image, the foreground map generation unit 2044 divides the crowd foreground image into a plurality of partial areas as shown in FIGS. 16, 17, and 18. The amount of foreground area per area (example: 150 pixels / person) is calculated. In FIG. 17 and FIG. 18, foreground area amounts are not displayed for some persons for reasons of drawing.
The foreground map generation unit 2044 calculates the foreground area amount included in the partial area for each partial area. For example, in the partial region denoted by reference numeral 161 in FIG. 16, for example, 90 pixels / partial region is obtained. Since the foreground image of the crowd is generated by combining the foreground images of a plurality of persons, the foreground map generation unit 2044 can obtain the true value of the number of persons appearing in each partial area. The foreground map generation unit 2044 counts the number of persons in the partial area for each partial area. When the entire foreground image of one person is included in one partial area, the number of persons in the partial area is one (one person / partial area). When only one part of the foreground image of the person is included in one partial area, the foreground map generation unit 2044 uses the entire area of the foreground image of the person as the area of the part of the person included in the partial area. By dividing by, the number of people included in the partial area is obtained in units of decimal places. For example, the foreground area amount of the person included in the partial region 161 is 150 pixels. Assuming that 90 pixels are included in the partial area 161, the foreground map generation unit 2044 defines the number of people included in the partial area 161 as 0.6 (90 ÷ 150 = 0.6). (0.6 people / partial area).
Then, the foreground map generation unit 2044 includes, for each partial area, the foreground area amount (eg, 90 pixels / partial area) of the person included in the partial area and the number of persons included in the partial area (eg: 1 person / partial area). 0.6 person / partial area).
Hereinafter, the foreground area amount (for example, 90 pixels / partial region) of a person included in the partial region is referred to as a regional person area amount. The number of persons included in the partial area (eg, 1 person / partial area, 0.6 person / partial area) is referred to as the number of area persons.
A crowd foreground image in which the area person area amount and the area person number are added to each partial area is referred to as a foreground map. As described above, the foreground map corresponds to the reference image.

Note that the foreground map generation unit 2044 may generate a plurality of crowd foreground images for each congestion level. That is, the foreground map generation unit 2044 may generate a plurality of crowd foreground images by randomly changing the arrangement of people for one congestion level (for example, congestion level 3). As described above, when a plurality of crowded foreground images are generated for one congestion level, the foreground map generation unit 2044 stores all the plurality of foreground maps obtained from the plurality of crowd foreground images in the parameter storage storage 205. May be. Further, the foreground map generation unit 2044 integrates the plurality of foreground maps into one foreground map by taking the average value of the plurality of foreground maps obtained from the plurality of crowd foreground images, and only the one foreground map is obtained. It may be stored in the parameter storage 205. By doing in this way, the effect which can respond to the arrangement pattern of various persons is anticipated.

Finally, the person information storage 2045 stores the generated foreground map in the parameter storage 205 (step ST17). The parameter acquisition process is completed by storing the foreground map.

Hereinafter, the detailed operation of the number analysis unit 206 will be described. FIG. 6 is a flowchart showing the operation of the number analysis unit 206.

First, the foreground extraction unit 2061 performs foreground extraction on the entire input image frame or the attention area (step ST21). This foreground extraction process is the same as the foreground extraction process of the foreground extraction unit 2042 of the parameter acquisition unit 204.

Next, the number estimating unit 2062 estimates the number of foreground images extracted by the foreground extracting unit 2061 using the foreground map for each congestion level stored in the parameter storage storage 205 (step ST22).
19, 20 and 21 illustrate the number estimation process.
The number estimating unit 2062 extracts a foreground image from the image frame as shown in FIG. Then, the number estimating unit 2062 divides the foreground image into the same partial area as the foreground map. The partial area in the foreground image shown in FIG. 19 is hereinafter referred to as an estimation target partial area.
As shown in FIG. 20, the number-of-people estimation unit 2062 determines the foreground area amount for each estimation target partial region.
Next, the number estimating unit 2062 extracts, for each estimation target partial area, a partial area at the same position from a plurality of foreground maps for each congestion level, the foreground area amount in each of the extracted partial areas, and the estimation target part The foreground area amount in the region is compared. Then, the number estimating unit 2062 selects a partial region of the foreground map having a foreground area amount most similar to the foreground area amount in the estimation target partial region.
For example, the number-of-people estimation unit 2062 applies the first row 1 from each of the foreground maps of the congestion levels 1 to 3 to the estimation target partial region in the first row and first column (upper left corner) of the foreground image of FIG. Extract a partial region of the column. Next, the number estimating unit 2062 compares the foreground area amount in each of the three partial regions extracted from the foreground map of the congestion levels 1 to 3 with the partial region in the estimation target partial region in FIG. Then, the number estimating unit 2062 selects a partial area to which the foreground area amount most similar to the foreground area amount in the estimation target partial area in FIG. 20 is added from the extracted three partial areas. The number-of-people estimation unit 2062 performs the above processing for all of the estimation target partial regions in FIG. Hereinafter, the partial region selected for the estimation target region is referred to as a selected partial region.
In this way, the number estimating unit 2062 selects a selected partial area from the foreground map of the congestion levels 1 to 3 for each estimation target partial area. For this reason, the congestion level of the selected partial region may differ depending on the estimation target partial region. For example, for the nth row and mth column estimation target partial region of the foreground image of FIG. 20, the congestion level 3 foreground map partial region is selected, and the nth row m + 1th column estimation target partial region is selected. Thus, a partial area of the foreground map at the congestion level 2 is selected, and a partial area of the foreground map at the congestion level 1 may be selected for the estimation target partial area at the nth row and m + 2 column. .
Next, the number estimating unit 2062 obtains the number of persons included in the estimation target partial area for each estimation target partial area. Specifically, the number estimating unit 2062 divides the foreground area amount of the estimation target partial region by the region person area amount of the selected partial region, and multiplies the division value by the number of region persons of the selected partial region. That is, the number estimating unit 2062 performs a calculation of (foreground area amount of estimation target partial region) / (region person area amount of selected partial region) × (number of region persons of selected partial region) to calculate the estimation target partial region. Find the number of people included.
Then, the number estimating unit 2062 counts the number of persons obtained for each estimation target partial area to obtain the number of persons included in the entire image frame or the attention area.

*** Explanation of the effect of the embodiment ***
As described above, in the present embodiment, the parameter acquisition unit 204 uses the actually observed foreground image to determine the area person area amount and the area person number in consideration of occlusion. Since the number analysis unit 206 applies the area person area amount and the number of area people to the foreground image to be estimated, the number of persons is estimated, so that it is possible to estimate the number of persons with high accuracy without obtaining the average size of the persons in advance. Is possible.
In the present embodiment, the parameter acquisition unit 204 estimates the depression angle of the camera, which is an external parameter of the camera, and thus it is not necessary to measure the external parameter of the camera in advance. In the technique disclosed in Patent Document 1, it is necessary to measure and acquire camera external parameters in advance in order to generate a CG model. The camera external parameters are, for example, parameters such as the depression angle of the camera and the distance from each of the scattered people to the camera. As described above, in the technique disclosed in Patent Document 1, since it is necessary to measure and acquire the camera external parameters in advance, there is a problem that the cost at the time of camera installation is large. In this embodiment, as described above, there is an effect that it is not necessary to measure the external parameters of the camera in advance.

In the above description, a person has been described as an example of a subject, but the subject is not limited to a person. The subject may be a living body such as a wild animal or an insect, or a moving body other than a person such as a vehicle.

Further, the present invention can be freely combined, modified, or omitted from the constituent elements or procedures shown in the embodiment within the scope of the invention.

*** Explanation of hardware configuration ***
Finally, a supplementary explanation of the hardware configuration of the crowd monitoring apparatus 20 will be given.
A processor 901 illustrated in FIG. 22 is an IC (Integrated Circuit) that performs processing.
The processor 1101 is a CPU (Central Processing Unit), a DSP (Digital Signal Processor), or the like.
A memory 1102 illustrated in FIG. 22 is a RAM (Random Access Memory).
The storage device 1104 illustrated in FIG. 22 is a ROM (Read Only Memory), a flash memory, an HDD (Hard Disk Drive), or the like.
The network interface 1103 illustrated in FIG. 22 includes a receiver that receives data and a transmitter that transmits data.
The network interface 1103 is, for example, a communication chip or a NIC (Network Interface Card).

The storage device 1104 also stores an OS (Operating System).
Then, at least a part of the OS is executed by the processor 1101.
A processor 1101 executes at least part of the OS, and implements the functions of the image reception decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207. Execute.
When the processor 1101 executes the OS, task management, memory management, file management, communication control, and the like are performed.
Further, at least information, data, signal values, and variable values indicating the processing results of the image reception decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207. Any one of them is stored in at least one of the memory 1102, the storage device 1104, a register in the processor 1101, and a cache memory.
The programs for realizing the functions of the image receiving / decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206 and the analysis result output unit 207 are a magnetic disk, a flexible disk, an optical disk, a compact You may memorize | store in portable storage media, such as a disk, a Blu-ray (trademark) disk, and DVD.

In addition, the image receiving / decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207 are set to “circuit” or “process” or “procedure” or “processing”. May be read as
The crowd monitoring device 20 may be realized by an electronic circuit such as a logic IC (Integrated Circuit), a GA (Gate Array), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Gate Array).
In this case, the image reception decoding unit 201, the changeover switch 202, the mode switching control unit 203, the parameter acquisition unit 204, the number of people analysis unit 206, and the analysis result output unit 207 are each realized as part of an electronic circuit.
The processor and the electronic circuit are also collectively referred to as a processing circuit.

10 cameras, 20 crowd monitoring devices, 201 image reception decoding unit, 202 switch, 203 mode switching control unit, 204 parameter acquisition unit, 205 parameter storage, 206 number analysis unit, 207 analysis result output unit, 1101 processor, 1102 memory 1103 network interface, 1104 storage device, 2041 person detection unit, 2042 foreground extraction unit, 2043 depression angle estimation unit, 2044 foreground map generation unit, 2045 person information storage storage, 2061 foreground extraction unit, 2062 number estimation unit.

Claims

A subject image that is an image of the subject is extracted from a first photographed image that is a photographed image in which the subject is photographed in a photographing space in which the subject is present, and the photographing space is based on the extracted subject image. A reference image generation unit that generates a plurality of reference images in which a plurality of subjects are present by changing the number of subjects existing in the shooting space for each reference image;
The second photographed image that is a photographed image in the photographing space having a photographing time different from that of the first photographed image is compared with the plurality of reference images, and the subject imaged in the second photographed image is compared. An image processing apparatus comprising: a subject number estimation unit that estimates a number.
The reference image generation unit
Dividing each reference image of the plurality of reference images into a plurality of partial areas, for each partial area of each reference image, calculating the number of subjects reflected in the partial area as the number of area subjects,
The subject number estimating unit
The second captured image is divided into the plurality of partial regions, and the position of the partial region of the second captured image is determined for each partial region of the second captured image from each reference image of the plurality of reference images. And extracting the partial areas at the same position, comparing each of the extracted partial areas with the partial area of the second photographed image, and selecting any partial area from the extracted partial areas. The image processing apparatus according to claim 1, wherein the number of subjects selected is estimated, and the number of subjects shown in the second captured image is estimated using the number of subject subjects in the selected partial region.
The reference image generation unit
For each partial area of each reference image, calculate the number of area subjects and the foreground area amount in the partial area,
The subject number estimating unit
For each partial area of the second captured image, calculate the foreground area amount in the partial area of the second captured image;
For each partial region of the second photographed image, the foreground area amount in each of the extracted partial regions is compared with the foreground area amount in the partial region of the second photographed image, and the second photographing Select a partial area of the foreground area amount similar to the foreground area amount in the partial area of the image,
For each partial region of the second captured image, using the number of area subjects in the selected partial region, the foreground area amount in the selected partial region, and the foreground area amount in the partial region of the second captured image, Estimating the number of subjects shown in the partial area of the second captured image;
The image processing apparatus according to claim 2, wherein the number of subjects shown in the second photographed image is estimated based on the number of subjects estimated for each partial region of the second photographed image.
The reference image generation unit
The image processing apparatus according to claim 1, wherein a plurality of reference images having the same number of subjects existing in the shooting space and different subject arrangements are generated for each number of subjects existing in the shooting space.
The reference image generation unit
The image processing apparatus according to claim 1, wherein a reference image in which occlusion occurs between subjects is generated.
The reference image generation unit
The subject image is extracted from the first photographed image photographed by a camera installed at a position overlooking the photographing space, the depression angle of the camera is estimated by analyzing the first photographed image, and the extracted The image processing apparatus according to claim 1, wherein the plurality of reference images are generated based on a subject image and an estimated depression angle of the camera.
A computer extracts a subject image that is an image of the subject from a first photographed image that is a photographed image in which the subject is photographed in a photographing space in which the subject is present, and based on the extracted subject image, Generating a plurality of reference images having a plurality of subjects in the shooting space by changing the number of subjects existing in the shooting space for each reference image;
The computer compares the plurality of reference images with a second photographed image that is a photographed image of the photographing space that has a photographing time different from that of the first photographed image, and is reflected in the second photographed image. An image processing method for estimating the number of subjects.
A subject image that is an image of the subject is extracted from a first photographed image that is a photographed image in which the subject is photographed in a photographing space in which the subject is present, and the photographing space is based on the extracted subject image. A reference image generation process for generating a plurality of reference images in which a plurality of subjects are present by changing the number of subjects existing in the shooting space for each reference image;
The second photographed image that is a photographed image in the photographing space having a photographing time different from that of the first photographed image is compared with the plurality of reference images, and the subject imaged in the second photographed image is compared. An image processing program for causing a computer to execute subject number estimation processing for estimating the number.