WO2022059223A1

WO2022059223A1 - Video analyzing system and video analyzing method

Info

Publication number: WO2022059223A1
Application number: PCT/JP2021/005097
Authority: WO
Inventors: 良起伊藤; 健一森田
Original assignee: 株式会社日立製作所
Priority date: 2020-09-15
Filing date: 2021-02-10
Publication date: 2022-03-24
Also published as: JP2022048475A

Abstract

This invention sets a monitoring significance for each of a plurality of persons having interacted with each other in a monitoring area and achieves reduction of the workload of a monitoring and reacting person as well as reduction of system's processing load.　For this purpose, this video analyzing system for detecting events in a monitoring area by use of video obtained by imaging the monitoring area comprises: an interaction detecting unit that, on the basis of the video, detects interaction being an event caused by participation of a plurality of persons and that outputs the type of the interaction and the directions of the interaction each indicating how a respective one of the plurality of persons has participated in the interaction together with the other or the others; a monitoring significance determining unit that compares the type and directions of the interaction with predetermined monitoring reference information to determine a monitoring significance for each of the plurality of persons having participated in the interaction; and an output control unit that outputs a detection result of the event on the basis of the monitoring significance.

Description

Video analysis system and video analysis method

The present invention relates to a video analysis system and a video analysis method that detect a person's state or an object from an image of a surveillance area and detect a monitoring target based on the detection result.

In recent years, there has been an increasing need for video surveillance at event venues such as concert venues and amusement facilities, and public facilities such as train stations and airports. For example, in order to prevent terrorist acts using dangerous materials such as explosives and toxic liquids, for the act of handing over luggage or leaving things inside or outside the security area, monitor people, detect acts, or act. For security reasons, it is required to take measures such as calling out to a person who has gone or has a sign of doing so. In addition, by early detection of people rubbing each other, falling and crouching movements of the imaged person, the facility manager can promptly protect the person requiring rescue in the facility and contribute to ensuring safety. Will be possible.

For example, in the image monitoring device described in Patent Document 1, the delivery act in the monitoring area is detected by calculating the information regarding the posture of the human body from the image taken in the monitoring area. Further, the monitoring importance is calculated using the position of occurrence and the type of the delivered article, and the detection result of the delivery act is output according to the monitoring importance. Further, regarding the output of the detection result of the delivery act, a method of notifying the "monitor of the monitoring center" by a screen display, an alarm lamp, an alarm sound, or the like is described.

Further, in the image processing device described in Patent Document 2, a plurality of videos distributed from a plurality of video distribution devices are displayed for the purpose of reducing the burden of monitoring the video and making it easier to capture the moment when a thing occurs. Acquisition means for acquiring, object position detecting means for detecting the positions of a plurality of objects included in each of a plurality of images, direction detecting means for detecting the orientation of each object at the positions of a plurality of objects, and orientation of each object. Disclosed is an image processing apparatus having a setting means for setting a priority for each of a plurality of images based on the above, and a display means for displaying a plurality of images based on the priority.

Japanese Unexamined Patent Publication No. 2017-028561 Japanese Patent Laid-Open No. 2017-017441 Public Relations

If the above-mentioned conventional technology is used, there will be cases where excessive notification is given to a person whose monitoring importance is not originally high, so workers involved in monitoring work must respond to the reported event. The work burden will increase. In addition, when outputting the result after event detection, if there are many actors with high monitoring importance, it will be necessary to perform person tracking and action recognition by the video analysis system for these actors, so processing by increasing the number of people Causes an increase in load. Therefore, in the present invention, the monitoring importance is set for each of a plurality of persons who have interacted with each other in the monitoring area, and the persons with high monitoring importance are accurately narrowed down, thereby reducing the work load of the responder in monitoring and the system. It is an object of the present invention to provide a video analysis system capable of reducing the processing load of the above.

The video analysis system as one aspect of the present invention is an image analysis system that detects an event in the surveillance area by using an image captured in the surveillance area, and is an event caused by the involvement of a plurality of persons based on the image. An interaction detector that detects an interaction and outputs the type of the interaction and the direction of the interaction indicating how each of the plurality of persons interacted with another person in the interaction, the type of the interaction, and the type of the interaction. Based on the monitoring importance determination unit that determines the monitoring importance of each person for a plurality of persons involved in the interaction by comparing the direction with the preset monitoring reference information, and the monitoring importance. Provided is a video analysis system including an output control unit that outputs the detection result of the event.
Further, according to the present invention, in a video analysis method for detecting an event in the surveillance area using an image obtained by capturing a surveillance area, the present invention detects an interaction, which is an event caused by the involvement of a plurality of people, based on the image. The interaction detection step that outputs the type of the interaction and the direction of the interaction indicating how each of the plurality of persons interacted with another person in the interaction, and the type and direction of the interaction are preset. The monitoring importance determination step for determining the monitoring importance of each person for a plurality of persons involved in the interaction by comparing with the monitoring reference information, and the detection result for the event based on the monitoring importance. Provides an output control step to output and a video analysis method including.

According to the present invention, it is possible to set the monitoring importance for each of a plurality of persons who have interacted with each other in the monitoring area, and to reduce the work load of the responder in the monitoring and the processing load of the system.

It is explanatory drawing of the video surveillance system in this embodiment. It is a figure which showed the whole structure of the image surveillance system in this embodiment. It is a figure which showed the block diagram of the image analysis system in this embodiment. It is a figure which showed the flowchart of the image analysis system in this embodiment. It is a figure which showed the data structure of the monitoring standard information in this embodiment. It is a figure which showed the setting screen example of the monitoring standard information in this embodiment. It is a figure which showed the setting screen example of the monitoring standard information in this embodiment. It is a figure which showed the display example of the image display part in this embodiment. It is a figure which showed the display example of the image display part in this embodiment. It is a figure which showed the display example of the search part in this embodiment. It is a figure which showed the display example of the image display part in this embodiment. It is a figure which showed the hardware configuration of the video surveillance system in this embodiment.

Hereinafter, embodiments of the video surveillance system according to the present invention will be described. In this embodiment, for the purpose of early detection of terrorist acts and dangerous acts in public facilities such as event venues, stations and airports, collaborative actions by multiple persons such as delivery of goods and conflicting acts in the surveillance area, that is, interaction. Is to be detected. According to the present invention, it is possible to set the monitoring importance for each person who interacts and set the priority for the event to be dealt with, so that the observer in the facility and the on-site staff can efficiently and quickly respond to the event. Encourage the response. In addition, when the number of staff members is smaller than that of the person requiring a response, it is possible to reduce the risk of missing a person of high importance. Furthermore, since it is possible to perform processing such as person tracking and action recognition after an event occurs from a person with high monitoring importance, it is possible to appropriately allocate limited computer resources.

Note that the "event" in the present embodiment is a situation preset as a detection target in a certain monitoring area. In particular, in the present embodiment, an interaction, which is an event involving a plurality of persons, is targeted for detection. For example, actions such as shaking hands, handing over luggage, rubbing, and assault are included in the interaction. Hereinafter, examples will be described with reference to the drawings.

FIG. 1 is an explanatory diagram of the video surveillance system according to the present embodiment. As shown in FIG. 1, the video monitoring system 1 is roughly classified into a shooting system 2, a video analysis system 3, and a monitoring center system 4. The photographing system 2 is composed of a camera unit installed in a monitored area. Further, in the image analysis system 3, the interaction between the persons to be detected and the attributes of the person are determined by analyzing the input image from the image pickup device, and further, the information on the generation position and the information on the direction of the interaction are obtained. , Determine the monitoring importance of each person by comparing with the preset monitoring standard information. Further, the monitoring center system 4 receives the analysis result from the video analysis system 3, effectively displays it to the observer and the on-site staff, and performs a search after an event related to an interaction or a person occurs.

Here, the direction of interaction indicates from which person to which person a predetermined action related to the interaction is performed. For example, in the case of delivery, the direction of interaction is set from the person who delivered the article (executor of delivery) to the person who received it (executor of delivery). Similarly, in the case of assault, the direction of interaction is set from the perpetrator (perpetrator of the assault) to the victim (perpetrator of the assault). In this way, the direction of interaction is determined for each type of interaction. There are also interactions that take place in both directions, such as shaking hands and rubbing.
In addition, the attributes of a person include whether they are ordinary people or security guards, age, and gender.
The video monitoring system 1 sets the monitoring importance for each person by using the direction of interaction and the attributes of the person involved in the interaction, and realizes reduction of the work load of the responder in monitoring and reduction of the processing load of the system. is doing.

This point will be explained with a concrete example. According to the calculation method of calculating the monitoring importance of a person using the position of occurrence and the type of the delivered article, the weight of the monitoring importance of a plurality of persons who have delivered and received is not taken into consideration. However, for example, if the article is an article requiring attention in the monitoring area, it is desirable that the monitoring importance of the person who delivered the article is set higher than that of the person who delivered the article. Alternatively, when there are areas with different security levels in the monitoring area, it is desirable that the monitoring importance is judged differently between the transfer from the low security area to the high security area and the opposite direction.

Further, for example, if the person who delivered the item is a security guard or other security maintenance worker (security personnel) in the monitoring area, the importance of monitoring should not be set high. However, if such attributes are not taken into consideration, the monitoring importance of the person who made the delivery that should be truly monitored and the security personnel are judged to be equal. In this method, it is necessary to deal with a plurality of persons who have no difference in the importance presented by the system, which increases the burden on the observer and the like. Furthermore, if the number of on-site staff, security guards, etc., is less than the number of people requiring a response, there is a risk of missing an important target person.
From the above, the video monitoring system 1 realizes accurate determination of monitoring importance by using information on not only the type of interaction and its occurrence position but also the direction and attributes of the action.

Hereinafter, the photographing system 2, the video analysis system 3, and the monitoring center system 4 will be specifically described.
FIG. 2 is a diagram showing the overall configuration of the video surveillance system according to the present embodiment. The photographing system 2 is composed of one or a plurality of camera units 21 installed in the monitoring target area, and the captured images are sequentially input to the image input unit 31 of the image analysis system 3. The camera unit 21 is a surveillance camera arranged so that the entire area to be monitored can be imaged. If the area setting for detecting the interaction is not required, the surveillance camera may be a mobile camera that is not fixed, and the format does not matter as long as the area to be monitored can be imaged. On the other hand, when area setting is required, it is desirable to use a surveillance camera fixed to a wall, a support, etc., and to set calibration in advance. In such a case, it is assumed that a fixed camera that cannot operate pan / tilt / zoom (PTZ) is used, but if the combination of these settings and the calibration setting is adjusted in advance, the PTZ is used. An operable camera may be used, or the same camera may be used to monitor various areas.

Further, the camera unit 21 and the video input unit 31 are connected by a wired communication means or a wireless communication means, and a frame image is continuously transmitted from the camera unit 21 to the video input unit 31. When the interaction recognition is a time-series data analysis model that assumes the input of a plurality of frame images, it is desirable that the frame rate of continuous transmission of the frame images is equal to or higher than the required value for the interaction recognition. On the other hand, if the decrease in the accuracy of interaction recognition that occurs when the frame rate is lower than the required value is acceptable, the frame rate may be lower than the required value. In this case, in the interaction recognition, processing for suppressing the decrease in accuracy, such as interpolation by interpolation or extrapolation of time series data, may be performed. Further, the camera unit 21 and the image analysis system 3 do not have to have a one-to-one correspondence, and may be used as one image analysis system with a plurality of camera units. Even in the case of executing such multiple processes, the frame rate required by each process is subject to the above-mentioned restrictions. The camera unit 21 may be equipped with a part or all of the functions of the video analysis system described later.

The video analysis system 3 is composed of a video input unit 31, a video processing unit 32, and a storage unit 33. The video input unit 31 receives video input from the camera unit 21 and transmits video data to the video processing unit 32. The video to be analyzed may not be a video directly input from the camera unit 21, but may be a video in a separately stored recorder. The storage location of the video does not matter. The video processing unit 32 reads the monitoring reference information stored in the storage unit 33, which will be described later, and analyzes the video input from the video input unit 31 to determine the monitoring importance of each individual who has interacted with the video processing unit 32. Has. The storage unit 33 stores the monitoring reference information set in the management control unit 43, which will be described later. The monitoring reference information is used for determining the monitoring importance, which is the output of the video processing unit 32. Further, in the present embodiment, the video analysis system 3 is not limited to the on-premises type system constructed on the server in the operation facility, but may be constructed on the server outside the facility such as by utilizing the cloud service. ..

The monitoring center system 4 is composed of a recording unit 41, a video display unit 42, a management control unit 43, and a search unit 44. The recording unit 41 has a function of holding information such as the generated interaction, the direction of the interaction, the person attribute, the generated area, and the generated time obtained by the video analysis by the video analysis system 3 as a database. The video display unit 42 displays information about the behavior of the person who has performed the interaction at the current time and a part or all of the frame at the time of detecting the interaction according to the monitoring importance. The management control unit 43 has a function of inputting setting information to the storage unit 33 by a watchman, a field staff, or the like in order to store the monitoring reference information used by the video processing unit 32. The search unit 44 has a function of searching for the corresponding person from the information stored in the recording unit 41 using the attribute and interaction type of the person as a query, and the position of the person at the current time and the movement within the facility up to that point. It has a function to check information such as trajectories.

FIG. 12 is a hardware configuration diagram of the video surveillance system according to the present embodiment. In FIG. 12, the camera unit 1102 is connected to the computer 1103 via a network. Further, the computer 1103 can communicate with the computer 1104 via the network.

One or a plurality of camera units 1102 are installed in the monitoring area, and video data is appropriately transmitted to the computer 1103. The computer 1103 includes a CPU (Central Processing Unit) as an arithmetic control device, a RAM (Random access memory) as a main storage device, and an HDD (hard disk drive) as an auxiliary storage device. The computer 1103 realizes the function as the video analysis system 3 by reading various programs from the HDD, developing them in the RAM, and executing them by the CPU. Further, the computer 1103 communicates with the camera unit 1102 and the computer 1104 via a predetermined communication interface (IF). Although not shown, input / output devices such as a keyboard and a display are also connected to the computer 1103 via a predetermined IF.

The computer 1104 includes a CPU as an arithmetic control device, a RAM as a main storage device, and an HDD as an auxiliary storage device. Various programs are read from the HDD, expanded into the RAM, and executed by the CPU to execute the monitoring center system 4. To realize the function as. Further, the computer 1104 is connected to the computer 1103 and an input / output device such as a keyboard and a display via a predetermined interface (IF).

Next, the details of the video analysis system 3 will be described with reference to FIG. FIG. 3 is a diagram showing a block diagram of the video analysis system according to the present embodiment. Hereinafter, the video input unit 31, the video processing unit 32, and the storage unit 33 that constitute the video analysis system 3 will be described.

The video input unit 31 sequentially receives images from one or more camera units 21 and outputs the images to the subsequent image processing unit 32. However, when the video processing unit 32 does not handle the time series information, the input may be an image.

The video processing unit 32 includes a calculation unit 321, an interaction detection unit 322, a monitoring importance determination unit 323, and an output control unit 324.
The calculation unit 321 is further composed of a person detection unit 3211 and an attribute determination unit 3212.

The person detection unit 3211 detects a person from the still image of the current frame using the image or video received from the video input unit. As the means for detecting a person, a means for determining by using a Haar-like feature, a means for determining by using R-CNN (Regions with CNN), or a means for estimating a skeleton coordinate group estimated for each person using a skeleton estimation means is used. There is a means for determining, and the means is not limited in the present embodiment. Further, the person detection unit 3211 performs person tracking after the person is detected. In person tracking, a rectangular image of a person and a person ID assigned to that person may be associated with each other in front and back frames, and a general person tracking method such as template matching or optical flow may be used. ..

Next, the rectangular image of the person obtained by the person detection unit is input to the attribute determination unit 3212, and the attribute of the person is determined. By using the attributes of a person, it can be used as information that contributes to the determination of the monitoring importance of each individual. Further, the attribute can be used as the query in the search unit 44 described above. Examples of attributes include security guards and staff in the facility, general facility users, age and gender, and the like. If it is possible to distinguish between the security guards and staff in the facility and general facility users, it is assumed that the interaction caused by the security guards and staff is an action performed within the scope of their duties, so the importance of monitoring is not set. , The effect of not causing extra alarm can be expected. In addition, based on the estimation of the age and gender of general facility users, for example, if there is prior statistical information that a person in a specific age group is likely to generate a behavior requiring attention, the importance of monitoring for that specific age group is available. Effective video monitoring becomes possible by setting the value higher. Further, in amusement facilities, event facilities, etc., it is also effective to collate with the pre-registered persons requiring attention, persons subject to entry / exit prohibition, etc. among the facility users. As an attribute estimation method, a rectangular image of a person is imaged as HOG (Histograms of Oriented Radiants), SIFT (Scale-Invariant Features Transfer Form), or a vector output from the middle layer of a trained deep learning model network. There is a means of converting to SVM (Support Vector Machine) and learning a discriminator such as a decision tree, and a means of making an end-to-end judgment by a method based on CNN (Convolutional Neural Network). These determination means may be constructed as the first stage as an acceptor for classifying security guards and field staff and general visitors, and as the second stage as an acceptor for determining age and gender among general visitors, or the same. It may be learned in the acceptors of. Furthermore, by separately using a means for determining the goods owned by a person, it is also possible to give an attribute to express the prohibited goods or dangerous goods to the person determined to be possessed. Is effective for.

The interaction detection unit 322 determines the presence / absence, type, and direction of the interaction using the information obtained from the person detection unit 3211. As the determination method, there are means for using the image feature amount and means for using the skeleton information for any pair of people as described above. When skeleton information is used, a feature amount representing a person's posture calculated from the estimation result of the skeleton by the person skeleton detecting means, a feature amount calculated from a relative distance between arbitrary skeleton points of a certain person pair, or A time-series feature amount that expresses the amount of movement of the skeleton per unit time or the amount of change in relative distance from the previous and next image frames may be used. Further, the attribute information obtained from the attribute determination unit 3212 may be used as a feature amount, and for example, a feature amount expressing age or gender may be used. Alternatively, not only focusing on a person, but also a feature amount expressing an article possessed by the person may be used. For example, if an article whose ownership is determined by one person is determined to be owned by another person after a certain period of time, it can be interpreted that a delivery act has occurred when the ownership determination is switched. These features may be used not only alone but also in combination. For example, if the interaction is determined using only the features related to the posture, the interaction may be erroneously detected even between people located at a long distance. However, by using the features related to the relative distance together, the erroneous detection may occur. The number can be reduced. As described above, effective interaction detection can be performed by using the feature amount alone or in combination.

When using image information, a means using a method based on CNN can be mentioned. Alternatively, when skeleton information is used, a means for learning a discriminator such as an SVM, a decision tree, or an LSTM (Long Short-Term Memory) based on a feature amount representing a posture, a relative distance, an attribute, or the like can be mentioned.

The monitoring importance determination unit 323 includes the attributes of the person obtained from the attribute determination unit 3212, the type and direction of the interaction obtained from the interaction detection unit 322, and the person obtained from the person detection unit 3211 in the embodiment of the present embodiment. The location information is also input, and by collating with the information set in the monitoring reference information 331, the monitoring importance for each individual who has interacted is set. The calculation of the area information can be determined by collating the preset information of the area with the information of the person rectangle. For example, if the area is set on the image coordinates in a camera with a certain PTZ setting, the area where the person is located can be determined depending on which area the estimated position of the person's feet is located on.

The output control unit 324 transmits the monitoring importance of each individual determined by the monitoring importance determination unit 323 to the monitoring center system 4. All events for which monitoring importance has been calculated may be transmitted, or thresholds may be preset so that only those with high monitoring importance are transmitted.

The storage unit 33 stores the monitoring reference information 331 for use in the monitoring importance determination unit 323. The monitoring standard information 331 has three types of security level setting information: an interaction security level set for each interaction type, an attribute security level set for each attribute type, and an area security level set for each area type. Further, it has the weight information of each of those security level settings and the weight information of the performer or the executed person set for each interaction type. The monitoring reference information 331 can be set from the management control unit 43.

Next, the flow of processing of the video analysis system according to the present embodiment will be described with reference to the flowchart shown in FIG.
When an image is input from the photographing system to the image analysis system in step S1, person detection is performed in step S2.
Next, the number of people is measured in step S3. If two or more people are detected in the screen, the process proceeds to step S4, and if only one person or less is detected, the process after step S4 is not performed and the input of the next frame is waited for. Then, the process returns to step S1. When an area where interaction detection is desired or permitted and an area where interaction detection is desired or not permitted are mixed in the screen, a mask process is partially performed on the monitored area immediately before step 2 in order to reduce the amount of calculation. You may.

In step S4, the interaction is determined. The determination is performed on any pair of people on the screen, but it is preferable not to perform the determination process on the pair of people over a certain distance for each interaction type in order to reduce the amount of calculation. For example, in detecting a delivery act, it is not necessary to determine whether or not a delivery is made between two persons who are clearly located within reach of each other. When determining the relative distance between people before determining the action, it is necessary to calculate the relative distance in the world coordinate system. Therefore, the position of a person in the world coordinate system is estimated by using preset area information or by using a depth estimation technique using a stereo camera or a monocular camera, etc., without performing preset settings, and the distance between the persons is estimated. Calculate the relative distance. Further, a setting table is prepared separately, and the threshold value for judgment is set collectively. For example, when the threshold value is set to "3 m", the action determination process is not performed for the person pair whose relative distance in the world coordinate system exceeds the threshold value of "3 m". If the discriminator is not a multi-class classifier that supports recognition of various types of interaction, but a two-class classifier that is learned for each interaction type, the threshold value can be set for each interaction type. In addition, when detecting an interaction that crosses an area boundary, it is also possible to reduce the amount of calculation and determine the interaction between people located in different areas without performing the interaction determination between people located in the same area. It is suitable for reducing false positives.

As a result of the interaction determination, when the interaction is performed, that is, it is determined that an event has occurred, the process proceeds to the process after step S6 by the branch of step S5. On the other hand, if it is determined that no event has occurred, the processing after step S6 is not performed, and the input of the next frame is waited for and the process returns to step S1.

In steps S6 to S8, the attribute is calculated for each person who caused the event. Next, in steps S9 to S11, the monitoring importance is determined for each detected event. After controlling the output of the determined monitoring importance in step S12, the process returns to step S1 after waiting for the input of the next frame.

Note that the flow processing shown in this figure does not necessarily have to be processed by a single process, and may be processed asynchronously using a plurality of processes in order to improve calculation efficiency.

Next, with reference to FIG. 5, a setting example of monitoring standard information in the present embodiment is shown. The monitoring standard information in the embodiment of this embodiment is composed of the security level setting information in Table 51, the weight for the security level setting target in Table 52, and the performer weight for each interaction type in Table 53. From Tables 51 and 52, the weighted points for each individual are calculated in consideration of the interaction type, the person attribute, and the occurrence area, and further, the monitoring importance is calculated using the weighted points and the performer weight in Table 53. do. These pieces of information are set by the management control unit 43, stored in the monitoring reference information 331, and read by the monitoring importance determination unit 323. The settings and effects of each table will be described in detail below.

Table 51 is composed of three types of security level setting tables: Table 511 for setting the security level for each interaction type, Table 512 for setting the security level for each attribute type, and Table 513 for setting the security level for each area type. To. In the embodiment of this embodiment, the security level is set in 4 stages from 3 points to 0 points, and is set to "high level", "medium level", "low level", and "no level" in descending order of points. .. High level indicates the object with the highest attention level, and no level indicates the object that does not require attention.
In the interaction security level settings shown in Table 511, the importance of each interaction type is set, for example, 1 point for "delivery" and 3 points for "assault". Similarly, in the attribute security level setting shown in Table 512, "staff" is set to 0 points, "entrance prohibited person" is set to 3 points, and in the area security level setting shown in Table 513, "entrance gate" is set. "Inside" is set as 3 points, and "Shop" is set as 2 points.

For each security level, items that are not subject to setting are explicitly set as level 0. For example, in the interaction security level setting field, "handshake" and "hug" are set as level 0. In the present embodiment, each item in Table 511, Table 512, and Table 513 is scored in four stages from 3 points to 0 points, but the class number is not limited to the present embodiment, and the setter It is desirable that can be set freely.

Table 52 is a setting table that stores weights for each of the three security level setting targets. In the example of Table 52, the sum of the weights of the three types is set to 100%, the interaction is set to 30%, the attribute is set to 20%, and the area is set to 50%. In the present embodiment, these weights are used to calculate the weighted points from the points in parentheses in Table 51, which are used to calculate the monitoring importance for each individual. For example, if the "delivery" act is performed by the "general / youth" in the "entrance gate", the weighted points given to the "general / youth" are 1 x 0.3 + 2 x 0.2 + 3 x according to Table 52. It is calculated as 0.5 = 2.2 points. By making it possible to set weights for each of the three types of security level setting targets, it is possible to flexibly respond to different demands for each facility. For example, if there is a demand that only the type of interaction and the attribute of the person performed are emphasized but the area does not matter, the interaction can be set to 70%, the attribute can be set to 30%, and the area can be set to 0%. Further, when one or more setting targets get 0 points, the weighted points of the individual may be set to 0 points. For example, if a person whose attribute is determined to be a "security guard" engages in any of the interactions, it is considered that the interaction was performed within the scope of the job, and it is generally not possible to monitor the "security guard". This is because it is considered appropriate. The above formula is an example of a formula for calculating weighted points, and different formulas may be used.

Table 53 is a setting table for calculating the monitoring importance in consideration of the direction of interaction with respect to the weighted points calculated from Tables 51 and 52. Referring to Table 53, for example, the "delivery" act is set to have a weight of 20% for the performer, that is, the person who delivered the goods. On the other hand, the weight for the person to be executed, that is, the person to whom the article is delivered is set to 80%. A setting in which the executor has less weight, as in this example, means that the executor is more important than the executor. In the act of "delivery", it is assumed that the person who received the item is more important than the person who gave up the item. Further, the "frustration" act, which is a two-way interaction, is set to 50% in Table 53 so that the weights of the performer and the person to be executed are equivalent. The weight of the performer is set to 90% for the "assault" act in which the direction of the perpetrator who is the performer and the victim who is the victim is clear. As an example of calculating the monitoring importance from the weighted points, the weighted points given to the "general / youth" when the "general / youth" performs the "delivery" act "inside the entrance gate" are described above. As shown in Tables 51 and 52, the points are 2.2 points, but when the performer weight of the "delivery" action is set to 20%, the monitoring importance is 2.2 x 0.2 = 0.44. Become.

It is necessary to avoid that the value of the monitoring importance is overwritten and underestimated when the person who generated the event of high monitoring importance continuously generates the event of low monitoring importance. Therefore, for a person who has generated multiple interactions, the monitoring importance calculated for multiple events is added or the value of a large event continues until the monitoring importance value for that person is reset. It is desirable to continue to be adopted.

It is also possible to comprehensively use the types and directions of multiple interactions involving the same person to determine the monitoring importance of the person or to make different responses. For example, consider a case in which an event in which pedestrians collide with each other is set as an interaction of the type "pedestrian collision", and the direction is defined with the collided side as the executor and the collided side as the executed person. When referring to the results of interactions involving a person, if they are frequently involved in "pedestrian collisions" and are always on the performer side, it is highly likely that this person is intentionally hit. I can think. In this case, the importance of monitoring should be set high, and security personnel should be dispatched immediately when a similar "pedestrian collision" is involved. On the other hand, although it is frequently involved in "pedestrian collision", if the direction is not constant (the performer side and the executed person side are about the same), it is possible that the person is in poor physical condition. In this case, it is desirable to set the monitoring importance to a high level and dispatch rescue personnel when behavior such as "crouching" is observed.

The monitoring importance calculated by the above is transmitted to the output control unit 324. In the output control unit, a threshold value can be set for the monitoring importance for the purpose of suppressing the number of events transmitted to the monitoring center system 4. For example, if the threshold value is set to 2.0, the monitored person calculated to have a monitoring importance of 1.5 is not transmitted, and the individual calculated to be 3.0 is transmitted. Although not shown in this embodiment, a person to be notified, such as a field staff or a security guard, may be designated according to the score of monitoring importance.

Next, a GUI (Graphical User Interface) for setting the monitoring reference information of FIG. 5 will be described with reference to FIGS. 6 and 7. 6 and 7 are views showing an example of a setting screen for monitoring reference information according to the present embodiment. Further, FIG. 6 is a setting screen for creating the table of Table 51, and FIG. 7 is a setting screen for creating the tables of Table 52 and Table 53.

FIG. 6 is a GUI for setting the interaction type, the attribute of the person, and the security level regarding the area, and in particular, the setting of the interaction security level shown in the area 611 will be described below. The number of stages of interaction security level in region 611 is three stages from 1 point to 3 points, but as described above, the number of stages is not limited to this embodiment and can be freely set by the setter. Is desirable. It is also desirable that the size of the security level can be freely set by the setter. Further, for an interaction for which a security level is not set, the importance may be specified as 0 point. The setter can press the pull-down column shown in the area 6111 and select the interaction registered from the list for each security level column. Click the "Add" button in the area 6112 to add the selected interaction type to the "Registered Interactions" list at the bottom. After the addition, in order to delete the registered interaction, the check box of the area 6113 corresponding to the interaction desired to be deleted is pressed, and the area 6114 is pressed. By pressing the area 6115 "Save settings" button, the information set in this area is reflected in the monitoring reference information 331. In order to confirm the information set in this area, the area 6116 may be pressed. The attribute security level setting shown in the area 612 and the area security level setting shown in the area 613 can be set in the same manner. However, the area type is different from the interaction type and attribute type in which the options are fixed depending on the system, and the area type itself can be added separately so that the area that the system does not anticipate can be dealt with.

The area 62 in FIG. 7 is a GUI for setting the weight of the security level setting target, and Table 52 is set in the setting of this area. Region 621 sets weights for interactions, attributes, and areas as a percentage. By pressing the area 622 "Save settings" button, the information set in this area is reflected in the monitoring reference information 331. In order to confirm the information set in this area, the area 623 may be pressed.

Area 63 is a GUI for setting the performer weight for each interaction type, and Table 53 is set in the setting of this area. In the area 631, the weight on the performer side is set as a percentage for each interaction type. The weight for the person to be executed may be automatically calculated and displayed by inputting the "executor weight". Alternatively, the input target may be the person to be executed. The interactions required to be set in the area 631 are all actions registered in the area 611 and given a score larger than 0 points, and the corresponding line is automatically added at the same time as the registration by pressing the area 6115. .. By pressing the area 632 "Save settings" button, the information set in this area is reflected in the monitoring reference information 331. In order to confirm the information set in this area, the area 633 may be pressed. However, in order to avoid inconsistency that the performer side weight is not registered in the area 63 even though it is registered in the area 611, a display calling attention to the deficiency of the registered items at the end of the entire setting screen. It is desirable to implement such as outputting.

Next, an example of a detection event notification screen for the observer will be described with reference to FIG. FIG. 8 is a diagram showing a display example in the video display unit 42 in the present embodiment, and the area 7 shows an output screen. The area 7 may be displayed on the entire notification screen or may be displayed on a part of the notification screen.

Of the areas 7, the person displayed on the screen A (area 71), the screen B (area 72), the screen C (area 73), and the screen D (area 74) has a monitoring importance set, and the monitoring center system. The person sent to 4. In this embodiment, an example of displaying an image at the current time for all of these persons is shown. The video used in this display is a real-time video based on the tracking of the person who caused the event. The person tracking is performed by the person detection unit 3211 as described above, and the video is transmitted to the video display unit 42 together with information on the monitoring importance. Further, in the area 75, the person displayed on each screen is shown in the order of monitoring importance, and the detected event, the place of occurrence, the time of occurrence, and the current position are shown. The column displayed in the order of monitoring importance may display the actual value output by the monitoring importance determination unit 323.

In the embodiment of this embodiment, the area sizes of screens A to D are dynamically changed according to the determined monitoring importance. For example, when the monitoring importance of all persons is reset and none of the persons are displayed on the screen, the person a who performs the "delivery" act at the time "09:20:00" is displayed on the largest screen. .. Next, when it is determined that the person b who performed the "assault" act at the time "09:30:50" has a higher monitoring importance than the person who performed the "delivery" act, the display area of the person b. Is larger than the display area of the person a. In addition, the person displayed on the screen may be displayed by performing image processing such as superimposing the detection frame of the person or trimming the image in order to easily distinguish it from other people photographed on the same screen. desirable. Further, due to the limitation of the screen size, the screen may be scrolled when a plurality of events occur.

It is desirable to be able to switch the state when an event occurs by selecting not only the real-time tracking image as described above but also the display screen and the line of the area 75. FIG. 9 is a diagram showing a display example of the image display unit in the present embodiment, and shows a state in which one event of the region 75 in FIG. 8 is selected. In FIG. 9, screen B (region 76) is selected and displays a frame at the time of detection for the interaction generated by the tracked person. Since the interaction takes place over a certain period of time, it is desirable to display the frame with the highest determination accuracy among the start and end of the interaction. Alternatively, it may be possible to play a short clip from the start to the end of the interaction. From the above, the observer can grasp under what circumstances the interaction was performed when the event occurred. Furthermore, if the staff has completed the response to the event confirmed as described above, if it is determined that the response is unnecessary, or if false positives are obvious, select the line on the display screen or area 75. And can be deleted.

In addition, this screen can be checked not only by the person to be notified, such as the monitor of the monitoring center, who is expected to use a large display, but also by the staff and security guards who respond at the site. Also, by using a smartphone terminal, a tablet terminal, AR goggles, or the like, a part or all of the area 7 can be confirmed at the site.

Next, with reference to FIG. 10, a search method using interaction types, attributes, and the like will be described. Events for which the response by the observer, field staff, etc. has been completed in the video display unit 42, and events for which it is once determined that no response is necessary are deleted from the display output. A means for searching for an event from the recording unit 41, which is a database of information related to the above, is required.

FIG. 10 is a diagram showing a display example of the search unit according to the present embodiment, and the area 9 shows the entire search screen. Of the areas 9, the area 91 shows the search input screen, and the area 92 shows the output screen.
In the area 91, in the area 911, the events are narrowed down by using the interaction type, the attribute, and the occurrence time as a query. Specifically, the pull-down column is pressed for each search item column in the area 911, and the registered item is selected from the list. The selected item is added to the "Registered Items" list at the bottom by pressing the "Add" button. After the addition, in order to delete the registered item, the check box of the item desired to be deleted is pressed, and the area 912 is pressed. In the embodiment of the present embodiment, an example of a search using the interaction type, attribute, and occurrence time as a query is shown, but the search may be performed using the interaction direction, area information, or the like as a query. Further, each column is not necessarily required to be input and may be blank. For example, if all the items are left blank and the search is executed, all the information stored in the storage unit 41 is output as the search result.

In the area 92, information such as the current location and the time of occurrence is displayed in the area 921 regarding the search result. As shown in area 921, information about the current position may be displayed as long as the stay in the monitoring area is confirmed and traceable even at the current time. In the area 922, a frame image corresponding to the information in the area 921, a short-time clip image from the start to the end of the interaction, or the like is displayed. As described above, the area 921 and the area 922 are collectively regarded as one search result, and in the embodiment of the present embodiment, the entire search result can be confirmed by scrolling. The search results may be switched in a grid pattern consisting only of the images or videos shown in the area 922.

By using the search unit 44, even if the event is deleted from the display output by the video display unit 42, the event can be efficiently searched from the recording unit 41. In addition, since it is possible to search for similar cases and check the number of such cases, it can be useful for taking countermeasures and preventive measures for events that are expected to occur in the future.

As described above, the video surveillance system 1 detects an interaction from the surveillance video, and determines the monitoring importance for each individual using the type and direction of the interaction, as well as the attributes and area information of the person. According to the present invention, the observer sets the priority of the response in order to set the monitoring importance for each individual instead of giving the equivalent monitoring importance to all the persons who interacted in the monitoring area. This makes it easier to deal with the actors efficiently.

Hereinafter, another embodiment of the video surveillance system 1 of the present invention will be described. The inventions common to the above-described embodiments will be omitted, and the specific processing in the present embodiment will be described.

In the display example in the video display unit 42 in the above-described embodiment, all the events for which the monitoring importance is set at the current time are displayed in the order of monitoring importance. On the other hand, a display method focusing on only one event can be considered.

FIG. 11 is a diagram showing a display example by the video display unit 42 in the present embodiment. It shows the information of each person in detail about the interaction between two people whose monitoring importance is set. Region 81 shows the frame image in which the interaction was detected. The frame image shows a state of delivery between the person 811 and the person 812. In addition, the area 84 shows information about the attributes, the place of occurrence, and the current position of the two persons, as well as the direction of delivery and the place of occurrence. As described in the area 84, the information about the two persons is displayed separately on the screen X (area 82) and the screen Y (area 83), respectively. Regarding the screens inside the area 82 and the area 83, in the area 821 and the area 831, the image of each person captured in the area 81 is displayed with the magnification adjusted so that the observer can easily see it. Further, in the area 822 and the area 832, the place where the event occurs, the current position of the person, and the movement locus are shown by using the floor map. In this drawing, the circle indicates the event occurrence site, the pentagon indicates the current position and traveling direction of the person, and the dotted line indicates the movement trajectory connecting the event occurrence site to the current position. Further, in the area 823 and the area 833, the state of the person at the current time can be confirmed. From the screens shown in the area 82 and the area 83, it is possible to grasp the movement locus of the performer and the person to be executed from the occurrence of the event to the current time.

The following processing is required to display the movement trajectory on the floor map. First, after the images of the person 811 and the person 812 are acquired, the person is identified for the frame image acquired from the camera at regular time intervals or at regular frame intervals. Next, in order to identify the position of the person on the floor map, the area information set in advance is used, or the depth estimation technique using a stereo camera or a monocular camera is used without performing the preset setting. Estimate the position of a person in the world coordinate system. By connecting the acquired position information and time information in chronological order, the movement trajectory of the person can be displayed on the floor map.

In addition, although both the performer and the person to be executed show the movement trajectory from the occurrence of the event in this drawing, the person who has not caused the event has already been tracked or saved in a separately prepared storage medium. If the person can be tracked after the event occurs by using the video of the above, the movement trajectory of each person to the event occurrence site may be displayed. For example, in the delivery act, the executor who delivered the article displays the movement trajectory until the event occurs, and the person to be delivered the article displays the movement trajectory after the event occurs, so that the person is not a person. It is possible to perform video monitoring focusing on the movement trajectory of an article.

As described above, according to the present embodiment, the image display unit can display only one event. Specifically, it is possible to display the state at the time of event occurrence, the state at the current time, and the movement trajectory of each person who generated the interaction, and further, the movement trajectory of the performers until the event occurrence can be displayed. can. As a result, the observer can check not only the presence or absence of the event but also the movements of the performers before and after the event occurrence at a glance, so that the detailed information of the event can be easily and accurately grasped.

As shown in each of the above-described embodiments, the disclosed video analysis system 3 is a video analysis system that detects an event in the monitoring area by using a video captured in the monitoring area, and is a plurality of video analysis systems based on the video. An interaction detection unit 322 that detects an interaction, which is an event caused by the involvement of a person, and outputs the type of interaction and the direction of the interaction indicating how each of the plurality of people interacted with another person in the interaction. The monitoring importance determination unit 323, which determines the monitoring importance of each person for a plurality of persons involved in the interaction by comparing the type and direction of the interaction with the preset monitoring standard information, and the monitoring importance. Based on this, it has an output control unit 324 that outputs an event detection result. With such a configuration and operation, it is possible to set the monitoring importance for each of a plurality of persons interacting in the monitoring area, reduce the work load of the responder in monitoring, and reduce the processing load of the system.

Further, a calculation unit 321 that detects an image of a person included in the video and calculates an attribute feature amount representing the attribute of the detected person is further provided, and the monitoring importance determination unit 323 further monitors using the attribute feature amount. Since the importance is determined, it is possible to make a highly accurate determination in consideration of the attributes of the person.

Further, the interaction detection unit 322 detects the skeleton of the person based on the image, and the posture feature amount representing the posture of the person calculated from the estimation result of the detected skeleton and one between arbitrary parts between the people. Or, the distance feature amount calculated from multiple distances, the movement feature amount representing the movement amount per unit time of the skeleton calculated from the difference between the previous and next image frames based on the image, and the ownership relationship of the article to the person. At least one of the feature amount of the article to be expressed is calculated, and the type of interaction and the direction of the interaction are detected based on the calculated feature amount. According to such a configuration, it is possible to make a highly accurate determination in consideration of the posture of the person, the distance between the persons, the type of the article, and the like.

Further, the monitoring importance determination unit 323 determines the monitoring importance by using the information regarding the positions of a plurality of persons at the time of the interaction, so that the irrational event is excluded from the positional relationship and the determination accuracy is improved. Can be improved.

Further, the video analysis system 3 monitors information on the type of interaction, the direction of the interaction, and the security level for each occurrence area in order to determine the monitoring importance for each person who generated the interaction to be detected. It has a storage unit 33 that holds it as information. By retaining this information in advance and using it as appropriate, a simple and highly accurate determination can be realized.

Further, by providing a search unit 44 capable of searching for the person who generated the interaction by searching the detection record of the interaction using the type of interaction and / or the information about the person as a search query, it is detected and accumulated. It is possible to make effective use of the interaction that has occurred.

Further, the output control unit 324 changes the size of the display on the display terminal according to the monitoring importance for each person who has generated the interaction. Therefore, it is possible to easily recognize the importance and control the amount of information according to the importance.
In addition, for each person who has generated an interaction, the behavior before and after the interaction can be confirmed by displaying the movement locus before and after the interaction on the screen.
Whether or not the movement locus of a plurality of persons needs to be generated may be determined by the importance. In this case, it is possible to selectively output the movement locus for an important person.
Further, for a plurality of persons in the detected interaction, the direction of the interaction may be clearly indicated by displaying on the screen information indicating that the person is the performer or the person to be executed of the predetermined action related to the interaction.
In addition, by displaying the frame image in which the interaction is detected on the screen and displaying the current position and / or the current image of the person involved in the interaction on the screen, the content of the interaction is associated with the current state of the person. Can also be displayed.

The present invention is not limited to the above-described embodiment, but includes various modifications. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the described configurations. Further, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. Further, it is possible to add / delete / replace a part of the configuration of each embodiment with another configuration. Further, each of the above configurations, functions, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be placed in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

1 ... Video surveillance system,
2 ... Shooting system, 21 ... Camera section,
3 ... Video analysis system, 31 ... Video input unit, 32 ... Video processing unit, 321 ... Calculation unit, 3211 ... Person detection unit, 3212 ... Attribute determination unit, 322 ... Interaction detection unit, 323 ... Monitoring importance determination unit, 324 ... output control unit, 33 ... storage unit, 331 ... monitoring reference information,
4 ... Monitoring center system, 41 ... Recording unit, 42 ... Video display unit, 43 ... Management control unit, 44 ... Search unit

Claims

In a video analysis system that detects an event in the surveillance area using an image of the surveillance area.
Based on the video, an interaction that is an event caused by the involvement of a plurality of persons is detected, and the type of the interaction and the interaction showing how each of the plurality of persons interacts with another person in the interaction. The interaction detector that outputs the direction and
A monitoring importance determination unit that compares the type and direction of the interaction with preset monitoring reference information to determine the monitoring importance of each person for a plurality of persons involved in the interaction.
A video analysis system including an output control unit that outputs a detection result of the event based on the monitoring importance.
Further provided with a calculation unit that detects an image of a person included in the video and calculates an attribute feature amount representing the attribute of the detected person.
The video analysis system according to claim 1, wherein the monitoring importance determination unit further uses the attribute feature amount to determine the monitoring importance.
The interaction detection unit detects the skeleton of the person based on the image and determines the skeleton of the person.
Posture features representing the posture of the person calculated from the detected estimation result of the skeleton, and
A distance feature calculated from one or more distances between arbitrary parts between the persons, and
A moving feature amount representing the moving amount per unit time of the skeleton calculated from the difference between the previous and next image frames based on the video, and
It is characterized in that at least one of the article feature amount expressing the ownership relationship of the article with respect to the person is calculated, and the type of the interaction and the direction of the interaction are detected based on the calculated feature amount. The video analysis system according to claim 1.
The video analysis system according to claim 1, wherein the monitoring importance determination unit determines the monitoring importance by using information regarding the positions of the plurality of persons at the time when the interaction occurs.
A storage unit that holds information on the type of interaction, the direction of the interaction, and the security level for each occurrence area as the monitoring reference information for determining the monitoring importance for each person who generated the interaction to be detected. The video analysis system according to claim 1, wherein the image analysis system has.
1 The video analysis system described in.
The video analysis system according to claim 1, wherein the output control unit changes the size of the display on the display terminal according to the monitoring importance for each person who has generated the interaction.
The video analysis system according to claim 1, wherein the output control unit displays on the screen the movement locus of the interaction before and after in time for each person who has generated the interaction.
The video analysis system according to claim 1, wherein the output control unit determines whether or not it is necessary to generate movement loci of the plurality of persons based on the monitoring importance.
The output control unit is characterized in that, with respect to the plurality of persons in the detected interaction, information indicating that the person is the performer or the person to be executed of the predetermined action related to the interaction is displayed on the screen. Video analysis system described in.
The output control unit is characterized in that the frame image in which the interaction is detected is displayed on the screen, and the current position and / or the current image of the person involved in the interaction is displayed on the screen. The video analysis system according to 1.
The interaction detection unit calculates the posture feature amount, the distance feature amount, the movement feature amount, and the article feature amount, and calculates the posture feature amount, the distance feature amount, the movement feature amount, and the article. The video analysis system according to claim 3, wherein the type of interaction between the detected plurality of persons and the direction of the interaction are detected based on the feature amount.
In a video monitoring method that detects an event in the monitoring area using a video executed by a computer and captured in the monitoring area.
Based on the video, an interaction that is an event caused by the involvement of a plurality of persons is detected, and the type of the interaction and the interaction showing how each of the plurality of persons interacts with another person in the interaction. Interaction detection step that outputs the direction and
A monitoring importance determination step for determining the monitoring importance of each person for a plurality of persons involved in the interaction by comparing the type and direction of the interaction with the preset monitoring standard information.
A video analysis method including an output control step for outputting a detection result of the event based on the monitoring importance.