WO2017155126A1

WO2017155126A1 - Information transmitting system, information transmitting device, information receiving device, and computer program

Info

Publication number: WO2017155126A1
Application number: PCT/JP2017/010290
Authority: WO
Inventors: 靖和本玉; 寛紀山内
Original assignee: 一般社団法人日本画像認識協会
Priority date: 2016-03-08
Filing date: 2017-03-08
Publication date: 2017-09-14
Also published as: JP6357595B2; JPWO2017155126A1

Abstract

Provided is an information transmitting system that enables the effective use of a network communication band when transmitting camera video signals. The information transmitting system 1 comprises: a feature point extracting unit 122 for extracting feature points from a subject in a video taken with at least one camera 11, and outputting said feature points as feature information; a coordinate information adding unit 121 for acquiring coordinate information for a subject within the imaging range of the camera; an information transmitting unit 124 for transmitting, to a network, the feature information and the coordinate information; an information receiving unit 131 for receiving the feature information and the coordinate information from the network; a movement generating unit 132 for generating an avatar image of a subject on the basis of the feature information; and an image compositing unit 133 for generating a composite image by combining, with an image showing the background in the imaging range of the camera, the avatar image on the basis of the coordinate information.

Description

Information transmission system, information transmission device, information reception device, and computer program

The present invention relates to a system for transmitting video information acquired by a camera.

In recent years, surveillance cameras have been installed in various places such as shopping centers and streets for the purpose of crime prevention. Images taken by these cameras are sent in real time to a remote centralized monitoring center or the like via a network and displayed on a monitor.
Conventionally, when transmitting the video of such a camera, in order to effectively use the communication band of the network, as shown in FIG. 10, a captured video signal is converted into a predetermined standard (for example, by a video compression device). , H.264 or MPEG2, etc.) before being transmitted. Then, at the transmission destination, the received video signal is decoded according to the standard by the video decoding device. The invention using such compression and decoding of the video signal of the surveillance camera is disclosed, for example, in Patent Document 1 below. Japanese Patent Application Laid-Open No. 2004-228561 discloses a technique that allows a person and background in an image to be decomposed into individual objects by an image recognition device and converted into another image for each object by a converted image selection unit and transmitted. Has been.

Japanese Patent Laid-Open No. 2007-60022

However, even if the above image compression technology is used, there is a limit to the data compression rate. For example, when transmitting images from a plurality of surveillance cameras in real time, a wide bandwidth is still necessary. There is a problem of becoming.
In view of such problems, the present invention provides a technology that enables effective use of a network communication band when transmitting video from a camera.

In order to achieve the above object, an information transmission system according to the present invention includes:
A feature point extraction unit that extracts feature points from a subject in an image captured by at least one camera and outputs the feature points as feature information;
A coordinate information adding unit for acquiring coordinate information of the subject in the shooting range of the camera;
An information transmission unit for transmitting the feature information and the coordinate information to the network;
An information receiver for receiving feature information and coordinate information from the network;
A dynamic generation unit that generates an avatar image of the subject based on the feature information;
And an image composition unit that generates a composite image by compositing an avatar image based on the coordinate information with an image representing the background of the shooting range of the camera.
According to this configuration, the feature point feature information extracted from the subject and the subject coordinate information in the shooting range of the camera are transmitted to the network. Then, an avatar image of the subject is generated based on the feature information, and the avatar image is combined with an image representing the background of the shooting range of the camera based on the coordinate information. As a result, a composite image is generated from the background image and the avatar image on the receiving side without sending the video signal of the camera. Therefore, the network communication band is made effective compared to the case of sending the video signal of the camera. It can be used. In addition, since an avatar image is used instead of the actual video of the subject, there is an advantage that privacy is not infringed even when a large number of unspecified persons are photographed.
In the information transmission system of the present invention, the coordinate information adding unit sets the position identified as the ground contact point of the person who makes the subject appearing on the photographing screen as the photographing ground position, and displays the person image area appearing at the photographing installation position on the photographing screen. The coordinate information adding unit uses a camera two-dimensional coordinate system set on the camera shooting screen with the walking plane in the real space when the subject is a person as a reference in the height direction. The transformation relationship between the plane coordinate point in the real space and the spatial coordinate point on the walking plane in the real space 3D coordinate system, the shooting height of the person at each shooting ground position in the camera 2D coordinate system, and the Position / height conversion relationship information acquisition means for acquiring position / height conversion relationship information including conversion relationship with actual height, and shooting for specifying the shooting contact position and shooting height of a person image on the shooting screen Actual grounding position coordinate information which is the grounding position coordinates of the person in the real space based on the position / height conversion relation information on the ground position / height specifying means and the information of the identified shooting grounding position coordinates and shooting height And real person coordinate / height information generating means for converting and generating the height of the person in the real space into real person height information as information. The dynamic generation unit includes an avatar height determining unit that determines the height dimension of the avatar image based on the generated real person height information, and the image composition unit includes the background image based on the actual ground position coordinate information. An avatar composition position determining means for determining the composition position of the avatar image to the image can be provided.
When it is desired to specify a three-dimensional space in a three-dimensional space from a camera image, since the image is two-dimensional data, it is not possible in principle to specify a general three-dimensional spatial state with one camera image. However, the subject of the present invention is a person who moves around in the area on the plane to be photographed by the camera, and according to the above configuration, the spatial geometric movement characteristics are taken into consideration, so that a single camera image can be obtained. It is possible to easily specify the position and height of the person in the real space from the information of the person image area. That is, the spatial existence range of the person in the area to be photographed is a horizontal plane (similarly, x− direction) such as a floor surface or the ground where the height direction (assuming to be the z axis direction of the real space orthogonal coordinate system) is constant. It is almost limited to the y plane), and the z coordinate of the contact point (foot position) can always be regarded as constant (for example, 0). That is, the coordinates of the contact point of the person walking in the area can be substantially described in an xy two-dimensional system, and can be uniquely associated with the camera two-dimensional coordinate system. The camera two-dimensional coordinate system corresponds to a real-space three-dimensional coordinate system obtained by projective transformation, and an object separated from the camera is projected with a reduced size. This transformation is mathematically described as a matrix, but if a reference object with a known height is placed at various known positions in the real space coordinate system on the floor or on the ground, the image is taken with a camera. By comparing the position and height of the reference body image on the shooting screen with the position and actual size in the real space, the position and height of the person on the camera screen are changed to the position and height in the real space. Position / height conversion related information that is information to be converted can be obtained. By using this, the dynamic generation unit can easily determine the height of the avatar image to be synthesized on the background image, and the image synthesis unit reasonably and easily determines the synthesis position of the avatar image to the background image. be able to.
Next, in the information transmission system of the present invention, the feature point extraction unit analyzes the motion or orientation of the subject and outputs it as motion analysis information, the information transmission unit sends the person attribute information to the network, and the image composition unit operates. It can be configured to adjust the movement or orientation of the avatar image based on the analysis information. According to this, since the movement and direction of the avatar image are adjusted based on the movement and direction of the subject at the time of shooting, for example, the moving speed and moving direction of the subject can be reflected in the avatar image.
In this case, the camera can capture moving images, the coordinate information adding unit acquires coordinate information of a person who makes a subject for each frame of the captured moving image, and the feature point extracting unit stores the coordinate information of the person. It may be configured to output movement trajectory information between frames as motion analysis information. If the movement trajectory information of the previous frame is analyzed for the current frame, it becomes particularly easy to grasp the movement of the person image that reaches the current frame.
For example, a person usually walks with their face and torso facing forward unless they move irregularly, such as walking sideways or backwards, so the movement trajectory of the representative point of the person image (for example, the ground contact point) is If it is found out, the orientation of the body according to the walking motion can be grasped sequentially. Therefore, the image composition unit can be configured to adjust the orientation of the avatar image to be synthesized on the background image based on the movement trajectory information.
In this case, the dynamic generation unit can be configured to generate different avatar images according to the movement direction of the person in the real space so that the appearance of the person from the viewpoint of the camera is reflected. When the walking direction of the person changes with respect to the camera viewpoint, the reality of the avatar image expression can be increased by changing the avatar image in accordance with the way (angle) the camera is reflected in the walking direction.
For example, the dynamic generation unit includes a direction-specific two-dimensional avatar image data storage unit that stores a plurality of two-dimensional avatar image data having different representation forms according to a plurality of predetermined movement directions of a person in real space. Estimating the moving direction of the person based on the movement trajectory information acquired for the frame, and selecting one that matches the estimated moving direction from the two-dimensional avatar image data for each direction. Can be configured to synthesize an avatar image based on the selected two-dimensional avatar image data with a background image. By selecting the moving direction of the person to be avatar from a plurality of directions determined as described above and configuring the avatar image data as two-dimensional drawing data, the prepared avatar image data The capacity can be greatly reduced.
On the other hand, the dynamic generation unit includes a three-dimensional avatar image data storage unit that stores the data of the avatar image as the three-dimensional avatar image data, and generates a three-dimensional avatar object based on the three-dimensional avatar image data. Estimating the movement direction of the person based on the movement trajectory information acquired for the frame, and determining the arrangement direction of the three-dimensional avatar object in the real space so as to face the estimated movement direction, The image composition unit generates a two-dimensional avatar image data by projecting and transforming a three-dimensional avatar object in the real space whose arrangement direction is determined to a two-dimensional coordinate system of a background image, and based on the two-dimensional avatar image data The avatar image can be configured to be combined with the background image. In this case, the data capacity is increased by making the avatar image data three-dimensional. However, the direction in which the avatar image is pasted on the background image can be made stepless, and more realistic expression can be realized.
The image composition unit can also generate an image representing a person's flow line based on the movement trajectory information. According to this configuration, it is easy to visually grasp how a specific subject has moved on the background image. For example, it can be used effectively for crime prevention purposes, and it can be clarified by statistical trend analysis of flow line images where there are places where individual people are interested in exhibition halls and public facilities. Benefit from the benefits.
Next, the information transmission system of the present invention can be configured such that the feature point extraction unit analyzes the person attribute of the subject and outputs it as person attribute information, and the information transmission unit sends the person attribute information to the network. According to this configuration, various analysis / statistical processes and the like can be performed using the person attribute information on the receiving side.
In this case, the dynamic generation unit can be configured to generate the avatar image as reflecting the person attribute information. Thereby, the attribute of a corresponding person can be easily grasped even after being converted into an avatar image. For example, when considering use for crime prevention purposes, this also contributes to the identification of suspects while avoiding privacy infringements such as portrait rights. The attributes of the people can be simplified or emphasized by the avatar image, and there is an advantage that the tendency on the image can be easily grasped. Specifically, the person attribute information can be configured to include gender information that reflects the gender of the person and age information that reflects the age of the person, but is not limited thereto, for example, the appearance of the face, etc. However, nationality (for example, Japanese or Westerners) can be considered as one of the attributes.
In addition, the feature point extraction unit can analyze the appearance of the subject person and output it as appearance feature information, and the information transmission unit can be configured to send the appearance feature information to the network. The appearance of the subject is important information that leads to the identification of individual persons following the person attributes, and is useful in analysis and statistical processing. And the dynamics generation unit can be configured to generate the avatar image as a reflection of the appearance feature information, so that the features of the corresponding person can be further understood after being converted into the avatar image. .
Examples of elements that most reflect the characteristics of the appearance of a person include hair, clothing, and belongings. In this case, the appearance characteristic information includes hair information that reflects one or both of the form and color of the person's hair, clothing information that reflects one or both of the form and color of the person's clothing, and the form and color of the person's belongings. Can be configured to include one or more items of inventory information reflecting one or both of the items. These contribute to assisting in grasping attributes such as gender and age group.For example, when it is difficult to grasp the age etc. only with facial features, more accurate attribute grasping is possible by considering these feature information together. It becomes possible. For example, since clothes and belongings reflect trends by age group, it is useful for clarifying the attributes of persons with similar generations, such as the latter half of 10 and the middle of 20.
In addition, for the purpose of crime prevention or the like, the body shape of a person (obesity, fatness, skinny shape, middle back of a meat, length of legs, etc.) is also useful information. In this case, the appearance feature information can be configured to include body shape information that reflects the body shape of a person.
Furthermore, in recent years, gaits (features of walking) are also useful as information for specifying a person. In this case, the appearance feature information can be configured to include gait information reflecting a person's gait. The information for specifying the gait is, for example, the stride (or the frequency of movement linked to the walking speed), the swing angle of the arm, the walking speed, the upper body angle at the time of walking, the vertical shaking, etc. Can be used in combination with more than one species.
The dynamic generation unit can be configured to use avatar animation data composed of frame data obtained by subdividing a person walking action, and can realistically represent an avatar image as an animation of a walking action on a background image. In this case, the dynamic generation unit performs image correction processing for correcting each frame of the frame data based on the gait information, and the image composition unit reflects the gait feature on the avatar image based on the corrected frame data. It can be configured to be combined with the background image in the form of animation. The movement of the avatar image reflecting the gait information of the corresponding person can be easily realized by the correction processing of each frame of the avatar animation data.
In addition, when the area to be photographed and monitored is large, there are cases where the field of view cannot be reached with one camera, or the image becomes small at a distant place and the feature cannot be grasped. In this case, the imaging range can be covered by a plurality of cameras in the form of sharing real space coordinates. Each camera shoots a common area with different camera coordinate systems, but if the real space of the area to be jointly monitored is stretched in the same coordinate system, if you want to integrate the shooting information of each camera later, There is an advantage that the integration process can be completed immediately only by performing the process of converting the coordinates of the person into the common real space (for example, a global coordinate system that can be acquired by GPS or the like). At this time, if the same person moves in the sense of field of view of different cameras, it is necessary to pass the determination of the identity of the person of the image between the cameras. In this case, the degree of coincidence of the attribute information and appearance feature information described above If the configuration is such that the identity of a person is determined according to the above, it can be easily used for the tracking of a specific person and the determination of using the same avatar image for the same person.
Further, the image composition unit can be configured to generate a composite image as a bird's-eye view image including shooting ranges of a plurality of cameras. According to this, the whole imaging range of a plurality of cameras can be grasped at a glance.
In particular, in order to obtain an overhead image as described above for an area that can only be covered by a plurality of cameras, the coordinate information adding unit may be configured as follows. That is, the position identified as the ground contact point of the person appearing on the shooting screens of the plurality of cameras is set as the shooting ground position, and the height of the person image area appearing at the shooting setting position on the shooting screen is set as the person shooting height. The information adding unit includes a plane coordinate in a camera two-dimensional coordinate system set on a camera shooting screen and a real space three-dimensional coordinate system with a walking plane in the real space when the subject is a person as a reference in the height direction. Including the conversion relationship between the spatial coordinate points on the walking plane in the camera and the conversion relationship between the shooting height of the person at each shooting ground position in the camera two-dimensional coordinate system and the actual height of the person in the real space coordinate system. A position / height conversion relationship information acquisition means for acquiring height conversion relation information, a shooting contact position / height specifying means for specifying a shooting contact position and a shooting height of a person image on the shooting screen, and Them Based on the position / height conversion relation information, the information on the shadow contact position coordinates and the shooting height is the actual contact position coordinate information that is the contact position coordinates of the person in the real space and the actual height of the person in the real space. The coordinate information adding unit is configured as having real person coordinate / height information generating means for converting / generating person height information. The dynamic generation unit includes avatar height determining means for determining the height dimension of the avatar image based on the generated real person height information, and the image composition unit is a person photographed by a plurality of cameras in the real space coordinate system. The actual ground position coordinate information is converted from the viewpoint of the bird's-eye view image, and the avatar composition position determining means for determining the composition position of the avatar image to the bird's-eye view image is provided.
This is an application of the above-described configuration for converting the position and height information of the person area on the photographing screen into the real space coordinate system by attaching the position / height conversion relation information to the camera side. Once the person's image information is converted into position / dimension information in the real space, even if you want to combine an avatar image with the background image of the overhead view viewpoint, prepare the conversion relationship between the overhead view background image and the real space in advance. By doing so, it is possible to easily synthesize the avatar image on the background image of the overhead view viewpoint.
Next, in the present invention, the feature point extraction unit divides the image of the subject into a plurality of parts corresponding to parts of the human body, and extracts feature points from each part. According to this structure, the feature point of each part can be detected effectively. In this case, the dynamic generation unit includes an avatar image data storage unit that divides and stores the data of the avatar image into a plurality of avatar fragments corresponding to the parts, and uses the feature point information extracted for the corresponding parts of the person. After correcting the avatar fragment of the avatar image based on the basis, the corrected avatar fragment can be integrated to generate an avatar image. In this way, it is possible to make fine corrections that reflect feature points for each avatar fragment (that is, a person's part), and it is not necessary to prepare a large number of image data of the entire avatar for each feature. Reduction can be achieved.
Next, the information transmitting apparatus of the present invention is
A feature point extraction unit that extracts feature points from a subject in an image captured by at least one camera and outputs the feature points as feature information;
A coordinate information adding unit for acquiring coordinate information of the subject in the shooting range of the camera;
An information transmission device comprising an information transmission unit for transmitting feature information and coordinate information to a network,
The feature information is associated with the constituent elements of the avatar image of the subject displayed at the transmission destination,
The coordinate information is used to specify a position where the avatar image is to be combined in an image representing the background of the shooting range of the camera at the transmission destination.
The information receiving apparatus of the present invention is
An information receiving unit that receives, via a network, feature information representing feature points extracted from a subject in an image captured by at least one camera, and coordinate information of the subject in a shooting range of the camera;
A dynamic generation unit that generates an avatar image of the subject based on the feature information;
An image composition unit that generates an image by combining an avatar image with an image representing a background of a shooting range of a camera based on coordinate information is provided.
The computer program applied to the information transmission side of the present invention is:
A feature point extraction process for extracting feature points from a subject in an image captured by at least one camera and outputting them as feature information;
Coordinate information addition processing for acquiring coordinate information of the subject in the shooting range of the camera;
A computer program for causing a computer to execute information transmission processing for transmitting feature information and coordinate information to a network,
The feature information is associated with the constituent elements of the avatar image of the subject displayed at the transmission destination,
The coordinate information is used to specify a position where the avatar image is to be combined in an image representing the background of the shooting range of the camera at the transmission destination.
The computer program applied to the information receiving side of the present invention is:
A reception process for receiving, via a network, feature information representing feature points extracted from a subject in an image captured by at least one camera, and coordinate information of the subject in a shooting range of the camera;
Dynamic generation processing for generating an avatar image of a subject based on feature information;
A computer is caused to perform an image composition process for generating a composite image by compositing the avatar image based on coordinate information with an image representing the background of the shooting range of the camera.

According to the present invention, it is possible to provide a transmission method that does not hinder the effective use of the communication band of the network when transmitting video from the camera.

FIG. 1 is a block diagram showing a schematic configuration of an information transmission system according to the first embodiment of the present invention.
FIG. 2 is a flowchart showing the processing procedure of the feature point extraction unit.
FIG. 3 is a schematic diagram showing how the feature point extraction unit extracts features by dividing the human body into parts.
FIG. 4 is a flowchart showing a flow of processing in which the feature point extraction unit extracts person attribute information.
FIG. 5 is a schematic diagram illustrating an example of a coordinate system set in the shooting range of the camera.
FIG. 6 is a schematic diagram illustrating a display example in which an avatar image is combined with a background image.
FIG. 7 is a schematic diagram showing an application example of the present invention.
FIG. 8 is a schematic diagram illustrating an expression example of an avatar when there is no continuity of the transmitting camera.
FIG. 9 is a block diagram illustrating a schematic configuration of an information transmission system according to the second embodiment.
FIG. 10 is a schematic diagram showing a conventional transmission method.
FIG. 11 is a diagram for explaining the concept of extracting a difference in a person image area.
FIG. 12 is a conceptual diagram of a background image.
FIG. 13 is an explanatory diagram of the coordinate information addition process.
FIG. 14 is an explanatory diagram following FIG. 13. FIG. 15 is an explanatory diagram of lens distortion correction.
FIG. 16 is a flowchart showing the flow of the coordinate information addition process.
FIG. 17 is a diagram showing an example of a person image region extraction state on the screen.
FIG. 18 is an explanatory diagram for converting the height h of the person image area into the actual height H using the conversion coefficient α.
FIG. 19 is a flowchart showing the flow of the person area detection process.
FIG. 20 is a diagram showing a concept of extracting gait feature information.
FIG. 21 is a diagram illustrating a concept of extracting movement trajectory information.
FIG. 22 is a diagram showing the concept of information storage on the receiving side.
FIG. 23 is a diagram illustrating the concept of an avatar image database.
FIG. 24 is a diagram illustrating the concept of the person moving direction used for determining the direction of the avatar image.
FIG. 25 is a diagram illustrating an example of avatar fragment graphic data.
FIG. 26 is an explanatory diagram illustrating an example in which an avatar image is obtained by combining avatar fragment graphics.
FIG. 27 is a diagram illustrating an example in which avatar image data is configured as avatar animation data.
FIG. 28 is a diagram illustrating an example in which avatar fragment image data is configured as two-dimensional vector graphic data.
FIG. 29 is a flowchart showing a flow of processing on the reception unit side.
FIG. 30 is a flowchart showing a flow of new avatar creation processing.
FIG. 31 is a flowchart showing the flow of the avatar background composition process.
FIG. 32 is a flowchart showing the flow of the integrated mode display process.
FIG. 33 is a diagram illustrating an example of a planar display form in the integrated display mode.
FIG. 34 is a diagram showing an example of a bird's eye view display form.
FIG. 35 is an image showing an example of displaying a three-dimensional avatar image.

Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings.
(Embodiment 1)
First, an outline of the configuration and operation of the information transmission system 1 according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a schematic configuration of the information transmission system 1. The information transmission system 1 includes an information transmission system transmission unit 12 (information transmission device) and an information transmission system reception unit 13 (information reception device). The information transmission system transmission unit 12 and the information transmission system reception unit 13 are connected via a network 15. The network 15 is a public network such as the Internet, but may be a private network such as a local network.
The information transmission system transmission unit 12 receives video signals from a plurality of cameras 11 (11a, 11b...) Installed in various places, performs pre-transmission processing (described in detail later), and then performs the network 15. To send. In FIG. 1, only two cameras 11 are shown, but the number of cameras is arbitrary. Communication between the camera 11 and the information transmission system transmission unit 12 may be wired communication or wireless communication.
The information transmission system reception unit 13 receives the video signal transmitted from the information transmission system transmission unit 12 via the network 15, performs post-reception processing (described in detail later), and then displays it on the monitor 14. Or recording to a video recording device (not shown) as necessary.
The information transmission system transmission unit 12 includes a coordinate information addition unit 121, a feature point extraction unit 122, a multiple camera linkage unit 123, and an information transmission unit 124. One set of coordinate information adding unit 121 and feature point extracting unit 122 is provided for each camera 11. For example, in FIG. 1, a coordinate information adding unit 121a and a feature point extracting unit 122a are provided for the camera 11a, and a coordinate information adding unit 121b and a feature point extracting unit 122b are provided for the camera 11b.
The feature point extraction unit 122 detects a person area from the video signal photographed by the camera 11, and further extracts features regarding the appearance (for example, clothing, hairstyle, body shape, belongings, etc.) of each person. The coordinate information adding unit 121 detects the position of a person in an area photographed by the camera 11 as coordinate information.
Here, the information transmission system 1 differs from the conventional information transmission system in which the video signal photographed by the camera is compressed and transmitted as it is, and is obtained by the feature information obtained by the feature point extraction unit 122 and the coordinate information addition unit 121. Only the received coordinate information is transmitted via the network 15. The information transmission system receiving unit 13 that has received the feature information and the coordinate information records a background image of the shooting range of each camera 11 in advance and accurately identifies each person based on the feature information. An avatar image is generated, and an avatar image is synthesized at an appropriate position of the background image according to the coordinate information. By doing so, the amount of data to be transmitted can be reduced compared to the case where the captured video signal is compressed and transmitted as it is, so that the network communication band can be used effectively.
Since the information transmission system 1 is connected to the plurality of cameras 11, as described above, each of the cameras 11 includes the coordinate information addition unit 121 and the feature point extraction unit 122. For this reason, the multi-camera cooperation unit 123 uses the video signal of any of the plurality of cameras 11 from the coordinate information obtained by the coordinate information addition unit 121 and the feature information obtained by the feature point extraction unit 122. Tag information indicating whether the information is obtained is attached and sent to the information transmission unit 124. The information transmission unit 124 encodes information obtained from the multi-camera cooperation unit 123 according to a predetermined standard, and transmits the encoded information to the network 15.
The information transmission system reception unit 13 includes an information reception unit 131, a dynamic generation unit 132, and an image composition unit 133. The information receiving unit 131 decodes the information received from the network 15 and sends it to the dynamic generation unit 132. The dynamic generation unit 132 generates an avatar image representing a photographed person based on the feature information included in the received information. The avatar image generated by the dynamic generation unit 132 is sent to the image composition unit 133 together with the coordinate information. Based on the avatar image and the coordinate information, the image composition unit 133 generates a composite image of the background image of the shooting range of each camera 11 and the avatar image and displays the composite image on the monitor 14. At this time, the tag information indicating which camera 11 is the information obtained from the video signal is used to specify the background image.
Next, processing of the coordinate information adding unit 121 will be described. The coordinate information adding unit 121 specifies the coordinates of the position of the person in the coordinate system set for the shooting range of each camera 11. For example, as shown in FIG. 5, an xy coordinate system 51 is set in the shooting range of one camera 11. The coordinate information adding unit 121 detects the coordinates of the person area specified by the feature point extracting unit 122 in the xy coordinate system 51. The coordinates detected here are sent to the information transmission system receiving unit 13 together with the feature information as coordinate information representing the position of the person.
When it is desired to specify a three-dimensional space in a three-dimensional space from a camera image, since the image is two-dimensional data, it is not possible in principle to specify a general three-dimensional spatial state with one camera image. However, the subject targeted by the present invention is a person who moves around in the area photographed by the camera 11, and considering the spatial geometric movement characteristics, the person on the screen of the single camera 11 shown in FIG. The position and height of the person in the real space can be specified from the information of the image area PA. In other words, the spatial existence range of the person in the area to be photographed is the floor surface or the ground, in the case of FIG. 5, the road surface RS on which the person walks, and the position in the height direction (z-axis direction) is constant. Note that it is almost confined to the horizontal plane. This road surface RS is an xy plane whose z-axis coordinate is always 0 in an orthogonal coordinate system, and the coordinates of the contact point of a person walking on the road surface RS can be substantially described in two dimensions of xy. Although it is a point in the three-dimensional space, it can be uniquely associated with the camera two-dimensional coordinate system set on the photographing screen.
On the other hand, the camera two-dimensional coordinate system corresponds to a projective transformation of the real space three-dimensional coordinate system, and a subject that is separated in the camera optical axis direction is projected with a reduced size. This is mathematically described by a projective transformation matrix, but if a reference body is placed at various positions known in advance in the real space coordinate system and shot with a camera, the reference body image on the shooting screen is displayed. By comparing the position and height with the position and actual size of the reference object in the real space, the position / height conversion related information that converts the image position / height of the person on the camera into the position / height in the real space Can be obtained. Specific examples thereof will be described with reference to explanatory diagrams of FIGS. 13 to 15 and a flowchart of FIG.
That is, in the imaging field of view SA of the camera, the reference body SC having a known height is arranged on the road surface RS at various positions on the front, rear, left, and right. In S501 of FIG. 16, the height H of the reference body is input. Then, on the photographing screen SA, although it is derived from the same reference body, it appears as a reference body image SCI having a different size depending on the distance from the camera 11, and this is extracted ( S502). Since these reference body images SCI are all on the same road surface RS (that is, the xy plane (z = 0)), the points (reference points) p1 to p3 representing the lower end thereof are all z = 0 in the real space. It is a grounding point. Therefore, the reference points p1 to p3 are read by the ξ-η coordinate system which is a camera two-dimensional coordinate system set on the screen, and stored as screen coordinate data p (ξ, η) of the reference point (S503). In addition, as for which area on the screen represents the road surface RS, images such as the road side edge RE and the white line WL on the road surface can be referred to.
Next, since the image on the shooting screen is affected by distortion of the camera lens, it is not a strict projective transformation image in real space, and the image may be distorted depending on the position in the field of view. is there. As shown on the left of FIG. 15, the distortion is larger in the region closer to the edge of the screen, and the coordinate system becomes nonlinear. For example, a lens having a large viewing angle such as a wide-angle lens has an outward convex distortion, whereas a lens having a small viewing angle such as a telephoto lens has a concave distortion. Therefore, this distortion is eliminated and conversion correction is performed so as to be a point in the orthogonal plane coordinate system (S504). The correction coefficient at this time is determined by an optimization operation that linearizes the shape of a figure that is known to be straight in real space, such as the white line WL appearing on the screen in FIG. Can do. Note that the size of the edge of the screen expands as the edge of the screen is eliminated by this correction, and thus the corrected screen shape SA ′ protrudes outside the original screen SA.
Next, as shown in FIG. 14, the coordinates of the reference body SC in the real space system are determined. For example, in the case of surveying, if the distance d from the camera 11 to the reference body SC installed on the road surface and the angle θ formed by the line from which the reference body is viewed from the camera and the reference line (for example, the x-axis direction) are measured,
x = d · cos θ
y = d · sin θ
The real space coordinates P (x, y, 0) of the ground contact point of the reference body SC can be obtained. On the other hand, the coordinates may be directly specified by a satellite positioning system (GPS). Note that the real space coordinate system used here may be an independent coordinate system set within the shooting range of each camera, or linked to a global coordinate system provided by a satellite positioning system (GPS). May be. However, as will be described later, when generating a single space by integrating the shooting ranges of multiple cameras, it is necessary to connect the coordinate systems of the respective cameras, and the multiple cameras will be integrated into the area where linked shooting is performed. It is desirable to set real space coordinates. Further, when setting the xy coordinate system 51, it is desirable to perform calibration using, for example, an LED light or the like.
Next, the height h on the screen of the reference body image SCI is read (S506). Since the actual height H of the reference body SC is known, a coefficient for converting the height h of the reference body image SCI into the actual height H
α = H / h
(S507), and the screen coordinate data p (ξ, η) and the real space coordinate data P (x, y, 0) are stored in association with each other (S508). After the above process is repeated for all the reference bodies SC (S509 → S501), a process for supplementing the set of p, P, α at the main points not actually measured on the road surface RS is performed. This process may be performed as a step of acquiring interpolation data, or may be performed as a process of determining the elements of the projective transformation matrix from the obtained combination of p and P. These pieces of information constitute position / height conversion related information.
In addition, as shown in FIG. 13, in order to acquire coordinate information, a person SP may be used instead of the reference body SC. In this case, a calibration method or the like may be used in which the height of the person who is the subject is input and the person walks around the four corners of the camera to learn and acquire the camera angle of view and position information.
Next, the contents of the feature point extraction process by the feature point extraction unit 122 will be described in detail with reference to FIGS. FIG. 2 is a flowchart showing a processing procedure of the feature point extraction unit 122. FIG. 3 is a schematic diagram showing how the feature point extraction unit 122 extracts features by dividing the human body into parts.
When a predetermined number of frames of video signals are input from the corresponding camera 11, the feature point extraction unit 122 calculates the moving object MO reflected in the video signal by taking the difference between the frames FM as shown in FIG. 11. It detects (step S11 of FIG. 2). Specifically, the image area MO ′ of the preceding frame and the image area MO of the succeeding frame have different positions and shapes between frames if the image area is a moving object, but the background does not change. The image area MO of the moving object can be extracted by taking the image difference between them. On the other hand, if an image is taken in the absence of a moving object, a background image BP is obtained as shown in FIG. The background image BP is taken for each camera, transmitted to the receiving unit 13 in FIG. 1 and accessible by a computer constituting the receiving unit 13 (in the present embodiment, configured by an external storage device or another computer). Stored in the information storage / statistical processing unit).
Next, the feature point extraction unit 122 extracts a person region by performing segmentation, edge detection, pattern matching, and the like on the moving object image detected in step S11, and determines whether or not the moving object is a person. Judgment is made (step S12). Various methods can be used for the moving object detection process and the person extraction process from the video signal, and the method is not limited to a specific method. Also, among the moving objects detected from the video signal, those having a relatively small size are likely to be noise, so they are determined not to be humans, and those having a relatively large size are determined to be humans. You may do it.
Along with the extraction of the person area, processing for specifying the position coordinates and height of the person is performed. Hereinafter, description will be made with reference to the explanatory diagrams of FIGS. First, as shown in FIG. 17, the detected position of the lower end edge of the person area PA is regarded as the ground point p, and the coordinates p (ξ, η) on the screen are read (S1201), and the above-described position / height conversion is performed. Since the dimension in the height direction of the person area changes depending on the posture of the object with reference to the relationship information, all the images of the person area that seems to be closest to the upright state are searched and specified in the frame (S1203). For example, for a person who is walking, rather than a posture with both feet open, in the process of next stepping on the opposite foot that is located behind the foot that was stepped first, it almost overlaps the foot that stepped first Since the time is closest to the upright posture, the height h of the area is measured on the screen using the person area on the image frame as such (S1204). Then, as shown in FIG. 18, using the conversion coefficient α included in the position / height conversion relation information, it is converted into the actual height H (S1205).
Moreover, it is preferable that the feature point extraction unit 122 further performs a process (step S15) of analyzing the operation of each part. For example, for the head p1, head movement (movement and orientation) is detected. Of the parts of the human body, it is said that the head is the easiest to recognize. If the orientation of the head is known by first extracting the head p1, it is easy to specify the state of other parts, the moving direction, and the like. In addition, for example, when the head is pointing to the right, it is possible to work on the assumption that the left hand and the left foot may be hidden and invisible in the parting described later.
For example, if a person is walking, the movement is analyzed and acquired as gait information. In this case, as the operation of the trunk p2, for example, as shown in FIG. 20, the posture such as the upper body angle ψ, whether or not it is a stoop, etc. are detected. The movements of the right hand p3 and the left hand p4 are detected as, for example, each hand swing λ. As movements of the right foot p5 and the left foot p6, for example, a walking speed, a stride WL, a knee bending angle, and the like are detected. The motion analysis information such as a gait detected here is sent to the information transmission system receiving unit 13 as motion analysis information, and is reflected in the movement and orientation of the avatar representing the person.
What is important as motion analysis information is the moving direction of the person. As shown in FIG. 21, when the coordinate information P1, P2,... Pn of the person is specified for each frame of the captured moving image, the set of the coordinate information P1, P2,. The movement trajectory information between them will be configured. A difference Vn−Vn−1 between the position vectors Vn and Vn−1 of the coordinates Pn and Pn−1 between the adjacent frames can be used as an index representing the moving direction of the person at the position Pn, and the direction of the avatar image described later It is also used effectively in decisions.
Next, the feature point extraction unit 122 uses the human region P extracted in step S12 (see FIG. 3A) as six parts: a head p1, a torso p2, a right hand p3, a left hand p4, a right foot p5, and a left foot p6. (Refer to (b) of FIG. 3) to form parts (step S13). Then, an external feature analysis is performed for each of the six parts that have been made into parts (step S14). For example, for the head p1, the hairstyle, hair color, presence / absence of a hat, and the like are extracted as feature points. For the body p2, the body shape, the shape of clothing, the color of clothing, the presence or absence of specific belongings such as a rucksack, and the like are extracted as feature points. The characteristic points regarding the right hand p3 and the left hand p4 are, for example, the body shape, the shape (or type) of clothing, the color of clothing, and the belongings. The characteristic points regarding the right foot p5 and the left foot p6 are, for example, body shape, clothing shape (or type), clothing color, shoes, and the like.
In addition, the number of parts at the time of making into parts is not limited to six. For the purpose of reducing the processing load, for example, it is possible to divide into three parts, that is, the head, upper body, and lower body. Conversely, in order to generate a more realistic avatar, it may be possible to divide into more than six parts. The feature points listed here are merely examples, and various elements may be extracted as feature points. In addition, for example, in the above description, the hairstyle and hair color, and the clothing shape and clothing color are extracted as independent feature points, but the “hair color” and “clothing color” It may be treated as additional data of “hairstyle” and “clothing shape”. The extracted feature points for each part are output as feature data and sent to the information transmission system receiver 13.
The variation of the extracted feature point (feature data) corresponds to the variation of the component (partial image) of each part in the avatar of the person generated by the information transmission system receiving unit 13 as described later. For example, when “long hair” is extracted as a feature point of the hairstyle, a partial image of “long hair” is used as the head hair of the avatar. For example, in the case of a person whose body width is thick, a thick body is used as a partial image of the body of an avatar.
When a plurality of person areas are extracted in the shooting range of one camera 11 in steps S11 and S12, the processes in steps S13 to S15 are performed for each person area. The obtained feature information and operation information are sent to the multi-camera cooperation unit 123 together with tag information for specifying individual person areas.
Further, the feature point extraction unit 122 may further extract information (person attribute information) that specifies the person to some extent, such as the age and sex of the person. In this case, as shown in FIG. 4, the feature point extraction unit 122 determines the age and sex of the person based on the feature amount extracted from the image of the part that has been partized (step S <b> 23). For example, if the head p1 can be captured, it is possible to discriminate age and sex using face recognition technology.
The age may be output as age data in increments of 1 year, or may be output as data representing the age zone, for example, “20 years old”. In addition, here, gender and age are exemplified as the person attribute information, but any information other than this can be used as information for specifying a person to some extent. For example, it may be possible to discriminate between “adult” and “child”.
In the above case, a person database in which images such as faces and personal information (names, etc.) are registered in advance is used instead of uniquely identifying a person but by gender and age. If it can be used, it is possible to uniquely identify an individual by collating the image of the head p1 with the face image registered in the person database as necessary (step S24). This allows, for example,
(1) One-to-one verification for the purpose of arresting criminals
(2) When searching for lost children, check by clothing age
(3) For marketing purposes, etc., collation for rough age estimation
It becomes possible to identify such individuals.
Next, returning to FIG. 1, in the information transmission system receiving unit 13, the information receiving unit 131 receives information from the network 15 and decodes it. As described above, the decoded information includes information (feature information and coordinate information) obtained from the video signals of a plurality of cameras 11 (cameras 11a, 11b,...). Stored and accumulated in the statistical processing unit 135. FIG. 22 shows an example of accumulated information. A detection ID is assigned to a person who is determined to be the same from the degree of coincidence of the feature information described above, the time and date of reception, and the position (x coordinate) And y coordinate), how to walk (gait), physique, height, hair color, upper body clothing color, lower body clothing color, facial feature information, gender, age, etc., sequentially stored in association with each other It has been done. In addition, the date and ID part are abbreviated to # 1, # 2, etc., but information such as the type (form) of the upper and lower body clothes, the presence or absence of a hat, and belongings are also associated. Further, the gait data includes a stride WL, an arm swing angle λ, an upper body angle ψ, a knee bending angle δ, a one-step cycle τ, and the like.
Hereinafter, processing on the receiving unit side will be described.
The dynamic generation unit 132 generates an avatar image of each person based on the received feature information. That is, as described above, the feature information includes feature data representing the feature of each part of the person. The dynamic generation unit 132 is a database (in the information accumulation / statistical processing unit 135 in FIG. 1) that stores in advance a partial image of an avatar corresponding to each of the feature data. It may be a storage device or a server). For example, when “long hair” is included as feature data of a person's head p1, a partial image of long hair is acquired from the database. The dynamic generation unit 132 generates the avatar image of the person by combining the partial images of the avatar based on the feature data of each part of the person.
FIG. 23 conceptually shows a construction example of the database. The database includes avatar fragment graphic data, which is an avatar constituent element, including upper and lower body clothes, hairstyles, belongings, and the like when the height and body shape are set as standard. Each avatar fragment graphic data for each avatar component is configured in different representations according to the moving direction of the person in the real space so that the appearance of the person from the viewpoint from the camera (direction with respect to the camera) is reflected. Yes. In this embodiment, as shown in FIG. 24, the direction of the person P with respect to the camera 11 is determined in eight directions (J1 to J8), and the avatar fragments divided in a form corresponding to the human body parts p1 to p6 described in FIG. Graphic data (p2 to p4 for upper body clothes, p5 and p6 for lower body clothes) is prepared for each of the eight directions (v1 to v8: arguments correspond to J1 to J8). . Also, shoes, hair, and belongings are not divided, but they are also prepared in eight ways.
The left of FIG. 25 shows an example of selection of avatar fragment graphic data when the direction v7 in FIG. 24 is designated, and the right shows an example of selection of avatar fragment graphic data when the direction v1 is designated. T-shirts are selected for the upper body and jeans are selected for the lower body, and those corresponding to the directions v7 and v1 are selected from the data shown in FIG. FIG. 26 shows avatar images AV7 and AV1 obtained by combining them. As for the face, contours and human phases that reflect the extracted facial feature information are synthesized for each direction, but the standard face (or head) of the face depends on gender and age. An image may be prepared.
In addition, as shown in FIG. 27, the avatar image data (or avatar fragment graphic data) is configured as avatar animation data including a set of frame data obtained by subdividing the object walking motion. In the example of FIG. 27, one frame for two steps of a plurality of frames (four frames AFM1 to AFM4 in this case) until the landing with the right foot and four frames (AFM5 to 8) until the landing with the left foot is performed. Expressed in an 8-frame animation. As shown in FIG. 23, for at least the lower body clothes and the upper body clothes, data of these eight frames is prepared for each type of avatar fragment graphic data.
The image data of each avatar fragment is configured as two-dimensional vector graphic data as shown in FIG. The vector graphic data is obtained by circularly concatenating vertex coordinates that specify a graphic outline with a vector. When the primary transformation process is performed, the vertex coordinates are also moved according to a matrix operation representing the primary transformation, and the point after the movement is moved. By performing vector connection again, deformation processing such as enlargement, reduction, and rotation of a figure can be easily executed. Then, inside the figure surrounded by the vector, a colored area is specified by internal / external judgment based on the vector line, and the characteristic information in FIG. 22 is defined by rasterizing pixels in the area with a specified color. Color information to be reflected can be easily reflected in the final avatar image VDR.
FIG. 29 is a flowchart showing the flow of processing on the receiving side.
First, the person ID, operation information (coordinate points), and feature information sent via the network are received (S601). Next, the received coordinate information P is plotted on real space coordinates shared by a plurality of cameras (S602). Then, the person's walking direction vector is calculated from the position change of the person's coordinate P in the preceding and following frames, and the avatar is selected from the eight directions J1 to J8 in FIG. It determines as an image arrangement direction (S603).
For a person who has entered the camera field of view for the first time from another camera's field of view, an avatar image may have already been created and an avatar image may have been created. (S604). If there is no avatar creation history in S605, a search is made as to whether or not there is a person on the database whose time, position, and feature match under a predetermined condition (S606). If there is no corresponding person in S607, a process for creating a new avatar image is performed (S610).
FIG. 30 shows the details of the new avatar creation process. In S601, the hairstyle, clothes, belongings and their colors included in the feature data are specified. Next, in S6102, among the avatar fragment graphics corresponding to the identified feature, the one corresponding to the determined avatar image arrangement direction (any of J1 to J8) (any of v1 to v8 in FIG. 23). read out. In step S6103, the avatar fragment graphic is corrected based on the height / body shape / gait information included in the feature data, and the avatar fragment graphic is colored in the color designated in step S6104. Finally, the avatar image data is completed by synthesizing each avatar fragment in S6105.
Returning to FIG. 29, if there is an avatar creation history in S605, the process proceeds to S609, where the avatar image data of the corresponding ID is read from the database and reused. On the other hand, if there is a corresponding person in S607, the received person ID is updated with the ID of the corresponding person, and the process proceeds to S609 and the same processing is performed.
The avatar image of each person generated by the dynamic generation unit 132 is sent to the image composition unit 133 together with the coordinate information of each person, and the avatar / background composition process is performed (S611). The image composition unit 133 can access a database (in the information accumulation / statistical processing unit 135) in which background images of the photographing ranges of the respective cameras 11 are stored in advance. The image composition unit 133 acquires the background image of each camera 11 from the database, and composes it with the avatar image generated by the dynamic generation unit 132. The composite image is output to the monitor 14. At this time, the position where the avatar image is arranged is based on the coordinate information of the person of the avatar. Further, the image composition unit 133 changes the avatar direction or adjusts the speed at which the avatar moves based on the motion analysis information (data representing the movement and direction of the person) obtained by the feature point extraction unit 122. can do. Note that transmission from the information transmission system transmission unit 12 is performed at a frame rate as high as possible within a range that can be processed by the feature point extraction unit 122, the coordinate information addition unit 121, the dynamic generation unit 132, and the image synthesis unit 133. The video signal of the camera 11 can be displayed on the monitor 14 in almost real time.
FIG. 31 shows the flow of the avatar background / compositing process. First, in S61101, avatar image data corresponding to the specified ID and direction is read. This avatar image data is a set of frame data constituting an animation (FIG. 27), and avatar animation frames are allocated to frames for moving image reproduction in accordance with the speed and stride of the moving coordinate point P (S61102).
Although this is a composite video display mode, in this embodiment, a display mode (camera actual video mode) in the same field of view as the shooting screen of the camera 11 or an integrated display mode of a plurality of cameras can be selected. Yes. This mode selection can be switched by an input unit (configured by a keyboard or a touch panel) connected to the information transmission system reception unit 13 in FIG. When the camera real video mode is selected, the process proceeds to S61104, and the position coordinates P (x, y, 0) of all avatars to be displayed simultaneously are plotted in the real space visual field region of the corresponding camera. In step S61105, the real space visual field area is projected and converted to the corresponding coordinate system of the camera together with the plotted position coordinates P.
Here, as described above, the camera two-dimensional coordinate system used for determining the position coordinate P is temporarily corrected from the left state in FIG. 15 to the right state in consideration of lens distortion. In this case, the entire field of view fits on the output screen in the coordinate system before correction, but after correction, the region at the end of the field of view extends beyond the screen of the monitor (FIG. 1: reference numeral 14). Compared to viewing the camera image directly, the image changes by the amount of distortion correction, creating a sense of incongruity and displaying the avatar image of the person reflected at the edge of the field of view. It may disappear. Therefore, in S61106 of FIG. 31, reverse distortion correction for restoring the influence of the original lens distortion is performed on the projective transformation image, and the shape of the field of view is restored. As a result, the above problem is solved.
In step S61107, the selected background image is superimposed on the output plane returned to the camera two-dimensional coordinate system through projective transformation and reverse distortion correction together with the mapped human coordinate position P (S61107) The avatar image data adjusted in size and orientation as described above is pasted and synthesized at each position p (according to the camera two-dimensional coordinate system).
The screen of the monitor 14 in FIG. 1 may be divided to display the video signals of the plurality of cameras 11 at the same time, or any one of the plurality of cameras 11 by switching the screen of the monitor 14. Only the video signal may be displayed.
On the other hand, if the integrated display mode is selected in S61104, the process proceeds to S1000 to display in the integrated mode. FIG. 32 is a flowchart showing the details, and in S1001, the positions P (x, y, 0) and directions of all avatars to be displayed simultaneously are plotted in a real space shared by a plurality of cameras. In step S1002, flow line trajectory data is created by superimposing the person position coordinates P in the previous and next frames. In this embodiment, it is possible to select from either a display form in plan view as shown in FIG. 33 or a display form in overhead view as shown in FIG.
If the plan view is selected, the process proceeds to S1004, and a plan view background image PBP prepared in advance is pasted as shown in FIG. Then, the avatar image AV is plotted and displayed on the planar view background image. In this case, an avatar image for plane view may be prepared separately, or the avatar may be displayed in the horizontal direction so that the feature information can be easily grasped. When the flow line display is designated, the flow line image ML of the corresponding avatar image AV is displayed based on the flow line locus data described above.
On the other hand, if bird's-eye view is selected in S1003, the process proceeds to S1006, where the real space position, direction, and flow line data of the avatar are projectively transformed according to the bird's-eye view angle and direction, and the background image in the case of bird's-eye view is superimposed in S100. . As for the background image, a captured image for overhead view may be prepared and used, or three-dimensional background image data (for example, three-dimensional computer graphics (CG) data) may be prepared and converted to an overhead view by projective transformation. Good. In S1008, the avatar image corresponding to the direction of the avatar after the projective transformation is read, and the pasted avatar image AVS is pasted on the overhead view background image PBPS as shown in FIG. Also in this case, when the flow line display is designated, the flow line image MLS of the corresponding avatar image AVS is displayed based on the above-mentioned flow line locus data.
The avatar image data may be 3D avatar image data, and the avatar image may be displayed as a 3D CG image as shown in FIG. In this case, since the avatar image is prepared as a three-dimensional avatar object from the beginning, the arrangement and rotation in the designated direction can be freely set in three dimensions. In this case, the image composition unit 133 (FIG. 1) generates two-dimensional avatar image data by projectively transforming the three-dimensional avatar object in the real space whose arrangement direction is determined into the two-dimensional coordinate system of the background image. The avatar image based on the two-dimensional avatar image data is combined with the background image.
As described above, by extracting the feature points and position coordinates of the person from the video signal of the camera 11 and transmitting only the extracted data, the video signal is directly compressed and transmitted as compared with the conventional case. Thus, the bandwidth of the network 15 can be used effectively. In addition, since the image of the person is not displayed as it is on the monitor 14 but is displayed in an anthropomorphic (avatarized) state, privacy can be used when shooting an unspecified number of persons such as a security camera on the street. Has the advantage of not infringing. For example, the person in the captured image shown in FIG. 5 is displayed on the monitor 14 as an avatar as shown in FIG. In addition, each avatar is designed to represent the characteristics of each person based on the feature information extracted from the video, so it is possible to grasp what person is in the shooting range.
In the above-described embodiment, the feature information and coordinate information of the person is acquired from the video signal of the camera 11, and further, the motion analysis information indicating the motion and direction of the person and the person attribute information such as age and sex are obtained. The example to acquire was demonstrated. Various applications can be considered using such information. For example, as shown in FIG. 7, the above information may be processed and processed by the image composition unit 133 and a plurality of screens may be displayed on the monitor 14. In the example of FIG. 7, an actual video space screen 81, a feature amount reproduction screen 82, a statistical space screen 83, a flow line analysis space screen 84, and a personal identification space screen 85 are displayed side by side on the monitor 14. Alternatively, a method of streaming an actual video as necessary while watching an avatar may be used. In addition, when a face is recognized on the transmission side, a capture of the face is recorded on the transmission side, and a face image associated with an avatar can be received by a request from the reception side upon request from the reception side. Also good.
The real video space screen 81 is a screen that displays video signals from the plurality of cameras 11 in a state where a person is replaced with an avatar. In the example of FIG. 7, the actual video space screen 81 is divided into four, and the video signals from the four cameras 11 are displayed simultaneously, but the number of cameras is not limited to this.
The feature amount reproduction screen 82 is a screen for displaying videos from a plurality of cameras 11, in which a person is replaced with an avatar and a background image is also displayed in a graphic display. In the example of FIG. 7, the feature amount reproduction screen 82 is generated by three-dimensionally integrating the images from the plurality of cameras 11. That is, the feature amount reproduction screen 82 is configured as a bird's-eye view image by combining videos taken by a plurality of cameras installed at a plurality of locations. For example, the feature amount reproduction screen 82 illustrated in FIG. 7 is a screen representing the state of the station premises (the vicinity of the platform and the ticket gate) and the surrounding stores. In this example, for example, video signals respectively obtained from an installation camera at a station platform, an installation camera around a ticket gate, and a camera installed at each of a plurality of stores are used. Although it is impossible to shoot all of these areas with a single camera, such a bird's-eye view image can be obtained by three-dimensionally combining images taken with multiple cameras installed at multiple locations. A simple screen can be configured.
Further, the motion analysis information extracted from the video signal of the camera includes information about the direction of the person and the direction in which the person is moving. By using this information and arranging the avatars so as to match the direction of the actual person, there is an advantage that the movement of the crowd can be easily grasped on the feature amount reproduction screen 82. By configuring such a feature amount reproduction screen 82, it is possible to view the images of cameras installed at a plurality of locations in an integrated manner, and to monitor a wider range of situations in real time. Further, as with the real image space screen 81, since the person is replaced with an avatar, there is an advantage that privacy is not infringed. In addition, each avatar is designed to represent the characteristics of each person based on the feature information extracted from the video, so it is possible to know what person is in the shooting range. is there.
The statistical space screen 83 is a screen that displays various statistical results. For example, the transition of the number of people within the shooting range of a certain camera can be represented by a graph. Or based on person attribute information, you may represent the person who exists in the imaging | photography range with a graph according to sex and age. In addition, the analysis space screen 84 pays attention to a certain person (avatar), and displays how the person has moved in the shooting range of the camera by a flow line. This is possible by acquiring the coordinate information of a certain person (avatar) in time series. Further, the personal identification space screen 85 displays the person attribute information of the person in the shooting range. In the example of FIG. 7, the face part of each person's avatar image, gender, and age are displayed.
In addition, the real image space screen 81, the feature amount reproduction screen 82, the statistical space screen 83, and the flow line analysis space 84 preferably have a GUI (graphical user interface) function. For example, when one of the displayed avatars on the feature amount reproduction screen 82 is selected using a pointing device such as a mouse (in FIG. 7, an avatar 82a), the person attribute of the person represented by the avatar is displayed. Information is highlighted on the personal identification space screen 85. In this example, “male, 35 years old” which is the personal attribute information of the avatar 82 a is highlighted on the personal identification space screen 85. On the other hand, when any person attribute information is selected on the personal identification space screen 85, an avatar image corresponding to the selected person attribute information is highlighted on the feature amount reproduction screen 82. Anyway. Further, when any avatar is selected on the feature amount reproduction screen 82, the moving path of the avatar may be displayed on the flow line analysis space screen 84.
FIG. 8 is a schematic diagram illustrating an expression example of an avatar when there is no continuity of the transmitting camera. In this embodiment, when there is no continuity of the camera on the transmission side, it may be assumed that the avatar cannot be confirmed on the reception side. As a countermeasure, for example, as shown in the diagram of the integrated layer shown in FIG. 8 (A), the received avatar is colored when the feature amount captured by the camera A can be confirmed by the camera B at the destination, As shown in FIG. 8 (B), when the avatar cannot be confirmed on the receiving side, the received avatar is not colored and the avatar (default avatar) is left as it is and the camera is used in each case while distinguishing between the two. A method of calculating the movement in the meantime based on the moving speed and projecting a three-dimensional image onto a feature reproduction space or the like can be considered.
In this way, an integrated security camera system or the like is realized by variously using feature information, coordinate information, person attribute information, motion analysis information, etc. extracted from the video signal of the camera on the information transmission system receiving unit 13 side. It becomes possible to do.
(Embodiment 2)
An information transmission system 2 according to the second exemplary embodiment of the present invention will be described. In addition, about the structure which has the function similar to the structure demonstrated in (Embodiment 1), the same referential mark is attached and the overlapping description is not performed.
FIG. 9 is a block diagram illustrating a schematic configuration of the information transmission system 2. As shown in FIG. 9, the information transmission system 2 is different from the information transmission system 1 (first embodiment) including a plurality of cameras 11a, 11b,. . For this reason, the information transmission system 2 includes only one set of the coordinate information addition unit 121 and the feature point extraction unit 122 and does not include the multiple camera cooperation unit 123. The operations of the coordinate information adding unit 121, the feature point extracting unit 122, and other processing units are the same as those in the first embodiment. Thus, a system that extracts and transmits information only from the video signal of one camera 11 is also included in one embodiment of the present invention.
In the first embodiment, each of the information transmission system transmission unit 12 and the information transmission system reception unit 13 can be realized as an independent device (camera controller), a computer, or a server. Each unit such as the coordinate information adding unit 121 shown in the block diagram can be realized by the processor executing the program recorded in the memory in these devices. Further, the information transmission system transmission unit 22 according to the second embodiment can be realized as an apparatus integrated with the camera 11.
As described above, the present invention can also be implemented as a program executed by a general-purpose computer or server, or a medium recording the program, in addition to the embodiment in which the present invention is implemented as hardware.

DESCRIPTION OF

SYMBOLS

1, 2 Information transmission system 11 Camera 12 Information transmission system transmission part 13 Information transmission system reception part 14 Monitor 15 Network 121 Coordinate information addition part 122 Feature point extraction part 123 Multiple camera cooperation part 124 Information transmission part 131 Information reception part 132 Dynamic generation 133 Image composition unit

Claims

A feature point extraction unit that extracts feature points from a subject in an image captured by at least one camera and outputs the feature points as feature information;
A coordinate information adding unit for acquiring coordinate information of a subject in the shooting range of the camera;
An information transmission unit for transmitting the feature information and the coordinate information to a network;
An information receiving unit for receiving the feature information and the coordinate information from the network;
A dynamic generation unit that generates an avatar image of the subject based on the feature information;
An information transmission system comprising: an image composition unit that generates a composite image by compositing the avatar image on the basis of the coordinate information with a background image representing the background of the shooting range of the camera.
The coordinate information adding unit sets a position identified as a grounding point of the person who forms the subject appearing on the photographing screen as a photographing grounding position, and sets the height of the person image area appearing at the photographing installation position on the photographing screen. As the height, the coordinate information adding unit is a plane coordinate point in the camera two-dimensional coordinate system set on the shooting screen of the camera with the walking plane in the real space when the subject is a person as a reference in the height direction. And the transformation relationship between the space coordinate point on the walking surface in the real space three-dimensional coordinate system, the photographing height of the person for each photographing grounding position in the camera two-dimensional coordinate system, and the real space coordinate system Position / height conversion relationship information acquisition means for acquiring position / height conversion relationship information including a conversion relationship with the actual height of the person, and the shooting ground position and shooting height of the person image on the shooting screen Based on the position / height conversion relation information, the information on the ground contact position / height specifying means to be identified, and the information on the identified ground contact position coordinates and the image height are specified as the ground contact position coordinates of the person in the real space. Real person coordinate / height information generating means for converting and generating certain real contact position coordinate information and the height of the person in the real space into real person height information as information,
The dynamic generation unit includes avatar height determining means for determining a height dimension of the avatar image based on the generated real person height information,
The information transmission system according to claim 1, wherein the image composition unit includes avatar composition position determination means for determining a composition position of the avatar image to the background image based on the actual ground position coordinate information.
The feature point extraction unit further analyzes the movement or orientation of the subject and outputs it as motion analysis information,
The information transmitting unit further sends the person attribute information to a network;
The information transmission system according to claim 1, wherein the image composition unit adjusts the movement or orientation of the avatar image based on the motion analysis information.
The camera is capable of video recording,
The coordinate information adding unit acquires coordinate information of a person who forms the subject for each frame of a captured video,
The information transmission system according to claim 3, wherein the feature point extraction unit outputs movement trajectory information between the frames of the coordinate information of the person as the motion analysis information.
The information transmission system according to claim 4, wherein the image composition unit adjusts the direction of the avatar image to be synthesized on the background image based on the movement trajectory information.
The dynamic generation unit generates an avatar image having a different expression form according to a moving direction of the person in the real space so that the appearance of the person from the viewpoint from the camera is reflected. 5. The information transmission system according to 5.
The dynamic generation unit includes a direction-specific two-dimensional avatar image data storage unit that stores a plurality of two-dimensional avatar image data having different representation forms for a plurality of predetermined movement directions of the person in the real space. Estimating the moving direction of the person based on the moving trajectory information acquired for the preceding frame, and adapting to the estimated moving direction from the two-dimensional avatar image data for each direction To choose,
The information transmission system according to claim 6, wherein the image combining unit combines an avatar image based on the selected two-dimensional avatar image data with the background image.
The dynamic generation unit includes a three-dimensional avatar image data storage unit that stores the data of the avatar image as three-dimensional avatar image data, generates a three-dimensional avatar object based on the three-dimensional avatar image data, and The movement direction of the person is estimated based on the movement trajectory information acquired for the frame to be determined, and the arrangement direction of the three-dimensional avatar object in the real space is determined so as to face the estimated movement direction Is what
The image composition unit generates two-dimensional avatar image data by projectively transforming the three-dimensional avatar object in the real space whose arrangement direction is determined to a two-dimensional coordinate system of the background image, and the two-dimensional avatar The information transmission system according to claim 6, wherein an avatar image based on image data is combined with the background image.
The information transmission system according to any one of claims 4 to 8, wherein the image composition unit generates an image representing a flow line of the person based on the movement trajectory information.
The feature point extraction unit analyzes the person attribute of the subject and outputs it as person attribute information;
The information transmission system according to claim 1, wherein the information transmission unit transmits the person attribute information to the network.
The information transmission system according to claim 10, wherein the behavior generation unit generates the avatar image as a reflection of the person attribute information.
The information transmission system according to claim 10 or 11, wherein the person attribute information includes gender information reflecting the gender of the person and age information reflecting the age of the person.
The feature point extraction unit analyzes the appearance of the person of the subject and outputs it as appearance feature information;
The information transmission system according to any one of claims 1 to 12, wherein the information transmission unit transmits the appearance feature information to the network.
14. The information transmission system according to claim 13, wherein the dynamic generation unit generates the avatar image as a reflection of the appearance feature information.
The appearance characteristic information includes hair information that reflects one or both of the form and color of the person's hair, clothing information that reflects one or both of the form and color of the person's clothes, the form of the person's belongings, and 15. The information transmission system according to claim 14, wherein the information transmission system includes one or more items of belonging information reflecting one or both of colors.
16. The information transmission system according to claim 14, wherein the appearance feature information includes body shape information reflecting the body shape of the person.
The information transmission system according to any one of claims 14 to 16, wherein the appearance feature information includes gait information reflecting the gait of the person.
The dynamic generation unit uses avatar animation data composed of a set of frame data obtained by subdividing human walking motion, and performs image correction processing for correcting each frame of the frame data based on the gait information. ,
The information transmission system according to claim 17, wherein the image composition unit synthesizes the avatar image with the background image in a moving image form in which the gait feature is reflected based on the corrected frame data.
The information transmission system according to any one of claims 1 to 18, wherein the image composition unit generates the composite image as a bird's-eye view image including a plurality of shooting ranges of the cameras.
The shooting range is covered by a plurality of the cameras in the form of sharing real space coordinates,
The coordinate information adding unit sets a position identified as a grounding point of a person appearing on the photographing screens of the plurality of cameras as a photographing grounding position, and sets a height on the photographing screen of a person image area appearing at the photographing installation position. As the shooting height, the coordinate information adding unit uses the walking plane in the real space when the subject is a person as a reference in the height direction, and the plane coordinates in the camera two-dimensional coordinate system set on the shooting screen of the camera And the transformation relationship between the space coordinate points on the walking plane in the real space coordinate system which is a three-dimensional coordinate system of the real space, the photographing height of the person for each photographing grounding position in the camera two-dimensional coordinate system, and the Position / height conversion relationship information acquisition means for acquiring position / height conversion relationship information including a conversion relationship with the actual height of the person in the real space coordinate system, and the person image on the shooting screen Shoot The photographing grounding position / height identifying means for identifying the grounding position and the photographing height, and the information of the identified photographing grounding position coordinates and the photographing height in the real space based on the position / height conversion relation information. Real person coordinate / height information generating means for converting and generating real ground position coordinate information that is the contact position coordinates of the person and real person height information that is information about the height of the person in the real space,
The dynamic generation unit includes avatar height determining means for determining a height dimension of the avatar image based on the generated real person height information,
The image synthesizing unit synthesizes the avatar image to the overhead image while performing coordinate conversion on the actual ground position coordinate information of the person photographed by the plurality of cameras in the real space coordinate system from the viewpoint of the overhead image. The information transmission system according to claim 19, further comprising avatar synthesis position determination means for determining a position.
The information transmission system according to any one of claims 1 to 20, wherein the feature point extraction unit divides a person image into a plurality of parts corresponding to parts of a human body and extracts feature points from each part. .
The dynamic generation unit includes an avatar image data storage unit that stores data of the avatar image divided into a plurality of avatar fragments corresponding to the parts, and stores the feature points extracted for the parts corresponding to the person. The information transmission system according to claim 21, wherein after correcting the avatar fragment of the avatar image based on information, the corrected avatar fragment is integrated to generate the avatar image.
A feature point extraction unit that extracts feature points from a subject in an image captured by at least one camera and outputs the feature points as feature information;
A coordinate information adding unit for acquiring coordinate information of a subject in the shooting range of the camera;
An information transmission device comprising an information transmission unit for transmitting the feature information and the coordinate information to a network,
The feature information is associated with the constituent elements of the avatar image of the subject displayed at the transmission destination,
The information transmission apparatus according to claim 1, wherein the coordinate information is used at a transmission destination to specify a position where the avatar image is synthesized in an image representing a background of a shooting range of the camera.
An information receiving unit that receives, via a network, feature information representing feature points extracted from a subject in an image captured by at least one camera, and coordinate information of the subject in a shooting range of the camera;
A dynamic generation unit that generates an avatar image of the subject based on the feature information;
An information receiving apparatus comprising: an image composition unit configured to generate a synthesized image by synthesizing the avatar image based on the coordinate information with an image representing a background of a shooting range of the camera.
A feature point extraction process for extracting feature points from a subject in an image captured by at least one camera and outputting them as feature information;
Coordinate information addition processing for acquiring coordinate information of a subject in the shooting range of the camera;
A computer program for causing a computer to execute information transmission processing for sending the feature information and the coordinate information to a network,
The feature information is associated with the constituent elements of the avatar image of the subject displayed at the transmission destination,
The coordinate information is a computer program used for specifying a position where the avatar image is synthesized in an image representing a background of a shooting range of the camera at a transmission destination.
An information receiving process for receiving, via a network, feature information representing feature points extracted from a subject in an image captured by at least one camera, and coordinate information of the subject in a shooting range of the camera;
A dynamic generation process for generating an avatar image of a subject based on the feature information;
A computer program for causing a computer to execute an image composition process for generating a composite image by compositing the avatar image with an image representing a background of a shooting range of the camera based on the coordinate information.