WO2018101247A1

WO2018101247A1 - Image recognition imaging apparatus

Info

Publication number: WO2018101247A1
Application number: PCT/JP2017/042578
Authority: WO
Inventors: 大坪　宏安; 石崎　修
Original assignee: マクセル株式会社
Priority date: 2016-11-29
Filing date: 2017-11-28
Publication date: 2018-06-07

Abstract

Provided is an image recognition imaging apparatus that is capable of reducing the calculation amount in image recognition and improving the recognition rate. This image recognition imaging apparatus is provided with: an image sensor 1 that captures an image; and a 3D sensor 2 that acquires a distance image in which each pixel in the range corresponding to the captured range of the image is indicated by the distance to an object to be imaged. The image recognition imaging apparatus is also provided with an object extraction means 3 for extracting an object to be recognized in the distance image on the basis of the distance from the distance image. The image recognition imaging apparatus is further provided with: an object image extraction means 4 for extracting, from the image on the basis of the range, a partial image which becomes an object, in the distance image, of the object extracted from the distance image; and an image recognition means 5 for identifying the extracted object by executing image recognition for the partial image.

Description

Image recognition imaging device

The present invention relates to an image recognition imaging apparatus that performs image recognition by acquiring a two-dimensional image and a distance image.

Generally, a surveillance camera captures an image and displays it on a monitor, and this image is used by a person to monitor the image in real time or to store an image and confirm the incident after the incident occurs. On the other hand, in recent years, with the development of AI (artificial intelligence) technology, it is possible to automatically detect the presence of a specific person or automatically detect the entry of a suspicious person into a restricted area by image recognition. Is possible.

On the other hand, in an automobile, in order to automate the avoidance of a danger such as a collision, it has been proposed to use a distance image including a distance to a subject to be photographed in each pixel arranged vertically and horizontally (see, for example, Patent Document 1). ) Thereby, the position and movement of the obstacle can be detected. In this case, a stereo camera having two cameras is used for distance measurement and image acquisition.

As described above, a camera that detects / recognizes an object such as a person or an object is known (for example, see Patent Documents 2 and 3). Such a camera is used as a surveillance camera for the purpose of crime prevention, for example, and issues an alarm when an abnormality is detected by detection / recognition.

In recent years, machine learning is often used in fields such as image recognition. As a machine learning technique, for example, deep learning is known. Deep learning is a technique for learning data characteristics using a neural network having a multilayer structure, and it is known that high-accuracy image recognition is possible by using this.

As described above, along with the recent improvement of image recognition technology, risk avoidance and automatic driving when driving a car using image recognition of in-vehicle camera images, and known criminals by image recognition of surveillance camera images Detection is possible. Further, as a vehicle-mounted camera or a surveillance camera, using a stereo camera having the above-described pair of cameras, for example, based on the difference in position (parallax) between corresponding points on each image corresponding to the same target point on the subject. Detects the distance from the stereo camera to the point on the subject corresponding to each point on the image, and obtains a distance image that represents each pixel of the image not by brightness or color difference but by the distance to the subject. It is conceivable to perform recognition.

In this case, a corresponding point corresponding to each point on the photographed subject in each of the two photographed images is searched, and the difference in position between corresponding points on the two images corresponding to the same point on the subject is determined. The above distance is calculated from the parallax. In addition, when the distance image is obtained, for example, image recognition such as human detection or moving body detection by hail recognition, etc., and further, avoidance of danger when driving a car, automatic driving, nomination arrangement offense by hail recognition, etc. Detecting a specific person, detecting a suspicious person such as an illegal intruder, and the like. In the image recognition in the distance image, since a part of the three-dimensional shape of the object on the object to be photographed is known from the distance image, the subject on the image can easily and accurately identify people, dogs, cats, horses, vehicles, etc. Can be done.
In consideration of the accuracy of image recognition, it is preferable to use a high-resolution image sensor. However, when an arithmetic process such as distance detection described above is performed using an image having a large number of pixels, the amount of calculation is enormous. Therefore, when processing a moving image such as a video from an in-vehicle camera or a surveillance camera, it takes a long time to process one frame. Depending on the calculation speed of the integrated circuit that performs calculation processing, the calculation processing time of one frame becomes too long, and there is a possibility that subsequent processing may not be in time when a real-time response is required. In this case, it is conceivable to reduce the processing time by first thinning out the pixels of the image to lower the resolution (number of pixels) of the image.

Also, fish-eye lenses are known as camera lens units (lenses). The fisheye lens is a lens unit that adopts a projection method that is not the central projection method used in a normal wide-angle lens or a telephoto lens. For example, most fisheye lenses adopt an equidistant projection method. In addition to this, a fisheye lens employs an equisolid angle projection method, an orthographic projection method, a stereo projection method, and the like. On the telephoto side, there is little difference in the captured image between the central projection method and other projection methods, there is no telescopic fisheye lens, and the fisheye lens is an ultra-wide-angle lens. In general, the angle of view of a fisheye lens is often 180 degrees, but there are lenses with an angle of view of less than 180 degrees and lenses with an angle of view of greater than 180 degrees.

A distance calculation device for obtaining the distance to a subject using a stereo camera using such a fisheye lens has been proposed (see Patent Document 4). In a stereo camera using this fisheye lens, an image captured by the stereo camera is converted into a spherical image obtained by projecting the image onto a spherical surface, and the distance to each subject is obtained.

Japanese Unexamined Patent Publication No. 2016-1464 JP 2012-208551 A JP 2010-160743 A JP 2013-109777 A

By the way, since recognition of a two-dimensional image using a 2D camera is performed on an image in a state where a person or an object in a three-dimensional space is projected two-dimensionally, AI technology (deep machine learning or the like) is used. However, the improvement of the processing speed and the recognition rate will end. That is, in object extraction from a two-dimensional visible image, different colors and different luminance parts are recognized as the same object, so the entire screen is processed and the object is recognized from changes in color and luminance. In addition, it requires a huge amount of calculation and requires a long time for processing, and the detection accuracy is about 90% in image recognition using deep machine learning.

On the other hand, with a 3D camera, it is possible to recognize relatively easily whether there is an obstacle with which the host vehicle collides by knowing the distance. That is, it is easy to detect a person or an object at a collision position as an obstacle. That is, the 3D sensor can recognize the shape of the object, but the object may not be identified only by the shape read by the 3D sensor.

The present invention has been made in view of the above circumstances, and an object thereof is to provide an image recognition imaging apparatus capable of reducing the amount of calculation in image recognition and improving the recognition rate.

In order to solve the above problems, an image recognition imaging apparatus of the present invention includes an imaging unit that captures an image,
Distance image acquisition means for acquiring a distance image in which each pixel in a range corresponding to the imaging range of the image is represented by a distance to a shooting target;
Distance image recognition object extraction means for extracting a recognition object on the distance image based on the distance from the distance image;
Recognition object image extraction means for extracting a partial image to be the recognition object from the image based on a range on the distance image of the recognition object extracted from the distance image;
And image recognition means for recognizing the partial image and identifying the recognition object.

According to such a configuration, in the distance image, a group of pixels having a distance value close to each other and a lump can be recognized as an object that is a recognition target object on the distance image. In this case, for example, an object can be extracted by simple calculation as compared with the case of extracting an object from a two-dimensional color visible image.

The distance image basically corresponds to the shooting range of the image, and the position of the distance image corresponding to each position of the image includes distance information. For example, when the image and the distance image have the same number of pixels, the pixel of the distance image corresponding to the pixel at the specific position of the image obtained by photographing a part of the object is up to a part of the object by the color. It represents distance information. Further, the image capturing range corresponds to the range of the range image. In other words, it is preferable that the image and the distance image are basically taken from the same range, but if the correspondence between the position of the image and the distance image is obtained, even if either image is larger than the other image, Good. In addition, it is preferable that it was what was imaged simultaneously. Therefore, the recognition target on the distance image can be assigned on the image, and the same range on the image as the range of the object extracted on the distance image can be set as the object on the image.

画像 Image recognition of the partial image that is the range of the object on this image to identify the object. In this case, the amount of calculation can be greatly reduced because the image recognition is performed only for the extracted partial images, compared to the conventional image recognition performed on the entire image. In addition, since the object has already been separated from the background on the image, many operations for separating the object from the image are not required, and the amount of calculation can be reduced. In addition, since the separation of the object is performed based on the distance, the separation accuracy can be increased.

In image recognition, it is possible to identify a person, a car, or the like as an object attribute based on the shape, color, brightness, etc. of the extracted object. In addition to image recognition using deep machine learning, it can identify not only humans but also men, women, adults, children, etc., and facial features such as nose height, eyes and mouth. The size of eyes and the color of eyes can be classified by job. In addition, it is possible to identify the type and year of an automobile. Further, if data of feature points of a specific person is stored in advance, it is possible to identify whether or not the recognition object recognized as a person is a specific person. In addition, when the object is separated from the distance image, the three-dimensional shape and size of the object can be recognized from the relationship of the distance, and the image can be recognized in consideration of the three-dimensional shape and size of the object. Separation of an object according to a distance from an image and image recognition using data of a three-dimensional shape and size in addition to the image can improve object identification capability and identification accuracy including object detection and identification.

In the configuration of the present invention, it is preferable that the distance image acquisition unit includes a distance measurement unit that measures a distance of each pixel of the distance image.

According to such a configuration, the above-described image and the distance image can be obtained from the imaging unit that is a monocular camera that captures an image and the distance measurement unit that generates a distance image such as a depth sensor and a 3D sensor. As the depth sensor and the 3D sensor, for example, a TOF (Time Of Flight) type can be used.

Further, in the configuration of the present invention, it is preferable that the distance image acquisition unit obtains a distance image based on parallax between the two imaging units.

According to such a configuration, a distance image can be obtained from a parallax of a so-called stereo camera.

Further, in the configuration of the present invention, the imaging means, the distance image acquisition means, the distance image recognition object extraction means, the recognition object image extraction means, and the image recognition means are provided in one housing. preferable.

According to such a configuration, processing required for image recognition is simplified, so that a large arithmetic processing device with a high processing speed is not required, and the image recognition imaging device is housed in a relatively small casing such as a monitoring camera. be able to. Note that the image recognition imaging apparatus may be connected to an external server so as to be able to perform data communication, and the server may be configured to store data or perform more advanced image recognition processing.

Here, surveillance cameras and the like are often used for a long time after installation. However, since the technology for detecting / recognizing an object from an image is advancing day by day, The detection / recognition technology used in the system may become obsolete.
Also, the detection / recognition algorithm varies depending on the environment such as the place where it is used, the subject of photography, etc., so that the detection / recognition firmware originally provided in the camera before the installation of the camera is used. There is a possibility that sufficient detection / recognition cannot be performed with the detection / recognition algorithm used.

The present invention has been made in view of the above circumstances, and detects the feature included in the image and updates the detection / recognition performance of the detection / recognition performance for recognizing the recognition target set from this feature. It is an object of the present invention to provide a detection recognition system that can be improved.

In order to achieve the above object, the detection recognition system of the present invention includes:
An imaging unit for imaging, a detection / recognition unit, and a server;
The detection / recognition unit includes detection / recognition firmware, detects a feature included in the image from the image acquired by the imaging unit, and recognizes a set recognition target. In addition, the detection / recognition firmware can be updated to a new detection / recognition firmware generated by the detection / recognition firmware generation unit.
The server generates machine detection means for generating a detection / recognition algorithm by machine learning using the image acquired by the image pickup means as teacher data, and generates new detection / recognition firmware of the detection / recognition means from the detection / recognition algorithm. And detecting / recognizing firmware generation means.

In the detection recognition system of the present invention, the image pickup means picks up an image. Then, the detection / recognition unit detects a feature included in the image from the image obtained by the image pickup by the image pickup unit, and recognizes the set recognition target. The machine learning means of the server generates a detection / recognition algorithm by machine learning using the image acquired by the imaging means as teacher data. Then, the generated detection / recognition algorithm is converted into firmware (detection / recognition firmware) suitable for the detection / recognition means by the detection / recognition firmware generation means of the server. Then, the detection / recognition firmware of the detection / recognition unit is updated to a new detection / recognition firmware generated by the detection / recognition firmware generation unit.
Therefore, a detection / recognition algorithm that can be detected and recognized with higher accuracy is generated from the image obtained by the imaging means by machine learning, and the detection / recognition algorithm is converted into firmware suitable for the detection / recognition means. Since the detection / recognition firmware of the detection / recognition means can be updated, the detection / recognition performance can be improved.

Further, in the configuration of the detection recognition system of the present invention, the machine learning means may perform machine learning using, as teacher data, an image when the detection / recognition means erroneously recognizes the set recognition target. preferable.

According to such a configuration, the machine learning means learns a new recognition target so that the recognition target is not erroneously recognized with respect to an image when the recognition target set with the detection / recognition means is erroneously recognized. Since the detection / recognition algorithm can be generated and new detection / recognition firmware can be generated, the detection / recognition performance can be reliably improved.

Further, in the configuration of the detection recognition system of the present invention,
With at least one camera,
The camera preferably includes the imaging unit and the detection / recognition unit.

According to such a configuration, it is not necessary to provide a terminal or the like that includes detection / recognition means separately from the camera, so that the entire system can be reduced in size. In addition, since the camera can update the detection / recognition firmware of the detection / recognition means to a new detection / recognition firmware created as a result of machine learning in the server, the detection / recognition performance of the camera is improved. be able to. Therefore, the detection / recognition performance of the camera can be easily improved even with a camera or the like after installation.

Further, in the configuration of the detection recognition system of the present invention,
It is preferable to include a plurality of the cameras in which at least a part of an imaging range by the imaging unit and the set recognition target overlap.

According to such a configuration, a predetermined range can be imaged with a plurality of cameras and detected / recognized. Therefore, since the same object and the same phenomenon can be detected / recognized by a plurality of cameras, the accuracy of detection / recognition can be improved.

Further, in the configuration of the detection recognition system of the present invention,
When the machine learning means recognizes a recognition target that is overlapped by some of the plurality of cameras, the other camera of the plurality of cameras does not recognize the recognition target that is overlapped. It is preferable to perform machine learning using the acquired images as teacher data.

According to such a configuration, when at least one camera recognizes an overlapping recognition target, machine learning is performed using, as teacher data, an image in which another camera could not recognize the overlapping recognition target. be able to. Therefore, it is possible to perform machine learning using, as teacher data, an image that is likely to be recognized but wanted to be recognized, and the efficiency of machine learning can be increased.

Moreover, in the configuration of the detection recognition system of the present invention, it is preferable that at least one of the plurality of cameras is a camera having a different imaging unit.

According to such a configuration, it is difficult to detect / recognize from an image picked up by a certain image pickup means, and even if a recognition target cannot be recognized by a camera including the image pickup means, It becomes possible to recognize a recognition target with a camera provided with means. As a result, it is possible to easily know that the recognition target could not be recognized, and machine learning can be performed using the image that could not be recognized as teacher data. It is possible to perform machine learning so that it can be recognized even from difficult images.

According to the detection recognition system of the present invention, the detection / recognition performance for detecting a feature included in an image and recognizing a recognition target set from the feature is improved by updating the firmware for detection / recognition. be able to.

Here, calculating a distance image from each frame of a moving image shot with a stereo camera using a fisheye lens requires an enormous amount of computation as in the case of a normal lens, or more than that. Processing time becomes long. Further, if an image in which an image is formed from a fish-eye lens unit on a planar image sensor surface is detected and output as it is, distortion at the center of the image is small and distortion at the peripheral edge of the image is large. In addition, when monitoring indoors with a surveillance camera, depending on the size, the stereo camera should be installed vertically so that the floor directly below the center ceiling of the indoor space is centered. It is efficient to do.
In addition, when shooting in front with an in-vehicle camera or when shooting with a monitoring camera from the corner of the monitored space, for example, a stereo camera so that the horizontal direction or a diagonally downward direction is the center. It is efficient to set up and shoot. In these cases as well, the periphery is greatly distorted from the center of the image taken with the fisheye lens.

In addition, with a surveillance camera that faces the vertical direction of the ceiling, even if there is a person underneath, no wrinkle appears and the head appears, making it difficult to identify a person by recognizing a wrinkle or the like. However, in order for a person to come directly under the surveillance camera, it is necessary to move from the peripheral part to the center part of the photographing range, and since the face of the person who moves at this time is captured, it is possible to recognize wrinkles.
Also, in a horizontal outdoor surveillance camera or in-vehicle camera, the sky appears above the image, and there is often no need to specify the subject by image recognition. When obtaining a distance image from a stereo camera image having such a fisheye lens, it is inefficient to process the entire image uniformly, and some processing device is required to reduce the processing time. Seem.

The present invention has been made in view of the above circumstances, and when a distance image is obtained from a stereo camera image having a fisheye lens, the distance is changed after the resolution is changed so as to be different depending on the position of the image. An object of the present invention is to provide an object distance detection device capable of obtaining an image.

In order to solve the above problems, an object distance detection device of the present invention includes a stereo camera including a pair of fisheye cameras having a fisheye lens unit and an image sensor,
A distance image calculation unit that calculates a distance image from images output from the pair of imaging sensors;
A distance image recognition unit for performing image recognition including identification of a subject from the distance image;
With
The distance image calculation unit is a partitioning unit that partitions each of the images captured by the pair of fisheye cameras into a plurality of preset partitions.
A resolution conversion unit that converts the image into a resolution image set for each section;
A corresponding point search unit for obtaining corresponding points corresponding to the same point on the photographed subject in each of the two images captured substantially simultaneously with a pair of fisheye cameras;
A distance calculation unit that is searched by the corresponding point search unit and obtains a distance from the stereo camera to the corresponding point based on a difference in position between two corresponding points corresponding to the same point on the subject. To do.

According to such a configuration, the image captured by the fisheye camera can be divided into a plurality of sections, and the resolution can be changed according to the sections, so when considering the image captured by the fisheye camera, the central portion of the image is The distortion caused by the fisheye lens is small, and it is possible to detect, for example, a shark or a person even without modification. On the other hand, since the image has a large distortion at the peripheral edge of the image and the image is distorted and compressed, it is difficult to recognize the image unless the distortion is removed. Here, if the resolution of the image is first lowered uniformly in order to facilitate processing, it becomes difficult to perform highly accurate image recognition at the image peripheral portion having a large distortion.
Therefore, in the central section of the image, the resolution of the image is lowered to reduce the amount of processing, and the processing is performed at a high resolution without reducing the resolution in the peripheral portion of the image, thereby reducing the resolution of the image. Even if the processing speed is improved, the accuracy of image recognition can be maintained. In this case, if the resolution is lowered at a large area in the center of the image, the overall processing amount can be reduced, the processing speed can be increased, and image recognition by lowering the resolution can be performed. It can suppress that accuracy falls. The resolution here is, for example, the number of pixels per unit area of the image. When the resolution is lowered, the pixels of the image are thinned out by a well-known method. In this case, a process similar to a known image reduction process may be performed.

The sections of the images of the two fisheye cameras are basically sectioned in the same direction range. That is, when the section A, the section B, and the section C correspond to the respective images, corresponding points of the other image corresponding to the corresponding point A in the section A of one image at least other than the boundary portion of the section. A exists in the section A of the other image. Therefore, the search for corresponding points is basically performed in two corresponding sections of two images. The search for corresponding points is performed by, for example, extracting feature points (singular points) in image recognition in each section of two images of a pair of fisheye lenses, and the other image corresponding to the feature point in one image section. The feature points of the sections are determined, and corresponding points are determined by basic image recognition.
In the search for corresponding points, the epipolar geometry described in Patent Document 1 described above may be used for determination.

In the configuration of the object distance detection device of the present invention, the distance image calculation unit includes pixels arranged vertically and horizontally, and is divided into pixel regions each including one or a plurality of pixels. Output a distance image with a color that changes according to the distance from the stereo camera to the corresponding point,
In the section where the resolution of the distance image is different, it is preferable that the number of the pixels constituting the pixel region is different depending on the resolution.

According to such a configuration, by increasing / decreasing the number of pixels constituting the pixel area of each section in accordance with the resolution of each section, images of each section having different resolutions can be displayed on the same distance image at substantially the same display magnification. Can be expressed as Each pixel region is colored with one color, and in the distance image, the color of each pixel region changes according to the distance. For example, a distance image is an image showing a numerical value of a distance as compared to an image showing a numerical value of temperature as in a thermography, and a color change that becomes an image like a thermography is, for example, a change in brightness (luminance) or a hue. Or a combination of both luminance and hue changes.

In the configuration of the object distance detection device of the present invention, it is preferable that the distance image calculation unit includes a distortion removal unit that removes distortion caused by the fisheye lens for each of the sections.
According to such a configuration, since it can be handled as a two-dimensional image from which distortion is removed for each section, it is possible to easily search for corresponding points. It should be noted that a method for removing distortion photographed with a fisheye lens is well known, and an IP core that enables distortion removal by setting parameters necessary for distortion removal of each fisheye lens is sold. It is possible to remove distortion.

Further, in the configuration of the object distance detection device of the present invention, the pair of fisheye cameras are arranged in a substantially vertical direction,
It is preferable that the resolution conversion unit changes the resolution so that the resolution is higher in the section at the periphery of the image than in the section at the center of the image.

Considering image distortion caused by the fisheye lens as described above, it is preferable to lower the resolution of the central portion of an image with little distortion and not to lower the resolution of the peripheral portion of an image with large distortion. Furthermore, with a fisheye camera attached to the ceiling or some columnar structure facing down, the center of the image is directly below, but in this part, the subject is photographed from directly above. Since it is difficult to recognize the image, it is difficult to recognize the image. On the other hand, the image is recognizable at a position slightly deviated from above, and the image can be recognized. Further, the angle of view of the fisheye camera is 180 degrees. There is a possibility that the front of the kite will be photographed nearby. Therefore, by reducing the resolution of the central portion of the image and not reducing the resolution of the peripheral portion of the image, it is possible to achieve both shortening of the processing time and maintaining the accuracy of image recognition. In this case, it is preferable that the area of one section of the central portion with low resolution is larger than the area of one section of the peripheral portion with high resolution.

Further, in the configuration of the object distance detection device of the present invention, the pair of fisheye cameras are arranged in a substantially horizontal direction,
Preferably, the resolution conversion unit changes the resolution so that the resolution in the lower section of the image is higher than that in the upper section of the image.

According to such a configuration, when the horizontal direction is taken with a fisheye camera, the sky is displayed at the top of the image when it is outdoors and the ceiling is displayed when it is indoors. When monitoring driving, criminal or suspicious person identification, etc., it is a less important part, so reducing the resolution to the bottom of the image will reduce the accuracy of image recognition and reduce processing time. Can do. In particular, when the height of the fisheye camera is high, for example, when the fisheye camera is a surveillance camera, it is higher than a human, or when the fisheye camera is an in-vehicle camera, it is at a higher position in the car. It is preferable to maintain the resolution because the importance of the lower part of the image becomes high. In this case, the resolution at the center of the image is preferably lower than that at the bottom of the image. In addition, when the installation height of the fisheye camera is lower than the height of the person (the position of the shark), the lower section of the image is more likely than the upper section of the image where the shark is likely to be captured. The resolution may be changed so that the resolution is lowered.

According to the object distance detection device of the present invention, it is possible to calculate a distance image by a stereo camera using a fisheye lens at high speed and with high accuracy without imposing a heavy load on the arithmetic processing device.

According to the present invention, image recognition can be performed easily and with high accuracy.

1 is a block diagram illustrating an image recognition and imaging apparatus according to a first embodiment of the image recognition and imaging apparatus of the present invention. 3 is a flowchart illustrating an image recognition method by the image recognition imaging apparatus. It is a figure for demonstrating the image recognition method by an image recognition imaging device. It is a figure for demonstrating the image recognition method by an image recognition imaging device. It is a block diagram which shows the image recognition imaging device of the 2nd Embodiment of this invention.

1 is a block diagram illustrating a detection recognition system according to an embodiment of the detection recognition system of the present invention. It is a block diagram which shows the camera of a detection recognition system. It is a block diagram which shows the server of a detection recognition system. 4 is a flowchart for explaining a detection recognition firmware update method by the detection recognition system.

1 is a block diagram illustrating an object distance detection device according to an embodiment of the object distance detection device of the present invention. It is a block diagram which shows the image analysis part of an object distance detection apparatus similarly. 4 is a flowchart showing processing of an image analysis unit of the object distance detection device. It is a figure which shows the division of the image of an object distance detection apparatus similarly. It is a figure which shows the division of the image of an object distance detection apparatus similarly. (A), (b) is a figure for demonstrating the difference in the resolution for every division of a distance image.

Embodiments of the present invention will be described below.
The image recognition and imaging apparatus of the present embodiment is a combination of an image recognition apparatus and a camera mainly related to monitoring, such as a surveillance camera and an in-vehicle camera, and identifies a person, a car, and the like in a shooting range. ing.

As shown in FIG. 1, the image recognition and imaging apparatus according to the present embodiment includes an image sensor 1 that is an imaging unit, a 3D sensor 2 that is a distance image acquisition unit, and a recognition target from a distance image obtained by the 3D sensor 2. An object (including a person), that is, an object extracting unit 3 as a distance image recognition target extracting unit for extracting a recognition target, and a recognition target for extracting a partial image serving as the recognition target from the image of the image sensor 1 Object image extraction means 4 as an object image extraction means, image recognition means 5 for performing image recognition of the extracted partial image (object image), control means 6 for controlling these, images, distance images, recognition results, etc. Storage means 7 for storing the data.
The control means 6 is connected to an external server 9 (host PC) via a communication network 8 such as the Internet so that data communication is possible.

The image sensor 1 is a so-called image sensor (image sensor), and is used as a camera including a lens that forms an image to be photographed on the image sensor 1.
The 3D sensor 2 is of the above-described TOF method, for example, scans an ultrashort pulse of an infrared laser in the imaging range, measures the time until the reflected light of the light hitting the object returns, and this time Is multiplied by the speed of light to obtain the distance of each pixel in the shooting range. Note that the resolution of the image sensor 1 and the resolution of the 3D sensor 2 may or may not match, and the position of each part of the imaging range of the image sensor 1 and each of the imaging range of the 3D sensor 2 It is only necessary that the positions of the portions correspond to each other, and it is only necessary to know where an arbitrary position in the shooting range of the image sensor 1 is located in the shooting range of the 3D sensor 2. Basically, the image sensor 1 and the 3D sensor 2 are configured to simultaneously capture an overlapping range as an image and a distance image.

The object extraction unit 3 extracts an object from a distance image that is distance information acquired by the 3D sensor 2. At this time, it is determined that a group of pixels (neighboring pixels) in which distance values are close to each other on the distance image and are substantially in a lump is an object to be extracted. At this time, a group of pixels with approximate distance values are extracted as one object. In this case, the object can be extracted basically only by the distance of each pixel in the distance image. Therefore, the distance image captured only once, not the image of a monitoring camera or the like that constantly or repeatedly captures the same range. For example, an object can be extracted with high accuracy even from a distance image captured by the in-vehicle 3D sensor 2. Note that a portion where the pixel distance value is equal to or greater than the predetermined distance may not be extracted as an object using the background. Further, in the 3D sensor 2 that is fixed like a surveillance camera or the 3D sensor 2 in which the movement range such as rotation is determined, the same range is always shot or the same range is shot repeatedly, so the 3D sensor 2 When shooting a pixel, the distance that does not change for a certain period of time (the longest distance in the case of change) is stored as the background distance of the pixel, and the group of pixels whose distance has changed from the background distance is recognized as an object It is good. In this case, a group of pixels whose distances change over time may be detected as an object. Note that it is possible to separate the background and the object from the change over time of the image even in a two-dimensional image, but in the case of a distance image, the part that has changed so that the distance is basically shorter than the background Can be identified as an object.

In the object extraction means 3, the position of the range of the object is determined on the distance image by the 3D sensor 2.
The object image extraction unit 4 converts the range of the above-described object determined on the distance image into a range on a visible two-dimensional image by the image sensor 1 and extracts a partial image within this range. That is, the range of the object extracted on the distance image is assigned to the image, and the partial image that becomes the range of the object is extracted from the image. For example, a coordinate system may be provided for each of the distance image and the image, the coordinates on the image may be converted to the coordinates on the distance image, or both coordinate systems may be the same.

The image recognition means 5 performs image recognition of a partial image that is an object on the image extracted from the image as described above. In this case, since a portion that is an object has already been extracted from the distance image, it is recognized whether the extracted partial image is, for example, a person or a car. At this time, for example, the stored feature points of the person or the car and the feature points detected from the partial image are compared to determine whether the person is a car or the like. Further, based on an algorithm acquired by deep machine learning, a person may be identified as a dwarf, an adult, a woman, or a man. For actual image recognition, for example, open CV (Open Computer Vision Library), which is a library of functions related to image recognition, and recognition using deep machine learning, human detection, human tracking, and face recognition are performed. be able to. The latest OpenCV library includes a machine learning function, for example, also includes a deep learning module, and can identify people, vehicles, and the like. In the present embodiment, since the area on the image that becomes the object has already been determined using the distance image, for example, it is not necessary to determine the area of the image that becomes the identified person or car, and has already been extracted. Therefore, it is only necessary to identify whether the area of the object is a person or a car, so that an operation for processing the entire image and specifying a portion to be an object is not required, and the amount of calculation is small. In image recognition, the attributes of an object are recognized, and the type of an object such as a person or a car, and if it is a person, the attribute of the object such as an adult, dwarf, male, female, race, facial features, etc. To detect. In the case of a car, the model, year, color, etc. are identified as attributes. In the present embodiment, the three-dimensional shape and the size can be recognized from the object data on the distance image when the object is extracted, and the three-dimensional shape and the size are used when identifying the attribute of the object. Thus, it is easy to determine adults and children, small cars and large cars.

The control means 6 controls photographing by the image sensors 1 and 3D sensor 2, object extraction from the distance image by the object extraction means 3, extraction of a partial image from the image by the object image extraction means 4, and image recognition by the image recognition means 5. To do. The control means 6 comprises an arithmetic processing device, and the arithmetic processing device may function as the object extraction means 3, the object image extraction means 4, and the image recognition means 5. For example, the image recognition means 5 may be realized by executing a machine learning model (image recognition algorithm) that has undergone deep machine learning on the arithmetic processing unit.
The storage unit 7 is a storage device including a hard disk, a flash memory, and the like, and stores data such as images, distance images, and image recognition results.

Next, an image recognition method by the image recognition imaging apparatus will be described with reference to the flowchart of FIG. 2 and FIGS. 3 and 4.
A distance image b is photographed by the 3D sensor 2 (step S1). At this time, a two-dimensional visible (color) image a is simultaneously captured by the image sensor 1. For example, as shown in FIG. 3, the photographed image a includes an adult man, a car, and a dwarf woman. Further, in the distance image b shown in FIG. 4, the difference in distance between the pixels can be expressed as an image, and an image can be obtained as shown in FIG. 4 by assigning light and darkness or color to the distance. . In FIG. 4, pixels having a predetermined distance or more are represented by white, for example.

Next, an object is extracted from the distance image as a group of pixels having a close distance from each other (step S2). In this case, as described above, pixels having a predetermined distance or more may be used as the background, and a group of pixels having a smaller distance value and approximate distance may be extracted as objects. In addition, even if it is a lump of pixels, when it is separated into pixels of clearly different distances, they are handled as different objects. The object extracted on the distance image can represent the arrangement at the position of each pixel within the range of the object as a group of pixels. Here, distance image portions of an adult male, a car, and a dwarf female are extracted.

Next, on the image a in FIG. 4 taken by the image sensor 1, a range that is the same position as the range of the object extracted in the distance image b in FIG. 4 is extracted as a partial image that becomes an object image (step S3). . As a result, an object on the two-dimensional color image is extracted. Here, the object is not separated by identifying and extracting the object range on the image, but the object is separated by the difference in distance on the distance image, and the object on the separated distance image is displayed on the image. Since the object is merely fitted to the object, the position of the object is only known, and the object is not identified. Therefore, since the object is extracted only by the distance on the distance image and the object on the image is extracted based on the position, the calculation amount is extremely large compared with the case of identifying the object and extracting the object. Few.

Next, image recognition of the partial image as the object image extracted from the image is performed (step S4). In this case, since the object has been extracted as described above, in the image recognition, there is no need to process the entire image and extract the object. That is, since it is not necessary to detect and extract an object from the image, image recognition is performed only on the already extracted partial image, and the amount of calculation can be reduced. In addition, since image recognition can be performed on the assumption that the extracted partial image is one object or a plurality of adjacent objects, for example, the outer edge portion of the extracted portion is used as the outer edge portion of the object. Therefore, it is possible to reduce the amount of calculation in image recognition of partial images. As described above, a person can be identified by well-known person detection, wrinkle recognition, or the like. In addition, an object other than a person can also identify, for example, a car or a bicycle by registering feature points of various objects as part of the algorithm. In addition, the three-dimensional shape and size of an object that can be read from a distance image can be used as the feature point of the object, and the recognition accuracy of the object can be improved.

The data such as the object attribute as the image recognition result is transmitted to the server 9 (step S5). Further, various data related to the above-described image recognition, for example, data such as a distance image used for image recognition, an image, a range of an extracted object, and the like are also sent to the server 9. In this case, more advanced image recognition may be performed on the server 9 side. In this case, the server 9 may include the object extraction unit 3, the object image extraction unit 4, and the image recognition unit 5. It is possible to perform more advanced processing by performing image recognition processing on the server side having high computing ability. Further, when image recognition processing is performed by the server 9, a system in which a plurality of image sensors 1 and 3D sensors 2 are connected may be used. In this case, the calculation amount of each image recognition and imaging apparatus can be reduced even in the server. For example, in a system including a large number of image sensors 1 and 3D sensors 2, a single high-function server is used. You can also. Further, the image recognition in the normal time may be performed on the image recognition imaging apparatus side, and the image recognition may be performed on the server when the image recognition is performed with the past data when the incident occurs. In this case, it is not necessary to save all the images and distance images. For example, only the extracted object image (partial image) is saved to reduce the storage capacity required for the server 9. Can do.

According to such an image recognition imaging apparatus, after extracting an object from a distance image, the object image is extracted based on the position on the image corresponding to the position of the extracted object on the distance image, and then extracted. The amount of calculation can be reduced by performing image recognition on the object image, and the accuracy of separating the object from the background can be improved by performing object extraction on the distance image, for example, The object detection accuracy can be increased to about 99%. In addition, since the distance to the object is known, the size can be easily and accurately calculated as the attribute of the object. Based on this size, it becomes easy to determine other attributes of the object, for example, whether it is an adult or a dwarf, or the type of car.

Next, a second embodiment of the present invention will be described.
As shown in FIG. 5, in the image recognition and imaging apparatus according to the present embodiment, the distance image acquisition unit has two cameras (camera 1 and camera 2) 10 as imaging units, and left and right images captured by these cameras 10. Distance image detection means 11 for calculating a distance for each pixel from the parallax of the image and generating a distance image. The two cameras 10 and the distance image detecting means 11 constitute a stereo camera 12 with a 3D sensor function. The stereo camera 12 with the 3D sensor function can obtain the distance image generated by the distance image detection means 11 and the image photographed by the camera 10. Note that two images with parallax can be obtained by the stereo camera 12, but either one or both may be used.
In the second embodiment, the configuration other than the configuration of the distance image acquisition unit and the imaging unit is the same as that of the image recognition imaging device of the first embodiment described above, and the object extraction unit 3 and the object image extraction unit 4 The image recognition means 5, the control means 6, and the storage means 7 are provided, and are connected to an external server 9 via a communication network 8.

Also in the second embodiment, image recognition can be performed by the same method as in the first embodiment, except that a distance image is obtained by a known method from the parallax of a pair of images taken by the stereo camera 12. It is possible to achieve the same operational effects as those of the image recognition and imaging apparatus of the first embodiment.
The distance image acquisition means is not limited to the TOF type 3D sensor or the stereo camera, but may be another type of 3D sensor as long as it can generate a distance image corresponding to the shooting range of the imaging means. Good.

The detection and recognition system according to the present embodiment is used to notify when a recognition target set from an image captured by a camera is recognized, for example.
In the following description, an image simply includes both a moving image and a still image.

The detection recognition system 101 includes a plurality of cameras 102, a server 103, and a terminal 104, as shown in FIG. The plurality of cameras 102, the server 103, and the terminal 104 are connected by a wired or wireless network 105.

For example, when the camera 102 is installed in a building such as a convenience store or outdoors as a monitoring camera, the detection recognition system 101 recognizes an object photographed by the camera 102 as a suspicious person based on its outer shape and movement. In addition, it can be used to notify the terminal 104 at another place. Further, when a suspicious person is recognized in this way, a terminal, a system management device, or the like held by the administrator 106 of the detection recognition system 101 may be notified.

As shown in FIG. 7, the camera 102 includes an imaging unit 120, a detection / recognition unit 121, a recording unit 122, a communication unit 123, and a control unit 124.
The imaging means 120 has a lens and a solid-state image sensor, for example, and acquires an image by imaging. The detection / recognition means 121 includes an arithmetic processing unit and a memory, and performs image recognition. Specifically, a feature included in the image captured by the imaging unit 120 is detected by control by the detection / recognition firmware provided in the memory of the detection / recognition unit 121, and a recognition target set from this feature is recognized. To do. In the case of simply detecting / recognizing below, basically, the feature included in the image picked up by the image pickup means 120 is detected as described above, and the recognition target set from this feature is recognized. Say.

The recording unit 122 also includes a reference image and other information for detection / recognition by the detection / recognition unit 121, and an image at the time of abnormality (for example, when the recognition target set by the detection / recognition unit 121 is recognized). Record other information (for example, voice, etc.). The communication unit 123 communicates with the server 103 via the network 105 to transmit an image and other information at the time of abnormality to the server 103 and to receive a command and detection recognition firmware from the server 103. The communication unit 123 is also connected to the terminal 104 or the terminal of the administrator 106 via the network 105, and transmits an alarm signal or the like to these terminals or the server 103 when an abnormality occurs. In addition, the terminal 104 or the terminal of the administrator 106 receives this alarm signal, or receives an instruction to sound an alarm from the server 103 that has received the alarm signal, and sounds the alarm.

The control unit 124 includes an arithmetic processing unit and a memory, and controls the imaging unit 120, the detection / recognition unit 121, the recording unit 122, and the communication unit 123. Note that the control unit 124 may share the arithmetic processing unit or the memory with the detection / recognition unit 121.

It should be noted that the camera 102 may not be provided with all of the imaging unit 120, the detection / recognition unit 121, the recording unit 122, the communication unit 123, and the control unit 124. For example, the detection recognition system 101 is connected to the camera 102 by wire or wirelessly, and includes a terminal outside the camera 102 that can control the camera 102 and display an image captured by the camera 102. The detection / recognition means 121, the recording means 122, the communication means 123, and the control means 124 are provided in the terminal, and an image captured by the imaging means 120 provided in the camera 102 is detected / recognized by the terminal. You may do it.

For example, the camera 102 has the same configuration as that of a general monitoring camera. For example, the imaging unit 120 captures an imaging range corresponding to the set angle of view in accordance with the orientation of the camera 102. As the plurality of cameras 102 included in the detection recognition system 101, the same type of camera may be used, or different types of cameras may be used. Further, the imaging ranges of the respective cameras 102 may overlap or may be completely different.
In the present embodiment, a total of four different types of cameras 102 are used as the cameras 102: two stereo cameras 102a, one infrared camera 102b, and one monocular camera 102c. It is assumed that the imaging ranges of the four cameras 102 overlap each other.
By using the stereo camera 102a capable of calculating the distance, size, 3D structure, etc. from the parallax as the camera 102, the distance, size, 3D structure, etc. can be calculated from the parallax, so that an arithmetic processing unit for detection / recognition, etc. Therefore, even if the camera does not include a high-performance arithmetic processing unit or the like, detection / recognition can be easily performed.

In addition, by using an infrared camera (near infrared camera or far infrared camera) 102b as the camera 102, a near infrared or far infrared image can be taken, and what cannot be seen by human eyes is also detected. / Can be recognized. Also, detection / recognition in a dark environment such as at night becomes easy.

Further, the type of the camera 102 is not limited to these. For example, a distance image sensor may be used as the camera 102. As the distance image sensor, for example, TOF (Time Of Flight) can be used. The TOF measures the distance from the time taken for the projected laser to reciprocate to the target.

That is, the camera 102 may be one in which the imaging unit 120 captures one two-dimensional image and performs detection / recognition from this image, and the imaging unit 120 captures two images, and from the parallax of these images, The distance, size, 3D structure, and the like may be calculated and detected / recognized, or the imaging unit 120 may capture a 3D distance image using a TOF sensor or the like, and detect / recognize from the 3D distance image. The imaging unit 120 may pick up near-infrared or far-infrared images, and perform detection / recognition from these images. In addition, one camera 102 may include a plurality of the imaging means 120 described above. In other words, one camera 102 may include, for example, an imaging function of a stereo camera and an infrared camera, and detection / recognition may be performed from an image obtained by these functions.

The detection / recognition means 121 recognizes a set recognition target, and the recognition target may be a specific object (including a person and an object other than a person) or an abstract phenomenon. Conceivable. In other words, the recognition target may be an object such as a person such as a robber, a thief, or a arson, or an object such as a handgun, or a phenomenon such as a crime or fire.
For example, when a robbery is set as a recognition target, when an image of a person holding a knife or a handgun is taken by the image pickup means 120 of the camera 102 installed in a convenience store, the detection / recognition means It can be considered that the person 121 recognizes this person as a burglar by detecting a person holding a knife or a handgun or detecting the movement of the person from this image. In addition, for example, when a fire is set as a recognition target, it is detected that the temperature of a certain place is abnormally high from an image obtained by an infrared camera and recognizes that a fire has occurred. Can be considered. For example, if the infrared camera uses far-infrared rays, the temperature can be detected, and the handgun, knife, etc. hidden in the pocket of clothes due to the temperature difference between the handgun, knife and other weapons and body temperature. It is also conceivable to detect and recognize the weapons of this. However, since the detection / recognition firmware of the detection / recognition means 121 is generated by machine learning in the machine learning means 130 described later, actually, the detection / recognition means 121 is easy for such a person to understand ( It doesn't always make a way of recognition.
That is, the detection / recognition unit 121 detects a feature included in the image captured by the imaging unit 120 under the control of the detection / recognition firmware, and recognizes a recognition target set from the feature.
Note that the detection / recognition means 121 may perform detection / recognition using not only images but also sound. For example, the camera 102 includes a voice input unit such as a microphone, and the detection / recognition accuracy can be improved by performing detection / recognition using the voice acquired by the voice input unit. Similarly, voice may be used in detection / recognition by the server-side detection / recognition means 132 described later.

The detection / recognition firmware of the detection / recognition unit 121 is updated by new detection / recognition firmware generated by the machine learning unit 130 and the detection / recognition firmware generation unit 131 described later. First, the detection / recognition firmware provided in the detection / recognition unit 121 may be generated by the machine learning unit 130 and the detection / recognition firmware generation unit 131, and is detected by another machine learning capable device. / It may be incorporated in the recognition means 121. Alternatively, the detection / recognition means 121 may be initially provided with detection / recognition firmware generated by a method other than machine learning.

It is assumed that the target setting recognized by the detection / recognition means 121 is included in the detection / recognition firmware. For example, in the case where the detection / recognition firmware is generated by the machine learning means 130 and the detection / recognition firmware generation means 131, if the object to be recognized is a burglar at a convenience store, the machine learning teacher data is, for example, a convenience store The machine learning means 130 is provided with a plurality of images showing robbery robbers who have been robbed and information that these images are images showing burglars as teacher data (tag the images as robbers). . Then, by machine learning, it is learned where a given image (teacher data) can be recognized to recognize a burglar. As a result of machine learning, a detection / recognition algorithm with a high probability of being able to recognize a burglar from an image is generated. Then, the detection / recognition algorithm is converted by the detection / recognition firmware generation means 131, and detection / recognition firmware is generated. That is, the detection / recognition firmware (detection / recognition algorithm) obtained by this learning can recognize where in the image the robber is included, and can be recognized as an object to be recognized. That is, it can be said that robbery is set. Note that tagging an image is not always necessary when performing this machine learning. For example, if only the image showing the burglar is given as teacher data, it recognizes an image whose characteristics are similar to the image given as the teacher data even if there is no information that it is an image showing the burglar It is possible to generate an algorithm for recognizing a burglar by generating an algorithm.
Note that the number of recognition targets (targets recognized by the detection / recognition firmware) set in the detection / recognition firmware is not limited to one, and a plurality of recognition targets may be set.

As described above, the detection / recognition firmware recognizes a specific target, and when the detection / recognition unit 121 recognizes this specific target by the detection / recognition firmware, a signal indicating that the detection has been performed. (For example, an alarm signal) is output. Further, a signal indicating that the recognition has been performed is transmitted to the server 103, the terminal 104, the terminal of the administrator 106, and the like via the communication unit 123, and notification that the setting target has been recognized is made to these terminals. The The signal indicating that the recognition has been performed is sent only to the server 103, and the server 103 comprehensively determines the information from each camera 102 and then recognizes the recognition target from the server 103 to the terminal 104 or the like. Alarm information such as an e-mail or an instruction to sound an alarm may be sent.

In addition, the four cameras 102 have overlapping imaging ranges, and the four cameras 102 can simultaneously recognize overlapping portions of the recognition targets set in the detection / recognition firmware. ing. That is, for example, when a burglar is set as an overlapping recognition target, it is possible to simultaneously recognize a specific burglar who performs a specific burglary with four cameras.

As shown in FIG. 8, the server 103 includes machine learning means 130, detection / recognition firmware generation means 131, server-side detection / recognition means 132, server-side recording means 133, server-side communication means 134, server Side control means 135. The machine learning unit 130, the detection / recognition firmware generation unit 131, the server-side detection / recognition unit 132, and the server-side control unit 135 each include an arithmetic processing unit and a memory, but each has an individual arithmetic processing unit or memory. You may have and you may share an arithmetic processing unit or memory.

The machine learning unit 130 performs machine learning such as deep learning to generate a detection / recognition algorithm. Here, the detection / recognition algorithm is an algorithm for recognizing a set recognition target from an image captured by the imaging unit 120 of the camera 102.

The detection / recognition firmware generation unit 131 converts the detection / recognition algorithm generated by the machine learning unit 130 into firmware that can be executed by each camera 102, and generates detection / recognition firmware. Each camera 102 has an image resolution that can be acquired by the imaging means 120, the performance of the arithmetic processing unit of the detection / recognition means 121, the presence / absence of a GPU (Graphics Processing Unit) for the detection / recognition means 121, and a voice input means such as a microphone. And the type of camera (stereo camera, TOF sensor, etc.) are different, and the firmware that can be executed by each camera 102 is also different. The detection / recognition firmware generation unit 131 converts a detection / recognition algorithm generated by machine learning into firmware that can be executed by each camera 102, so that a new detection / recognition program can be installed in each camera 102. It becomes.

The server-side detection / recognition means 132 performs detection / recognition by comprehensively judging the situation from the images and information of each camera 102. For example, the detection / recognition unit 121 of each camera 102 performs detection / recognition using the image acquired by the imaging unit 120 of the camera 102, but the server-side detection / recognition unit 132 is acquired by a plurality of cameras 102. Detection / recognition is performed using the selected image. Further, when the processing is heavy to be performed by each camera 102, the server-side detection / recognition unit 132 may perform a part of the processing. Further, the detection / recognition firmware of the server side detection / recognition means 132 is provided in the memory of the server side detection / recognition means 132. Further, the detection / recognition firmware of the server-side detection / recognition means 132 can also be updated by the detection / recognition firmware generated by the machine learning means 130 and the detection / recognition firmware generation means 131.

Further, the server side detection / recognition means 132 determines whether the recognition target of each camera 102 is correct from the recognition results of the detection / recognition means 121 of the four cameras 102 (

cameras

102a, 102b, 102c), or The probability that the recognition of the recognition target in each camera 102 is correct may be determined. Then, based on this determination result, alarm information or the like may be sent to the terminal 104 or the like. For example, when there is a notification that all four cameras have recognized the setting target (for example, robbery), the server-side detection / recognition unit 132 determines that the setting target is recognized correctly, and notifies the terminal 104 or the like May be ordered to sound. Further, the content of the alarm information may be changed depending on the number of cameras that have recognized the setting target. For example, when all four cameras recognize the setting target, the server side detection / recognition unit 132 determines that the recognition is correct and instructs the terminal 104 to sound a loud alarm sound. When only the following cameras are recognized, the server side detection / recognition unit 132 determines that the recognition may be correct, and instructs the terminal 104 to sound a small alarm sound. It may be.

Further, the server side detection / recognition means 132 determines the erroneous recognition (recognition error) of the camera 102 from the recognition results of the detection / recognition means 121 of the plurality of cameras 102. For example, when there is a notification that the setting target is recognized by the detection / recognition means 121 from the three cameras 102 out of the four cameras 102 and there is no notification from the one camera 102, this one The camera 102 determines that a recognition error (recognition error) has occurred. Conversely, out of the four cameras 102, three cameras 102 did not receive a notification that the setting target was recognized by the detection / recognition means 121, but one camera 102 received a notification. In addition, it may be determined that the one camera 102 has made a recognition error (recognition error).
Even if the detection / recognition result by the server-side detection / recognition means 132 is compared with the detection / recognition result by each camera 102, the erroneous recognition (recognition error) of each camera 102 is determined. Good.

The server-side recording means 133 records teacher data for machine learning performed by the machine learning means 130 and the like. The server-side communication means 134 communicates with each camera 102 via the network 105, receives images and other information from each camera 102, transmits commands and detection / recognition firmware to each camera 102, and is in an abnormal state. The alarm information (when the setting target is recognized) is transmitted to the terminal 104 or the administrator 106.

Next, a method for updating the detection recognition firmware of the detection recognition system 101 will be described with reference to the flowchart of FIG.
The camera 102 acquires an image by the imaging unit 120 and performs recognition (detection / recognition) of a recognition target set by control by the detection / recognition firmware of the detection / recognition unit 121. When the recognition is wrong, an image when the recognition is wrong is transmitted to the server 103 (step S11). Note that audio data or the like when the recognition is wrong may be transmitted simultaneously with the image when the recognition is wrong.
The server side detection / recognition means 132 determines whether or not the recognition is wrong from the recognition results of the plurality of cameras 102 as described above. For example, when the camera 102 recognizes a set recognition target (for example, a burglar), in the system that notifies the server 103 of the recognition (for example, the burglar is recognized), the camera 102a and the camera 102b Is notified to the server 103, but if there is no notification from the camera 102c that the camera 102c has recognized, the server-side detection / recognition means 132 has recognized the camera 102c incorrectly from these notification results ( Judgment was not possible. At this time, the control means of the server 103 gives the camera 102c the same time as or the time before and after the time when the camera 102a and the camera 102b recognized the recognition target (for example, several seconds to several minutes before and after). The camera 102c is instructed to transmit the image acquired by the camera 102c to the server 103 as an image when recognition is erroneous. Upon receiving this command, the camera 102 c transmits an image when recognition is erroneous to the server 103.
A person may determine whether or not the recognition is wrong. For example, the detection recognition system 101 includes a display unit that displays an image captured by the camera 102 and a terminal that includes an input unit such as a pointing device or a keyboard. The camera 102 can recognize a burglar. If not, the person confirms the image taken by the camera 102 from the display means of this terminal, and the person selects the image that he / she wanted to recognize the burglar using the input means of this terminal, and recognizes it. It is good also as transmitting to the server 103 as an image at the time of a mistake.

The server-side control unit 135 records an image sent from the camera 102 when the recognition is wrong in the server-side recording unit 133 as teacher data (education data). Further, together with the recording of the image when the recognition is wrong, the recognition result desired to be output to the detection / recognition means 121 (for example, the fact that the robber was desired to be recognized from the image) is stored in the server-side recording means 133 as teacher data. Record.
The recognition result desired to be output to the detection / recognition means 121 recorded as the teacher data may be generated by the server 103 or may be sent from the camera 102. For example, when the server-side detection / recognition unit 132 determines whether or not the recognition is wrong from the recognition results of the plurality of cameras 102, the server-side detection / recognition unit 132 determines that the recognition result (detection / recognition unit) will be correct. (Recognition result desired to be output to 121) may be created as teacher data, and the teacher data may be recorded in the server-side recording unit 133. Also, for example, when checking the image taken by the person with the camera 102 and determining whether the recognition is wrong, when selecting the image that the person wanted to recognize the burglar from the above-mentioned terminal, The fact that he / she wanted to recognize the burglar (the fact that the image indicates a burglar) is also input using the input means of this terminal, and is sent to the server 103 together with the image when the recognition is wrong. The server-side control unit 135 may record the data in the server-side recording unit 133 as teacher data.

The machine learning unit 130 reads the teacher data recorded in the server side recording unit 133 (step S12). Then, the machine learning means 130 extracts feature points by convolution calculation from the image when the recognition included in the read teacher data is erroneous (step S13). The machine learning unit 130 performs machine learning from the information of the extracted feature points and the recognition result that the detection / recognition unit 121 wanted to output (step S14). As a result of machine learning, a detection / recognition algorithm that is a neural network that performs detection and recognition processing is generated (step S15).

Machine learning by the machine learning means 130 is performed so that a detection / recognition algorithm (detection / recognition firmware) is optimized for each camera 102. Each camera 102 may have a different type of camera, or even a camera with exactly the same characteristics, and may have a different installation location and environment, so the optimum algorithm differs depending on these differences. This is because it may come. Based on the original detection / recognition algorithm and the teacher data, the machine learning unit 130 recognizes the detection / recognition unit 121 included in the teacher data from the image when the recognition included in the teacher data is wrong. Generate a new detection / recognition algorithm that can produce a result. The original detection / recognition algorithm used for machine learning may be recorded in the server-side recording unit 133. Detection / recognition firmware is transmitted from the camera 102, and the detection / recognition firmware is detected / recognized. It may be converted into an algorithm and used. That is, the machine learning unit 130 generates a new detection / recognition algorithm from the detection / recognition algorithm used in the detection / recognition firmware of the camera 102 with the wrong detection / recognition and the teacher data.

The detection / recognition firmware generation means 131 converts the detection / recognition algorithm generated by the machine learning means 130 into detection / recognition firmware that is detection / recognition software for each camera (step S16). That is, the detection / recognition algorithm is converted into software in a format that can be executed by each camera by the detection / recognition firmware generation unit 131.
The server-side communication unit 134 transmits detection / recognition firmware, which is detection / recognition software generated by the detection / recognition firmware generation unit 131, to the camera 102 (step S17). When the camera 102 receives the detection / recognition firmware, the control unit 124 of the camera 102 updates the firmware of the detection / recognition unit 121 to the new detection / recognition firmware.

According to the detection recognition system of the present embodiment, the detection / recognition firmware of the detection / recognition unit 121 of the camera 102 is the new detection / recognition firmware generated by the machine learning unit 130 and the detection / recognition firmware generation unit 131 of the server 103. Can be updated.
The machine learning by the machine learning unit 130 is performed using, as teacher data, an image when the detection / recognition unit 121 of the camera 102 erroneously recognizes the set recognition target. In machine learning using this teacher data, The detection / recognition algorithm is improved so that the recognition target set for the image is not erroneously recognized. Therefore, the detection / recognition performance of the camera 102 can be improved.

Further, since the machine learning is performed by the server 103 and the camera 102 only needs to execute the detection recognition firmware generated by the server 103, the detection recognition firmware is updated even if the computing ability of the camera 102 is not so high. Thus, highly accurate detection / recognition can be performed. In addition, the camera does not have a relatively low performance as compared with other cameras over the years, but rather the performance can be gradually improved with use. In addition, the performance of the camera 102 can be improved so that detection / recognition suitable for the environment in which the camera 102 is used can be performed.

Also, since the machine learns by itself, it becomes possible to recognize the set recognition target even when it cannot be noticed by a person. For example, as educational data, instead of giving an image when the burglar actually robbed, giving an image of the robber before the robber was actually taken. It is not an algorithm that recognizes a burglar when a burglary is actually being carried out, but a person who is likely to be a burglar in the future because of the behavior of people who roam around the convenience store or around the convenience store. It is also possible to generate a detection / recognition algorithm that finds the characteristics and recognizes such a person as a burglar (a person who is likely to be a burglar). Note that since the machine learning means 130 determines which features are actually focused and recognized, it does not always recognize a person who is likely to be a robber from a behavior.

Also, according to the detection recognition system of the present embodiment, since the four cameras 102 have overlapping imaging ranges, there are four overlapping portions of the recognition targets set in the detection / recognition firmware. Can be recognized simultaneously by two cameras 102. Therefore, even if several of the four cameras 102 cannot recognize the recognition target, the other cameras of the four cameras 102 can recognize the recognition target. The possibility of being able to be recognized / recognized can be increased, and the accuracy of detection / recognition as a whole system can be increased.
The four cameras 102 include cameras with different types of imaging means 120, such as a stereo camera 102a, an infrared camera 102b, and a monocular camera 102c. Therefore, for example, even when it is difficult to detect / recognize with the stereo camera 102a, the infrared camera 102b can detect / recognize the entire system as compared with the case where the same type of camera 102 is used. The accuracy of detection / recognition can be improved.
The plurality of cameras 102 may be installed in places where the imaging ranges are completely different, or may recognize completely different recognition targets.

The server-side detection / recognition means 132 determines whether the recognition target of each camera 102 is correct from the recognition results of the detection / recognition means 121 of the four cameras 102, or recognizes the recognition target of each camera 102. It is possible to determine the probability that the recognition of the camera is correct, and to determine the misrecognition (recognition error) of the camera 102. Therefore, the alarm sound can be emitted from the terminal 104 or the like only when the server-side detection / recognition means 132 determines that the recognition is correct from the detection / recognition results of the individual cameras 102.
Further, it is possible to automatically determine the misrecognition of the camera 102 and automatically perform machine learning so as to improve the detection / recognition ability of the misrecognized camera 102. In addition, the image used as the machine learning teacher data at this time can be the image that was misrecognized, so learning is performed so as not to misrecognize the image that has been misrecognized. It becomes possible. Therefore, it is possible to automatically determine misrecognition, and to improve the accuracy of detection / recognition as the camera 102 is used.

Note that the timing at which machine learning is performed may be adjusted as necessary. For example, teacher data may be stored in the recording unit 122 or the server-side recording unit 133, and machine learning may be performed when a certain number or more have been accumulated or when a certain period has elapsed.
The machine learning may be performed using an image other than the image captured by the imaging unit 120. When the number and quality of the teacher data are not sufficient with only the image captured by the imaging unit 120, the machine learning effect can be improved by providing another image to the machine learning unit 130.

The recognition target recognized by the camera 102 is not limited to that described above, and any recognition target can be used as long as it can be detected / recognized from the image captured by the imaging unit 120.

Next, an embodiment of the object distance detection device of the present invention will be described.
The object distance detection device of the present embodiment uses a fisheye lens stereo camera as a camera mainly related to monitoring, such as a surveillance camera and an in-vehicle camera, but is not for outputting a stereoscopic image, A distance image in which each pixel is represented by the distance from the camera to the subject is generated, and a monitoring operation such as detection of a suspicious person is automatically enabled by image recognition from the distance image.

As shown in FIG. 10, the object distance detection device includes a pair of fisheye cameras 211 having a fisheye lens unit 221, a color filter 222, an image sensor 223, and the like, and respective images from the image sensors 223 of each of the pair of fisheye cameras 211. A pair of image input units 212 for inputting signals, a pair of image signal correction processing units 213 for performing image signal correction processing for removing distortion based on the fisheye lens of the input image, and image signal correction processing from image signals are used. A correction parameter calculation unit 215 for obtaining a correction parameter, an image analysis unit (distance image calculation unit) 214 for obtaining a distance image from a corrected image signal from which distortion by the fisheye lens has been removed, and a distance image generated by the image analysis unit 214 A suspicious person detection unit as a distance image recognition unit that automatically recognizes images and automatically performs monitoring operations such as suspicious person detection And a 16.
The pair of fisheye cameras 211 constitutes a stereo camera, and outputs an image connected to the image sensor 223 via the color filter 222 by the fisheye lens unit 221 as an image signal. At this time, the image is output as a moving image.

The fisheye lens unit 221 is a fisheye lens by adopting a projection method that is not a central projection method, and is a lens unit that adopts an equidistant projection method in this embodiment. The projection method of the fisheye lens unit 221 is not limited to the equidistant projection method, and any projection method other than the central projection method may be employed. For example, a lens unit of a projection method other than the above-described central projection method May be used as a fisheye lens. In this embodiment, the angle of view of the fisheye lens unit 221 is 180 degrees, but it may be, for example, an angle of view of about 160 degrees to 200 degrees.

For example, the pair of fisheye cameras 211 are arranged adjacent to each other so that the optical axes of the fisheye lens units 221 are parallel to each other so that each of the other fisheye lens units 221 is photographed with an angle of view of 180 degrees. It has become. This makes it possible to use epipolar geometry for searching for corresponding points described later.

The output from the image sensor 223 of the fisheye camera 211 is input from the image input unit 212 in the object distance detection apparatus, and the image signal correction processing unit 213 performs color synchronization processing, white balance processing, gamma processing, color matrix processing, luminance. Performs matrix processing, color difference / luminance processing, and the like. In the present embodiment, since a color image is not necessarily output on a monitor or the like, for example, the color filter 222 may not be used, and processing relating to color may not be performed. Note that, as described later, feature points are extracted from two images by image recognition, and corresponding points are searched based on the feature points. Therefore, in image recognition, a color image is selected from a luminance image (grayscale image). If the recognition rate is higher, a color image may be generated from the image signal as described above. Also, parameters based on the image signal required by the image signal correction processing unit 213 are calculated from the image signal by the correction parameter calculation unit 215.

As shown in FIG. 11, the image analysis unit 214 receives the image corrected as described above, and an image conversion unit 231 as a distortion removal unit performs image conversion to remove distortion caused by the fisheye lens as necessary. . For example, an equidistant projection image is converted into a central projection image. Note that distortion removal can be performed by a known method. For example, a known integrated circuit for image conversion for distortion removal is used. Next, partitioning is performed to divide the converted image from which distortion has been removed by the partitioning unit and the image selection unit 232 serving as a resolution conversion unit into set partitions.
This partitioning is not limited to a straight line partitioning, but may be a curve partitioning. For example, when an equidistance projection image is used without image conversion, segmentation by a curve is preferable.

Here, FIG. 13 shows the sections K11 to K33 when the pair of fisheye cameras 211 are arranged with the optical axis directed in the vertical direction (downward) on the ceiling or the like when the converted image is rectangular. As shown, the area of each partition is large in the central sections K11, K12, and K21, and the peripheral sections K13, K23, K31, K32, and K33 are narrow in area. The resolutions of these sections K11 to K33 are low in the central sections K11, K12, and K21, while the resolutions in the peripheral sections K13, K23, K31, K32, and K33 are high. For example, by setting each section to the same number of pixels, 400 × 200, the resolution of the central sections K11, K12, and K21 is lowered, whereas the peripheral sections K13, K23, K31, and K32 The resolution will be higher.

In addition, when distortion is corrected by image conversion before partitioning, the distorted portion is stretched in a contracted state at the peripheral edge of the image. However, the resolution is higher than that at the peripheral portion, and even if the pixel is thinned out so that the resolution is reduced at the center portion of the image, the influence is less than when the resolution is reduced at the peripheral portion. Therefore, the resolution is lowered by performing the process of thinning out pixels in the central sections K11, K12, and K21 of the image, and the pixels are not thinned out in the peripheral sections K13, K23, K31, K32, and K33 of the image, The resolution is not lowered. It should be noted that pixels may be thinned out in all the sections K11 to K33, and the degree thereof may be increased in the sections K11, K12, and K21 in the central portion than the sections K13, K23, K31, K32, and K33 in the peripheral portion of the image. In order to shorten the processing time, the area of the central sections K11, K12, and K21 where the resolution is lowered is made wider than the peripheral sections K13, K23, K31, K32, and K33 where the resolution is not lowered. . In addition, when performing correction for removing distortion after dividing an image into a plurality of sections, the image is distorted so that the image shrinks at the peripheral portion. Therefore, in order to correct such distortion, Since the area is enlarged, the area of the peripheral section is made smaller than that of the central part when partitioning. In this case, by removing the distortion after partitioning, if the resolution is lowered at the peripheral part where the distortion is large, the image is greatly deteriorated, and there is a possibility that the calculation of the distance and the image recognition are adversely affected. Therefore, when removing distortion after partitioning, it is preferable to thin out more pixels at the central portion than at the peripheral portion.

Further, FIG. 14 shows that when the converted image is rectangular, for example, a pair of fisheye cameras 211 is arranged at a position higher than the human back with the optical axis directed horizontally or obliquely downward with respect to the horizontal direction. The sections K11 to K42 are shown in the figure. The sections K11 to K33 at the center and the lower side are the same sections as the above-described fish-eye camera 211 facing the vertical direction, and each section K11 to K33. The resolution and area are set in the same manner.

On the other hand, in the upper sections K41 and K42 above the center portion of the image, for example, the sky is outdoors and the ceiling is reflected indoors. Therefore, the importance is low, and the resolution is lower and the area is wider than the section K11 in the center portion. Yes. In the upper sections K41 and K42, the upper center section K41 has the lowest resolution and the largest area. On the other hand, the upper left and right sections K42 have a higher resolution than the section K41 and a smaller area. It has become.

FIG. 13 and FIG. 14 show an example of partitioning. Based on the distortion of the image by the fisheye lens unit 221 of the fisheye camera 211, the resolution of the central partition is lowered and the area is widened. In comparison, the resolution and the area can be adjusted according to the importance based on the arrangement of the fisheye camera 211, based on increasing the resolution of the peripheral portion and reducing the area. That is, the fisheye camera 211 having an angle of view of about 180 degrees or more has a wide shooting range. For example, a position where a person cannot exist may enter the shooting range. It is preferable to improve the processing speed by lowering the resolution of such a portion.

The image selection unit 232 shown in FIG. 11 selects the same sections K11 to K33 for each of the pair of images. When a pair of sections is selected, the corresponding point selection unit 233 as the next corresponding point search unit extracts feature points of each image by image recognition and associates feature points of each image as described above. Do. In this case, epipolar geometry can be used, and corresponding points to be paired in each image are sequentially determined and selected. When selection of corresponding points for a pair of sections is completed, corresponding points are extracted in the next pair of sections. When the extraction of corresponding points is completed in all pairs of sections, the distance calculation unit 234 then determines the difference between the positions of the corresponding points on the image and the distance between the pair of fisheye cameras 211. The distance from the fisheye camera 211 to the target point corresponding to the pair of corresponding points is obtained. The distance is calculated based on the so-called triangulation method. However, when using parallax, the baseline distance (the distance between a pair of cameras) is much smaller than the distance to the target 100 with respect to general triangulation. When the difference in distance from the point (each fisheye camera 211) to the target point does not matter, the distance to the target point is obtained by dividing the baseline distance by the parallax (unit radians).
Also, the three-dimensional coordinates of the target point on the three-dimensional coordinate in the real space are calculated from the coordinate position (projection position) on the two-dimensional coordinate on each image of the corresponding point, the distance between the cameras, and the focal length of the camera. The distance from the camera to the target point can be calculated from the target point and the three-dimensional coordinate position in the real space of the camera.

The flowchart in FIG. 12 shows the processing in the image analysis unit 214 described above. After the image is captured and corrected by the pair of fisheye cameras 211, the image is input to the image analysis unit 214 for each pair of frames. The process which calculates the distance image which shows the distance from the fish-eye camera 211 for every corresponding point used as each pixel to the object point is shown. Note that the flowchart in FIG. 12 illustrates a case where a process for removing distortion is performed after partitioning. As shown in the flowchart, when a corrected image for each frame is input from the pair of fisheye cameras 211, partitioning is performed (step S21). In partitioning, an image is divided into a plurality of partitions and a process of reducing the resolution of each partition (for example, reduction processing) is performed, but the resolution is changed depending on the partition. Moreover, the area of one division is adjusted with a division. Basically, the resolution is low and the area is increased in the central section of the image, while the resolution of the section is higher than that in the central area and the area of the section is smaller than that in the central section.

After partitioning, distortion is removed in each partition (step S22). Note that an already established method is used as a method for removing distortion of an image using a fisheye lens.
Next, feature points (singular points) that become corresponding points in each section are extracted (step S23). As feature points, for example, singular points such as edges are detected by well-known edge detection. A corresponding point is a point on the image corresponding to a target point in the real space, a point corresponding to the same target point in two images is a corresponding point, and each point in the real space is shown in both images. If so, there will be corresponding points, and it is preferable to search for many corresponding points using the above-described epipolar geometry method.
Next, the distance from the fisheye camera 211 to the target point in the real space corresponding to the corresponding point is calculated based on the difference in the position of the corresponding point in the image and the distance between the fisheye cameras 211 ( Step S24). Next, a distance image is generated and output with the distance of each corresponding point determined as the value of each pixel (step S25).

And this distance image is used for monitoring. A person is detected by recognizing a distance image, or a living thing or an article (vehicle or the like) other than a person is detected. In addition, with regard to people, the person is identified from the registered person's three-dimensional shape or photograph, or the person's three-dimensional shape detected in the distance image, for example, height, physique, clothing shape, etc. A child and an adult may be distinguished, an age may be distinguished, and a man and a woman may be distinguished. Moreover, it is good also as what detects the shape for every vehicle model of vehicles, such as a motor vehicle and a motorcycle, and detects a vehicle type.

According to the object distance detection device of the present embodiment, distance detection is performed by a stereo camera including a plurality of fisheye cameras 211 having an angle of view of, for example, about 180 degrees. Therefore, for example, up to a target point as each object in real space. The distance can be obtained and the distance image can be output. However, since the angle of view is wide, the amount of information in one frame image is large, and the amount of calculation processing including image recognition necessary for distance detection is large. Depending on the capability of the processing circuit, a long time is required for processing, and the processing time for one frame is long, making it difficult to process a moving image. In addition, there are many target points that can be calculated for the distance in the image of the fisheye camera 211. And in the peripheral part of the image where the amount of information (target point) is larger than the central part of the image, the image is distorted from the central part, and the resolution is not necessarily high. On the other hand, in the image of the fisheye camera 211, a lot of less important parts are reflected in the monitoring work such as the sky, the ceiling, the floor and the ground. Therefore, by dividing the part with low distortion and high resolution and the part with low importance and calculating the distance by reducing the resolution of this part, the processing amount can be reduced without greatly reducing the accuracy of distance calculation. This can reduce the processing time. Thereby, the image recognition in the monitoring apparatus using the distance image calculated | required with the stereo camera which has the some fisheye camera 211 can be performed smoothly.

Here, each pixel region D of the distance image output from the object distance detection device will be described. In the distance image, each point on the image is represented by a distance from the stereo camera. In the present embodiment, the distance image is represented by a change in color shade according to the distance value. The color change may be, for example, a monochrome gradation or a gradation of another color. When mechanically recognizing an image using a distance image, each point on the image may be represented by a numerical value indicating the distance. FIG. 15A shows a part of the section K11 as a part of the distance image corresponding to the sectioned image shown in FIG. 13, and FIG. 15B shows a part of the section K33. The sections K11 to K33 are arranged in the same manner as in FIG. 13 on one distance image. The resolution varies depending on the sections K11 to K33. However, as shown in FIG. 15, the size of the pixel P (part divided by both a double line and a dotted line) which is the minimum unit on the distance image is the same. The pixel P is, for example, a monitor pixel that displays a distance image.

In the distance image, the image is divided into pixel areas D (parts separated by double lines), and each pixel area D is composed of one pixel P or a plurality of pixels P. For example, black and white shading corresponding to the distance from the stereo camera to the corresponding point (photographing target) is attached to each pixel area D, and the distance image is a color (color shading) corresponding to the distance of each pixel area D. It is expressed. Here, the resolution of the section K11 is lower than that of the section K33, while the number of pixels in each pixel area D of the section K11 is 4, whereas the number of pixels of the pixel area D of the section K33 is 2. In addition, the pixel area D of the section K11 having a lower resolution has a larger number of pixels P and a larger area than the pixel area D of the section K33 having a higher resolution. In one distance image, it is not necessary to change the size of the minimum unit pixel depending on the section, and it is possible to cope with the difference in resolution by changing the number of pixels P in the pixel region D. Can be displayed on a single monitor at substantially the same display magnification.

1 Image sensor (imaging means)
2 3D sensor (distance image acquisition means, distance measurement means)
3 Object extraction means (distance image recognition object extraction means)
4 Object image extraction means (recognition object image extraction means)
5 Image recognition means 10 Camera (imaging means)
11 Distance image detection means (distance image acquisition means)
12 Stereo camera (distance image acquisition means)

Claims

An imaging means for capturing an image;
Distance image acquisition means for acquiring a distance image in which each pixel in a range corresponding to the imaging range of the image is represented by a distance to a shooting target;
Distance image recognition object extraction means for extracting a recognition object on the distance image based on the distance from the distance image;
Recognition object image extraction means for extracting a partial image to be the recognition object from the image based on a range on the distance image of the recognition object extracted from the distance image;
An image recognition imaging apparatus comprising: an image recognition unit that performs image recognition of the partial image and identifies the recognition object.
2. The image recognition imaging apparatus according to claim 1, wherein the distance image acquisition unit includes a distance measurement unit that measures a distance of each pixel of the distance image.
2. The image recognition imaging apparatus according to claim 1, wherein the distance image acquisition means obtains a distance image based on parallax between the two imaging means.
The image pickup means, the distance image acquisition means, the distance image recognition object extraction means, the recognition object image extraction means, and the image recognition means are provided in one housing. The image recognition imaging device according to any one of the above.