WO2022190157A1

WO2022190157A1 - Imaging device and video processing system

Info

Publication number: WO2022190157A1
Application number: PCT/JP2021/008913
Authority: WO
Inventors: 嵩臣神田
Original assignee: 株式会社日立国際電気
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2022-09-15
Also published as: JP7448721B2; JPWO2022190157A1

Abstract

The purpose of the present invention is to provide an imaging device and a video processing system that can obtain predetermined information related to an image while protecting image information at a higher level. This invention involves an imaging device (1) and a video processing device (5). The imaging device (1) captures a video to obtain an image, detects a predetermined area in the image, resizes the detected area to extract feature values of the resized detected area, and outputs an image obtained by placing a mask image (16-3), in which the extracted feature values are arranged two-dimensionally, on the detected area of the obtained image. The video processing device (5) receives the image output by the imaging device, obtains the feature values from the mask image (16-3), and performs an inference process on the basis of the feature values.

Description

Imaging device and image processing system

The present invention relates to an imaging device and a video processing system, and more particularly to an imaging device and a video processing system that are capable of inference processing by machine learning and have video processing functions for privacy protection.

In recent years, there has been an increase in demand for cameras that capture a large number of people, such as surveillance cameras. These cameras are connected to a LAN (Local Area Network) and have the advantage of being able to monitor images remotely. On the other hand, if security is breached, captured information may be leaked, which may pose a problem in terms of privacy protection.

Therefore, Patent Document 1 discloses a method of protecting privacy by performing processing such as reversible mosaic processing and mask processing on captured images. The processed image can be restored to the original image by performing corresponding restoration processing.

JP-A-2009-33738

In Patent Document 1, if restoration information including restoration information for performing restoration processing is leaked to the outside, a malicious third party can perform restoration processing and obtain the original image. In order to prevent this, it is necessary to distribute the irreversible image over the LAN, but in that case the original image cannot be restored. As a result, it becomes impossible to perform face recognition and action recognition using image recognition technology or the like.

In view of the above problems, it is an object of the present invention to provide an imaging device and a video processing system capable of transmitting predetermined information about images while protecting image information at a higher level.

In order to achieve the above object, one of the representative imaging devices of the present invention acquires an image by photographing a video, detects a predetermined area from the image, resizes the detected area, and detects it. It is characterized by extracting a feature amount of an area and outputting an image arranged in a detection area of the obtained image as a mask image in which the extracted feature amount is arranged two-dimensionally.

Further, one of the video processing systems of the present invention includes an imaging device and a video processing device, wherein the imaging device acquires an image by capturing a video, detects a predetermined region from the resizing the detected area, extracting a feature amount of the detection area, and outputting an image arranged in the detection area of the acquired image as a mask image in which the extracted feature amount is arranged two-dimensionally, , the image output by the imaging device is input, the feature amount is obtained from the mask image, and an inference process is performed based on the feature amount.

Advantageous Effects of Invention According to the present invention, in an imaging device and a video processing system, predetermined information regarding an image can be transmitted while protecting the image information at a higher level.
Problems, configurations, and effects other than those described above will be clarified by the following embodiments.

FIG. 1 is a block diagram showing one embodiment of the video processing system of the present invention. FIG. 2 is a block diagram showing an example of the processing system section of FIG. FIG. 3 is a diagram showing an example of processing for calculating feature amounts applied in the video processing system of the present invention. FIG. 4 is a diagram showing an example of processing of the imaging device in the video processing system of the present invention. FIG. 5 is a diagram showing an example of processing of the video processing device in the video processing system of the present invention.

A form for carrying out the present invention will be described.

FIG. 1 is a block diagram showing one embodiment of the video processing system of the present invention. The video processing system in FIG. 1 includes an imaging device 1 and a video processing device 5 . The imaging device 1 includes an imaging section 2 and a processing system section 3 . The video processing device 5 also includes a processing system section 6 and a display output section 7 . Note that the display output unit 7 may be configured separately from the video processing device 5 instead of being provided in the video processing device 5 . A personal computer, a tablet computer, a server, or the like can be applied to the video processing device 5 .

The imaging device 1 has a configuration of one or more cameras, and can be arranged in various places. For example, it may be installed at a monitoring location as a monitoring camera.

The imaging unit 2 is a camera configuration that obtains information by forming an image of incident light on an imaging device via a lens and a diaphragm. Examples of the imaging device here include a CCD (Charge-Coupled Device) image sensor and a CMOS (Complementary Metal Oxide Semiconductor) image sensor. The obtained information is sent to the processing system section 3 . In addition, the imaging unit 2 can perform imaging processing using a video processing IC (Integrated Circuit) such as an FPGA (Field Programmable Gate Array). On the other hand, this video processing IC may be integrated with the processing system section 3 .

The processing system unit 3 acquires information captured by the imaging unit 2 and performs the processing of FIG. 4, which will be described later. A specific configuration example will be described later with reference to FIG. 2, and specific processing contents will be described later with reference to FIG. The processed information is sent to the processing system section 6 .

The processing system section 6 acquires information from the processing system section 3 and performs the processing of FIG. 5, which will be described later. A specific configuration example will be described later with reference to FIG. 2, and specific processing contents will be described later with reference to FIG.

The display output unit 7 is a device that can display the content processed by the processing system unit 6. For example, it is displayed by a structure such as a liquid crystal display (LCD), an organic EL (OEL) display, a touch panel, or the like.

Information can be exchanged between the imaging device 1 and the video processing device 5 via the Internet network or the like. For example, it is connected to a LAN or the like. Alternatively, information may be exchanged via a dedicated communication line. That is, it is possible to check the processing contents of the imaging device 1 at a remote location with the video processing device 5 . Further, the imaging device 1 and the video processing device 5 may not be one-to-one, and one imaging device 1 may correspond to a plurality of video processing devices 5. A plurality of imaging devices 1 may correspond to one imaging device. The video processing device 5 may correspond. Also, the image processing device 5 may be configured to enable setting and operation of the imaging device 1 .

FIG. 2 is a block diagram showing an example of the processing system section of FIG. A specific example of the

processing system units

3 and 6 will be described as a computer system 300 in FIG.

The major components of computer system 300 include one or more processors 302 , memory 304 , terminal interfaces 312 , storage interfaces 314 , I/O (input/output) device interfaces 316 , and network interfaces 318 . These components may be interconnected via memory bus 306 , I/O bus 308 , bus interface 309 and I/O bus interface 310 .

Computer system 300 may include one or

more processing units

302 A and 302 B, collectively referred to as processor 302 . Each processor 302 executes instructions stored in memory 304 and may include an on-board cache. As the processing device, CPU (Central Processing Unit), FPGA (Field-Programmable Gate Array), GPU (Graphics Processing Unit), etc. can be applied.

Memory 304 may include random access semiconductor memory, storage devices, or storage media (either volatile or non-volatile) for storing data and programs. Memory 304 also represents the entire virtual memory of computer system 300 and may include the virtual memory of other computer systems connected to computer system 300 over a network. Memory 304 may conceptually be viewed as a single entity, but may be more complex arrangements, such as hierarchies of caches and other memory devices.

The memory 304 may store all or part of the programs, modules, and data structures that implement the functions described in this embodiment. For example, memory 304 may store application 350 . Application 350 may include instructions or descriptions that perform the functions described below on processor 302, or may include instructions or descriptions that are interpreted by other instructions or descriptions. Application 350 may be implemented in hardware via semiconductor devices, chips, logic gates, circuits, circuit cards, and/or other physical hardware devices instead of or in addition to processor-based systems. may be Application 350 may include data other than instructions or descriptions. Other data input devices such as cameras and sensors may also be provided in direct communication with bus interface 309 , processor 302 , or other hardware of computer system 300 .

Computer system 300 may include bus interface 309 that provides communication between processor 302 , memory 304 , display system 324 , and I/O bus interface 310 . I/O bus interface 310 may couple to I/O bus 308 for transferring data to and from various I/O units. I/O bus interface 310 connects via I/O bus 308 to a plurality of I/O interfaces 312, 314, 316, and 318, also known as I/O processors (IOPs) or I/O adapters (IOAs). may communicate with Display system 324 may include a display controller, display memory, or both. The display controller can provide video, audio, or both data to display device 326 . Computer system 300 may also include one or more sensors or other devices configured to collect data and provide such data to processor 302 . The display system 324 may be connected to a display device 326 such as a single display screen, television, tablet, or handheld device. Display device 326 may include speakers for rendering audio. Alternatively, speakers for rendering audio may be connected to the I/O interface. Alternatively, the functionality provided by display system 324 may be implemented by an integrated circuit that includes processor 302 . Similarly, the functionality provided by bus interface 309 may be implemented by an integrated circuit including processor 302 .

The I/O interface has the ability to communicate with various storage or I/O devices. For example, terminal interface 312 may be a user output device such as a video display, speaker television, etc., or a user input device such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device. Any user I/O device 320 can be attached. A user inputs input data and instructions to the user I/O device 320 and the computer system 300 by operating the user input device using the user interface, and receives output data from the computer system 300. good too. The user interface may be displayed on a display device or played by speakers via the user I/O device 320, for example.

The storage interface 314 allows attachment of one or more disk drives or direct access storage devices 322 . Storage device 322 may be implemented as any secondary storage device. The contents of memory 304 may be stored in storage device 322 and read from storage device 322 as needed. I/O device interface 316 may provide an interface to other I/O devices. Network interface 318 may provide a communication pathway to allow computer system 300 and other devices to communicate with each other. This communication path may be, for example, network 330 .

Although computer system 300 includes a bus structure that provides a direct communication path between processor 302, memory 304, bus interface 309, display system 324, and I/O bus interface 310, computer system 300 is hierarchically organized. , star or web configuration point-to-point links, multiple hierarchical buses, parallel or redundant communication paths. Further, although I/O bus interface 310 and I/O bus 308 are shown as a single unit, in reality computer system 300 may include multiple I/O bus interfaces 310 or multiple I/O buses 308 . may be provided. Also, although multiple I/O interfaces are shown to separate the I/O bus 308 from various communication paths leading to various I/O devices, some or all of the I/O devices may be connected to a single interface. It may be connected directly to the system I/O bus.

Computer system 300 may be a device that receives requests from other computer systems (clients) that do not have a direct user interface, such as a multi-user mainframe computer system, a single-user system, or a server computer.

When the computer system 300 in FIG. 2 is applied to the processing system section 3 in FIG. 1, the display device 326 is an arbitrary configuration and may or may not be provided. Also, the imaging unit 2 can be applied as the user I/O device 320 . 2 is applied as the processing system unit 6 in FIG. 1, the display device 326 can be applied as the display output unit 7. FIG. Also, the network 330 can be applied as a network interposed between the processing system section 3 and the processing system section 6 .

FIG. 3 is a diagram showing an example of processing for calculating feature amounts applied in the video processing system of the present invention. FIG. 3 shows a configuration example of machine learning by CNN (Convolution Neural Networks) for estimating a person from a face image. The number above each layer is the number of neurons in that layer, but these are just examples.

A portion of a specific image is input from the input layer 11, transmitted to the first convolution layer 12 and pooling layer 13, and connected to the convolution layer 12 and pooling layer 13, which are subsequent layers. After these processes there are fully connected layers, an input layer 16, an intermediate layer 17 and an output layer 18. FIG. The number of neurons in the output layer 18 is equivalent to the number of classes. When face recognition is performed, it is almost equivalent to the number of people that can be specified. When a part of a specific image is input from the input layer 11, for example, a 200×200 image is input after being resized to 64×64.

The input layer 11 acquires image information of a specific size (64×64 pixels in FIG. 3). The example in FIG. 3 is an image of a person's face captured by face detection.

Next, the convolution layer 12 performs convolution processing. The image acquired in the input layer 11 is filtered. Filtering reduces the size (60×60 in FIG. 3). Then, the number of filters prepared (eight in FIG. 3) is output.

Next, pooling processing is performed in the pooling layer 13 . The information output from the convolutional layer 12 is compressed. This halves the size (30×30 in FIG. 3).

Next, the convolution layer 14 performs convolution processing. The information compressed by the pooling layer 13 is further filtered to reduce the size (26×26 in FIG. 3). Then, the number of filters prepared (16 in FIG. 3) is output.

Next, the pooling layer 15 performs pooling processing. The information output from the convolution layer 14 is compressed. This halves the size (13×13 in FIG. 3).

Next, in the input layer 16 of the fully connected layer, the pooling layer 15 rearranges the three-dimensional information (13×13×16) into one-dimensional information (2704). The information here indicates the feature amount. In FIG. 3, the convolution layer and the pooling layer are repeated twice (two layers), but the number of repetitions is not limited to this and may be more.

A mask image can be formed from the input layer 16 of the fully connected layer. The mask image here means an image whose original image cannot be identified (if it is a face, it is not possible to identify someone from the image alone). This processing is irreversible video processing, and once the mask image is formed, the original image cannot be restored.

Specifically, as shown in FIG. 3, one-dimensional information 16-1 (2704 in FIG. 3), which is the information of the input layer 16 of the fully connected layer, is converted to two-dimensional image information 16-2 (52× in FIG. 3). 52). The information at this time can be held as image information, such as color density information in the case of a black and white image, and color type and color density information in the case of a color image. For example, a black-and-white image can be converted as 8-bit information per pixel, and an RGB color image can be converted as 24-bit information per pixel. The 52×52 pixel image information is expanded to a 200×200 pixel mask image 16-3. This is conversion processing for adjusting the size of the face image that was originally captured.

Then, the created mask image 16-3 is returned to the original one-dimensional information for inference processing. Specifically, the mask image 16-3 (200×200 in FIG. 3) is restored by resizing the two-dimensional image information 16-4 (52×52 in FIG. 3) before stretching, and then the one-dimensional The information 16-1 (2704 in FIG. 3) is rearranged. This makes it possible to temporarily convert the information of the input layer 16 of the fully connected layer into the mask image 16-3 and put it on the image.

Next, in the intermediate layer 17 of the fully connected layer, 1000 neurons are applied in FIG. This is an example and suitable numbers can be applied as needed. Also, the number of intermediate layers 17 may be increased to form a plurality of layers.

In the output layer 18 of the next fully connected layer, 100 neurons are applied. Here, the number of neurons is the number of classes and corresponds to the number of classes that can be classified. For example, in the case of face recognition, the person is estimated from the neuron that fires the most, such as Mr. A, Mr. B, and Mr. C. Such an inference process can classify 100 people. Alternatively, it is possible to classify the 99 persons and the remaining one as others.

FIG. 4 is a diagram showing an example of processing of an imaging device in the video processing system of the present invention. The processing here is performed by the imaging device 1 side, and is performed by the processing system unit 3 of the imaging device 1 unless otherwise specified. Here, irreversible image processing is performed.

The imaging device 1 first performs video shooting 21 . This is performed by the image pickup unit 2 and can be realized by an image pickup device and a video processing IC such as an FPGA. Filming is done on video. For example, shooting is performed at 30 frames per second (30 fps) or more. The image captured by the imaging unit 2 is sent to the processing system unit 3 for each image of one frame, and can be processed.

Next, the processing system unit 3 performs face detection 22 on the input video. Face detection 22 is a process of identifying the shape of a human face and detecting a range containing the face. This is done automatically using existing techniques. If a human face is identified, that area is detected. In addition, since the processing described later is performed, it is possible to perform detection processing when the range identified as a face has a certain number of pixels or more. If the number of bits handled by one neuron in the input layer 16 is the same as the number of bits in one pixel, the minimum range is set to 52×52 pixels in the example of FIG.

Next, resizing 23 of the detection area is performed in the detection area resizing section. This resizes the area detected by face detection 22 to a predetermined size. Since the area detected by the face detection 22 is not constant, this resizing is to convert the area into a predetermined size suitable for the calculation of the next feature amount. In the example of FIG. 4, a process of converting 200×200 pixels to 64×64 pixels is performed.

Next, feature quantity calculation 24 of the detection area is performed in the feature quantity calculation unit. Here, a feature amount necessary for face recognition is obtained using CNN or the like. The calculation of this feature amount is the same as the processing from the input layer 11 to the input layer 16 of the fully connected layer described with reference to FIG.

Next, rearrange/resize 25 the feature values. Here, processing is performed to convert the data into a format of a size applicable to the region where face detection has been performed. The number of feature amount neurons calculated in the input layer 16 of the fully connected layer is 2704, and when this is converted to two dimensions, it becomes a 52×52 area. On the other hand, the area detected by face detection 22 is 200×200. In order to apply the data of a two-dimensional area of 52×52 calculated from the number of feature amount neurons to the area of 200×200 of the face detection 22, the data of one neuron is expanded to about 4 pixels and allocated. As a result, the data of the 52×52 area are converted to the data of the 200×200 area. Note that the processing of rearranging/resizing the feature quantity 25 here is the same as the processing from the one-dimensional information 16-1 to the mask image 16-3 described with reference to FIG.

Here, the larger the enlargement ratio described above, the smaller the change between pixels in the mask area and between frames. As a result, sudden changes between pixels or between frames are alleviated, making it easier to perform processing using the irreversible codec. In addition, this feature amount must fit in the minimum image size data area for face detection, but depending on this minimum size, for example, the output of a pooling layer in the middle of CNN can be treated as a feature amount.

Next, the rearranged feature amount is subjected to mask processing 26 on the original image detected by face detection 22 . This is arranged on the original image by applying the rearranged feature amount (200×200) to the area detected by the face detection 22 as the mask image 16-3. Since the mask image 16-3 is an image with color types and densities based on feature amounts, it differs from the original image of the area detected by the face detection 22 and has information different from that of a human face.

Next, masking metadata addition 27 is performed on the image that has undergone masking 26 . Here, the index number of the image subjected to the mask processing, the coordinates of the starting point on the image, the length of one side thereof, and the like are given. As a result, information for specifying the masked area and information for specifying the masked image are added.

Next, external output 28 is performed. Here, when outputting to the outside, codec processing is performed in order to compress the transmission capacity. In the case of video, a lossy codec is generally used, but depending on the application, only intermittent transmission of images is sufficient, in which case a lossless codec may be used. The information externally output here is sent to the video processing device 5 via the Internet network or the like.

FIG. 5 is a diagram showing an example of processing of the video processing device in the video processing system of the present invention. The processing here is performed on the video processing device 5 side, and is performed by the processing system unit 6 of the video processing device 5 unless otherwise specified. Here, inference processing by machine learning is performed to identify a person.

First, video data having an image output from the imaging device 1 at the external output 28 in FIG.

Next, the feature amount extraction/resize/rearrangement unit extracts, resizes, and rearranges 32 the feature amount from the metadata of the video data. In this process, first, the mask image 16-3 is extracted from the video data. The range can be specified from the assigned metadata. Next, it is returned to two-dimensional image information 16-4 (52×52 in FIG. 5), and further rearranged into one-dimensional information 16-5 (2704 in FIG. 5). This is similar to FIG. This gives the value of the feature quantity. It should be noted that since this value undergoes processing such as resizing and codec on the way, the data value may deviate slightly and may not match completely. However, this deviation does not affect the process of obtaining the inference result from the next feature amount, and a value that is the same as or close to the original feature amount (one-dimensional information 16-1) is obtained.

Next, acquire 33 the inference result from the feature quantity. This is similar to the processing of the all-bonded layers 16-18 of FIG. Here, the class is specified by the inference result acquisition unit from the feature amount. In the case of the example of FIG. 5, an individual can be identified from the face by inference processing.

It should be noted that the above processing can be performed by storing the information about the individual's face in the video processing device 5. For example, when outputting classes for 100 people, it is possible to store information for 100 people and identify individuals from feature amounts. Also, if the person does not correspond to the person recorded in advance, it is possible to prepare one class for outputting that the person is another person.

In addition, the data structure of the feature amount and the parameters for extracting the feature amount such as the parameters of the neural network are shared in advance between the imaging device 1 and the video processing device 5 . As a result, when the mask image 16-3 is sent to the video processing device 5, it is possible to restore the one-dimensional information 16-5 and output the class from the feature amount. Regarding the setting of this parameter, the image processing apparatus 5 may have a function of setting the imaging apparatus 1 as well.

Although the above embodiment shows an example of processing for identifying a person by face detection, it is also possible to identify a person's behavior. For example, the imaging device 1 has a human detection function, detects the whole person, and masks the whole person with a two-dimensional image containing a feature amount. Then, the video processing device 5 infers what the behavior of the masked person is from the feature amount. In this case, the class outputs for each type of human action.

(effect)
In the above-described embodiment, irreversible mask processing of a person area (face or whole person) for which privacy protection is important can be realized. At the same time, the destination can also receive the data necessary to identify people and actions, and perform post-processing inferences as needed. This makes it possible to determine who the person is and what their actions are even in the masked area.

When using conventional reversible mask processing, decoding the masked part restores the original image of a person, for example, and if that is leaked, all personal information contained in the image will be leaked. On the other hand, with the method according to the present embodiment, even if information is leaked and decrypted by a malicious third party, only the label information such as the name associated with face recognition can Minimal information with only action label information.

Furthermore, when inferring human recognition and action recognition results on the imaging device side, if the data is transmitted and the communication is intercepted, the label information will be leaked. On the other hand, in the present embodiment, inference is made from the feature quantity on the side of the video processing device 5 that has received it. Therefore, even if the data from the imaging device 1 leaks out, inference cannot be made unless the data structure of the feature amount, the structure of the parameters of the neural network, and the like are known. Therefore, the information from the imaging device 1 is double-protected in addition to the communication encryption, and can be data that is more difficult to decrypt. Also, the transmission capacity can be reduced by embedding the feature amount in the mask image 16-3.

As described above, the embodiments of the present invention have been described, but the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the configurations described. Moreover, it is possible to add, delete, or replace part of the configuration of each embodiment with another configuration.

For example, in the above embodiment, the process of embedding the feature amount in the mask image 16-3 is performed in order to reduce the transmission capacity. However, it is also possible to apply a configuration in which the image is subjected to appropriate mask processing (for example, masking with the same color and density) without embedding the feature amount information, and the feature amount information and the image are separately transmitted.

Also, in the above embodiment, an example using CNN was shown, but the present invention can also be applied using a DNN (Deep Neural Networks) technique as machine learning.

DESCRIPTION OF SYMBOLS 1... Imaging device, 2... Imaging part, 3... Processing system part, 5... Video processing apparatus, 6... Processing system part, 7... Display output part, 11... Input layer, 12... Convolution layer, 13... Pooling layer, 14 Convolution layer 15 Pooling layer 16 Input layer of fully connected layer 17 Intermediate layer of fully connected layer 18 Output layer of fully connected layer 21 Video shooting 22 Face detection 23 Resize of detection area 24 Feature amount calculation of detection area 25 Rearrangement/resize of feature amount 26 Mask processing to original image 27 Addition of mask processing metadata 28 External output 31 Video input , 32... Extraction/resize/rearrangement of feature quantity 33... Acquisition of inference result from feature quantity 300... Computer system 302...

Processor

302A, 302B... Processing device 304... Memory 306... Memory bus 308... I/O bus 309 Bus interface 310 I/O bus interface 312 Terminal interface 314 Storage interface 316 I/O device interface 318 Network interface 320 User I/O device 322 ... storage device, 324 ... display system, 326 ... display device, 330 ... network, 350 ... application

Claims

An image is obtained by photographing a video, a predetermined area is detected from the image, the detected detection area is resized to extract the feature amount of the detection area, and the extracted feature amount is arranged two-dimensionally as a mask. An imaging device, which outputs an image arranged in a detection area of the acquired image as an image.
The imaging device according to claim 1,
An imaging apparatus, wherein the extraction of the feature amount is performed using a CNN (Convolution Neural Networks) or DNN (Deep Neural Networks) technique.
The imaging device according to claim 1,
The imaging device, wherein the predetermined area is an area of a person's face.
The imaging device according to claim 1,
1. An imaging apparatus, wherein the image to be output is provided with information for specifying the range of the mask image in the image.
comprising an imaging device and a video processing device,
The imaging device captures an image to obtain an image, detects a predetermined area from the image, resizes the detected detection area, extracts a feature amount of the detection area, and divides the extracted feature amount into two. outputting an image arranged in a detection region of the acquired image as a mask image arranged in dimensions;
A video processing system, wherein the video processing device receives an image output from the imaging device, acquires a feature amount from the mask image, and performs an inference process based on the feature amount.
In the video processing system according to claim 5,
A video processing system, wherein the extraction of the feature amount and the inference processing are performed using a CNN (Convolution Neural Networks) or a DNN (Deep Neural Networks) technique.
In the video processing system according to claim 5,
A video processing system, wherein the predetermined area is a human face area, and the inference process is a process for identifying a person.
In the video processing system according to claim 5,
A video processing system having a function of setting a parameter used for extraction of a feature quantity in said imaging device from said video processing device.