WO2021241261A1

WO2021241261A1 - Information processing device, information processing method, program, and learning method

Info

Publication number: WO2021241261A1
Application number: PCT/JP2021/018332
Authority: WO
Inventors: レオナルドイシダアベ; クリストファーライト; ベルナデットエリオットボウマン; ハームクローニー; ニコラスウォーカー
Original assignee: ソニーグループ株式会社
Priority date: 2020-05-27
Filing date: 2021-05-14
Publication date: 2021-12-02

Abstract

The present technology relates to an information processing device, an information processing method, a program, and a learning method which make it possible to improve the performance of a model provided with the same function as that of the human vision system. The information processing device comprises: a first Convolution Neural Network (CNN) which implements the same process and function as those of the human vision system; and a second CNN which implements a similar function to that of the first CNN through a different process from the first CNN, wherein an arbitrary layer of the first CNN is connected with an arbitrary layer of the second CNN, and information is transmitted from the arbitrary layer of one of the CNNs to the arbitrary layer of the other CNN. The present technology can be applied to, for example, a system that executes object recognition.

Description

Information processing device, information processing method, program, and learning method

This technology relates to information processing devices, information processing methods, programs, and learning methods, and in particular, information processing devices, information processing methods, and programs designed to improve the performance of models having functions similar to those of human visual systems. , And learning methods.

Conventionally, the distance likelihood for each of the multiple distances to the object is calculated from the information obtained by the distance measuring method using multiple sensors, and the learning model is used to determine the distance likelihood for the multiple distance measuring methods. It has been proposed to integrate and obtain the integration likelihood of each of a plurality of distances (see, for example, Patent Document 1).

International Publication No. 2017/057056

By the way, it is known that the human vision system can be effectively modeled by CNN (Convolutional Neural Network). For example, it is possible to satisfactorily associate each layer of CNN with each layer of computational abstraction performed in the human brain during execution of object recognition. For example, the initial layer of a CNN modeled on a human visual system can be made to perform functions similar to those performed on the retina, such as edge detection.

On the other hand, Patent Document 1 does not consider modeling a human visual system.

This technology was made in view of such a situation, and is intended to improve the performance of a model having the same functions as a human visual system.

The information processing device of the first aspect of the present technology is the first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and the first CNN by processing different from the first CNN. With a second CNN that realizes a function similar to that of the CNN, any layer of the first CNN and any layer of the second CNN are connected and from the arbitrary layer of one CNN. Information is transferred to the arbitrary layer of the other CNN.

The information processing method of the first aspect of the present technology is based on an arbitrary layer of a first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and processing different from the first CNN. , Connects to any layer of the second CNN that realizes a function similar to that of the first CNN, and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.

The program of the first aspect of the present technology is described by an arbitrary layer of a first CNN (Convolution Neural Network) that realizes processing and functions similar to those of a human visual system, and processing different from the first CNN. A process of connecting to an arbitrary layer of a second CNN that realizes a function similar to that of the first CNN and transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN to a computer. Let it run.

The learning method of the second aspect of the present technology is based on an arbitrary layer of the first CNN (Convolution Neural Network) that realizes the same processing and functions as the human visual system, and processing different from the first CNN. A cross-fusion in which an arbitrary layer of a second CNN that realizes a function similar to that of the first CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. When the first CNN training of the CNN is performed before the combination with the second CNN and the training is performed in combination with the first CNN and the second CNN, the training of the second CNN is performed. And do not change the configuration and parameters of the first CNN.

In the first aspect of the present technology, the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, and the first CNN are processed differently from the first CNN. An arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.

In the second aspect of the present technology, the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, and the first CNN are processed differently from the first CNN. A cross-fused CNN in which an arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. The training of the first CNN is performed before the combination with the second CNN, and when the training is performed in combination with the first CNN and the second CNN, the training of the second CNN is performed. We do not change the configuration and parameters of the first CNN.

It is a block diagram which shows one Embodiment of the information processing system which applied this technology. It is a figure which shows the structural example of the cross fusion CNN. It is a figure which shows the structural example of HVC. It is a figure which shows the distribution of the visual cortex of a human brain. It is a figure which shows the structural example of MVC. It is a figure which shows the structural example of the cross fusion CNN. It is a flowchart for demonstrating a training process. It is a block diagram which shows the structural example of the function of the information processing system in the training stage of HVC. It is a block diagram which shows the structural example of the function of the information processing system in the training stage of a cross fusion CNN. It is a figure for demonstrating an example of the training method of HVC. It is a figure for demonstrating information processing. It is a block diagram which shows the structural example of the function of the information processing system in the execution stage of a cross fusion CNN. It is a block diagram which shows the configuration example of a vehicle control system. It is a figure which shows the example of the sensing area. It is a block diagram which shows the configuration example of a computer.

Hereinafter, a mode for implementing the present technology will be described. The explanation will be given in the following order.
1. 1. Definition of terms 2. Embodiment 3. Modification example 4. Effects and application examples of this technology 5. others

<< 1. Definition of terms >>
Hereinafter, the terms used in the present specification will be defined.

<HVC (Human Vision CNN)>
HVC is a CNN that realizes the same processing and function as the human visual system, that is, the same processing and function as the processing and function occurring in the human brain. HVC is provided as a model of a human visual system that includes all steps from light detection to image perception and potentially includes high level functions such as object recognition.

<MVC (Machine Vision CNN)>
MVC is a CNN that realizes a function similar to HVC by processing different from HVC (that is, human visual system) without the limitation of modeling the processing of human visual system. Hereinafter, the visual system realized by MVC will be referred to as a machine visual system.

<Cross fusion>
Cross-fusion refers to the fusion of multiple CNNs by combining multiple CNNs with independent architectures and parameters and training in a combined state to generate a combined CNN architecture. .. Hereinafter, the CNN architecture in which a plurality of CNNs are cross-fused is referred to as a cross fused CNN (Cross fused CNN).

For example, in order to connect different CNN processing structures in multiple convolutional layers, a simple cross connection input is added between one CNN and another CNN. This allows related information to be transferred between any intermediate layer (eg, convolution layer) of different CNNs, allowing information from different CNNs to be efficiently fused at any relevant level of abstraction. can. As a result, difference detection between human visual processing and machine visual processing (Disparity detection) and efficient image fusion (Efficient image fusion) are realized in real time.

<Cross connection>
A cross-connection is a connection that affects the function of the layer within the destination CNN by means of a trainable scalar value. The cross-connection may be, for example, the output from each convolutional layer of the CNN, or the output from a layer of a subset of the convolutional layers.

<< 2. Embodiment >>
Next, an embodiment of the present technology will be described.

<Visual processing system>
FIG. 1 is a block diagram showing an embodiment of an information processing system 101 to which the present technology is applied.

The information processing system 101 includes a sensor unit 111, a human visual processing sensors 112, a processing unit 113, a training data set generation unit 114, and a training database 115.

The sensor unit 111 includes, for example, an image sensor and an optical sensor such as LiDAR (Light Detection And Ringing). The sensor unit 111 generates visual information (for example, image data) based on the data collected by the optical sensor.

The human visual processing sensor 112 collects data (hereinafter referred to as human visual processing data) indicating the state, processing, or function of the human (user) visual system using the information processing system 101. The human visual processing sensor 112 is provided in, for example, an AR (Augmented Reality) headset.

Human visual processing data includes, for example, the following data.

・ Line-of-sight parameters ・ Pupil dynamics ・ Estimated field of view
・ Physiological condition data such as fatigue and caffeine intake ・ EEG (Electroencephalogram) data

For example, the function of the user's visual system can be measured at different processing levels depending on the characteristics of the user's line of sight and the EEG system placed in the ear. In addition, some EEG systems can reconstruct the images that human beings form in their minds.

The processing unit 121 may be configured by, for example, a general processor such as a CPU, or may be configured by a processor optimized for the information processing system 101 or the like. The processing unit 121 realizes the cross-fusion CNN 121.

Cross-fusion CNN121 is a cross-fusion of HVC131 and MVC132 as shown in FIG. The cross-fusion CNN 121 is represented by various parameters for the HVC 131, various parameters for the MVC 131, and cross-fusion parameters for the cross connection between the HVC 131 and the MVC 132.

The cross fusion parameter indicates a fusion method of HVC131 and MVC132, and includes, for example, the following parameters.

-Number of crossed connections-Layer number of origin of crossed connection-Layer number of destination of crossed connection-Type of crossed connection

The connection relationship between HVC131 and MVC132 is indicated by the number of crossed connections, the layer number of the origin of the crossed connection, and the layer number of the destination of the crossed connection.

The type of cross-connection indicates the type of information that is transferred from one CNN intermediate layer (eg, a convolution layer) by cross-connection and input to the other CNN middle layer. For example, the types of cross-connection include feature maps (Feature maps), attention maps (Attention maps), region proposals (Region Proposal), and the like. The transferred information is used for processing the transfer destination layer.

The feature map is, for example, image data output from each convolution layer of CNN, and shows the feature amount of each pixel in the image data.

The attention map is a feature map of a special format, for example, a region (attention region) important for object recognition or the like is represented in a format such as a heat map. For example, in a map of interest, pixels in a region of high importance have a color closer to red, and pixels in a region of less importance have a color closer to blue.

The area proposal is, for example, image data showing an area in an image where an object is likely to exist with a rectangular frame or the like. The region proposal is used, for example, in another algorithm to determine whether or not an object is present in the image, or what object is present in the image. Region proposals are also used, for example, for specific types of object detection.

The cross-fusion CNN121 realizes, for example, a function of expanding or emphasizing the processing difference between HVC131 and MVC132. Further, the cross-fusion CNN 121 realizes a function of reducing the difference between the processing of the HVC 131 (that is, the visual processing of a human) and the processing of the MVC 132 (that is, the visual processing of a machine).

As described above, the HVC 131 is a CNN that realizes the same processing and functions as a human visual system, unlike a standard CNN. At present, the model of the human visual system continues to evolve, but for example, a state-of-the-art model of the human visual system can be applied to the HVC 131.

On the other hand, considering the technical level of the current model of the human visual system, the HVC 131 may include, for example, the following configuration.

The independent functional module of the HVC 131 is an architecture that reflects different functional areas of the human visual system, such as the primary visual cortex (V1) to the fifth visual cortex (V5), for example, as shown in FIG. Can be provided.

Note that FIG. 4 shows a schematic diagram of the human brain. FIG. 4 schematically shows the distribution of V1 to V5 in the human brain. Further, FIG. 4 shows the positions of the human eye 201 and the region 202 for face recognition.

In the example of FIG. 3, an example is shown in which each convolutional layer of the HVC 131 realizes a function similar to V1 to V5 of the human brain. Then, from each convolutional layer of the HVC 131, image data (hereinafter referred to as visual cortex data V1 to visual cortex data V5) that images a signal similar to the signal output from V1 to V5 of the human brain is output. , Is input to the next layer of HVC131, or is input to MVC132.

The functional modules included in the HVC 131 can be connected in series or non-linearly.

In particular, the feedforward type network can realize the optimum model for the processing of the first 200 ms in the visual field. Such models have been demonstrated to produce images capable of activating only specific parts of the human primate visual cortex.

Further, for the HVC 131, for example, it is possible to use a neuron that returns the output to its own input, that is, recurrence. This can, for example, improve the adjustment of the neural architecture (Primate visual cortex neural architecture) and functional performance of the primate visual cortex.

Furthermore, a neuromorphic computing architecture such as a spiking neural network can be applied to the HVC 131. This makes it possible, for example, to more accurately reproduce the functions of the human visual system.

Further, the HVC 131 may be a model of a general human visual system or a model of a specific individual visual system.

Further, for example, the recognition process supported by the HVC 131 may be limited. For example, the object to be recognized by the HVC 131 may be limited. This may improve the performance of the HVC 131.

Further, the content of the output data of the HVC 131 (hereinafter referred to as the HVC output) differs depending on the application of the HVC 131 or the cross fusion CNN 121. For example, labeled image data to which a label indicating the type of an object in the image is attached to the image data input to the HVC 131 is output as HVC output.

As described above, the MVC 132 is a CNN that realizes a function similar to that of the HVC 131 by processing different from the HVC 131 without setting the limitation of modeling the processing of the human visual system.

As will be described later, the HVC 131 is trained before being combined with the MVC 132, that is, before being mounted on the cross-fusion CNN 121. The parameters such as the configuration and weight of the HVC 131 are not changed after being implemented in the cross-fusion CNN 121.

On the other hand, the MVC 132 is not trained before being combined with the HVC 131, i.e., before being mounted on the cross-fused CNN 121, but after being mounted on the cross-fused CNN 121, i.e., after being combined with the HVC 131. That is, parameters such as the configuration and weight of the MVC 132 are adjusted after being mounted on the cross-fusion CNN 121.

Also, the cross-fusion parameters between the HVC 131 and the MVC 132 may be fixed based on a predefined parameter set or may be changed during training of the cross-fusion CNN 121. For example, in the former case, the connection relationship between the HVC 131 and the MVC 132 is not changed, but in the latter case, the connection relationship between the HVC 131 and the MVC 132 may be changed.

Further, in the cross-fusion CNN 121, information is always transferred from the HVC 131 to the MVC 132 so that the MVC 132 is affected by the processing of the HVC 131. That is, there is always a cross connection from the HVC 131 to the MVC 132. As a result, each layer of the HVC 131 independently affects each layer of the MVC 132 via the cross connection, so that the processing of the visual system of the machine depends on the processing of the human visual system.

For example, as shown in FIG. 5, the visual cortex data V1 to the visual cortex data V5 output from each convolution layer of the HVC 131 is transferred to the MVC 132.

On the other hand, it is optional whether or not to transfer information from MVC132 to HVC131. The transfer of information from the MVC 132 to the HVC 131 is useful, for example, in determining where to enter the information from the MVC 132 into the HVC 131 in order to improve the results of the classification of the HVC 131. It also helps determine, for example, how to enhance the image to improve human cognitive function.

As shown in FIG. 2 or FIG. 6, image data to be processed is input to the HVC 131 and MVC 132 of the cross fusion CNN 121. The types of image data input to the HVC 131 and the image data input to the MVC 132 may be the same or different. For example, the image data taken by the camera may be input to the HVC 131, and the image data obtained by imaging the data collected by the LiDAR may be input to the MVC 132. Further, the human visual processing data collected by the human visual processing sensor 112 is input to the HVC 132 as needed.

HVC131 and MVC132 perform image processing individually. Then, the HVC 131 outputs the HVC output as needed. On the other hand, the MVC 132 outputs output data influenced by the HVC 131, that is, output data that depends to some extent on the functions of the human visual system (hereinafter referred to as HVC-influenced output).

The content of the HVC-sensitized output differs depending on the application of the cross-fusion CNN121 and the like. For example, the HVC-sensitized output contains the same data as the HVC output. Further, for example, the HVC-sensitized output includes image data indicating a coincidence point, a difference, etc. between the processing with the HVC 131 and the processing with the MVC 132. This image data is similar to, for example, a focus map featuring an image that the MVC 132 is paying attention to and the HVC 131 is not paying attention to. Further, for example, the HVC-sensitized output includes image data showing the result of image processing such as object classification.

The training data set generation unit 113 includes a set of training data used for training the HVC 131 (hereinafter referred to as a training data set for HVC) and a set of training data used for training the cross fusion CNN 121 (hereinafter referred to as training for cross fusion CNN). Generate a dataset). The training data set generation unit 113 stores the generated training data set for HVC and the training data set for cross-fusion CNN in the training database 115.

Even if the training data set generation unit 113 has a function of giving an object recognition task to a human and collecting a label (hereinafter referred to as a human recognition label) indicating a recognition result of an object in an image by a human. good.

The object recognition task is, for example, a task for testing a human visual system. Specifically, in the object recognition task, for example, an image is presented to a human for a predetermined time (for example, for 100 ms), the human classifies the object in the image, and a label indicating the result of the classification is given. It's a task.

<Processing of information processing system 101>
Next, the processing of the information processing system 101 will be described.

<Training process>
First, the training process executed by the information processing system 101 will be described with reference to the flowchart of FIG. 7.

Note that FIG. 8 shows a configuration example of the function of the information processing system 101 in the training stage of the HVC 131. FIG. 9 shows a configuration example of the function of the information processing system 101 in the training stage of the cross-fusion CNN 121.

In step S1, the training data set generation unit 114 generates a training data set for HVC 131 (that is, a training data set for HVC).

For example, a training dataset for HVC includes a set of image data labeled by humans. This label is given, for example, within the object recognition task described above.

Alternatively, the label may be automatically added, for example, using a video database that determines whether a human has correctly or incorrectly identified an object in the image.

Further, the training data set for HVC includes human visual processing data collected by the human visual processing sensor 112, if necessary. Human visual processing data is acquired in synchronization with other training data (eg, image data corresponding to the presented image). That is, the human visual processing data collected from the person to whom the image is presented is associated with the image data corresponding to the presented image.

Note that, for example, the time lag that exists between the presentation of the image and the activity of the human visual system may be taken into consideration. For example, there is a time lag of about 100 ms before V4 in the human brain reacts to the presented image. In this case, for example, among the human visual processing data, the data acquired from V4 of the human brain is associated with the image data corresponding to the presented image 100 ms before the acquisition of the data.

The training data set generation unit 114 stores the generated training data for HVC in the training database 115.

In step S2, the processing unit 121 trains the HVC 131.

For example, the HVC 131 is trained by the standard method of neural networks that realizes functions similar to the human visual system.

For example, a match between the output of the HVC 131 for the image data in the training data set for HVC and the result of the human object recognition task for the image corresponding to the image data (that is, the human recognition label attached to the image data). Training is conducted as a success indicator.

It should be noted that the degree of similarity between the activity mapping in the human visual system for the same image and the functional activity in the HVC 131 may be included in the training success index.

For example, as shown in FIG. 10, image data is input to the HVC 131, and an image corresponding to the image data is presented to a human wearing a human visual processing sensor 112 (hereinafter referred to as a data provider). NS. The human visual processing sensor 112 collects human visual processing data indicating the reaction of the data provider to the presented image and inputs it to the HVC 131. Further, the human visual processing sensor 112 uses the signals output from the data providers V1 to V5 (hereinafter referred to as visual cortex signals V1 to visual cortex signals V5) among the collected human visual processing data into the image reproduction model 251. input.

The image reproduction model 251 converts the visual cortex signals V1 to the visual cortex signals V5 into image data (hereinafter referred to as correct visual cortex data V1 to correct visual cortex data V5) and outputs them. Then, the visual cortex image data V1 to the visual cortex image data V5 output from each convolutional layer of the HVC 131 and the correct visual cortex data V1 to the correct visual cortex data V5 output from the image reproduction model 251 are compared with each other. Then, the degree of similarity between the visual cortex image data V1 to the visual cortex image data V5 and the correct visual cortex data V1 to the correct visual cortex data V5 is used as a success index.

For example, the above two types of success indicators are scored by combining them using a predetermined function or the like, and the HVC 131 with respect to the input image data is based on the result of comparing the calculated score with the predetermined threshold value. The success / failure of the process is determined. Then, the configuration and parameters of the HVC 131 are adjusted so that the success rate of the processing of the HVC 131 is improved.

The HVC 131 may be trained for a specific individual or for a general human being.

Next, the performance of HVC131 is tested before it is mounted on the cross-fusion CNN121. Here, an example of a method for testing the performance of the HVC 131 will be described.

For example, Interpretability techniques are used to cause the HVC 131 to generate images that maximize the activation of specific areas or neurons of the human visual system. For example, the image generated by the HVC 131 is presented to a person wearing a device that scans the brain, such as an EGG scanner, an MRI (Magnetic Resonance Imaging) scanner, or a NIR (Near-Infrared Spectroscopy) scanner. If the processing of HVC131 is effective, the target region or neuron is activated in the human brain to which the image is presented.

Further, for example, a database showing the reaction of the human visual cortex to various images may be used for the test of HVC131. For example, the correct visual cortex data V1 to the correct visual cortex data V5 obtained by converting the visual cortex signal V5 into the human visual cortex signal V1 for various images is stored in the database. Then, by the method described above with reference to FIG. 10, the visual field data V1 to the visual field data V5 output from the HVC 131 and the correct visual field data V1 to the correct visual field in the database are obtained for various images in the database. The HVC 131 is tested by comparing the data V5.

Further, for example, a behavior prediction test (Behavioral predictivity test) in which a behavior pattern corresponding to a presented image is linked to an object recognition process inside a human being may be used. For example, human reactions in behavior prediction tests are classified into multiple recognition categories (eg, missed recognitions, False positive recognitions, etc.). Then, the degree of similarity between the recognition result of HVC131 and the human reaction in the behavior prediction test is used for the test of HVC131.

Then, for example, when the score (accuracy) of the above test is equal to or higher than the preset threshold value, the HVC 131 is mounted on the cross-fusion CNN 121. On the other hand, if the test score is below the threshold, HVC131 is retrained.

In step S3, the training data set generation unit 114 generates a training data set (cross-fusion CNN training data set) for the cross-fusion CNN 121.

The method of generating the training data set for the cross-fusion CNN differs depending on the function realized by the cross-fusion CNN 121 (for example, the content of the HVC-sensitized output). Hereinafter, an example of a method for generating a training data set for cross-fusion CNN will be described.

For example, when the cross-fusion CNN 121 outputs the recognition result of the object as the HVC-sensitized output, the training data set for HVC may be used for the training for the cross-fusion CNN. In this case, for example, a correct answer label (Ground truth label) indicating the type of the actual object is given to the input image data with the human recognition label included in the training data set for HVC. The human recognition label and the correct label may or may not match.

Alternatively, for example, the training data used for the training data set for HVC is generated by assigning a pseudo human recognition label to the input image data to which only the correct answer label is attached by the trained HVC 131. You may.

Note that the input image data may include either a negative example (Negative example) or a positive example (Positive example). The negative example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 are different. A positive example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 match.

Further, for example, when the cross-fusion CNN 121 outputs the image data generated by processing the input image data as an HVC-sensitized output, the cross-fusion CNN training data set includes the input image data and the input image data. Generated from, and contains correct image data in the same format as the HVC-sensitized output.

The following is an example of correct image data.

For example, if the HVC-sensitized output contains a variance map, which is image data showing the difference between human visual processing and machine visual processing, the training dataset for cross-fusion CNN uses the difference map as the correct image data. include.

For example, the attention map is generated using the HVC131 after training and the MVC132 before being mounted on the cross-fusion CNN121. This focus map is generated, for example, from the entire neural network (HVC131 and MVC132). It is also possible to use an individual neuron in each neural network or a attention map generated in any combination of neurons.

Then, by subtracting the attention map generated by the HVC 131 and the attention map generated by the MVC 132, a difference map is generated and used as correct image data.

For example, if the HVC-sensitized output contains AI extended image data that does not have a significant effect on human decision making, the cross-fusion CNN training dataset will include such AI extended image data as correct image data.

For example, predetermined image processing (for example, sharpening, Generative data, uprising, etc.) is performed on the input image data using the MVC132 before being mounted on the cross-fusion CNN121. As a result, the AI extended image is generated. Further, for example, predetermined image processing is performed on the input image data using the HVC 131, and an attention map estimated to be noticed by humans in an important recognition task (for example, bleeding detection in surgery) is generated. NS.

Then, based on the generated attention map, the input image is divided into an attention area where the HVC 131 pays a great deal of attention and a non-attention area where the HVC 131 does not pay a great deal of attention. That is, the input image is divided into a region of interest where humans are presumed to pay a great deal of attention in an important recognition task and a non-attention region where humans are presumed not to pay a great deal of attention. Then, a composite image using the input image in the attention region and the AI extended image in the non-focus region is generated, and the composite image data corresponding to the generated composite image is used as the correct image data.

Further, when the human visual processing data is input from the human visual processing sensor 112 to the cross-fusion CNN 121 at the execution stage, the training data for the cross-fusion CNN includes the human visual processing data.

The training data set generation unit 114 stores the generated training data set for cross-fusion CNN in the training database 115.

In step S4, the processing unit 121 trains the cross-fusion CNN 121.

Specifically, the cross-fusion CNN 121 is generated by connecting the post-training HVC 131 and the pre-training MVC 132 using the cross-fusion parameters according to a predefined architecture.

Further, the training data set for cross-fusion CNN stored in the training database 115 is divided into a training data set and a test data set. For example, about 80% of the training dataset for cross-fusion CNN is used for the training dataset and the rest is used for the test dataset.

The cross-fusion CNN121 processes the input image data included in the training data set and outputs the HVC-sensitized output.

Then, for example, a score is given based on the result of comparing the HVC-sensitized output with the human recognition label and the correct answer label given to the input image data. Whether or not the case where the HVC-sensitized output and the human recognition label match is judged as success, and whether or not the case where the HVC-sensitized output and the correct answer label match is judged as success depends on the category of the object to be recognized, etc. Is determined by.

Further, for example, a score is given based on the degree of similarity between the HVC-sensitized output and the correct image data corresponding to the input image data (for example, the degree of similarity of pixel values, etc.).

The above processing is repeated for all the input image data of the training data set, and the configuration and parameters of the cross fusion CNN121 are adjusted so that the score is improved. Specifically, the configuration and parameters of the MVC 132, as well as the cross-fusion parameters are adjusted. On the other hand, the configuration and parameters of HVC131 are not changed.

Next, the cross-fusion CNN121 trained using the training data set is tested using the test data set. In the test stage, the configuration and parameters of the cross-fusion CNN 121 are not adjusted, and the same scoring as in the training stage is performed in order to estimate the accuracy of the cross-fusion CNN 121.

For example, if it is determined in the test stage that the performance of the cross-fusion CNN121 is not sufficient, the training is restarted from the training stage.

As described above, the cross fusion CNN121 is generated.

<Information processing>
Next, the information processing executed by the information processing system 101 will be described with reference to the flowchart of FIG.

Note that FIG. 12 shows a detailed configuration example of the information processing system 101 at the execution stage.

In step S51, the cross-fusion CNN 121 acquires image data and human visual processing data.

Specifically, the image data is input to the HVC 131 and the MVC 132.

Further, the human visual processing sensor 112 is attached to the user, collects the human visual processing data of the user, and inputs it to the HVC 131. As the human visual processing data, the same type of data as that used for training the HVC131 and the cross-fusion CNN121 is used.

In step S52, the cross fusion CNN121 processes the image data. Then, the cross-fusion CNN 121 outputs a predetermined HVC-sensitized output which is a processing result of the image data.

The human visual processing data is used to synchronize the HVC 131 with the current state and operation of the user. This makes it possible to provide a more accurate model of the human visual system (HVC131), and as a result, the HVC-sensitized output can be optimized.

After that, the visual model processing ends.

<< 3. Modification example >>
Hereinafter, a modification of the present technology will be described.

For example, in the training stage, training of a plurality of cross-fusion CNN 121s having different cross-fusion parameters may be performed. Then, each cross-fusion CNN 121 may be given a score, and the cross-fusion CNN 121 having the highest score may be adopted.

Alternatively, for example, training may be performed to generate a plurality of cross-fused CNN 121s having different tolerances for the processing of the HVC 131, and the plurality of cross-fused CNN 121s may be used in the information processing system 101. This allows, for example, each cross-fusion CNN 121 to provide output influenced by human visual processing with different intuitive interpretations to a standard user.

Further, for example, a cross-fusion CNN121 in which a plurality of MVC132s are combined for one HVC131 may be generated. In this case, for example, different types of HVC-sensitized outputs can be output from each MVC 132.

<< 4. Effects and application examples of this technology >>
Hereinafter, the effects and application examples of this technology will be described.

According to this technology, the human visual system and the machine visual system can be effectively integrated. As a result, the performance of models with functions similar to those of human visual systems can be improved.

For example, at any semantic level between HVC131 and MVC132, it is possible to influence one or the other.

For example, within a controlled range, it is possible to obtain an output in which the visual processing of a machine is adjusted by human visual processing.

For example, this technology can be suitably applied when it is necessary for humans and machines to make decisions with emphasis, or when humans have to monitor and verify machine decisions.

For example, the coincidence points and differences between the processing of the human visual system (HVC131) and the processing of the machine visual system (MVC132) can be detected and used at an arbitrary semantic level.

Specifically, for example, an output showing the points of agreement and differences between human and machine visual processing can be obtained. For example, it will be possible to generate and present a difference map that emphasizes or differentiates the features of an image that the visual system of the machine is paying attention to but not the user is paying attention to in relation to the current task. ..

For example, the inputs, arbitrary intermediate layers, and outputs of the HVC 131 and MVC 132 can be individually accessed to evaluate the performance of the system. Thus, for example, any Semantic processing layer on CNN makes it possible to compare human and machine visual systems. Also, for example, monitoring and certainty of parameters related to the human visual system can be ensured.

This makes it possible, for example, to prevent the choices made by the AI system in interpreting, generating, or manipulating data from deviating from reality. Also, for example, in a human-machine coordination system, a phenomenon in which a machine deviates from the perception of the human scene, causing a difference in human-machine decision making, making it difficult or slow for humans to understand the machine's decision. The occurrence can be suppressed.

For example, the cross-fused CNN 121 can identify image features that are difficult for humans to detect but affect the visual processing of the machine, i.e., differentiated features that differentiate the machine from humans. can. For example, even if the same object is finally recognized by both human and machine visual systems, it is possible to predict the difference between human and machine at a lower semantic level such as the difference in edge detection behavior. ..

For example, the information recognized by the user can be expanded based on the difference in visual processing between humans and machines.

For example, using an AR (Augmented Reality) system or the like, in order to improve the user's scene recognition, it is possible to emphasize the above-mentioned differentiating features and superimpose them on the real world. For example, it is possible to provide an interpretability stream to effectively emphasize features in a scene that the visual system of a machine is paying attention to and not humans are paying attention to.

For example, in medical robots and monitor systems, information recognized by doctors and the like can be expanded. For example, it is possible to display an image during surgery while emphasizing features that affect the surgeon's decision but are difficult to notice.

For example, the image can be expanded without changing important features in human visual processing. Specifically, for example, when the user is operating a predetermined device while viewing an image expanded by the visual system (MVC132) of the machine, for a task (Non-critical task) that is not important to the user. Therefore, the information presented to the user is expanded. On the other hand, for example, the feature that the HVC 131 pays attention to is presumed to be important in human visual processing, and is presumed to be important when the user performs an important task (Critical task). Therefore, the features that the HVC 131 pays attention to are presented to the user without major changes. This keeps the user accountable for important tasks.

Note that the detection of image features that are important to human vision can occur at any semantic level of the CNN and can be difficult to define with rule-based methods. On the other hand, in the present art, features that are difficult to define by such a rule-based method are maintained unchanged by the MVC 132. It should be noted that such a feature may be, for example, a feature that humans do not consciously pay attention to.

For example, according to this technology, the interpretability (Interpretability) and accountability (Accountability) of the processing and results of the AI system can be improved or customized.

Specifically, for example, a part of HVC131 can be used to verify the performance of the system. Further, for example, automatic domain adaptation can be realized based on the user's experience. Further, for example, racial bias of face recognition, medical AR applications can be realized.

Also, for example, due to the interpretability and accountability of the processing of the AI system, the visual system of the machine can be harmonized with the human visual processing.

For example, the cross-fused CNN 121 is trained with the permitted deviation of multiple levels of data processing capabilities of the CNN. As a result, when the tolerance is small, the processing steps of the AI system can be intuitively understood by humans. Also, by focusing on features that humans do not pay attention to, multiple cross-fused CNN 121s that achieve higher efficiency or accuracy are generated with larger, different levels of tolerance. Then, at the execution stage, a "human-like" vision system can be created by humans selecting the cross-fused CNN121 with the ideal level of deviation (intuitive interpretability) allowed for the task. You can control how you behave.

For example, a doctor may select high human coherence for a task that does not require advanced cognitive processing. This ensures that the criteria for determining the AI system are similar to those that are easily detectable by humans, making it possible for humans to communicate with the AI system easily and quickly.

For example, as described above, the HVC131 and MVC132 are individually constructed, trained and established, and it is guaranteed that the configuration and parameters of the HVC131 will not change during the training and execution of the cross-fusion CNN121. Thus, for example, a good model of the human visual system (HVC131) can be used with the configuration and parameters guaranteed not to be adjusted or changed.

Also, for example, it is possible to add a layer of information that cannot be obtained by simply training CNN by the error propagation method. As a result, for example, it is possible to improve the robustness against an adversarial attack. Further, for example, it is possible to improve the accuracy of object recognition for a large hidden object (Heavily occupied objects). Further, for example, it is possible to improve the performance of Abstraction, compensate for the weaknesses of human and machine visual systems, and improve recognition accuracy and the like.

Furthermore, the model to which this technology is applied has a mapping between human visual processing and machine visual processing. This can be mapped by estimating the reciprocal activation of neurons in the human and machine visual system for the same image. Since this mapping is understood in the art, images can be generated, for example, by networks optimized to produce higher activation in certain parts of the human visual system.

In addition, with the evolution of this field, it is expected that the degree of agreement between human and machine nerve activation patterns will increase, and the accuracy of systems to which this technology will be applied will improve. In addition, the range of images that can be handled by the model of the human visual system can be expanded.

Furthermore, for example, by adding a feedback loop to the user, a natural UI (user interface) can be realized. For example, you can improve the responsiveness of the UI or make the UI more comfortable.

Further, for example, since the processing of HVC131 and MVC132 is executed in parallel, the processing speed is increased as compared with the system mounted in series. In addition, it is possible to realize a system that is simpler and has higher calculation efficiency than performing the calculation of human and machine visual processing individually.

This technology can also be applied to mobile devices such as vehicles.

FIG. 13 is a block diagram showing a configuration example of a vehicle control system 1011 which is an example of a mobile device control system to which the present technology is applied.

The vehicle control system 1011 is provided in the vehicle 1001 and performs processing related to driving support and automatic driving of the vehicle 1001.

The vehicle control system 1011 includes a processor 1021, a communication unit 1022, a map information storage unit 1023, a GNSS (Global Navigation Satellite System) receiving unit 1024, an external recognition sensor 1025, an in-vehicle sensor 1026, a vehicle sensor 1027, a recording unit 1028, and driving support. It includes an automatic driving control unit 1029, a DMS (Driver Monitoring System) 1030, an HMI (Human Machine Interface) 1031, and a vehicle control unit 1032.

Processor 1021, communication unit 1022, map information storage unit 1023, GNSS receiver unit 1024, external recognition sensor 1025, in-vehicle sensor 1026, vehicle sensor 1027, recording unit 1028, driving support / automatic driving control unit 1029, driver monitoring system (DMS) 30, the human machine interface (HMI) 31, and the vehicle control unit 1032 are connected to each other via the communication network 1041. The communication network 1041 is provided by, for example, an in-vehicle communication network or a bus compliant with any standard such as CAN (Controller Area Network), LIN (Local Interconnect Network), LAN (Local Area Network), FlexRay (registered trademark), and Ethernet. It is composed. In addition, each part of the vehicle control system 1011 may be directly connected by, for example, short-range wireless communication (NFC (Near Field Communication)), Bluetooth (registered trademark), or the like without going through the communication network 1041.

Hereinafter, when each part of the vehicle control system 1011 communicates via the communication network 1041, the description of the communication network 1041 shall be omitted. For example, when the processor 1021 and the communication unit 1022 communicate with each other via the communication network 1041, it is described that the processor 1021 and the communication unit 1022 simply communicate with each other.

The processor 1021 is composed of various processors such as a CPU (Central Processing Unit), an MPU (Micro Processing Unit), and an ECU (Electronic Control Unit), for example. The processor 1021 controls the entire vehicle control system 1011.

The communication unit 1022 communicates with various devices inside and outside the vehicle, other vehicles, servers, base stations, etc., and transmits and receives various data. As for communication with the outside of the vehicle, for example, the communication unit 1022 receives from the outside a program for updating the software for controlling the operation of the vehicle control system 1011, map information, traffic information, information around the vehicle 1001 and the like. .. For example, the communication unit 1022 transmits information about the vehicle 1001 (for example, data indicating the state of the vehicle 1001, recognition result by the recognition unit 1073, etc.), information around the vehicle 1001, and the like to the outside. For example, the communication unit 1022 performs communication corresponding to a vehicle emergency call system such as eCall.

The communication method of the communication unit 1022 is not particularly limited. Moreover, a plurality of communication methods may be used.

As for communication with the inside of the vehicle, for example, the communication unit 1022 performs wireless communication with the equipment in the vehicle by a communication method such as wireless LAN, Bluetooth, NFC, WUSB (WirelessUSB). For example, the communication unit 1022 may use USB (Universal Serial Bus), HDMI (High-Definition Multimedia Interface, registered trademark), or MHL (Mobile High-) via a connection terminal (and a cable if necessary) (not shown). Wired communication is performed with the equipment in the car by a communication method such as definitionLink).

Here, the device in the vehicle is, for example, a device that is not connected to the communication network 1041 in the vehicle. For example, mobile devices and wearable devices possessed by passengers such as drivers, information devices brought into a vehicle and temporarily installed, and the like are assumed.

For example, the communication unit 1022 is a base station using a wireless communication system such as 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), LTE (LongTermEvolution), DSRC (DedicatedShortRangeCommunications), etc. Alternatively, it communicates with a server or the like existing on an external network (for example, the Internet, a cloud network, or a network peculiar to a business operator) via an access point.

For example, the communication unit 1022 uses P2P (Peer To Peer) technology to communicate with a terminal existing in the vicinity of the own vehicle (for example, a pedestrian or store terminal, or an MTC (Machine Type Communication) terminal). .. For example, the communication unit 1022 performs V2X communication. V2X communication is, for example, vehicle-to-vehicle (Vehicle to Vehicle) communication with other vehicles, road-to-vehicle (Vehicle to Infrastructure) communication with roadside devices, and home (Vehicle to Home) communication. , And pedestrian-to-vehicle (Vehicle to Pedestrian) communication with terminals owned by pedestrians.

For example, the communication unit 1022 receives electromagnetic waves transmitted by a vehicle information and communication system (VICS (Vehicle Information and Communication System), registered trademark) such as a radio wave beacon, an optical beacon, and FM multiplex broadcasting.

The map information storage unit 1023 stores a map acquired from the outside and a map created by the vehicle 1001. For example, the map information storage unit 1023 stores a three-dimensional high-precision map, a global map that is less accurate than the high-precision map and covers a wide area, and the like.

The high-precision map is, for example, a dynamic map, a point cloud map, a vector map (also referred to as an ADAS (Advanced Driver Assistance System) map), or the like. The dynamic map is, for example, a map composed of four layers of dynamic information, quasi-dynamic information, quasi-static information, and static information, and is provided from an external server or the like. The point cloud map is a map composed of point clouds (point cloud data). A vector map is a map in which information such as lanes and signal positions is associated with a point cloud map. The point cloud map and the vector map may be provided, for example, from an external server or the like, or the vehicle 1001 as a map for matching with a local map described later based on the sensing result by the radar 1052, LiDAR1053, or the like. It may be created and stored in the map information storage unit 1023. Further, when a high-precision map is provided from an external server or the like, in order to reduce the communication capacity, map data of, for example, several hundred meters square, relating to the planned route on which the vehicle 1001 will travel from now on is acquired from the server or the like.

The GNSS receiving unit 1024 receives the GNSS signal from the GNSS satellite and supplies it to the traveling support / automatic driving control unit 1029.

The external recognition sensor 1025 includes various sensors used for recognizing the external situation of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the external recognition sensor 1025 are arbitrary.

For example, the external recognition sensor 1025 includes a camera 1051, a radar 1052, a LiDAR (Light Detection and Ringing, Laser Imaging Detection and Ringing) 53, and an ultrasonic sensor 1054. The number of cameras 1051, radar 1052, LiDAR1053, and ultrasonic sensors 1054 is arbitrary, and examples of sensing areas of each sensor will be described later.

As the camera 1051, for example, a camera of any shooting method such as a ToF (TimeOfFlight) camera, a stereo camera, a monocular camera, an infrared camera, etc. is used as needed.

Further, for example, the external recognition sensor 1025 includes an environment sensor for detecting weather, weather, brightness, and the like. The environment sensor includes, for example, a raindrop sensor, a fog sensor, a sunshine sensor, a snow sensor, an illuminance sensor, and the like.

Further, for example, the external recognition sensor 1025 includes a microphone used for detecting the position of a sound or a sound source around the vehicle 1001.

The in-vehicle sensor 1026 includes various sensors for detecting information in the vehicle, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the in-vehicle sensor 1026 are arbitrary.

For example, the in-vehicle sensor 1026 includes a camera, a radar, a seating sensor, a steering wheel sensor, a microphone, a biological sensor, and the like. As the camera, for example, a camera of any shooting method such as a ToF camera, a stereo camera, a monocular camera, and an infrared camera can be used. The biosensor is provided on, for example, a seat, a stelling wheel, or the like, and detects various biometric information of a occupant such as a driver.

The vehicle sensor 1027 includes various sensors for detecting the state of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the vehicle sensor 1027 are arbitrary.

For example, the vehicle sensor 1027 includes a speed sensor, an acceleration sensor, an angular velocity sensor (gyro sensor), and an inertial measurement unit (IMU (Inertial Measurement Unit)). For example, the vehicle sensor 1027 includes a steering angle sensor for detecting the steering angle of the steering wheel, a yaw rate sensor, an accelerator sensor for detecting the operation amount of the accelerator pedal, and a brake sensor for detecting the operation amount of the brake pedal. For example, the vehicle sensor 1027 includes a rotation sensor that detects the rotation speed of an engine or a motor, an air pressure sensor that detects tire pressure, a slip ratio sensor that detects tire slip ratio, and a wheel speed that detects wheel rotation speed. Equipped with a sensor. For example, the vehicle sensor 1027 includes a battery sensor that detects the remaining amount and temperature of the battery, and an impact sensor that detects an impact from the outside.

The recording unit 1028 includes, for example, a magnetic storage device such as a ROM (ReadOnlyMemory), a RAM (RandomAccessMemory), and an HDD (HardDiscDrive), a semiconductor storage device, an optical storage device, an optical magnetic storage device, and the like. .. The recording unit 1028 records various programs, data, and the like used by each unit of the vehicle control system 1011. For example, the recording unit 1028 records a rosbag file including messages sent and received by the ROS (Robot Operating System) in which an application program related to automatic driving operates. For example, the recording unit 1028 includes an EDR (Event Data Recorder) and a DSSAD (Data Storage System for Automated Driving), and records information on the vehicle 1001 before and after an event such as an accident.

The driving support / automatic driving control unit 1029 controls the driving support and automatic driving of the vehicle 1001. For example, the driving support / automatic driving control unit 1029 includes an analysis unit 1061, an action planning unit 1062, and an operation control unit 1063.

The analysis unit 1061 analyzes the vehicle 1001 and the surrounding conditions. The analysis unit 1061 includes a self-position estimation unit 1071, a sensor fusion unit 1072, and a recognition unit 1073.

The self-position estimation unit 1071 estimates the self-position of the vehicle 1001 based on the sensor data from the external recognition sensor 1025 and the high-precision map stored in the map information storage unit 1023. For example, the self-position estimation unit 1071 generates a local map based on the sensor data from the external recognition sensor 1025, and estimates the self-position of the vehicle 1001 by matching the local map with the high-precision map. The position of the vehicle 1001 is based on, for example, the center of the rear wheel-to-axle.

The local map is, for example, a three-dimensional high-precision map created by using a technology such as SLAM (Simultaneous Localization and Mapping), an occupied grid map (OccupancyGridMap), or the like. The three-dimensional high-precision map is, for example, the point cloud map described above. The occupied grid map is a map that divides a three-dimensional or two-dimensional space around the vehicle 1001 into a grid (grid) of a predetermined size and shows the occupied state of an object in grid units. The occupied state of an object is indicated by, for example, the presence or absence of an object and the probability of existence. The local map is also used, for example, in the detection process and the recognition process of the external situation of the vehicle 1001 by the recognition unit 1073.

The self-position estimation unit 1071 may estimate the self-position of the vehicle 1001 based on the GNSS signal and the sensor data from the vehicle sensor 1027.

The sensor fusion unit 1072 performs a sensor fusion process for obtaining new information by combining a plurality of different types of sensor data (for example, image data supplied from the camera 1051 and sensor data supplied from the radar 1052). .. Methods for combining different types of sensor data include integration, fusion, and association.

The recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001.

For example, the recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001 based on the information from the external recognition sensor 1025, the information from the self-position estimation unit 1071, the information from the sensor fusion unit 1072, and the like. ..

Specifically, for example, the recognition unit 1073 performs detection processing, recognition processing, and the like of objects around the vehicle 1001. The object detection process is, for example, a process of detecting the presence / absence, size, shape, position, movement, etc. of an object. The object recognition process is, for example, a process of recognizing an attribute such as an object type or identifying a specific object. However, the detection process and the recognition process are not always clearly separated and may overlap.

For example, the recognition unit 1073 detects an object around the vehicle 1001 by performing clustering that classifies the point cloud based on sensor data such as LiDAR or radar into a block of a point cloud. As a result, the presence / absence, size, shape, and position of objects around the vehicle 1001 are detected.

For example, the recognition unit 1073 detects the movement of an object around the vehicle 1001 by performing tracking that follows the movement of a mass of point clouds classified by clustering. As a result, the velocity and the traveling direction (movement vector) of the object around the vehicle 1001 are detected.

For example, the recognition unit 1073 recognizes the type of an object around the vehicle 1001 by performing an object recognition process such as semantic segmentation on the image data supplied from the camera 1051.

The object to be detected or recognized is assumed to be, for example, a vehicle, a person, a bicycle, an obstacle, a structure, a road, a traffic light, a traffic sign, a road sign, or the like.

For example, the recognition unit 1073 recognizes the traffic rules around the vehicle 1001 based on the map stored in the map information storage unit 1023, the self-position estimation result, and the recognition result of the objects around the vehicle 1001. I do. By this processing, for example, the position and state of a signal, the contents of traffic signs and road markings, the contents of traffic regulations, the lanes in which the vehicle can travel, and the like are recognized.

For example, the recognition unit 1073 performs recognition processing of the environment around the vehicle 1001. As the surrounding environment to be recognized, for example, weather, temperature, humidity, brightness, road surface condition, and the like are assumed.

The action planning unit 1062 creates an action plan for the vehicle 1001. For example, the action planning unit 1062 creates an action plan by performing route planning and route tracking processing.

Note that route planning (Global path planning) is a process of planning a rough route from the start to the goal. This route plan is called a track plan, and in the route planned by the route plan, the track generation (Local) capable of safely and smoothly traveling in the vicinity of the vehicle 1001 in consideration of the motion characteristics of the vehicle 1001 is taken into consideration. The processing of path planning) is also included.

Route tracking is a process of planning an operation for safely and accurately traveling on a route planned by route planning within a planned time. For example, the target speed and the target angular velocity of the vehicle 1001 are calculated.

The motion control unit 1063 controls the motion of the vehicle 1001 in order to realize the action plan created by the action plan unit 1062.

For example, the motion control unit 1063 controls the steering control unit 1081, the brake control unit 1082, and the drive control unit 1083 so that the vehicle 1001 advances on the track calculated by the track plan. Take control. For example, the motion control unit 1063 performs coordinated control for the purpose of realizing ADAS functions such as collision avoidance or impact mitigation, follow-up running, vehicle speed maintenance running, collision warning of own vehicle, and lane deviation warning of own vehicle. For example, the motion control unit 1063 performs coordinated control for the purpose of automatic driving or the like that autonomously travels without being operated by the driver.

The DMS1073 performs driver authentication processing, driver status recognition processing, and the like based on sensor data from the in-vehicle sensor 1026 and input data input to HMI1031. As the state of the driver to be recognized, for example, physical condition, alertness, concentration, fatigue, line-of-sight direction, drunkenness, driving operation, posture, and the like are assumed.

Note that the DMS1073 may perform authentication processing for passengers other than the driver and recognition processing for the status of the passenger. Further, for example, the DMS 1073 may perform the recognition processing of the situation inside the vehicle based on the sensor data from the sensor 1026 in the vehicle. As the situation inside the vehicle to be recognized, for example, temperature, humidity, brightness, odor, etc. are assumed.

The HMI 1031 is used for inputting various data and instructions, generates an input signal based on the input data and instructions, and supplies the input signal to each part of the vehicle control system 1011. For example, the HMI 1031 includes an operation device such as a touch panel, a button, a microphone, a switch, and a lever, and an operation device that can be input by a method other than manual operation by voice or gesture. The HMI 1031 may be, for example, a remote control device using infrared rays or other radio waves, or an externally connected device such as a mobile device or a wearable device corresponding to the operation of the vehicle control system 1011.

Further, the HMI 1031 performs output control for generating and outputting visual information, auditory information, and tactile information for the passenger or the outside of the vehicle, and for controlling output contents, output timing, output method, and the like. The visual information is, for example, information shown by an image such as an operation screen, a state display of the vehicle 1001, a warning display, a monitor image showing the surrounding situation of the vehicle 1001, or light. The auditory information is, for example, information indicated by voice such as a guidance, a warning sound, and a warning message. The tactile information is information given to the passenger's tactile sensation by, for example, force, vibration, movement, or the like.

As a device that outputs visual information, for example, a display device, a projector, a navigation device, an instrument panel, a CMS (Camera Monitoring System), an electronic mirror, a lamp, etc. are assumed. The display device is a device that displays visual information in the occupant's field of view, such as a head-up display, a transmissive display, and a wearable device having an AR (Augmented Reality) function, in addition to a device having a normal display. You may.

As a device that outputs auditory information, for example, an audio speaker, headphones, earphones, etc. are assumed.

As a device that outputs tactile information, for example, a haptics element using haptics technology or the like is assumed. The haptic element is provided on, for example, a steering wheel, a seat, or the like.

The vehicle control unit 1032 controls each part of the vehicle 1001. The vehicle control unit 1032 includes a steering control unit 1081, a brake control unit 1082, a drive control unit 1083, a body system control unit 1084, a light control unit 1085, and a horn control unit 1086.

The steering control unit 1081 detects and controls the state of the steering system of the vehicle 1001. The steering system includes, for example, a steering mechanism including a steering wheel, electric power steering, and the like. The steering control unit 1081 includes, for example, a control unit such as an ECU that controls the steering system, an actuator that drives the steering system, and the like.

The brake control unit 1082 detects and controls the state of the brake system of the vehicle 1001. The brake system includes, for example, a brake mechanism including a brake pedal and the like, ABS (Antilock Brake System) and the like. The brake control unit 1082 includes, for example, a control unit such as an ECU that controls the brake system, an actuator that drives the brake system, and the like.

The drive control unit 1083 detects and controls the state of the drive system of the vehicle 1001. The drive system includes, for example, a drive force generator for generating a drive force of an accelerator pedal, an internal combustion engine, a drive motor, or the like, a drive force transmission mechanism for transmitting the drive force to the wheels, and the like. The drive control unit 1083 includes, for example, a control unit such as an ECU that controls the drive system, an actuator that drives the drive system, and the like.

The body system control unit 1084 detects and controls the state of the body system of the vehicle 1001. The body system includes, for example, a keyless entry system, a smart key system, a power window device, a power seat, an air conditioner, an airbag, a seat belt, a shift lever, and the like. The body system control unit 1084 includes, for example, a control unit such as an ECU that controls the body system, an actuator that drives the body system, and the like.

The light control unit 1085 detects and controls various light states of the vehicle 1001. As the light to be controlled, for example, a headlight, a backlight, a fog light, a turn signal, a brake light, a projection, a bumper display, or the like is assumed. The light control unit 1085 includes a control unit such as an ECU that controls the light, an actuator that drives the light, and the like.

The horn control unit 1086 detects and controls the state of the car horn of the vehicle 1001. The horn control unit 1086 includes, for example, a control unit such as an ECU that controls the car horn, an actuator that drives the car horn, and the like.

FIG. 14 is a diagram showing an example of a sensing region by a camera 1051, a radar 1052, a LiDAR 1053, and an ultrasonic sensor 1054 of the external recognition sensor 1025 of FIG.

The sensing area 1101F and the sensing area 1101B show an example of the sensing area of the ultrasonic sensor 1054. The sensing region 1101F covers the vicinity of the front end of the vehicle 1001. The sensing region 1101B covers the periphery of the rear end of the vehicle 1001.

The sensing results in the sensing area 1101F and the sensing area 1101B are used, for example, for parking support of the vehicle 1001 and the like.

The sensing area 1102F to the sensing area 1102B show an example of the sensing area of the radar 1052 for a short distance or a medium distance. The sensing area 1102F covers a position farther than the sensing area 1101F in front of the vehicle 1001. The sensing region 1102B covers the rear of the vehicle 1001 to a position farther than the sensing region 1101B. The sensing area 1102L covers the rear periphery of the left side surface of the vehicle 1001. The sensing region 1102R covers the rear periphery of the right side surface of the vehicle 1001.

The sensing result in the sensing area 1102F is used, for example, to detect a vehicle, a pedestrian, or the like existing in front of the vehicle 1001. The sensing result in the sensing region 1102B is used, for example, for a collision prevention function behind the vehicle 1001. The sensing results in the sensing area 1102L and the sensing area 1102R are used, for example, for detecting an object in a blind spot on the side of the vehicle 1001.

The sensing area 1103F to the sensing area 1103B show an example of the sensing area by the camera 1051. The sensing area 1103F covers a position farther than the sensing area 1102F in front of the vehicle 1001. The sensing region 1103B covers the rear of the vehicle 1001 to a position farther than the sensing region 1102B. The sensing area 1103L covers the periphery of the left side surface of the vehicle 1001. The sensing region 1103R covers the periphery of the right side surface of the vehicle 1001.

The sensing result in the sensing area 1103F is used, for example, for recognition of traffic lights and traffic signs, lane departure prevention support system, and the like. The sensing result in the sensing area 1103B is used, for example, for parking assistance, a surround view system, and the like. The sensing results in the sensing area 1103L and the sensing area 1103R are used, for example, in a surround view system or the like.

The sensing area 1104 shows an example of the sensing area of LiDAR1053. The sensing region 1104 covers a position farther than the sensing region 1103F in front of the vehicle 1001. On the other hand, the sensing area 1104 has a narrower range in the left-right direction than the sensing area 1103F.

The sensing result in the sensing area 1104 is used, for example, for emergency braking, collision avoidance, pedestrian detection, and the like.

The sensing area 1105 shows an example of the sensing area of the radar 1052 for a long distance. The sensing region 1105 covers a position farther than the sensing region 1104 in front of the vehicle 1001. On the other hand, the sensing region 1105 has a narrower range in the left-right direction than the sensing region 1104.

The sensing result in the sensing area 1105 is used for, for example, ACC (Adaptive Cruise Control) or the like.

Note that the sensing area of each sensor may have various configurations other than that shown in FIG. Specifically, the ultrasonic sensor 1054 may also sense the side of the vehicle 1001, or the LiDAR 1053 may sense the rear of the vehicle 1001.

For example, this technology (cross-fusion CNN121) can be applied to DMS1030, HMI1031, sensor fusion unit 1072, recognition unit 1073, and the like.

Thereby, for example, the information displayed by the HMI 1031 can be expanded.

For example, when an image is projected on a windshield by the HUD (Head Up Display) provided in the HMI 1031 using AR technology and information is superimposed and displayed in the driver's field of view, a human being can be used by using the cross fusion CNN121. Information that is difficult to recognize can be emphasized or added. For example, in fog, it is possible to emphasize and display a delicate contrast of an object or the like in the field of view in front of the vehicle 1001.

Further, for example, the cross-fusion CNN121 can be applied to the recognition unit 1073 and the HMI 1031 to fuse the image data supplied from the camera 1051 and the LiDAR data supplied from the LiDAR 1053 and present them to the driver.

Specifically, image data is input from the camera 1051 to the HVC 131 of the cross-fusion CNN 121, and LiDAR data is input from the LiDAR 1053 to the MVC 132.

For example, in the training stage, the cross fusion CNN121 is trained using correct fusion images (Ground truth used images) data in which image data and LiDAR data are fused. In the correct fusion image data, for example, a label indicating whether or not the feature based on the LiDAR data is salient is added to the image data in which the feature detected based on the LiDAR data is emphasized or added. It is data.

Next, in the execution stage, for example, the cross-fusion CNN 121 adds image data and LiDAR data to the features of the image data that the HVC 131 pays attention to (and thus the driver), by adding features based on the LiDAR data. Fuse. Then, the HMI 1031 displays an image in which the image data and the LiDAR data are fused.

This makes it possible to provide information that is easy for the driver to see, in addition to improving the efficiency of processing that fuses the characteristics of data from multiple sensors.

<< 5. Others >>
<Computer configuration example>
The series of processes described above can be executed by hardware or software. When a series of processes are executed by software, the programs constituting the software are installed in the computer. Here, the computer includes a computer embedded in dedicated hardware and, for example, a general-purpose personal computer capable of executing various functions by installing various programs.

FIG. 15 is a block diagram showing a configuration example of computer hardware that executes the above-mentioned series of processes programmatically.

In the computer 2000, the CPU (Central Processing Unit) 2001, the ROM (Read Only Memory) 2002, and the RAM (Random Access Memory) 2003 are connected to each other by the bus 2004.

An input / output interface 2005 is further connected to the bus 2004. An input unit 2006, an output unit 2007, a recording unit 2008, a communication unit 2009, and a drive 2010 are connected to the input / output interface 2005.

The input unit 2006 includes an input switch, a button, a microphone, an image pickup device, and the like. The output unit 2007 includes a display, a speaker, and the like. The recording unit 2008 includes a hard disk, a non-volatile memory, and the like. The communication unit 2009 includes a network interface and the like. The drive 2010 drives a removable media 2011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer 2000 configured as described above, the CPU 2001 loads the program recorded in the recording unit 2008 into the RAM 2003 via the input / output interface 2005 and the bus 2004 and executes the program. A series of processes are performed.

The program executed by the computer 2000 (CPU 2001) can be recorded and provided on the removable media 2011 as a package media or the like, for example. The program can also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer 2000, the program can be installed in the recording unit 2008 via the input / output interface 2005 by mounting the removable media 2011 in the drive 2010. Further, the program can be received by the communication unit 2009 via a wired or wireless transmission medium and installed in the recording unit 2008. In addition, the program can be installed in ROM 2002 or the recording unit 2008 in advance.

The program executed by the computer may be a program in which processing is performed in chronological order according to the order described in the present specification, in parallel, or at a necessary timing such as when a call is made. It may be a program in which processing is performed.

Further, in the present specification, the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a device in which a plurality of modules are housed in one housing are both systems. ..

Further, the embodiment of the present technology is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present technology.

For example, this technology can take a cloud computing configuration in which one function is shared by multiple devices via a network and processed jointly.

In addition, each step described in the above flowchart can be executed by one device or shared by a plurality of devices.

Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.

<Example of configuration combination>
The present technology can also have the following configurations.

(1)
The first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system,
It is provided with a second CNN that realizes a function similar to that of the first CNN by a process different from that of the first CNN.
An information processing device in which an arbitrary layer of the first CNN and an arbitrary layer of the second CNN are connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. ..
(2)
The one CNN is the first CNN, and the CNN is the first CNN.
The other CNN is the second CNN, which is the information processing apparatus according to (1).
(3)
The information processing apparatus according to (2) above, wherein the transferred information includes image data obtained by imaging a signal similar to a signal output from the human visual cortex.
(4)
The information processing apparatus according to (2) or (3) above, wherein the transferred information includes at least one of a feature map, a attention map, and a region proposal.
(5)
The information processing apparatus according to any one of (2) to (4), wherein the second CNN outputs data influenced by the first CNN.
(6)
The information processing apparatus according to (5) above, wherein the output data of the second CNN includes data showing a difference between the processing of the first CNN and the processing of the second CNN.
(7)
The output data of the second CNN is obtained by synthesizing an image obtained by performing predetermined image processing on the input image by the second CNN in a region of the input image that the first CNN is not paying attention to. The information processing apparatus according to (5) or (6) above, which includes image data.
(8)
The information processing apparatus according to any one of (5) to (7) above, wherein the output data of the second CNN includes the same data as the output data of the first CNN.
(9)
The first CNN is trained and trained prior to being combined with the second CNN.
The information processing apparatus according to any one of (1) to (8), wherein the configuration and parameters of the first CNN are not changed when the first CNN and the second CNN are combined for training. ..
(10)
The information processing apparatus according to (9), wherein the second CNN is trained when training is performed by combining the first CNN and the second CNN.
(11)
When training is performed by combining the first CNN and the second CNN, the cross fusion parameter indicating the fusion method of the first CNN and the second CNN is adjusted in (10). The information processing device described.
(12)
The cross-fusion parameters are transferred between the connection relationship between the first CNN layer and the second CNN layer, and between the first CNN layer and the second CNN layer. The information processing apparatus according to (11) above, which includes a parameter indicating the type of information.
(13)
The information processing apparatus according to any one of (1) to (12), wherein the transferred information is used for processing the arbitrary layer of the other CNN.
(14)
Image data and data indicating the state, processing, or function of the human visual system are input to the first CNN.
The information processing apparatus according to any one of (1) to (13), wherein image data is input to the second CNN.
(15)
The information processing apparatus according to any one of (1) to (14), wherein information is transferred from the arbitrary layer of the other CNN to the arbitrary layer of the one CNN.
(16)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. An information processing method that connects to an arbitrary layer of a second CNN and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
(17)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. A program for connecting to an arbitrary layer of a second CNN and causing a computer to perform a process of transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
(18)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. The training of the first CNN of a cross-fused CNN in which any layer of the second CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. Do it before combining with the second CNN,
A learning method in which when training is performed in combination with the first CNN and the second CNN, the second CNN is trained and the configuration and parameters of the first CNN are not changed.

It should be noted that the effects described in the present specification are merely examples and are not limited, and other effects may be obtained.

101 information processing system, 111 sensor unit, 112 human visual processing sensor, 113 processing unit, 114 training data set generation unit, 121 cross fusion CNN, 131 HVC, 132 MVC, 1001 vehicle, 1030 DMS, 1031 HMI, 1072 sensor fusion unit. , 1073 Recognition unit

Claims

The first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system,
It is provided with a second CNN that realizes a function similar to that of the first CNN by a process different from that of the first CNN.
An information processing device in which an arbitrary layer of the first CNN and an arbitrary layer of the second CNN are connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. ..
The one CNN is the first CNN, and the CNN is the first CNN.
The information processing apparatus according to claim 1, wherein the other CNN is the second CNN.
The information processing apparatus according to claim 2, wherein the transferred information includes image data obtained by imaging a signal similar to a signal output from the human visual cortex.
The information processing apparatus according to claim 2, wherein the transferred information includes at least one of a feature map, a attention map, and a region proposal.
The information processing apparatus according to claim 2, wherein the second CNN outputs data affected by the first CNN.
The information processing apparatus according to claim 5, wherein the output data of the second CNN includes data showing a difference between the processing of the first CNN and the processing of the second CNN.
The output data of the second CNN is obtained by synthesizing an image obtained by performing predetermined image processing on the input image by the second CNN in a region of the input image that the first CNN is not paying attention to. The information processing apparatus according to claim 5, which includes image data.
The information processing apparatus according to claim 5, wherein the output data of the second CNN includes the same data as the output data of the first CNN.
The first CNN is trained and trained prior to being combined with the second CNN.
The information processing apparatus according to claim 1, wherein the configuration and parameters of the first CNN are not changed when the first CNN and the second CNN are combined for training.
The information processing apparatus according to claim 9, wherein the second CNN is trained when training is performed by combining the first CNN and the second CNN.
The tenth aspect of the present invention, wherein when training is performed by combining the first CNN and the second CNN, a cross fusion parameter indicating a method of fusing the first CNN and the second CNN is adjusted. Information processing equipment.
The cross-fusion parameters are transferred between the connection relationship between the first CNN layer and the second CNN layer, and between the first CNN layer and the second CNN layer. The information processing apparatus according to claim 11, which includes a parameter indicating the type of information.
The information processing apparatus according to claim 1, wherein the transferred information is used for processing the arbitrary layer of the other CNN.
Image data and data indicating the state, processing, or function of the human visual system are input to the first CNN.
The information processing apparatus according to claim 1, wherein image data is input to the second CNN.
The information processing apparatus according to claim 1, wherein information is transferred from the arbitrary layer of the other CNN to the arbitrary layer of the one CNN.
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. An information processing method that connects to an arbitrary layer of a second CNN and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. A program for connecting to an arbitrary layer of a second CNN and causing a computer to perform a process of transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. The training of the first CNN of a cross-fused CNN in which any layer of the second CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. Do it before combining with the second CNN,
A learning method in which when training is performed in combination with the first CNN and the second CNN, the second CNN is trained and the configuration and parameters of the first CNN are not changed.