WO2021241261A1 - Information processing device, information processing method, program, and learning method - Google Patents

Information processing device, information processing method, program, and learning method Download PDF

Info

Publication number
WO2021241261A1
WO2021241261A1 PCT/JP2021/018332 JP2021018332W WO2021241261A1 WO 2021241261 A1 WO2021241261 A1 WO 2021241261A1 JP 2021018332 W JP2021018332 W JP 2021018332W WO 2021241261 A1 WO2021241261 A1 WO 2021241261A1
Authority
WO
WIPO (PCT)
Prior art keywords
cnn
information
information processing
layer
processing
Prior art date
Application number
PCT/JP2021/018332
Other languages
French (fr)
Japanese (ja)
Inventor
レオナルド イシダアベ
クリストファー ライト
ベルナデット エリオットボウマン
ハーム クローニー
ニコラス ウォーカー
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2021241261A1 publication Critical patent/WO2021241261A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • This technology relates to information processing devices, information processing methods, programs, and learning methods, and in particular, information processing devices, information processing methods, and programs designed to improve the performance of models having functions similar to those of human visual systems. , And learning methods.
  • the distance likelihood for each of the multiple distances to the object is calculated from the information obtained by the distance measuring method using multiple sensors, and the learning model is used to determine the distance likelihood for the multiple distance measuring methods. It has been proposed to integrate and obtain the integration likelihood of each of a plurality of distances (see, for example, Patent Document 1).
  • the human vision system can be effectively modeled by CNN (Convolutional Neural Network).
  • CNN Convolutional Neural Network
  • the initial layer of a CNN modeled on a human visual system can be made to perform functions similar to those performed on the retina, such as edge detection.
  • Patent Document 1 does not consider modeling a human visual system.
  • This technology was made in view of such a situation, and is intended to improve the performance of a model having the same functions as a human visual system.
  • the information processing device of the first aspect of the present technology is the first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and the first CNN by processing different from the first CNN.
  • the first CNN Convolution Neural Network
  • the second CNN that realizes a function similar to that of the CNN, any layer of the first CNN and any layer of the second CNN are connected and from the arbitrary layer of one CNN. Information is transferred to the arbitrary layer of the other CNN.
  • the information processing method of the first aspect of the present technology is based on an arbitrary layer of a first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and processing different from the first CNN. , Connects to any layer of the second CNN that realizes a function similar to that of the first CNN, and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  • a first CNN Convolution Neural Network
  • the program of the first aspect of the present technology is described by an arbitrary layer of a first CNN (Convolution Neural Network) that realizes processing and functions similar to those of a human visual system, and processing different from the first CNN.
  • a first CNN Convolution Neural Network
  • the learning method of the second aspect of the present technology is based on an arbitrary layer of the first CNN (Convolution Neural Network) that realizes the same processing and functions as the human visual system, and processing different from the first CNN.
  • a cross-fusion in which an arbitrary layer of a second CNN that realizes a function similar to that of the first CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  • the first CNN Convolution Neural Network
  • the first CNN which realizes the same processing and functions as the human visual system
  • the first CNN are processed differently from the first CNN.
  • An arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  • the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, and the first CNN are processed differently from the first CNN.
  • a cross-fused CNN in which an arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  • the training of the first CNN is performed before the combination with the second CNN, and when the training is performed in combination with the first CNN and the second CNN, the training of the second CNN is performed. We do not change the configuration and parameters of the first CNN.
  • HVC Human Vision CNN
  • HVC is a CNN that realizes the same processing and function as the human visual system, that is, the same processing and function as the processing and function occurring in the human brain.
  • HVC is provided as a model of a human visual system that includes all steps from light detection to image perception and potentially includes high level functions such as object recognition.
  • MVC Machine Vision CNN
  • Cross-fusion refers to the fusion of multiple CNNs by combining multiple CNNs with independent architectures and parameters and training in a combined state to generate a combined CNN architecture. ..
  • the CNN architecture in which a plurality of CNNs are cross-fused is referred to as a cross fused CNN (Cross fused CNN).
  • a simple cross connection input is added between one CNN and another CNN.
  • This allows related information to be transferred between any intermediate layer (eg, convolution layer) of different CNNs, allowing information from different CNNs to be efficiently fused at any relevant level of abstraction. can.
  • difference detection between human visual processing and machine visual processing (Disparity detection) and efficient image fusion (Efficient image fusion) are realized in real time.
  • a cross-connection is a connection that affects the function of the layer within the destination CNN by means of a trainable scalar value.
  • the cross-connection may be, for example, the output from each convolutional layer of the CNN, or the output from a layer of a subset of the convolutional layers.
  • FIG. 1 is a block diagram showing an embodiment of an information processing system 101 to which the present technology is applied.
  • the information processing system 101 includes a sensor unit 111, a human visual processing sensors 112, a processing unit 113, a training data set generation unit 114, and a training database 115.
  • the sensor unit 111 includes, for example, an image sensor and an optical sensor such as LiDAR (Light Detection And Ringing).
  • the sensor unit 111 generates visual information (for example, image data) based on the data collected by the optical sensor.
  • the human visual processing sensor 112 collects data (hereinafter referred to as human visual processing data) indicating the state, processing, or function of the human (user) visual system using the information processing system 101.
  • the human visual processing sensor 112 is provided in, for example, an AR (Augmented Reality) headset.
  • Human visual processing data includes, for example, the following data.
  • the function of the user's visual system can be measured at different processing levels depending on the characteristics of the user's line of sight and the EEG system placed in the ear.
  • some EEG systems can reconstruct the images that human beings form in their minds.
  • the processing unit 121 may be configured by, for example, a general processor such as a CPU, or may be configured by a processor optimized for the information processing system 101 or the like.
  • the processing unit 121 realizes the cross-fusion CNN 121.
  • Cross-fusion CNN121 is a cross-fusion of HVC131 and MVC132 as shown in FIG.
  • the cross-fusion CNN 121 is represented by various parameters for the HVC 131, various parameters for the MVC 131, and cross-fusion parameters for the cross connection between the HVC 131 and the MVC 132.
  • the cross fusion parameter indicates a fusion method of HVC131 and MVC132, and includes, for example, the following parameters.
  • connection relationship between HVC131 and MVC132 is indicated by the number of crossed connections, the layer number of the origin of the crossed connection, and the layer number of the destination of the crossed connection.
  • the type of cross-connection indicates the type of information that is transferred from one CNN intermediate layer (eg, a convolution layer) by cross-connection and input to the other CNN middle layer.
  • the types of cross-connection include feature maps (Feature maps), attention maps (Attention maps), region proposals (Region Proposal), and the like.
  • the transferred information is used for processing the transfer destination layer.
  • the feature map is, for example, image data output from each convolution layer of CNN, and shows the feature amount of each pixel in the image data.
  • the attention map is a feature map of a special format, for example, a region (attention region) important for object recognition or the like is represented in a format such as a heat map. For example, in a map of interest, pixels in a region of high importance have a color closer to red, and pixels in a region of less importance have a color closer to blue.
  • the area proposal is, for example, image data showing an area in an image where an object is likely to exist with a rectangular frame or the like.
  • the region proposal is used, for example, in another algorithm to determine whether or not an object is present in the image, or what object is present in the image. Region proposals are also used, for example, for specific types of object detection.
  • the cross-fusion CNN121 realizes, for example, a function of expanding or emphasizing the processing difference between HVC131 and MVC132. Further, the cross-fusion CNN 121 realizes a function of reducing the difference between the processing of the HVC 131 (that is, the visual processing of a human) and the processing of the MVC 132 (that is, the visual processing of a machine).
  • the HVC 131 is a CNN that realizes the same processing and functions as a human visual system, unlike a standard CNN.
  • the model of the human visual system continues to evolve, but for example, a state-of-the-art model of the human visual system can be applied to the HVC 131.
  • the HVC 131 may include, for example, the following configuration.
  • the independent functional module of the HVC 131 is an architecture that reflects different functional areas of the human visual system, such as the primary visual cortex (V1) to the fifth visual cortex (V5), for example, as shown in FIG. Can be provided.
  • FIG. 4 shows a schematic diagram of the human brain.
  • FIG. 4 schematically shows the distribution of V1 to V5 in the human brain. Further, FIG. 4 shows the positions of the human eye 201 and the region 202 for face recognition.
  • each convolutional layer of the HVC 131 realizes a function similar to V1 to V5 of the human brain. Then, from each convolutional layer of the HVC 131, image data (hereinafter referred to as visual cortex data V1 to visual cortex data V5) that images a signal similar to the signal output from V1 to V5 of the human brain is output. , Is input to the next layer of HVC131, or is input to MVC132.
  • the functional modules included in the HVC 131 can be connected in series or non-linearly.
  • the feedforward type network can realize the optimum model for the processing of the first 200 ms in the visual field.
  • Such models have been demonstrated to produce images capable of activating only specific parts of the human primate visual cortex.
  • the HVC 131 for example, it is possible to use a neuron that returns the output to its own input, that is, recurrence. This can, for example, improve the adjustment of the neural architecture (Primate visual cortex neural architecture) and functional performance of the primate visual cortex.
  • a neuron that returns the output to its own input, that is, recurrence. This can, for example, improve the adjustment of the neural architecture (Primate visual cortex neural architecture) and functional performance of the primate visual cortex.
  • a neuromorphic computing architecture such as a spiking neural network can be applied to the HVC 131. This makes it possible, for example, to more accurately reproduce the functions of the human visual system.
  • the HVC 131 may be a model of a general human visual system or a model of a specific individual visual system.
  • the recognition process supported by the HVC 131 may be limited.
  • the object to be recognized by the HVC 131 may be limited. This may improve the performance of the HVC 131.
  • the content of the output data of the HVC 131 (hereinafter referred to as the HVC output) differs depending on the application of the HVC 131 or the cross fusion CNN 121.
  • labeled image data to which a label indicating the type of an object in the image is attached to the image data input to the HVC 131 is output as HVC output.
  • the MVC 132 is a CNN that realizes a function similar to that of the HVC 131 by processing different from the HVC 131 without setting the limitation of modeling the processing of the human visual system.
  • the HVC 131 is trained before being combined with the MVC 132, that is, before being mounted on the cross-fusion CNN 121.
  • the parameters such as the configuration and weight of the HVC 131 are not changed after being implemented in the cross-fusion CNN 121.
  • the MVC 132 is not trained before being combined with the HVC 131, i.e., before being mounted on the cross-fused CNN 121, but after being mounted on the cross-fused CNN 121, i.e., after being combined with the HVC 131. That is, parameters such as the configuration and weight of the MVC 132 are adjusted after being mounted on the cross-fusion CNN 121.
  • the cross-fusion parameters between the HVC 131 and the MVC 132 may be fixed based on a predefined parameter set or may be changed during training of the cross-fusion CNN 121.
  • the connection relationship between the HVC 131 and the MVC 132 is not changed, but in the latter case, the connection relationship between the HVC 131 and the MVC 132 may be changed.
  • the cross-fusion CNN 121 information is always transferred from the HVC 131 to the MVC 132 so that the MVC 132 is affected by the processing of the HVC 131. That is, there is always a cross connection from the HVC 131 to the MVC 132. As a result, each layer of the HVC 131 independently affects each layer of the MVC 132 via the cross connection, so that the processing of the visual system of the machine depends on the processing of the human visual system.
  • the visual cortex data V1 to the visual cortex data V5 output from each convolution layer of the HVC 131 is transferred to the MVC 132.
  • MVC132 it is optional whether or not to transfer information from MVC132 to HVC131.
  • the transfer of information from the MVC 132 to the HVC 131 is useful, for example, in determining where to enter the information from the MVC 132 into the HVC 131 in order to improve the results of the classification of the HVC 131. It also helps determine, for example, how to enhance the image to improve human cognitive function.
  • image data to be processed is input to the HVC 131 and MVC 132 of the cross fusion CNN 121.
  • the types of image data input to the HVC 131 and the image data input to the MVC 132 may be the same or different.
  • the image data taken by the camera may be input to the HVC 131, and the image data obtained by imaging the data collected by the LiDAR may be input to the MVC 132.
  • the human visual processing data collected by the human visual processing sensor 112 is input to the HVC 132 as needed.
  • HVC131 and MVC132 perform image processing individually. Then, the HVC 131 outputs the HVC output as needed.
  • the MVC 132 outputs output data influenced by the HVC 131, that is, output data that depends to some extent on the functions of the human visual system (hereinafter referred to as HVC-influenced output).
  • the content of the HVC-sensitized output differs depending on the application of the cross-fusion CNN121 and the like.
  • the HVC-sensitized output contains the same data as the HVC output.
  • the HVC-sensitized output includes image data indicating a coincidence point, a difference, etc. between the processing with the HVC 131 and the processing with the MVC 132.
  • This image data is similar to, for example, a focus map featuring an image that the MVC 132 is paying attention to and the HVC 131 is not paying attention to.
  • the HVC-sensitized output includes image data showing the result of image processing such as object classification.
  • the training data set generation unit 113 includes a set of training data used for training the HVC 131 (hereinafter referred to as a training data set for HVC) and a set of training data used for training the cross fusion CNN 121 (hereinafter referred to as training for cross fusion CNN). Generate a dataset).
  • the training data set generation unit 113 stores the generated training data set for HVC and the training data set for cross-fusion CNN in the training database 115.
  • the training data set generation unit 113 has a function of giving an object recognition task to a human and collecting a label (hereinafter referred to as a human recognition label) indicating a recognition result of an object in an image by a human. good.
  • the object recognition task is, for example, a task for testing a human visual system. Specifically, in the object recognition task, for example, an image is presented to a human for a predetermined time (for example, for 100 ms), the human classifies the object in the image, and a label indicating the result of the classification is given. It's a task.
  • FIG. 8 shows a configuration example of the function of the information processing system 101 in the training stage of the HVC 131.
  • FIG. 9 shows a configuration example of the function of the information processing system 101 in the training stage of the cross-fusion CNN 121.
  • step S1 the training data set generation unit 114 generates a training data set for HVC 131 (that is, a training data set for HVC).
  • a training dataset for HVC includes a set of image data labeled by humans. This label is given, for example, within the object recognition task described above.
  • the label may be automatically added, for example, using a video database that determines whether a human has correctly or incorrectly identified an object in the image.
  • the training data set for HVC includes human visual processing data collected by the human visual processing sensor 112, if necessary.
  • Human visual processing data is acquired in synchronization with other training data (eg, image data corresponding to the presented image). That is, the human visual processing data collected from the person to whom the image is presented is associated with the image data corresponding to the presented image.
  • the time lag that exists between the presentation of the image and the activity of the human visual system may be taken into consideration. For example, there is a time lag of about 100 ms before V4 in the human brain reacts to the presented image.
  • the data acquired from V4 of the human brain is associated with the image data corresponding to the presented image 100 ms before the acquisition of the data.
  • the training data set generation unit 114 stores the generated training data for HVC in the training database 115.
  • step S2 the processing unit 121 trains the HVC 131.
  • the HVC 131 is trained by the standard method of neural networks that realizes functions similar to the human visual system.
  • a match between the output of the HVC 131 for the image data in the training data set for HVC and the result of the human object recognition task for the image corresponding to the image data (that is, the human recognition label attached to the image data). Training is conducted as a success indicator.
  • the degree of similarity between the activity mapping in the human visual system for the same image and the functional activity in the HVC 131 may be included in the training success index.
  • image data is input to the HVC 131, and an image corresponding to the image data is presented to a human wearing a human visual processing sensor 112 (hereinafter referred to as a data provider).
  • a human visual processing sensor 112 collects human visual processing data indicating the reaction of the data provider to the presented image and inputs it to the HVC 131.
  • the human visual processing sensor 112 uses the signals output from the data providers V1 to V5 (hereinafter referred to as visual cortex signals V1 to visual cortex signals V5) among the collected human visual processing data into the image reproduction model 251. input.
  • the image reproduction model 251 converts the visual cortex signals V1 to the visual cortex signals V5 into image data (hereinafter referred to as correct visual cortex data V1 to correct visual cortex data V5) and outputs them. Then, the visual cortex image data V1 to the visual cortex image data V5 output from each convolutional layer of the HVC 131 and the correct visual cortex data V1 to the correct visual cortex data V5 output from the image reproduction model 251 are compared with each other. Then, the degree of similarity between the visual cortex image data V1 to the visual cortex image data V5 and the correct visual cortex data V1 to the correct visual cortex data V5 is used as a success index.
  • the above two types of success indicators are scored by combining them using a predetermined function or the like, and the HVC 131 with respect to the input image data is based on the result of comparing the calculated score with the predetermined threshold value. The success / failure of the process is determined. Then, the configuration and parameters of the HVC 131 are adjusted so that the success rate of the processing of the HVC 131 is improved.
  • the HVC 131 may be trained for a specific individual or for a general human being.
  • Interpretability techniques are used to cause the HVC 131 to generate images that maximize the activation of specific areas or neurons of the human visual system.
  • the image generated by the HVC 131 is presented to a person wearing a device that scans the brain, such as an EGG scanner, an MRI (Magnetic Resonance Imaging) scanner, or a NIR (Near-Infrared Spectroscopy) scanner. If the processing of HVC131 is effective, the target region or neuron is activated in the human brain to which the image is presented.
  • a database showing the reaction of the human visual cortex to various images may be used for the test of HVC131.
  • the correct visual cortex data V1 to the correct visual cortex data V5 obtained by converting the visual cortex signal V5 into the human visual cortex signal V1 for various images is stored in the database.
  • the visual field data V1 to the visual field data V5 output from the HVC 131 and the correct visual field data V1 to the correct visual field in the database are obtained for various images in the database.
  • the HVC 131 is tested by comparing the data V5.
  • a behavior prediction test (Behavioral predictivity test) in which a behavior pattern corresponding to a presented image is linked to an object recognition process inside a human being may be used.
  • human reactions in behavior prediction tests are classified into multiple recognition categories (eg, missed recognitions, False positive recognitions, etc.). Then, the degree of similarity between the recognition result of HVC131 and the human reaction in the behavior prediction test is used for the test of HVC131.
  • the HVC 131 is mounted on the cross-fusion CNN 121.
  • the test score is below the threshold, HVC131 is retrained.
  • step S3 the training data set generation unit 114 generates a training data set (cross-fusion CNN training data set) for the cross-fusion CNN 121.
  • the method of generating the training data set for the cross-fusion CNN differs depending on the function realized by the cross-fusion CNN 121 (for example, the content of the HVC-sensitized output).
  • the cross-fusion CNN 121 for example, the content of the HVC-sensitized output.
  • the training data set for HVC may be used for the training for the cross-fusion CNN.
  • a correct answer label (Ground truth label) indicating the type of the actual object is given to the input image data with the human recognition label included in the training data set for HVC.
  • the human recognition label and the correct label may or may not match.
  • the training data used for the training data set for HVC is generated by assigning a pseudo human recognition label to the input image data to which only the correct answer label is attached by the trained HVC 131. You may.
  • the input image data may include either a negative example (Negative example) or a positive example (Positive example).
  • the negative example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 are different.
  • a positive example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 match.
  • the cross-fusion CNN training data set includes the input image data and the input image data. Generated from, and contains correct image data in the same format as the HVC-sensitized output.
  • the training dataset for cross-fusion CNN uses the difference map as the correct image data.
  • the attention map is generated using the HVC131 after training and the MVC132 before being mounted on the cross-fusion CNN121.
  • This focus map is generated, for example, from the entire neural network (HVC131 and MVC132). It is also possible to use an individual neuron in each neural network or a attention map generated in any combination of neurons.
  • the cross-fusion CNN training dataset will include such AI extended image data as correct image data.
  • predetermined image processing for example, sharpening, Generative data, uprising, etc.
  • predetermined image processing is performed on the input image data using the MVC132 before being mounted on the cross-fusion CNN121.
  • the AI extended image is generated.
  • predetermined image processing is performed on the input image data using the HVC 131, and an attention map estimated to be noticed by humans in an important recognition task (for example, bleeding detection in surgery) is generated.
  • an attention map estimated to be noticed by humans in an important recognition task for example, bleeding detection in surgery
  • the input image is divided into an attention area where the HVC 131 pays a great deal of attention and a non-attention area where the HVC 131 does not pay a great deal of attention. That is, the input image is divided into a region of interest where humans are presumed to pay a great deal of attention in an important recognition task and a non-attention region where humans are presumed not to pay a great deal of attention. Then, a composite image using the input image in the attention region and the AI extended image in the non-focus region is generated, and the composite image data corresponding to the generated composite image is used as the correct image data.
  • the training data for the cross-fusion CNN includes the human visual processing data.
  • the training data set generation unit 114 stores the generated training data set for cross-fusion CNN in the training database 115.
  • step S4 the processing unit 121 trains the cross-fusion CNN 121.
  • the cross-fusion CNN 121 is generated by connecting the post-training HVC 131 and the pre-training MVC 132 using the cross-fusion parameters according to a predefined architecture.
  • the training data set for cross-fusion CNN stored in the training database 115 is divided into a training data set and a test data set. For example, about 80% of the training dataset for cross-fusion CNN is used for the training dataset and the rest is used for the test dataset.
  • the cross-fusion CNN121 processes the input image data included in the training data set and outputs the HVC-sensitized output.
  • a score is given based on the result of comparing the HVC-sensitized output with the human recognition label and the correct answer label given to the input image data. Whether or not the case where the HVC-sensitized output and the human recognition label match is judged as success, and whether or not the case where the HVC-sensitized output and the correct answer label match is judged as success depends on the category of the object to be recognized, etc. Is determined by.
  • a score is given based on the degree of similarity between the HVC-sensitized output and the correct image data corresponding to the input image data (for example, the degree of similarity of pixel values, etc.).
  • the above processing is repeated for all the input image data of the training data set, and the configuration and parameters of the cross fusion CNN121 are adjusted so that the score is improved. Specifically, the configuration and parameters of the MVC 132, as well as the cross-fusion parameters are adjusted. On the other hand, the configuration and parameters of HVC131 are not changed.
  • the cross-fusion CNN121 trained using the training data set is tested using the test data set.
  • the configuration and parameters of the cross-fusion CNN 121 are not adjusted, and the same scoring as in the training stage is performed in order to estimate the accuracy of the cross-fusion CNN 121.
  • the training is restarted from the training stage.
  • the cross fusion CNN121 is generated.
  • FIG. 12 shows a detailed configuration example of the information processing system 101 at the execution stage.
  • step S51 the cross-fusion CNN 121 acquires image data and human visual processing data.
  • the image data is input to the HVC 131 and the MVC 132.
  • the human visual processing sensor 112 is attached to the user, collects the human visual processing data of the user, and inputs it to the HVC 131.
  • the human visual processing data the same type of data as that used for training the HVC131 and the cross-fusion CNN121 is used.
  • step S52 the cross fusion CNN121 processes the image data. Then, the cross-fusion CNN 121 outputs a predetermined HVC-sensitized output which is a processing result of the image data.
  • the human visual processing data is used to synchronize the HVC 131 with the current state and operation of the user. This makes it possible to provide a more accurate model of the human visual system (HVC131), and as a result, the HVC-sensitized output can be optimized.
  • each cross-fusion CNN 121 may be given a score, and the cross-fusion CNN 121 having the highest score may be adopted.
  • training may be performed to generate a plurality of cross-fused CNN 121s having different tolerances for the processing of the HVC 131, and the plurality of cross-fused CNN 121s may be used in the information processing system 101.
  • each cross-fusion CNN 121 to provide output influenced by human visual processing with different intuitive interpretations to a standard user.
  • a cross-fusion CNN121 in which a plurality of MVC132s are combined for one HVC131 may be generated.
  • different types of HVC-sensitized outputs can be output from each MVC 132.
  • the human visual system and the machine visual system can be effectively integrated.
  • the performance of models with functions similar to those of human visual systems can be improved.
  • this technology can be suitably applied when it is necessary for humans and machines to make decisions with emphasis, or when humans have to monitor and verify machine decisions.
  • the coincidence points and differences between the processing of the human visual system (HVC131) and the processing of the machine visual system (MVC132) can be detected and used at an arbitrary semantic level.
  • an output showing the points of agreement and differences between human and machine visual processing can be obtained.
  • the inputs, arbitrary intermediate layers, and outputs of the HVC 131 and MVC 132 can be individually accessed to evaluate the performance of the system.
  • any Semantic processing layer on CNN makes it possible to compare human and machine visual systems. Also, for example, monitoring and certainty of parameters related to the human visual system can be ensured.
  • the cross-fused CNN 121 can identify image features that are difficult for humans to detect but affect the visual processing of the machine, i.e., differentiated features that differentiate the machine from humans. can. For example, even if the same object is finally recognized by both human and machine visual systems, it is possible to predict the difference between human and machine at a lower semantic level such as the difference in edge detection behavior. ..
  • the information recognized by the user can be expanded based on the difference in visual processing between humans and machines.
  • an AR Augmented Reality
  • the image can be expanded without changing important features in human visual processing.
  • the visual system (MVC132) of the machine for a task (Non-critical task) that is not important to the user. Therefore, the information presented to the user is expanded.
  • the feature that the HVC 131 pays attention to is presumed to be important in human visual processing, and is presumed to be important when the user performs an important task (Critical task). Therefore, the features that the HVC 131 pays attention to are presented to the user without major changes. This keeps the user accountable for important tasks.
  • the interpretability (Interpretability) and accountability (Accountability) of the processing and results of the AI system can be improved or customized.
  • a part of HVC131 can be used to verify the performance of the system.
  • automatic domain adaptation can be realized based on the user's experience.
  • racial bias of face recognition medical AR applications can be realized.
  • the visual system of the machine can be harmonized with the human visual processing.
  • the cross-fused CNN 121 is trained with the permitted deviation of multiple levels of data processing capabilities of the CNN.
  • the processing steps of the AI system can be intuitively understood by humans.
  • multiple cross-fused CNN 121s that achieve higher efficiency or accuracy are generated with larger, different levels of tolerance.
  • a "human-like" vision system can be created by humans selecting the cross-fused CNN121 with the ideal level of deviation (intuitive interpretability) allowed for the task. You can control how you behave.
  • a doctor may select high human coherence for a task that does not require advanced cognitive processing. This ensures that the criteria for determining the AI system are similar to those that are easily detectable by humans, making it possible for humans to communicate with the AI system easily and quickly.
  • the HVC131 and MVC132 are individually constructed, trained and established, and it is guaranteed that the configuration and parameters of the HVC131 will not change during the training and execution of the cross-fusion CNN121.
  • a good model of the human visual system HVC131
  • HVC131 can be used with the configuration and parameters guaranteed not to be adjusted or changed.
  • the model to which this technology is applied has a mapping between human visual processing and machine visual processing. This can be mapped by estimating the reciprocal activation of neurons in the human and machine visual system for the same image. Since this mapping is understood in the art, images can be generated, for example, by networks optimized to produce higher activation in certain parts of the human visual system.
  • a natural UI user interface
  • This technology can also be applied to mobile devices such as vehicles.
  • FIG. 13 is a block diagram showing a configuration example of a vehicle control system 1011 which is an example of a mobile device control system to which the present technology is applied.
  • the vehicle control system 1011 is provided in the vehicle 1001 and performs processing related to driving support and automatic driving of the vehicle 1001.
  • the vehicle control system 1011 includes a processor 1021, a communication unit 1022, a map information storage unit 1023, a GNSS (Global Navigation Satellite System) receiving unit 1024, an external recognition sensor 1025, an in-vehicle sensor 1026, a vehicle sensor 1027, a recording unit 1028, and driving support. It includes an automatic driving control unit 1029, a DMS (Driver Monitoring System) 1030, an HMI (Human Machine Interface) 1031, and a vehicle control unit 1032.
  • GNSS Global Navigation Satellite System
  • the communication network 1041 is provided by, for example, an in-vehicle communication network or a bus compliant with any standard such as CAN (Controller Area Network), LIN (Local Interconnect Network), LAN (Local Area Network), FlexRay (registered trademark), and Ethernet. It is composed.
  • each part of the vehicle control system 1011 may be directly connected by, for example, short-range wireless communication (NFC (Near Field Communication)), Bluetooth (registered trademark), or the like without going through the communication network 1041.
  • NFC Near Field Communication
  • Bluetooth registered trademark
  • the description of the communication network 1041 shall be omitted.
  • the processor 1021 and the communication unit 1022 communicate with each other via the communication network 1041, it is described that the processor 1021 and the communication unit 1022 simply communicate with each other.
  • the processor 1021 is composed of various processors such as a CPU (Central Processing Unit), an MPU (Micro Processing Unit), and an ECU (Electronic Control Unit), for example.
  • the processor 1021 controls the entire vehicle control system 1011.
  • the communication unit 1022 communicates with various devices inside and outside the vehicle, other vehicles, servers, base stations, etc., and transmits and receives various data.
  • the communication unit 1022 receives from the outside a program for updating the software for controlling the operation of the vehicle control system 1011, map information, traffic information, information around the vehicle 1001 and the like. ..
  • the communication unit 1022 transmits information about the vehicle 1001 (for example, data indicating the state of the vehicle 1001, recognition result by the recognition unit 1073, etc.), information around the vehicle 1001, and the like to the outside.
  • the communication unit 1022 performs communication corresponding to a vehicle emergency call system such as eCall.
  • the communication method of the communication unit 1022 is not particularly limited. Moreover, a plurality of communication methods may be used.
  • the communication unit 1022 performs wireless communication with the equipment in the vehicle by a communication method such as wireless LAN, Bluetooth, NFC, WUSB (WirelessUSB).
  • a communication method such as wireless LAN, Bluetooth, NFC, WUSB (WirelessUSB).
  • the communication unit 1022 may use USB (Universal Serial Bus), HDMI (High-Definition Multimedia Interface, registered trademark), or MHL (Mobile High-) via a connection terminal (and a cable if necessary) (not shown).
  • Wired communication is performed with the equipment in the car by a communication method such as definitionLink).
  • the device in the vehicle is, for example, a device that is not connected to the communication network 1041 in the vehicle.
  • mobile devices and wearable devices possessed by passengers such as drivers, information devices brought into a vehicle and temporarily installed, and the like are assumed.
  • the communication unit 1022 is a base station using a wireless communication system such as 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), LTE (LongTermEvolution), DSRC (DedicatedShortRangeCommunications), etc.
  • a wireless communication system such as 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), LTE (LongTermEvolution), DSRC (DedicatedShortRangeCommunications), etc.
  • a server or the like existing on an external network for example, the Internet, a cloud network, or a network peculiar to a business operator
  • the communication unit 1022 uses P2P (Peer To Peer) technology to communicate with a terminal existing in the vicinity of the own vehicle (for example, a pedestrian or store terminal, or an MTC (Machine Type Communication) terminal). ..
  • the communication unit 1022 performs V2X communication.
  • V2X communication is, for example, vehicle-to-vehicle (Vehicle to Vehicle) communication with other vehicles, road-to-vehicle (Vehicle to Infrastructure) communication with roadside devices, and home (Vehicle to Home) communication.
  • And pedestrian-to-vehicle (Vehicle to Pedestrian) communication with terminals owned by pedestrians.
  • the communication unit 1022 receives electromagnetic waves transmitted by a vehicle information and communication system (VICS (Vehicle Information and Communication System), registered trademark) such as a radio wave beacon, an optical beacon, and FM multiplex broadcasting.
  • VICS Vehicle Information and Communication System
  • the map information storage unit 1023 stores a map acquired from the outside and a map created by the vehicle 1001.
  • the map information storage unit 1023 stores a three-dimensional high-precision map, a global map that is less accurate than the high-precision map and covers a wide area, and the like.
  • the high-precision map is, for example, a dynamic map, a point cloud map, a vector map (also referred to as an ADAS (Advanced Driver Assistance System) map), or the like.
  • the dynamic map is, for example, a map composed of four layers of dynamic information, quasi-dynamic information, quasi-static information, and static information, and is provided from an external server or the like.
  • the point cloud map is a map composed of point clouds (point cloud data).
  • a vector map is a map in which information such as lanes and signal positions is associated with a point cloud map.
  • the point cloud map and the vector map may be provided, for example, from an external server or the like, or the vehicle 1001 as a map for matching with a local map described later based on the sensing result by the radar 1052, LiDAR1053, or the like. It may be created and stored in the map information storage unit 1023. Further, when a high-precision map is provided from an external server or the like, in order to reduce the communication capacity, map data of, for example, several hundred meters square, relating to the planned route on which the vehicle 1001 will travel from now on is acquired from the server or the like.
  • the GNSS receiving unit 1024 receives the GNSS signal from the GNSS satellite and supplies it to the traveling support / automatic driving control unit 1029.
  • the external recognition sensor 1025 includes various sensors used for recognizing the external situation of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011.
  • the type and number of sensors included in the external recognition sensor 1025 are arbitrary.
  • the external recognition sensor 1025 includes a camera 1051, a radar 1052, a LiDAR (Light Detection and Ringing, Laser Imaging Detection and Ringing) 53, and an ultrasonic sensor 1054.
  • the number of cameras 1051, radar 1052, LiDAR1053, and ultrasonic sensors 1054 is arbitrary, and examples of sensing areas of each sensor will be described later.
  • a camera of any shooting method such as a ToF (TimeOfFlight) camera, a stereo camera, a monocular camera, an infrared camera, etc. is used as needed.
  • ToF TimeOfFlight
  • stereo camera stereo camera
  • monocular camera stereo camera
  • infrared camera etc.
  • the external recognition sensor 1025 includes an environment sensor for detecting weather, weather, brightness, and the like.
  • the environment sensor includes, for example, a raindrop sensor, a fog sensor, a sunshine sensor, a snow sensor, an illuminance sensor, and the like.
  • the external recognition sensor 1025 includes a microphone used for detecting the position of a sound or a sound source around the vehicle 1001.
  • the in-vehicle sensor 1026 includes various sensors for detecting information in the vehicle, and supplies sensor data from each sensor to each part of the vehicle control system 1011.
  • the type and number of sensors included in the in-vehicle sensor 1026 are arbitrary.
  • the in-vehicle sensor 1026 includes a camera, a radar, a seating sensor, a steering wheel sensor, a microphone, a biological sensor, and the like.
  • the camera for example, a camera of any shooting method such as a ToF camera, a stereo camera, a monocular camera, and an infrared camera can be used.
  • the biosensor is provided on, for example, a seat, a stelling wheel, or the like, and detects various biometric information of a occupant such as a driver.
  • the vehicle sensor 1027 includes various sensors for detecting the state of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011.
  • the type and number of sensors included in the vehicle sensor 1027 are arbitrary.
  • the vehicle sensor 1027 includes a speed sensor, an acceleration sensor, an angular velocity sensor (gyro sensor), and an inertial measurement unit (IMU (Inertial Measurement Unit)).
  • the vehicle sensor 1027 includes a steering angle sensor for detecting the steering angle of the steering wheel, a yaw rate sensor, an accelerator sensor for detecting the operation amount of the accelerator pedal, and a brake sensor for detecting the operation amount of the brake pedal.
  • the vehicle sensor 1027 includes a rotation sensor that detects the rotation speed of an engine or a motor, an air pressure sensor that detects tire pressure, a slip ratio sensor that detects tire slip ratio, and a wheel speed that detects wheel rotation speed. Equipped with a sensor.
  • the vehicle sensor 1027 includes a battery sensor that detects the remaining amount and temperature of the battery, and an impact sensor that detects an impact from the outside.
  • the recording unit 1028 includes, for example, a magnetic storage device such as a ROM (ReadOnlyMemory), a RAM (RandomAccessMemory), and an HDD (HardDiscDrive), a semiconductor storage device, an optical storage device, an optical magnetic storage device, and the like. ..
  • the recording unit 1028 records various programs, data, and the like used by each unit of the vehicle control system 1011.
  • the recording unit 1028 records a rosbag file including messages sent and received by the ROS (Robot Operating System) in which an application program related to automatic driving operates.
  • the recording unit 1028 includes an EDR (Event Data Recorder) and a DSSAD (Data Storage System for Automated Driving), and records information on the vehicle 1001 before and after an event such as an accident.
  • EDR Event Data Recorder
  • DSSAD Data Storage System for Automated Driving
  • the driving support / automatic driving control unit 1029 controls the driving support and automatic driving of the vehicle 1001.
  • the driving support / automatic driving control unit 1029 includes an analysis unit 1061, an action planning unit 1062, and an operation control unit 1063.
  • the analysis unit 1061 analyzes the vehicle 1001 and the surrounding conditions.
  • the analysis unit 1061 includes a self-position estimation unit 1071, a sensor fusion unit 1072, and a recognition unit 1073.
  • the self-position estimation unit 1071 estimates the self-position of the vehicle 1001 based on the sensor data from the external recognition sensor 1025 and the high-precision map stored in the map information storage unit 1023. For example, the self-position estimation unit 1071 generates a local map based on the sensor data from the external recognition sensor 1025, and estimates the self-position of the vehicle 1001 by matching the local map with the high-precision map.
  • the position of the vehicle 1001 is based on, for example, the center of the rear wheel-to-axle.
  • the local map is, for example, a three-dimensional high-precision map created by using a technology such as SLAM (Simultaneous Localization and Mapping), an occupied grid map (OccupancyGridMap), or the like.
  • the three-dimensional high-precision map is, for example, the point cloud map described above.
  • the occupied grid map is a map that divides a three-dimensional or two-dimensional space around the vehicle 1001 into a grid (grid) of a predetermined size and shows the occupied state of an object in grid units.
  • the occupied state of an object is indicated by, for example, the presence or absence of an object and the probability of existence.
  • the local map is also used, for example, in the detection process and the recognition process of the external situation of the vehicle 1001 by the recognition unit 1073.
  • the self-position estimation unit 1071 may estimate the self-position of the vehicle 1001 based on the GNSS signal and the sensor data from the vehicle sensor 1027.
  • the sensor fusion unit 1072 performs a sensor fusion process for obtaining new information by combining a plurality of different types of sensor data (for example, image data supplied from the camera 1051 and sensor data supplied from the radar 1052). .. Methods for combining different types of sensor data include integration, fusion, and association.
  • the recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001.
  • the recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001 based on the information from the external recognition sensor 1025, the information from the self-position estimation unit 1071, the information from the sensor fusion unit 1072, and the like. ..
  • the recognition unit 1073 performs detection processing, recognition processing, and the like of objects around the vehicle 1001.
  • the object detection process is, for example, a process of detecting the presence / absence, size, shape, position, movement, etc. of an object.
  • the object recognition process is, for example, a process of recognizing an attribute such as an object type or identifying a specific object.
  • the detection process and the recognition process are not always clearly separated and may overlap.
  • the recognition unit 1073 detects an object around the vehicle 1001 by performing clustering that classifies the point cloud based on sensor data such as LiDAR or radar into a block of a point cloud. As a result, the presence / absence, size, shape, and position of objects around the vehicle 1001 are detected.
  • the recognition unit 1073 detects the movement of an object around the vehicle 1001 by performing tracking that follows the movement of a mass of point clouds classified by clustering. As a result, the velocity and the traveling direction (movement vector) of the object around the vehicle 1001 are detected.
  • the recognition unit 1073 recognizes the type of an object around the vehicle 1001 by performing an object recognition process such as semantic segmentation on the image data supplied from the camera 1051.
  • the object to be detected or recognized is assumed to be, for example, a vehicle, a person, a bicycle, an obstacle, a structure, a road, a traffic light, a traffic sign, a road sign, or the like.
  • the recognition unit 1073 recognizes the traffic rules around the vehicle 1001 based on the map stored in the map information storage unit 1023, the self-position estimation result, and the recognition result of the objects around the vehicle 1001. I do.
  • this processing for example, the position and state of a signal, the contents of traffic signs and road markings, the contents of traffic regulations, the lanes in which the vehicle can travel, and the like are recognized.
  • the recognition unit 1073 performs recognition processing of the environment around the vehicle 1001.
  • the surrounding environment to be recognized for example, weather, temperature, humidity, brightness, road surface condition, and the like are assumed.
  • the action planning unit 1062 creates an action plan for the vehicle 1001. For example, the action planning unit 1062 creates an action plan by performing route planning and route tracking processing.
  • route planning is a process of planning a rough route from the start to the goal.
  • This route plan is called a track plan, and in the route planned by the route plan, the track generation (Local) capable of safely and smoothly traveling in the vicinity of the vehicle 1001 in consideration of the motion characteristics of the vehicle 1001 is taken into consideration.
  • the processing of path planning is also included.
  • Route tracking is a process of planning an operation for safely and accurately traveling on a route planned by route planning within a planned time. For example, the target speed and the target angular velocity of the vehicle 1001 are calculated.
  • the motion control unit 1063 controls the motion of the vehicle 1001 in order to realize the action plan created by the action plan unit 1062.
  • the motion control unit 1063 controls the steering control unit 1081, the brake control unit 1082, and the drive control unit 1083 so that the vehicle 1001 advances on the track calculated by the track plan. Take control.
  • the motion control unit 1063 performs coordinated control for the purpose of realizing ADAS functions such as collision avoidance or impact mitigation, follow-up running, vehicle speed maintenance running, collision warning of own vehicle, and lane deviation warning of own vehicle.
  • the motion control unit 1063 performs coordinated control for the purpose of automatic driving or the like that autonomously travels without being operated by the driver.
  • the DMS1073 performs driver authentication processing, driver status recognition processing, and the like based on sensor data from the in-vehicle sensor 1026 and input data input to HMI1031.
  • As the state of the driver to be recognized for example, physical condition, alertness, concentration, fatigue, line-of-sight direction, drunkenness, driving operation, posture, and the like are assumed.
  • the DMS1073 may perform authentication processing for passengers other than the driver and recognition processing for the status of the passenger. Further, for example, the DMS 1073 may perform the recognition processing of the situation inside the vehicle based on the sensor data from the sensor 1026 in the vehicle. As the situation inside the vehicle to be recognized, for example, temperature, humidity, brightness, odor, etc. are assumed.
  • the HMI 1031 is used for inputting various data and instructions, generates an input signal based on the input data and instructions, and supplies the input signal to each part of the vehicle control system 1011.
  • the HMI 1031 includes an operation device such as a touch panel, a button, a microphone, a switch, and a lever, and an operation device that can be input by a method other than manual operation by voice or gesture.
  • the HMI 1031 may be, for example, a remote control device using infrared rays or other radio waves, or an externally connected device such as a mobile device or a wearable device corresponding to the operation of the vehicle control system 1011.
  • the HMI 1031 performs output control for generating and outputting visual information, auditory information, and tactile information for the passenger or the outside of the vehicle, and for controlling output contents, output timing, output method, and the like.
  • the visual information is, for example, information shown by an image such as an operation screen, a state display of the vehicle 1001, a warning display, a monitor image showing the surrounding situation of the vehicle 1001, or light.
  • the auditory information is, for example, information indicated by voice such as a guidance, a warning sound, and a warning message.
  • the tactile information is information given to the passenger's tactile sensation by, for example, force, vibration, movement, or the like.
  • a display device As a device that outputs visual information, for example, a display device, a projector, a navigation device, an instrument panel, a CMS (Camera Monitoring System), an electronic mirror, a lamp, etc. are assumed.
  • the display device is a device that displays visual information in the occupant's field of view, such as a head-up display, a transmissive display, and a wearable device having an AR (Augmented Reality) function, in addition to a device having a normal display. You may.
  • an audio speaker for example, an audio speaker, headphones, earphones, etc. are assumed.
  • a haptics element using haptics technology or the like As a device that outputs tactile information, for example, a haptics element using haptics technology or the like is assumed.
  • the haptic element is provided on, for example, a steering wheel, a seat, or the like.
  • the vehicle control unit 1032 controls each part of the vehicle 1001.
  • the vehicle control unit 1032 includes a steering control unit 1081, a brake control unit 1082, a drive control unit 1083, a body system control unit 1084, a light control unit 1085, and a horn control unit 1086.
  • the steering control unit 1081 detects and controls the state of the steering system of the vehicle 1001.
  • the steering system includes, for example, a steering mechanism including a steering wheel, electric power steering, and the like.
  • the steering control unit 1081 includes, for example, a control unit such as an ECU that controls the steering system, an actuator that drives the steering system, and the like.
  • the brake control unit 1082 detects and controls the state of the brake system of the vehicle 1001.
  • the brake system includes, for example, a brake mechanism including a brake pedal and the like, ABS (Antilock Brake System) and the like.
  • the brake control unit 1082 includes, for example, a control unit such as an ECU that controls the brake system, an actuator that drives the brake system, and the like.
  • the drive control unit 1083 detects and controls the state of the drive system of the vehicle 1001.
  • the drive system includes, for example, a drive force generator for generating a drive force of an accelerator pedal, an internal combustion engine, a drive motor, or the like, a drive force transmission mechanism for transmitting the drive force to the wheels, and the like.
  • the drive control unit 1083 includes, for example, a control unit such as an ECU that controls the drive system, an actuator that drives the drive system, and the like.
  • the body system control unit 1084 detects and controls the state of the body system of the vehicle 1001.
  • the body system includes, for example, a keyless entry system, a smart key system, a power window device, a power seat, an air conditioner, an airbag, a seat belt, a shift lever, and the like.
  • the body system control unit 1084 includes, for example, a control unit such as an ECU that controls the body system, an actuator that drives the body system, and the like.
  • the light control unit 1085 detects and controls various light states of the vehicle 1001. As the light to be controlled, for example, a headlight, a backlight, a fog light, a turn signal, a brake light, a projection, a bumper display, or the like is assumed.
  • the light control unit 1085 includes a control unit such as an ECU that controls the light, an actuator that drives the light, and the like.
  • the horn control unit 1086 detects and controls the state of the car horn of the vehicle 1001.
  • the horn control unit 1086 includes, for example, a control unit such as an ECU that controls the car horn, an actuator that drives the car horn, and the like.
  • FIG. 14 is a diagram showing an example of a sensing region by a camera 1051, a radar 1052, a LiDAR 1053, and an ultrasonic sensor 1054 of the external recognition sensor 1025 of FIG.
  • the sensing area 1101F and the sensing area 1101B show an example of the sensing area of the ultrasonic sensor 1054.
  • the sensing region 1101F covers the vicinity of the front end of the vehicle 1001.
  • the sensing region 1101B covers the periphery of the rear end of the vehicle 1001.
  • the sensing results in the sensing area 1101F and the sensing area 1101B are used, for example, for parking support of the vehicle 1001 and the like.
  • the sensing area 1102F to the sensing area 1102B show an example of the sensing area of the radar 1052 for a short distance or a medium distance.
  • the sensing area 1102F covers a position farther than the sensing area 1101F in front of the vehicle 1001.
  • the sensing region 1102B covers the rear of the vehicle 1001 to a position farther than the sensing region 1101B.
  • the sensing area 1102L covers the rear periphery of the left side surface of the vehicle 1001.
  • the sensing region 1102R covers the rear periphery of the right side surface of the vehicle 1001.
  • the sensing result in the sensing area 1102F is used, for example, to detect a vehicle, a pedestrian, or the like existing in front of the vehicle 1001.
  • the sensing result in the sensing region 1102B is used, for example, for a collision prevention function behind the vehicle 1001.
  • the sensing results in the sensing area 1102L and the sensing area 1102R are used, for example, for detecting an object in a blind spot on the side of the vehicle 1001.
  • the sensing area 1103F to the sensing area 1103B show an example of the sensing area by the camera 1051.
  • the sensing area 1103F covers a position farther than the sensing area 1102F in front of the vehicle 1001.
  • the sensing region 1103B covers the rear of the vehicle 1001 to a position farther than the sensing region 1102B.
  • the sensing area 1103L covers the periphery of the left side surface of the vehicle 1001.
  • the sensing region 1103R covers the periphery of the right side surface of the vehicle 1001.
  • the sensing result in the sensing area 1103F is used, for example, for recognition of traffic lights and traffic signs, lane departure prevention support system, and the like.
  • the sensing result in the sensing area 1103B is used, for example, for parking assistance, a surround view system, and the like.
  • the sensing results in the sensing area 1103L and the sensing area 1103R are used, for example, in a surround view system or the like.
  • the sensing area 1104 shows an example of the sensing area of LiDAR1053.
  • the sensing region 1104 covers a position farther than the sensing region 1103F in front of the vehicle 1001.
  • the sensing area 1104 has a narrower range in the left-right direction than the sensing area 1103F.
  • the sensing result in the sensing area 1104 is used, for example, for emergency braking, collision avoidance, pedestrian detection, and the like.
  • the sensing area 1105 shows an example of the sensing area of the radar 1052 for a long distance.
  • the sensing region 1105 covers a position farther than the sensing region 1104 in front of the vehicle 1001.
  • the sensing region 1105 has a narrower range in the left-right direction than the sensing region 1104.
  • the sensing result in the sensing area 1105 is used for, for example, ACC (Adaptive Cruise Control) or the like.
  • each sensor may have various configurations other than that shown in FIG. Specifically, the ultrasonic sensor 1054 may also sense the side of the vehicle 1001, or the LiDAR 1053 may sense the rear of the vehicle 1001.
  • cross-fusion CNN121 can be applied to DMS1030, HMI1031, sensor fusion unit 1072, recognition unit 1073, and the like.
  • the information displayed by the HMI 1031 can be expanded.
  • HUD Head Up Display
  • HMI 1031 Head Up Display
  • Information that is difficult to recognize can be emphasized or added.
  • fog it is possible to emphasize and display a delicate contrast of an object or the like in the field of view in front of the vehicle 1001.
  • the cross-fusion CNN121 can be applied to the recognition unit 1073 and the HMI 1031 to fuse the image data supplied from the camera 1051 and the LiDAR data supplied from the LiDAR 1053 and present them to the driver.
  • image data is input from the camera 1051 to the HVC 131 of the cross-fusion CNN 121, and LiDAR data is input from the LiDAR 1053 to the MVC 132.
  • the cross fusion CNN121 is trained using correct fusion images (Ground truth used images) data in which image data and LiDAR data are fused.
  • correct fusion image data for example, a label indicating whether or not the feature based on the LiDAR data is salient is added to the image data in which the feature detected based on the LiDAR data is emphasized or added. It is data.
  • the cross-fusion CNN 121 adds image data and LiDAR data to the features of the image data that the HVC 131 pays attention to (and thus the driver), by adding features based on the LiDAR data. Fuse. Then, the HMI 1031 displays an image in which the image data and the LiDAR data are fused.
  • FIG. 15 is a block diagram showing a configuration example of computer hardware that executes the above-mentioned series of processes programmatically.
  • the CPU Central Processing Unit
  • the ROM Read Only Memory
  • the RAM Random Access Memory
  • An input / output interface 2005 is further connected to the bus 2004.
  • An input unit 2006, an output unit 2007, a recording unit 2008, a communication unit 2009, and a drive 2010 are connected to the input / output interface 2005.
  • the input unit 2006 includes an input switch, a button, a microphone, an image pickup device, and the like.
  • the output unit 2007 includes a display, a speaker, and the like.
  • the recording unit 2008 includes a hard disk, a non-volatile memory, and the like.
  • the communication unit 2009 includes a network interface and the like.
  • the drive 2010 drives a removable media 2011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
  • the CPU 2001 loads the program recorded in the recording unit 2008 into the RAM 2003 via the input / output interface 2005 and the bus 2004 and executes the program. A series of processes are performed.
  • the program executed by the computer 2000 can be recorded and provided on the removable media 2011 as a package media or the like, for example.
  • the program can also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
  • the program can be installed in the recording unit 2008 via the input / output interface 2005 by mounting the removable media 2011 in the drive 2010. Further, the program can be received by the communication unit 2009 via a wired or wireless transmission medium and installed in the recording unit 2008. In addition, the program can be installed in ROM 2002 or the recording unit 2008 in advance.
  • the program executed by the computer may be a program in which processing is performed in chronological order according to the order described in the present specification, in parallel, or at a necessary timing such as when a call is made. It may be a program in which processing is performed.
  • the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a device in which a plurality of modules are housed in one housing are both systems. ..
  • the embodiment of the present technology is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present technology.
  • this technology can take a cloud computing configuration in which one function is shared by multiple devices via a network and processed jointly.
  • each step described in the above flowchart can be executed by one device or shared by a plurality of devices.
  • the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.
  • the present technology can also have the following configurations.
  • the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, It is provided with a second CNN that realizes a function similar to that of the first CNN by a process different from that of the first CNN.
  • An information processing device in which an arbitrary layer of the first CNN and an arbitrary layer of the second CNN are connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. ..
  • the one CNN is the first CNN, and the CNN is the first CNN.
  • the other CNN is the second CNN, which is the information processing apparatus according to (1).
  • the information processing apparatus includes at least one of a feature map, a attention map, and a region proposal.
  • the second CNN outputs data influenced by the first CNN.
  • the output data of the second CNN includes data showing a difference between the processing of the first CNN and the processing of the second CNN.
  • the output data of the second CNN is obtained by synthesizing an image obtained by performing predetermined image processing on the input image by the second CNN in a region of the input image that the first CNN is not paying attention to.
  • the information processing apparatus according to (5) or (6) above, which includes image data.
  • the cross-fusion parameters are transferred between the connection relationship between the first CNN layer and the second CNN layer, and between the first CNN layer and the second CNN layer.
  • the information processing apparatus according to (11) above which includes a parameter indicating the type of information.
  • the information processing apparatus according to any one of (1) to (12), wherein the transferred information is used for processing the arbitrary layer of the other CNN.
  • Image data and data indicating the state, processing, or function of the human visual system are input to the first CNN.
  • the information processing apparatus according to any one of (1) to (13), wherein image data is input to the second CNN.
  • a function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN.
  • An information processing method that connects to an arbitrary layer of a second CNN and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  • a function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN.
  • a function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN.
  • 101 information processing system 111 sensor unit, 112 human visual processing sensor, 113 processing unit, 114 training data set generation unit, 121 cross fusion CNN, 131 HVC, 132 MVC, 1001 vehicle, 1030 DMS, 1031 HMI, 1072 sensor fusion unit. , 1073 Recognition unit

Abstract

The present technology relates to an information processing device, an information processing method, a program, and a learning method which make it possible to improve the performance of a model provided with the same function as that of the human vision system. The information processing device comprises: a first Convolution Neural Network (CNN) which implements the same process and function as those of the human vision system; and a second CNN which implements a similar function to that of the first CNN through a different process from the first CNN, wherein an arbitrary layer of the first CNN is connected with an arbitrary layer of the second CNN, and information is transmitted from the arbitrary layer of one of the CNNs to the arbitrary layer of the other CNN. The present technology can be applied to, for example, a system that executes object recognition.

Description

情報処理装置、情報処理方法、プログラム、及び、学習方法Information processing device, information processing method, program, and learning method
 本技術は、情報処理装置、情報処理方法、プログラム、及び、学習方法に関し、特に、人間の視覚システムと同様の機能を備えるモデルの性能を向上させるようにした情報処理装置、情報処理方法、プログラム、及び、学習方法に関する。 This technology relates to information processing devices, information processing methods, programs, and learning methods, and in particular, information processing devices, information processing methods, and programs designed to improve the performance of models having functions similar to those of human visual systems. , And learning methods.
 従来、複数のセンサによる測距方法で得られる情報から、物体までの距離が複数の距離それぞれである距離尤度を算出し、学習モデルを用いて、複数の測距方法についての距離尤度を統合し、複数の距離それぞれの統合尤度を求めることが提案されている(例えば、特許文献1参照)。 Conventionally, the distance likelihood for each of the multiple distances to the object is calculated from the information obtained by the distance measuring method using multiple sensors, and the learning model is used to determine the distance likelihood for the multiple distance measuring methods. It has been proposed to integrate and obtain the integration likelihood of each of a plurality of distances (see, for example, Patent Document 1).
国際公開第2017/057056号International Publication No. 2017/057056
 ところで、人間の視覚システム(Human vision system)は、CNN(Convolutional Neural Network、畳み込みニューラルネットワーク)により効果的にモデル化できることが知られている。例えば、物体認識の実行中に人間の脳内で行われる演算抽象化(Computational abstraction)の各層にCNNの各層を良好に対応付けることが可能である。例えば、人間の視覚システムをモデル化したCNNの初期層に、エッジ検出などの網膜で実行される演算機能と同様の機能を実行させることができる。 By the way, it is known that the human vision system can be effectively modeled by CNN (Convolutional Neural Network). For example, it is possible to satisfactorily associate each layer of CNN with each layer of computational abstraction performed in the human brain during execution of object recognition. For example, the initial layer of a CNN modeled on a human visual system can be made to perform functions similar to those performed on the retina, such as edge detection.
 一方、特許文献1では、人間の視覚システムをモデル化することは検討されていない。 On the other hand, Patent Document 1 does not consider modeling a human visual system.
 本技術は、このような状況に鑑みてなされたものであり、人間の視覚システムと同様の機能を備えるモデルの性能を向上させるようにするものである。 This technology was made in view of such a situation, and is intended to improve the performance of a model having the same functions as a human visual system.
 本技術の第1の側面の情報処理装置は、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNとを備え、前記第1のCNNの任意の層と前記第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される。 The information processing device of the first aspect of the present technology is the first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and the first CNN by processing different from the first CNN. With a second CNN that realizes a function similar to that of the CNN, any layer of the first CNN and any layer of the second CNN are connected and from the arbitrary layer of one CNN. Information is transferred to the arbitrary layer of the other CNN.
 本技術の第1の側面の情報処理方法は、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する。 The information processing method of the first aspect of the present technology is based on an arbitrary layer of a first CNN (Convolution Neural Network) that realizes the same processing and functions as a human visual system, and processing different from the first CNN. , Connects to any layer of the second CNN that realizes a function similar to that of the first CNN, and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
 本技術の第1の側面のプログラムは、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する処理をコンピュータに実行させる。 The program of the first aspect of the present technology is described by an arbitrary layer of a first CNN (Convolution Neural Network) that realizes processing and functions similar to those of a human visual system, and processing different from the first CNN. A process of connecting to an arbitrary layer of a second CNN that realizes a function similar to that of the first CNN and transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN to a computer. Let it run.
 本技術の第2の側面の学習方法は、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される交差融合CNNの前記第1のCNNのトレーニングを、前記第2のCNNと組み合わせる前に行い、前記第1のCNNと前記第2のCNNと組み合わせてトレーニングを行う際に、前記第2のCNNのトレーニングを行い、前記第1のCNNの構成及びパラメータを変更しない。 The learning method of the second aspect of the present technology is based on an arbitrary layer of the first CNN (Convolution Neural Network) that realizes the same processing and functions as the human visual system, and processing different from the first CNN. A cross-fusion in which an arbitrary layer of a second CNN that realizes a function similar to that of the first CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. When the first CNN training of the CNN is performed before the combination with the second CNN and the training is performed in combination with the first CNN and the second CNN, the training of the second CNN is performed. And do not change the configuration and parameters of the first CNN.
 本技術の第1の側面においては、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される。 In the first aspect of the present technology, the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, and the first CNN are processed differently from the first CNN. An arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
 本技術の第2の側面においては、人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される交差融合CNNの前記第1のCNNのトレーニングが、前記第2のCNNと組み合わせる前に行われ、前記第1のCNNと前記第2のCNNと組み合わせてトレーニングを行う際に、前記第2のCNNのトレーニングが行われ、前記第1のCNNの構成及びパラメータは変更されない。 In the second aspect of the present technology, the first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system, and the first CNN are processed differently from the first CNN. A cross-fused CNN in which an arbitrary layer of a second CNN that realizes a function similar to that of one CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. The training of the first CNN is performed before the combination with the second CNN, and when the training is performed in combination with the first CNN and the second CNN, the training of the second CNN is performed. We do not change the configuration and parameters of the first CNN.
本技術を適用した情報処理システムの一実施の形態を示すブロック図である。It is a block diagram which shows one Embodiment of the information processing system which applied this technology. 交差融合CNNの構成例を示す図である。It is a figure which shows the structural example of the cross fusion CNN. HVCの構成例を示す図である。It is a figure which shows the structural example of HVC. 人間の脳の視覚野等の分布を示す図である。It is a figure which shows the distribution of the visual cortex of a human brain. MVCの構成例を示す図である。It is a figure which shows the structural example of MVC. 交差融合CNNの構成例を示す図である。It is a figure which shows the structural example of the cross fusion CNN. トレーニング処理を説明するためのフローチャートである。It is a flowchart for demonstrating a training process. HVCのトレーニング段階における情報処理システムの機能の構成例を示すブロック図である。It is a block diagram which shows the structural example of the function of the information processing system in the training stage of HVC. 交差融合CNNのトレーニング段階における情報処理システムの機能の構成例を示すブロック図である。It is a block diagram which shows the structural example of the function of the information processing system in the training stage of a cross fusion CNN. HVCのトレーニング方法の例を説明するための図である。It is a figure for demonstrating an example of the training method of HVC. 情報処理を説明するための図である。It is a figure for demonstrating information processing. 交差融合CNNの実行段階における情報処理システムの機能の構成例を示すブロック図である。It is a block diagram which shows the structural example of the function of the information processing system in the execution stage of a cross fusion CNN. 車両制御システムの構成例を示すブロック図である。It is a block diagram which shows the configuration example of a vehicle control system. センシング領域の例を示す図である。It is a figure which shows the example of the sensing area. コンピュータの構成例を示すブロック図である。It is a block diagram which shows the configuration example of a computer.
 以下、本技術を実施するための形態について説明する。説明は以下の順序で行う。
 1.用語の定義
 2.実施の形態
 3.変形例
 4.本技術の効果と応用例
 5.その他
Hereinafter, a mode for implementing the present technology will be described. The explanation will be given in the following order.
1. 1. Definition of terms 2. Embodiment 3. Modification example 4. Effects and application examples of this technology 5. others
 <<1.用語の定義>>
 以下、本明細書で用いる用語の定義を行う。
<< 1. Definition of terms >>
Hereinafter, the terms used in the present specification will be defined.
  <HVC(ヒューマンビジョンCNN(Human Vision CNN))>
 HVCは、人間の視覚システムと同様の処理及び機能、すなわち、人間の脳内で起こる処理及び機能と同様の処理及び機能を実現するCNNである。HVCは、光の検出から画像の知覚までの全てのステップを含み、潜在的に物体認識等の高レベルの機能を含む人間の視覚システムのモデルとして提供される。
<HVC (Human Vision CNN)>
HVC is a CNN that realizes the same processing and function as the human visual system, that is, the same processing and function as the processing and function occurring in the human brain. HVC is provided as a model of a human visual system that includes all steps from light detection to image perception and potentially includes high level functions such as object recognition.
  <MVC(マシンビジョンCNN(Machine Vision CNN))>
 MVCは、人間の視覚システムの処理をモデル化するという制限を設けずに、HVC(すなわち、人間の視覚システム)と異なる処理により、HVCと類似する機能を実現するCNNである。なお、以下、MVCにより実現される視覚システムをマシンの視覚システムと称する。
<MVC (Machine Vision CNN)>
MVC is a CNN that realizes a function similar to HVC by processing different from HVC (that is, human visual system) without the limitation of modeling the processing of human visual system. Hereinafter, the visual system realized by MVC will be referred to as a machine visual system.
  <交差融合(Cross fusion)>
 交差融合とは、独立したアーキテクチャ及びパラメータを備える複数のCNNを組み合わせ、組み合わせた状態でトレーニングすることにより、複合CNNアーキテクチャ(Combined CNN architecture)を生成することにより、複数のCNNを融合することをいう。以下、複数のCNNを交差融合したCNNアーキテクチャを交差融合CNN(Cross fused CNN)と称する。
<Cross fusion>
Cross-fusion refers to the fusion of multiple CNNs by combining multiple CNNs with independent architectures and parameters and training in a combined state to generate a combined CNN architecture. .. Hereinafter, the CNN architecture in which a plurality of CNNs are cross-fused is referred to as a cross fused CNN (Cross fused CNN).
 例えば、複数の畳み込み層で異なるCNNの処理構造を接続するために、1つのCNNと他のCNNとの間にシンプルな交差接続入力(Cross connection input)が追加される。これにより、異なるCNNの任意の中間層(例えば、畳み込み層)間で関連する情報を転送することができ、異なるCNNからの情報を、任意の関連する抽象化レベルで効率的に融合することができる。その結果、リアルタイムに人間の視覚処理とマシンの視覚処理の差異検出(Disparity detection)や効率的な画像融合(Efficient image fusion)等が実現される。 For example, in order to connect different CNN processing structures in multiple convolutional layers, a simple cross connection input is added between one CNN and another CNN. This allows related information to be transferred between any intermediate layer (eg, convolution layer) of different CNNs, allowing information from different CNNs to be efficiently fused at any relevant level of abstraction. can. As a result, difference detection between human visual processing and machine visual processing (Disparity detection) and efficient image fusion (Efficient image fusion) are realized in real time.
  <交差接続(Cross connection)>
 交差接続は、トレーニング可能なスカラ値により、接続先のCNN内の層の機能に影響を与える接続である。交差接続は、例えば、CNNの各畳み込み層からの出力であってもよいし、又は、畳み込み層のサブセットの層からの出力であってもよい。
<Cross connection>
A cross-connection is a connection that affects the function of the layer within the destination CNN by means of a trainable scalar value. The cross-connection may be, for example, the output from each convolutional layer of the CNN, or the output from a layer of a subset of the convolutional layers.
 <<2.実施の形態>>
 次に、本技術の実施の形態について説明する。
<< 2. Embodiment >>
Next, an embodiment of the present technology will be described.
  <視覚処理システム>
 図1は、本技術を適用した情報処理システム101の一実施の形態を示すブロック図である。
<Visual processing system>
FIG. 1 is a block diagram showing an embodiment of an information processing system 101 to which the present technology is applied.
 情報処理システム101は、センサ部111、人間視覚処理センサ(Human visual processing sensors)112、処理ユニット(Processing unit)113、トレーニングデータセット生成部114、及び、トレーニングデータベース115を備える。 The information processing system 101 includes a sensor unit 111, a human visual processing sensors 112, a processing unit 113, a training data set generation unit 114, and a training database 115.
 センサ部111は、例えば、イメージセンサ、LiDAR(Light Detection And Ranging)等の光学センサを備える。センサ部111は、光学センサにより収集されたデータに基づいて、視覚情報(例えば、画像データ)を生成する。 The sensor unit 111 includes, for example, an image sensor and an optical sensor such as LiDAR (Light Detection And Ringing). The sensor unit 111 generates visual information (for example, image data) based on the data collected by the optical sensor.
 人間視覚処理センサ112は、情報処理システム101を使用する人間(ユーザ)の視覚システムの状態、処理、又は、機能を示すデータ(以下、人間視覚処理データと称する)を収集する。人間視覚処理センサ112は、例えば、AR(Augmented Reality)ヘッドセットに設けられる。 The human visual processing sensor 112 collects data (hereinafter referred to as human visual processing data) indicating the state, processing, or function of the human (user) visual system using the information processing system 101. The human visual processing sensor 112 is provided in, for example, an AR (Augmented Reality) headset.
 人間視覚処理データは、例えば、以下のデータを含む。 Human visual processing data includes, for example, the following data.
・視線パラメータ
・瞳孔ダイナミクス
・推定視野(Estimated field of view)
・疲労やカフェイン摂取量等の生理状態データ
・EEG(Electroencephalogram、脳波)データ
・ Line-of-sight parameters ・ Pupil dynamics ・ Estimated field of view
・ Physiological condition data such as fatigue and caffeine intake ・ EEG (Electroencephalogram) data
 例えば、ユーザの視覚システムの機能は、ユーザの視線の特徴や、耳に配置されたEEGシステムにより異なる処理レベルで測定することが可能である。また、EEGシステムの中には、人間が心の中で形成している画像を再構築可能なものも存在する。 For example, the function of the user's visual system can be measured at different processing levels depending on the characteristics of the user's line of sight and the EEG system placed in the ear. In addition, some EEG systems can reconstruct the images that human beings form in their minds.
 処理ユニット121は、例えば、CPU等の一般的なプロセッサ等により構成さてもよいし、情報処理システム101用に最適化されたプロセッサ等により構成されてもよい。処理ユニット121は、交差融合CNN121を実現する。 The processing unit 121 may be configured by, for example, a general processor such as a CPU, or may be configured by a processor optimized for the information processing system 101 or the like. The processing unit 121 realizes the cross-fusion CNN 121.
 交差融合CNN121は、図2に示されるように、HVC131とMVC132を交差融合したものである。交差融合CNN121は、HVC131に関する各種のパラメータ、MVC131に関する各種のパラメータ、並びに、HVC131とMVC132との間の交差接続に関する交差融合パラメータにより表される。 Cross-fusion CNN121 is a cross-fusion of HVC131 and MVC132 as shown in FIG. The cross-fusion CNN 121 is represented by various parameters for the HVC 131, various parameters for the MVC 131, and cross-fusion parameters for the cross connection between the HVC 131 and the MVC 132.
 交差融合パラメータは、HVC131とMVC132の融合方法を示し、例えば、以下のパラメータを含む。 The cross fusion parameter indicates a fusion method of HVC131 and MVC132, and includes, for example, the following parameters.
・交差接続数
・交差接続の原点の層番号
・交差接続の宛先の層番号
・交差接続のタイプ
-Number of crossed connections-Layer number of origin of crossed connection-Layer number of destination of crossed connection-Type of crossed connection
 交差接続数、交差接続の原点の層番号、及び、交差接続の宛先の層番号により、HVC131とMVC132の接続関係が示される。 The connection relationship between HVC131 and MVC132 is indicated by the number of crossed connections, the layer number of the origin of the crossed connection, and the layer number of the destination of the crossed connection.
 交差接続のタイプは、交差接続により一方のCNNの中間層(例えば、畳み込み層)から転送され、他方のCNNの中間層に入力される情報のタイプを示す。例えば、交差接続のタイプには、特徴マップ(Feature maps)、注目マップ(Attention maps)、領域提案(Region Proposal)等がある。転送された情報は、転送先の層の処理に用いられる。 The type of cross-connection indicates the type of information that is transferred from one CNN intermediate layer (eg, a convolution layer) by cross-connection and input to the other CNN middle layer. For example, the types of cross-connection include feature maps (Feature maps), attention maps (Attention maps), region proposals (Region Proposal), and the like. The transferred information is used for processing the transfer destination layer.
 特徴マップは、例えば、CNNの各畳み込み層から出力される画像データであり、画像データ内の各画素の特徴量を示す。 The feature map is, for example, image data output from each convolution layer of CNN, and shows the feature amount of each pixel in the image data.
 注目マップは、特別な形式の特徴マップであり、例えば、物体認識等に重要な領域(注目領域)をヒートマップ等の形式で表すものである。例えば、注目マップにおいて、重要度が高い領域の画素ほど赤に近い色になり、重要度が低い領域の画素ほど青に近い色になる。 The attention map is a feature map of a special format, for example, a region (attention region) important for object recognition or the like is represented in a format such as a heat map. For example, in a map of interest, pixels in a region of high importance have a color closer to red, and pixels in a region of less importance have a color closer to blue.
 領域提案は、例えば、画像内において物体が存在する可能性が高い領域を、矩形の枠等で示す画像データである。領域提案は、例えば、他のアルゴリズムにおいて、画像内に物体が存在するか否か、又は、画像内に何の物体が存在するか等の判断に用いられる。また、領域提案は、例えば、特定のタイプの物体検出に用いられる。 The area proposal is, for example, image data showing an area in an image where an object is likely to exist with a rectangular frame or the like. The region proposal is used, for example, in another algorithm to determine whether or not an object is present in the image, or what object is present in the image. Region proposals are also used, for example, for specific types of object detection.
 交差融合CNN121は、例えば、HVC131とMVC132の処理の差異を拡大したり、強調したりする機能を実現する。また、交差融合CNN121は、HVC131の処理(すなわち、人間の視覚処理)とMVC132の処理(すなわち、マシンの視覚処理)の差異を削減する機能を実現する。 The cross-fusion CNN121 realizes, for example, a function of expanding or emphasizing the processing difference between HVC131 and MVC132. Further, the cross-fusion CNN 121 realizes a function of reducing the difference between the processing of the HVC 131 (that is, the visual processing of a human) and the processing of the MVC 132 (that is, the visual processing of a machine).
 HVC131は、上述したように、標準的なCNNと異なり、人間の視覚システムと同様の処理及び機能を実現するCNNである。なお、現時点で人間の視覚システムのモデルは進化を続けているが、HVC131には、例えば、最先端の人間の視覚システムのモデルを適用することができる。 As described above, the HVC 131 is a CNN that realizes the same processing and functions as a human visual system, unlike a standard CNN. At present, the model of the human visual system continues to evolve, but for example, a state-of-the-art model of the human visual system can be applied to the HVC 131.
 一方、現在の人間の視覚システムのモデルの技術レベルを考慮すると、HVC131は、例えば、以下の構成を含みうる。 On the other hand, considering the technical level of the current model of the human visual system, the HVC 131 may include, for example, the following configuration.
 HVC131の独立した機能モジュールは、例えば、図3に示されるように、第1次視覚野(V1)乃至第5次視覚野(V5)のような人間の視覚システムの異なる機能エリアを反映したアーキテクチャを備えることができる。 The independent functional module of the HVC 131 is an architecture that reflects different functional areas of the human visual system, such as the primary visual cortex (V1) to the fifth visual cortex (V5), for example, as shown in FIG. Can be provided.
 なお、図4は、人間の脳の概要図を示している。図4には、人間の脳におけるV1乃至V5の分布を模式的に示している。また、図4には、人間の眼201、及び、顔認識を行う領域202の位置が図示されている。 Note that FIG. 4 shows a schematic diagram of the human brain. FIG. 4 schematically shows the distribution of V1 to V5 in the human brain. Further, FIG. 4 shows the positions of the human eye 201 and the region 202 for face recognition.
 図3の例では、HVC131の各畳み込み層が、人間の脳のV1乃至V5に類似する機能を実現する例が示されている。そして、HVC131の各畳み込み層からは、人間の脳のV1乃至V5から出力される信号に類似する信号を画像化した画像データ(以下、視覚野データV1乃至視覚野データV5と称する)が出力され、HVC131の次の層に入力されたり、MVC132に入力されたりする。 In the example of FIG. 3, an example is shown in which each convolutional layer of the HVC 131 realizes a function similar to V1 to V5 of the human brain. Then, from each convolutional layer of the HVC 131, image data (hereinafter referred to as visual cortex data V1 to visual cortex data V5) that images a signal similar to the signal output from V1 to V5 of the human brain is output. , Is input to the next layer of HVC131, or is input to MVC132.
 HVC131が備える機能モジュールは、直列に接続することも可能であるし、非線形に接続することも可能である。 The functional modules included in the HVC 131 can be connected in series or non-linearly.
 特に、フィードフォワード型のネットワークは、視覚野内の最初の200msの処理に最適なモデルを実現することができる。このようなモデルは、人間の霊長類視覚野(Primate visual cortex)の特定の部分のみを活性化することが可能な画像を生成することが実証されている。 In particular, the feedforward type network can realize the optimum model for the processing of the first 200 ms in the visual field. Such models have been demonstrated to produce images capable of activating only specific parts of the human primate visual cortex.
 また、HVC131には、例えば、再起(Recurrence)、すなわち、出力を自身の入力に戻すニューロンを用いることが可能である。これにより、例えば、霊長類視覚野の神経アーキテクチャ(Primate visual cortex neural architecture)及び機能的性能の調整を改善することができる。 Further, for the HVC 131, for example, it is possible to use a neuron that returns the output to its own input, that is, recurrence. This can, for example, improve the adjustment of the neural architecture (Primate visual cortex neural architecture) and functional performance of the primate visual cortex.
 さらに、HVC131には、例えば、スパイキングニューラルネットワーク等のニューロモーフィック(neuromorphic)コンピューティングのアーキテクチャを適用することができる。これにより、例えば、人間の視覚システムの機能をより正確に再現することができる。 Furthermore, a neuromorphic computing architecture such as a spiking neural network can be applied to the HVC 131. This makes it possible, for example, to more accurately reproduce the functions of the human visual system.
 また、HVC131は、一般的な人間の視覚システムをモデル化したものであってもよいし、特定の個人の視覚システムをモデル化したものであってもよい。 Further, the HVC 131 may be a model of a general human visual system or a model of a specific individual visual system.
 さらに、例えば、HVC131が対応する認識処理を限定するようにしてもよい。例えば、HVC131の認識対象とする物体を限定するようにしてもよい。これにより、HVC131の性能が向上する場合がある。 Further, for example, the recognition process supported by the HVC 131 may be limited. For example, the object to be recognized by the HVC 131 may be limited. This may improve the performance of the HVC 131.
 また、HVC131の出力データ(以下、HVC出力と称する)の内容は、HVC131又は交差融合CNN121の用途等により異なる。例えば、HVC131に入力された画像データに対して、画像内の物体の種類を示すラベルを付与したラベル付き画像データが、HVC出力として出力される。 Further, the content of the output data of the HVC 131 (hereinafter referred to as the HVC output) differs depending on the application of the HVC 131 or the cross fusion CNN 121. For example, labeled image data to which a label indicating the type of an object in the image is attached to the image data input to the HVC 131 is output as HVC output.
 MVC132は、上述したように、人間の視覚システムの処理をモデル化するという制限を設けずに、HVC131と異なる処理により、HVC131と類似する機能を実現するCNNである。 As described above, the MVC 132 is a CNN that realizes a function similar to that of the HVC 131 by processing different from the HVC 131 without setting the limitation of modeling the processing of the human visual system.
 なお、後述するように、HVC131は、MVC132と組み合わされる前、すなわち、交差融合CNN121に実装される前にトレーニングされる。そして、HVC131の構成及び重み等のパラメータは、交差融合CNN121に実装された後は変更されない。 As will be described later, the HVC 131 is trained before being combined with the MVC 132, that is, before being mounted on the cross-fusion CNN 121. The parameters such as the configuration and weight of the HVC 131 are not changed after being implemented in the cross-fusion CNN 121.
 一方、MVC132は、HVC131と組み合わされる前、すなわち、交差融合CNN121に実装される前にはトレーニングされず、交差融合CNN121に実装された後、すなわち、HVC131と組み合わされた後にトレーニングされる。すなわち、MVC132の構成及び重み等のパラメータは、交差融合CNN121に実装された後に調整される。 On the other hand, the MVC 132 is not trained before being combined with the HVC 131, i.e., before being mounted on the cross-fused CNN 121, but after being mounted on the cross-fused CNN 121, i.e., after being combined with the HVC 131. That is, parameters such as the configuration and weight of the MVC 132 are adjusted after being mounted on the cross-fusion CNN 121.
 また、HVC131とMVC132との間の交差融合パラメータは、予め定義されたパラメータセットに基づいて固定されていてもよいし、交差融合CNN121のトレーニング中に変更されてもよい。例えば、前者の場合、HVC131とMVC132との間の接続関係は変更されないが、後者の場合、HVC131とMVC132との間の接続関係が変更される場合がある。 Also, the cross-fusion parameters between the HVC 131 and the MVC 132 may be fixed based on a predefined parameter set or may be changed during training of the cross-fusion CNN 121. For example, in the former case, the connection relationship between the HVC 131 and the MVC 132 is not changed, but in the latter case, the connection relationship between the HVC 131 and the MVC 132 may be changed.
 さらに、交差融合CNN121では、必ず、MVC132がHVC131の処理の影響を受けるように、HVC131からMVC132に情報が転送される。すなわち、HVC131からMVC132への交差接続が必ず存在する。これにより、HVC131の各層が、交差接続を介して、MVC132の各層に独立して影響を与えることにより、マシンの視覚システムの処理が、人間の視覚システムの処理に依存するようになる。 Further, in the cross-fusion CNN 121, information is always transferred from the HVC 131 to the MVC 132 so that the MVC 132 is affected by the processing of the HVC 131. That is, there is always a cross connection from the HVC 131 to the MVC 132. As a result, each layer of the HVC 131 independently affects each layer of the MVC 132 via the cross connection, so that the processing of the visual system of the machine depends on the processing of the human visual system.
 例えば、図5に示されるように、HVC131の各畳み込み層から出力される視覚野データV1乃至視覚野データV5が、MVC132に転送される。 For example, as shown in FIG. 5, the visual cortex data V1 to the visual cortex data V5 output from each convolution layer of the HVC 131 is transferred to the MVC 132.
 一方、MVC132からHVC131への情報の転送を行うか否かは任意である。MVC132からHVC131への情報の転送は、例えば、HVC131のクラス分類の結果を改善するために、MVC132からの情報をHVC131に入力する場所を決定する場合に有用である。また、例えば、人間の認識機能を改善するために、どのように画像を拡張するのかを決定するのに役立つ。 On the other hand, it is optional whether or not to transfer information from MVC132 to HVC131. The transfer of information from the MVC 132 to the HVC 131 is useful, for example, in determining where to enter the information from the MVC 132 into the HVC 131 in order to improve the results of the classification of the HVC 131. It also helps determine, for example, how to enhance the image to improve human cognitive function.
 図2又は図6に示されるように、交差融合CNN121のHVC131及びMVC132には、処理対象となる画像データが入力される。なお、HVC131に入力される画像データとMVC132に入力される画像データの種類は、同一であってもよいし、異なっていてもよい。例えば、カメラにより撮影された画像データがHVC131に入力され、LiDARにより収集されたデータを画像化した画像データがMVC132に入力されてもよい。また、人間視覚処理センサ112により収集された人間視覚処理データが、必要に応じてHVC132に入力される。 As shown in FIG. 2 or FIG. 6, image data to be processed is input to the HVC 131 and MVC 132 of the cross fusion CNN 121. The types of image data input to the HVC 131 and the image data input to the MVC 132 may be the same or different. For example, the image data taken by the camera may be input to the HVC 131, and the image data obtained by imaging the data collected by the LiDAR may be input to the MVC 132. Further, the human visual processing data collected by the human visual processing sensor 112 is input to the HVC 132 as needed.
 HVC131とMVC132は、個別に画像処理を行う。そして、HVC131は、必要に応じて、HVC出力を出力する。一方、MVC132は、HVC131の影響を受けた出力データ、すなわち、人間の視覚システムの機能にある程度依存した出力データ(以下、HVC感化出力(HVC influenced output)と称する)を出力する。 HVC131 and MVC132 perform image processing individually. Then, the HVC 131 outputs the HVC output as needed. On the other hand, the MVC 132 outputs output data influenced by the HVC 131, that is, output data that depends to some extent on the functions of the human visual system (hereinafter referred to as HVC-influenced output).
 HVC感化出力の内容は、交差融合CNN121の用途等により異なる。例えば、HVC感化出力は、HVC出力と同様のデータを含む。また、例えば、HVC感化出力は、HVC131との処理とMVC132の処理の一致点、差異等を示す画像データを含む。この画像データは、例えば、MVC132が注目し、HVC131が注目していない画像の特徴示す注目マップに類似する。また、例えば、HVC感化出力は、物体分類等の画像処理の結果を示す画像データを含む。 The content of the HVC-sensitized output differs depending on the application of the cross-fusion CNN121 and the like. For example, the HVC-sensitized output contains the same data as the HVC output. Further, for example, the HVC-sensitized output includes image data indicating a coincidence point, a difference, etc. between the processing with the HVC 131 and the processing with the MVC 132. This image data is similar to, for example, a focus map featuring an image that the MVC 132 is paying attention to and the HVC 131 is not paying attention to. Further, for example, the HVC-sensitized output includes image data showing the result of image processing such as object classification.
 トレーニングデータセット生成部113は、HVC131のトレーニングに用いるトレーニングデータのセット(以下、HVC用トレーニングデータセットと称する)、及び、交差融合CNN121のトレーニングに用いるトレーニングデータのセット(以下、交差融合CNN用トレーニングデータセットと称する)を生成する。トレーニングデータセット生成部113は、生成したHVC用トレーニングデータセット及び交差融合CNN用トレーニングデータセットを、トレーニングデータベース115に格納する。 The training data set generation unit 113 includes a set of training data used for training the HVC 131 (hereinafter referred to as a training data set for HVC) and a set of training data used for training the cross fusion CNN 121 (hereinafter referred to as training for cross fusion CNN). Generate a dataset). The training data set generation unit 113 stores the generated training data set for HVC and the training data set for cross-fusion CNN in the training database 115.
 なお、トレーニングデータセット生成部113は、人間に対して物体認識タスクを与え、人間による画像内の物体の認識結果を示すラベル(以下、人間認識ラベルと称する)を収集する機能を備えていてもよい。 Even if the training data set generation unit 113 has a function of giving an object recognition task to a human and collecting a label (hereinafter referred to as a human recognition label) indicating a recognition result of an object in an image by a human. good.
 物体認識タスクとは、例えば、人間の視覚システムをテストするためのタスクである。具体的には、物体認識タスクとは、例えば、人間に対して所定の時間(例えば、100ms間)画像が提示され、人間が画像内の物体を分類し、分類した結果を示すラベルを付与するタスクである。 The object recognition task is, for example, a task for testing a human visual system. Specifically, in the object recognition task, for example, an image is presented to a human for a predetermined time (for example, for 100 ms), the human classifies the object in the image, and a label indicating the result of the classification is given. It's a task.
  <情報処理システム101の処理>
 次に、情報処理システム101の処理について説明する。
<Processing of information processing system 101>
Next, the processing of the information processing system 101 will be described.
   <トレーニング処理>
 まず、図7のフローチャートを参照して、情報処理システム101により実行されるトレーニング処理について説明する。
<Training process>
First, the training process executed by the information processing system 101 will be described with reference to the flowchart of FIG. 7.
 なお、図8は、HVC131のトレーニング段階における情報処理システム101の機能の構成例を示している。図9は、交差融合CNN121のトレーニング段階における情報処理システム101の機能の構成例を示している。 Note that FIG. 8 shows a configuration example of the function of the information processing system 101 in the training stage of the HVC 131. FIG. 9 shows a configuration example of the function of the information processing system 101 in the training stage of the cross-fusion CNN 121.
 ステップS1において、トレーニングデータセット生成部114は、HVC131用のトレーニングデータセット(すなわち、HVC用トレーニングデータセット)を生成する。 In step S1, the training data set generation unit 114 generates a training data set for HVC 131 (that is, a training data set for HVC).
 例えば、HVC用トレーニングデータセットは、人間によりラベル付けされた画像データ群を含む。このラベルは、例えば、上述した物体認識タスク内で付与される。 For example, a training dataset for HVC includes a set of image data labeled by humans. This label is given, for example, within the object recognition task described above.
 又は、例えば、人間が画像内の物体を正しく又は誤って識別したか否かを判断するビデオのデータベースを用いて、自動的にラベルが付与されてもよい。 Alternatively, the label may be automatically added, for example, using a video database that determines whether a human has correctly or incorrectly identified an object in the image.
 また、HVC用トレーニングデータセットは、必要に応じて、人間視覚処理センサ112により収集される人間視覚処理データを含む。人間視覚処理データは、他のトレーニングデータ(例えば、提示された画像に対応する画像データ)と同期して取得される。すなわち、画像が提示された人間から収集された人間視覚処理データが、提示された画像に対応する画像データと関連付けられる。 Further, the training data set for HVC includes human visual processing data collected by the human visual processing sensor 112, if necessary. Human visual processing data is acquired in synchronization with other training data (eg, image data corresponding to the presented image). That is, the human visual processing data collected from the person to whom the image is presented is associated with the image data corresponding to the presented image.
 なお、例えば、画像の提示と人間の視覚システムの活動との間に存在するタイムラグが考慮されてもよい。例えば、人間の脳のV4が、提示された画像に対して反応するまでには、約100msのタイムラグが存在する。この場合、例えば、人間視覚処理データのうち、人間の脳のV4から取得されるデータは、データを取得する100ms前の提示された画像に対応する画像データに関連付けられる。 Note that, for example, the time lag that exists between the presentation of the image and the activity of the human visual system may be taken into consideration. For example, there is a time lag of about 100 ms before V4 in the human brain reacts to the presented image. In this case, for example, among the human visual processing data, the data acquired from V4 of the human brain is associated with the image data corresponding to the presented image 100 ms before the acquisition of the data.
 トレーニングデータセット生成部114は、生成したHVC用トレーニングデータをトレーニングデータベース115に格納する。 The training data set generation unit 114 stores the generated training data for HVC in the training database 115.
 ステップS2において、処理ユニット121は、HVC131のトレーニングを行う。 In step S2, the processing unit 121 trains the HVC 131.
 例えば、HVC131は、人間の視覚システムに類似する機能を実現するニューラルネットワークの標準的な方法によりトレーニングされる。 For example, the HVC 131 is trained by the standard method of neural networks that realizes functions similar to the human visual system.
 例えば、HVC用トレーニングデータセット内の画像データに対するHVC131の出力と、当該画像データに対応する画像に対する人間の物体認識タスクの結果(すなわち、画像データに付与されている人間認識ラベル)との一致を成功指標とするトレーニングが行われる。 For example, a match between the output of the HVC 131 for the image data in the training data set for HVC and the result of the human object recognition task for the image corresponding to the image data (that is, the human recognition label attached to the image data). Training is conducted as a success indicator.
 なお、同じ画像に対する人間の視覚システム内の活動マッピングと、HVC131内の機能的活動との類似度を、トレーニングの成功指標に含めるようにしてもよい。 It should be noted that the degree of similarity between the activity mapping in the human visual system for the same image and the functional activity in the HVC 131 may be included in the training success index.
 例えば、図10に示されるように、画像データがHVC131に入力されるとともに、当該画像データに対応する画像が、人間視覚処理センサ112を装着した人間(以下、データ提供者と称する)に提示される。人間視覚処理センサ112は、提示された画像に対するデータ提供者の反応を示す人間視覚処理データを収集し、HVC131に入力する。また、人間視覚処理センサ112は、収集した人間視覚処理データのうちデータ提供者のV1乃至V5から出力される信号(以下、視覚野信号V1乃至視覚野信号V5と称する)を画像再生モデル251に入力する。 For example, as shown in FIG. 10, image data is input to the HVC 131, and an image corresponding to the image data is presented to a human wearing a human visual processing sensor 112 (hereinafter referred to as a data provider). NS. The human visual processing sensor 112 collects human visual processing data indicating the reaction of the data provider to the presented image and inputs it to the HVC 131. Further, the human visual processing sensor 112 uses the signals output from the data providers V1 to V5 (hereinafter referred to as visual cortex signals V1 to visual cortex signals V5) among the collected human visual processing data into the image reproduction model 251. input.
 画像再生モデル251は、視覚野信号V1乃至視覚野信号V5をそれぞれ画像データ(以下、正解視覚野データV1乃至正解視覚野データV5と称する)に変換して出力する。そして、HVC131の各畳み込み層から出力される視覚野画像データV1乃至視覚野画像データV5と、画像再生モデル251から出力される正解視覚野データV1乃至正解視覚野データV5とがそれぞれ比較される。そして、視覚野画像データV1乃至視覚野画像データV5と正解視覚野データV1乃至正解視覚野データV5との類似度が、成功指標として用いられる。 The image reproduction model 251 converts the visual cortex signals V1 to the visual cortex signals V5 into image data (hereinafter referred to as correct visual cortex data V1 to correct visual cortex data V5) and outputs them. Then, the visual cortex image data V1 to the visual cortex image data V5 output from each convolutional layer of the HVC 131 and the correct visual cortex data V1 to the correct visual cortex data V5 output from the image reproduction model 251 are compared with each other. Then, the degree of similarity between the visual cortex image data V1 to the visual cortex image data V5 and the correct visual cortex data V1 to the correct visual cortex data V5 is used as a success index.
 例えば、以上の2種類の成功指標が、所定の関数等を用いて組み合わされることによりスコア化され、計算されたスコアと所定の閾値を比較した結果に基づいて、入力された画像データに対するHVC131の処理の成功/不成功が判定される。そして、HVC131の処理の成功率が向上するように、HVC131の構成やパラメータが調整される。 For example, the above two types of success indicators are scored by combining them using a predetermined function or the like, and the HVC 131 with respect to the input image data is based on the result of comparing the calculated score with the predetermined threshold value. The success / failure of the process is determined. Then, the configuration and parameters of the HVC 131 are adjusted so that the success rate of the processing of the HVC 131 is improved.
 なお、HVC131は、特定の個人用にトレーニングされてもよいし、一般的な人間用にトレーニングされてもよい。 The HVC 131 may be trained for a specific individual or for a general human being.
 次に、HVC131の性能が、交差融合CNN121に実装される前にテストされる。ここで、HVC131の性能のテスト方法の例について説明する。 Next, the performance of HVC131 is tested before it is mounted on the cross-fusion CNN121. Here, an example of a method for testing the performance of the HVC 131 will be described.
 例えば、人間の視覚システムの特定の領域又はニューロンを最大限に活性化する画像をHVC131に生成させる解釈性技術(Interpretability techniques)が用いられる。例えば、HVC131が生成した画像が、EGGスキャナ、MRI(Magnetic Resonance Imaging)スキャナ、又は、NIR(Near-Infrared Spectroscopy)スキャナのような脳をスキャンする装置を装着した人間に提示される。もしHVC131の処理が有効であれば、画像が提示された人間の脳において、ターゲットとなる領域又はニューロンが活性化される。 For example, Interpretability techniques are used to cause the HVC 131 to generate images that maximize the activation of specific areas or neurons of the human visual system. For example, the image generated by the HVC 131 is presented to a person wearing a device that scans the brain, such as an EGG scanner, an MRI (Magnetic Resonance Imaging) scanner, or a NIR (Near-Infrared Spectroscopy) scanner. If the processing of HVC131 is effective, the target region or neuron is activated in the human brain to which the image is presented.
 また、例えば、様々な画像に対する人間の視覚野の反応を示すデータベースが、HVC131のテストに用いられてもよい。例えば、様々な画像に対する人間の視覚野信号V1に視覚野信号V5を変換した正解視覚野データV1乃至正解視覚野データV5がデータベースに格納される。そして、図10を参照して上述した方法により、データベース内の様々な画像に対してHVC131から出力される視覚野データV1乃至視覚野データV5と、データベース内の正解視覚野データV1乃至正解視覚野データV5が比較されることにより、HVC131のテストが行われる。 Further, for example, a database showing the reaction of the human visual cortex to various images may be used for the test of HVC131. For example, the correct visual cortex data V1 to the correct visual cortex data V5 obtained by converting the visual cortex signal V5 into the human visual cortex signal V1 for various images is stored in the database. Then, by the method described above with reference to FIG. 10, the visual field data V1 to the visual field data V5 output from the HVC 131 and the correct visual field data V1 to the correct visual field in the database are obtained for various images in the database. The HVC 131 is tested by comparing the data V5.
 さらに、例えば、提示された画像に応じた行動パターンが、人間の内部の物体認識処理にリンクされた行動予測テスト(Behavioral predictivity test)が用いられてもよい。例えば、行動予測テストにおける人間の反応は、複数の認識カテゴリ(例えば、認識漏れ(missed recognitions)、偽要請認識(False positive recognitions)等)に分類される。そして、HVC131の認識結果と、行動予測テストにおける人間の反応との類似度が、HVC131のテストに用いられる。 Further, for example, a behavior prediction test (Behavioral predictivity test) in which a behavior pattern corresponding to a presented image is linked to an object recognition process inside a human being may be used. For example, human reactions in behavior prediction tests are classified into multiple recognition categories (eg, missed recognitions, False positive recognitions, etc.). Then, the degree of similarity between the recognition result of HVC131 and the human reaction in the behavior prediction test is used for the test of HVC131.
 そして、例えば、以上のテストのスコア(精度)が事前に設定された閾値以上である場合、HVC131が、交差融合CNN121に実装される。一方、テストのスコアが閾値未満である場合、HVC131の再トレーニングが行われる。 Then, for example, when the score (accuracy) of the above test is equal to or higher than the preset threshold value, the HVC 131 is mounted on the cross-fusion CNN 121. On the other hand, if the test score is below the threshold, HVC131 is retrained.
 ステップS3において、トレーニングデータセット生成部114は、交差融合CNN121用のトレーニングデータセット(交差融合CNNトレーニングデータセット)を生成する。 In step S3, the training data set generation unit 114 generates a training data set (cross-fusion CNN training data set) for the cross-fusion CNN 121.
 交差融合CNN用トレーニングデータセットを生成する方法は、交差融合CNN121が実現する機能(例えば、HVC感化出力の内容)により異なる。以下、交差融合CNN用トレーニングデータセットの生成方法の例について説明する。 The method of generating the training data set for the cross-fusion CNN differs depending on the function realized by the cross-fusion CNN 121 (for example, the content of the HVC-sensitized output). Hereinafter, an example of a method for generating a training data set for cross-fusion CNN will be described.
 例えば、交差融合CNN121が、物体の認識結果をHVC感化出力として出力する場合、HVC用トレーニングデータセットを交差融合CNN用トレーニングに用いてもよい。この場合、例えば、HVC用トレーニングデータセットに含まれる人間認識ラベル付きの入力画像データに対して、実際の物体の種類を示す正解ラベル(Ground truth label)が付与される。人間認識ラベルと正解ラベルとは、一致する場合もあるし、一致しない場合もある。 For example, when the cross-fusion CNN 121 outputs the recognition result of the object as the HVC-sensitized output, the training data set for HVC may be used for the training for the cross-fusion CNN. In this case, for example, a correct answer label (Ground truth label) indicating the type of the actual object is given to the input image data with the human recognition label included in the training data set for HVC. The human recognition label and the correct label may or may not match.
 又は、例えば、正解ラベルのみが付与された入力画像データに対して、トレーニング済みのHVC131により、疑似的な人間認識ラベルを付与することにより、HVC用トレーニングデータセットに用いるトレーニングデータを生成するようにしてもよい。 Alternatively, for example, the training data used for the training data set for HVC is generated by assigning a pseudo human recognition label to the input image data to which only the correct answer label is attached by the trained HVC 131. You may.
 なお、入力画像データには、負の例(Negative example)及び正の例(Positive example)のいずれを含んでいてもよい。負の例とは、例えば、人間による物体の認識結果と、交差融合CNN121の認識結果(HVC感化出力)とが異なると推測される画像データである。正の例とは、例えば、人間による物体の認識結果と、交差融合CNN121の認識結果(HVC感化出力)とが一致すると推測される画像データである。 Note that the input image data may include either a negative example (Negative example) or a positive example (Positive example). The negative example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 are different. A positive example is, for example, image data in which it is presumed that the recognition result of an object by a human and the recognition result (HVC-sensitized output) of the cross-fusion CNN121 match.
 また、例えば、交差融合CNN121が、入力された画像データを処理することにより生成される画像データをHVC感化出力として出力する場合、交差融合CNN用トレーニングデータセットは、入力画像データと、入力画像データから生成され、HVC感化出力と同じフォーマットの正解画像データを含む。 Further, for example, when the cross-fusion CNN 121 outputs the image data generated by processing the input image data as an HVC-sensitized output, the cross-fusion CNN training data set includes the input image data and the input image data. Generated from, and contains correct image data in the same format as the HVC-sensitized output.
 以下、正解画像データの例について説明する。 The following is an example of correct image data.
 例えば、HVC感化出力が、人間の視覚処理とマシンの視覚処理の差異を示す画像データである差異マップ(Disparity map)を含む場合、交差融合CNN用トレーニングデータセットは、差異マップを正解画像データとして含む。 For example, if the HVC-sensitized output contains a variance map, which is image data showing the difference between human visual processing and machine visual processing, the training dataset for cross-fusion CNN uses the difference map as the correct image data. include.
 例えば、トレーニング後のHVC131、及び、交差融合CNN121に実装される前のMVC132をそれぞれ用いて、注目マップが生成される。この注目マップは、例えば、各ニューラルネットワーク(HVC131及びMVC132)全体から生成される。なお、各ニューラルネットワーク内の個々のニューロン、又は、任意のニューロンの組み合わせ内で生成される注目マップを用いることも可能である。 For example, the attention map is generated using the HVC131 after training and the MVC132 before being mounted on the cross-fusion CNN121. This focus map is generated, for example, from the entire neural network (HVC131 and MVC132). It is also possible to use an individual neuron in each neural network or a attention map generated in any combination of neurons.
 そして、HVC131により生成された注目マップとMVC132により生成された注目マップとの減算を行うことにより、差異マップが生成され、正解画像データとして用いられる。 Then, by subtracting the attention map generated by the HVC 131 and the attention map generated by the MVC 132, a difference map is generated and used as correct image data.
 例えば、HVC感化出力が、人間の意志決定に重要な影響を与えないAI拡張画像データを含む場合、交差融合CNN用トレーニングデータセットは、そのようなAI拡張画像データを正解画像データとして含む。 For example, if the HVC-sensitized output contains AI extended image data that does not have a significant effect on human decision making, the cross-fusion CNN training dataset will include such AI extended image data as correct image data.
 例えば、入力画像データに対して、交差融合CNN121に実装される前のMVC132を用いて、所定の画像処理(例えば、シャープニング、Generative data、アップリジングなど)が行われる。これにより、AI拡張画像が生成される。また、例えば、入力画像データに対して、HVC131を用いて所定の画像処理が行われ、重要な認識タスク(例えば、外科手術における出血検出等)において人間が注目すると推定される注目マップが生成される。 For example, predetermined image processing (for example, sharpening, Generative data, uprising, etc.) is performed on the input image data using the MVC132 before being mounted on the cross-fusion CNN121. As a result, the AI extended image is generated. Further, for example, predetermined image processing is performed on the input image data using the HVC 131, and an attention map estimated to be noticed by humans in an important recognition task (for example, bleeding detection in surgery) is generated. NS.
 そして、生成された注目マップに基づいて、HVC131が大きな注目を払う注目領域と大きな注目を払わない非注目領域に、入力画像が分割される。すなわち、重要な認識タスクにおいて人間が大きな注目を払うと推定される注目領域と、人間が大きな注目を払わないと推定される非注目領域に、入力画像が分割される。そして、注目領域に入力画像を用い、非注目領域にAI拡張画像を用いた合成画像が生成され、生成された合成画像に対応する合成画像データが、正解画像データに用いられる。 Then, based on the generated attention map, the input image is divided into an attention area where the HVC 131 pays a great deal of attention and a non-attention area where the HVC 131 does not pay a great deal of attention. That is, the input image is divided into a region of interest where humans are presumed to pay a great deal of attention in an important recognition task and a non-attention region where humans are presumed not to pay a great deal of attention. Then, a composite image using the input image in the attention region and the AI extended image in the non-focus region is generated, and the composite image data corresponding to the generated composite image is used as the correct image data.
 また、実行段階において、人間視覚処理センサ112から交差融合CNN121に人間視覚処理データが入力される場合、交差融合CNN用トレーニングデータは、人間視覚処理データを含む。 Further, when the human visual processing data is input from the human visual processing sensor 112 to the cross-fusion CNN 121 at the execution stage, the training data for the cross-fusion CNN includes the human visual processing data.
 トレーニングデータセット生成部114は、生成した交差融合CNN用トレーニングデータセットをトレーニングデータベース115に格納する。 The training data set generation unit 114 stores the generated training data set for cross-fusion CNN in the training database 115.
 ステップS4において、処理ユニット121は、交差融合CNN121のトレーニングを行う。 In step S4, the processing unit 121 trains the cross-fusion CNN 121.
 具体的には、事前に定義されたアーキテクチャに従って、トレーニング後のHVC131と、トレーニング前のMVC132とを、交差融合パラメータを用いて接続することにより、交差融合CNN121が生成される。 Specifically, the cross-fusion CNN 121 is generated by connecting the post-training HVC 131 and the pre-training MVC 132 using the cross-fusion parameters according to a predefined architecture.
 また、トレーニングデータベース115に蓄積されている交差融合CNN用トレーニングデータセットが、トレーニング用データセットとテスト用データセットに分割される。例えば、交差融合CNN用トレーニングデータセットの約80%がトレーニング用データセットに用いられ、残りがテスト用データセットに用いられる。 Further, the training data set for cross-fusion CNN stored in the training database 115 is divided into a training data set and a test data set. For example, about 80% of the training dataset for cross-fusion CNN is used for the training dataset and the rest is used for the test dataset.
 交差融合CNN121は、トレーニング用データセットに含まれる入力画像データを処理し、HVC感化出力を出力する。 The cross-fusion CNN121 processes the input image data included in the training data set and outputs the HVC-sensitized output.
 そして、例えば、HVC感化出力と、入力画像データに付与されている人間認識ラベル及び正解ラベルとを比較した結果に基づいて、スコアが付与される。HVC感化出力と人間認識ラベルとが一致した場合を成功と判断するか否か、HVC感化出力と正解ラベルとが一致した場合を成功と判断するか否かは、認識対象となる物体のカテゴリ等により決定される。 Then, for example, a score is given based on the result of comparing the HVC-sensitized output with the human recognition label and the correct answer label given to the input image data. Whether or not the case where the HVC-sensitized output and the human recognition label match is judged as success, and whether or not the case where the HVC-sensitized output and the correct answer label match is judged as success depends on the category of the object to be recognized, etc. Is determined by.
 また、例えば、HVC感化出力と、入力画像データに対応する正解画像データとの類似度(例えば、画素値の類似度等)に基づいて、スコアが付与される。 Further, for example, a score is given based on the degree of similarity between the HVC-sensitized output and the correct image data corresponding to the input image data (for example, the degree of similarity of pixel values, etc.).
 以上の処理が、トレーニング用データセットの全ての入力画像データに対して繰り返され、スコアが向上するように、交差融合CNN121の構成及びパラメータが調整される。具体的には、MVC132の構成及びパラメータ、並びに、交差融合パラメータが調整される。一方、HVC131の構成及びパラメータは変更されない。 The above processing is repeated for all the input image data of the training data set, and the configuration and parameters of the cross fusion CNN121 are adjusted so that the score is improved. Specifically, the configuration and parameters of the MVC 132, as well as the cross-fusion parameters are adjusted. On the other hand, the configuration and parameters of HVC131 are not changed.
 次に、テスト用データセットを用いて、トレーニング用データセットを用いてトレーニングされた交差融合CNN121のテストが行われる。テスト段階では、交差融合CNN121の構成及びパラメータの調整は行われず、交差融合CNN121の精度を推定するために、トレーニング段階と同様のスコアリングが行われる。 Next, the cross-fusion CNN121 trained using the training data set is tested using the test data set. In the test stage, the configuration and parameters of the cross-fusion CNN 121 are not adjusted, and the same scoring as in the training stage is performed in order to estimate the accuracy of the cross-fusion CNN 121.
 なお、例えば、テスト段階において、交差融合CNN121の性能が十分でないと判定された場合、トレーニング段階からやり直される。 For example, if it is determined in the test stage that the performance of the cross-fusion CNN121 is not sufficient, the training is restarted from the training stage.
 以上のようにして、交差融合CNN121が生成される。 As described above, the cross fusion CNN121 is generated.
  <情報処理>
 次に、図11のフローチャートを参照して、情報処理システム101により実行される情報処理について説明する。
<Information processing>
Next, the information processing executed by the information processing system 101 will be described with reference to the flowchart of FIG.
 なお、図12は、実行段階における情報処理システム101の詳細な構成例を示している。 Note that FIG. 12 shows a detailed configuration example of the information processing system 101 at the execution stage.
 ステップS51において、交差融合CNN121は、画像データ及び人間視覚処理データを取得する。 In step S51, the cross-fusion CNN 121 acquires image data and human visual processing data.
 具体的には、画像データが、HVC131及びMVC132に入力される。 Specifically, the image data is input to the HVC 131 and the MVC 132.
 また、人間視覚処理センサ112は、ユーザに装着され、ユーザの人間視覚処理データを収集し、HVC131に入力する。なお、人間視覚処理データには、HVC131及び交差融合CNN121のトレーニングに用いられたものと同じ種類のデータが用いられる。 Further, the human visual processing sensor 112 is attached to the user, collects the human visual processing data of the user, and inputs it to the HVC 131. As the human visual processing data, the same type of data as that used for training the HVC131 and the cross-fusion CNN121 is used.
 ステップS52において、交差融合CNN121は、画像データの処理を行う。そして、交差融合CNN121は、画像データの処理結果である所定のHVC感化出力を出力する。 In step S52, the cross fusion CNN121 processes the image data. Then, the cross-fusion CNN 121 outputs a predetermined HVC-sensitized output which is a processing result of the image data.
 なお、人間視覚処理データは、ユーザの現在の状態と動作にHVC131を同期させるために用いられる。これにより、より正確な人間の視覚システムのモデル(HVC131)を提供することができ、その結果、HVC感化出力を最適化することができる。 The human visual processing data is used to synchronize the HVC 131 with the current state and operation of the user. This makes it possible to provide a more accurate model of the human visual system (HVC131), and as a result, the HVC-sensitized output can be optimized.
 その後、視覚モデル処理は終了する。 After that, the visual model processing ends.
 <<3.変形例>>
 以下、本技術の変形例について説明する。
<< 3. Modification example >>
Hereinafter, a modification of the present technology will be described.
 例えば、トレーニング段階において、異なる交差融合パラメータを有する複数の交差融合CNN121のトレーニングを行うようにしてもよい。そして、各交差融合CNN121にスコアを付けて、最もスコアが高い交差融合CNN121を採用するようにしてもよい。 For example, in the training stage, training of a plurality of cross-fusion CNN 121s having different cross-fusion parameters may be performed. Then, each cross-fusion CNN 121 may be given a score, and the cross-fusion CNN 121 having the highest score may be adopted.
 又は、例えば、トレーニングにより、HVC131の処理に対して異なる許容偏差を有する複数の交差融合CNN121を生成し、複数の交差融合CNN121を情報処理システム101に用いるようにしてもよい。これにより、例えば、各交差融合CNN121から、標準的なユーザに対して異なる直観的な解釈により人間の視覚処理の影響を受けた出力を提供することができる。 Alternatively, for example, training may be performed to generate a plurality of cross-fused CNN 121s having different tolerances for the processing of the HVC 131, and the plurality of cross-fused CNN 121s may be used in the information processing system 101. This allows, for example, each cross-fusion CNN 121 to provide output influenced by human visual processing with different intuitive interpretations to a standard user.
 また、例えば、1つのHVC131に対して複数のMVC132を組み合わせた交差融合CNN121を生成するようにしてもよい。この場合、例えば、各MVC132からそれぞれ異なる種類のHVC感化出力を出力することができる。 Further, for example, a cross-fusion CNN121 in which a plurality of MVC132s are combined for one HVC131 may be generated. In this case, for example, different types of HVC-sensitized outputs can be output from each MVC 132.
 <<4.本技術の効果と応用例>>
 以下、本技術の効果と応用例について説明する。
<< 4. Effects and application examples of this technology >>
Hereinafter, the effects and application examples of this technology will be described.
 本技術によれば、人間の視覚システムとマシンの視覚システムを効果的に融合することができる。その結果、人間の視覚システムと同様の機能を備えるモデルの性能を向上させることができる。 According to this technology, the human visual system and the machine visual system can be effectively integrated. As a result, the performance of models with functions similar to those of human visual systems can be improved.
 例えば、HVC131とMVC132の間の任意のセマンティックレベルにおいて、一方又は相互に影響を与えることが可能になる。 For example, at any semantic level between HVC131 and MVC132, it is possible to influence one or the other.
 例えば、制御された範囲内において、マシンの視覚処理を人間の視覚処理により調整した出力を得ることができる。 For example, within a controlled range, it is possible to obtain an output in which the visual processing of a machine is adjusted by human visual processing.
 例えば、人間とマシンが強調して意思決定を行う必要がある場合、又は、人間がマシンの決定の監視や検証しなければならない場合等に、本技術を好適に適用することができる。 For example, this technology can be suitably applied when it is necessary for humans and machines to make decisions with emphasis, or when humans have to monitor and verify machine decisions.
 例えば、人間の視覚システム(HVC131)の処理とマシンの視覚システム(MVC132)の処理との一致点及び差異点を、任意のセマンティックレベルで検出し、利用することができる。 For example, the coincidence points and differences between the processing of the human visual system (HVC131) and the processing of the machine visual system (MVC132) can be detected and used at an arbitrary semantic level.
 具体的には、例えば、人間とマシンの視覚処理の一致点及び相違点を示す出力が得られる。例えば、現在のタスクに関連して、マシンの視覚システムが注目しているが、ユーザが注目していない画像の特徴を強調又は差別化した差異マップを生成して、提示することが可能になる。 Specifically, for example, an output showing the points of agreement and differences between human and machine visual processing can be obtained. For example, it will be possible to generate and present a difference map that emphasizes or differentiates the features of an image that the visual system of the machine is paying attention to but not the user is paying attention to in relation to the current task. ..
 例えば、システムの性能を評価するために、HVC131及びMVC132の入力、任意の中間層、及び、出力に個別にアクセスすることができる。従って、例えば、CNNの任意のセマンティック処理層(Semantic processing layer)で、人間とマシンの視覚システムの比較が可能になる。また、例えば、人間の視覚システムに関連するパラメータの監視と確実性を確保することができる。 For example, the inputs, arbitrary intermediate layers, and outputs of the HVC 131 and MVC 132 can be individually accessed to evaluate the performance of the system. Thus, for example, any Semantic processing layer on CNN makes it possible to compare human and machine visual systems. Also, for example, monitoring and certainty of parameters related to the human visual system can be ensured.
 これにより、例えば、AIシステムがデータの解釈、生成、又は、操作等において行う選択が、現実から逸脱することを抑制することができる。また、例えば、人間とマシンの協調システムにおいて、マシンが人間のシーンの知覚から逸脱し、人間とマシンの意思決定の差異を引き起こし、人間がマシンの決定を理解するのを困難または遅くする現象の発生を抑制することができる。 This makes it possible, for example, to prevent the choices made by the AI system in interpreting, generating, or manipulating data from deviating from reality. Also, for example, in a human-machine coordination system, a phenomenon in which a machine deviates from the perception of the human scene, causing a difference in human-machine decision making, making it difficult or slow for humans to understand the machine's decision. The occurrence can be suppressed.
 例えば、交差融合CNN121は、人間が検出することが困難であるが、マシンの視覚処理に影響を与える画像の特徴、すなわち、マシンと人間とを差別化する特徴(Differentiated features)を識別することができる。例えば、人間とマシンの両方の視覚システムにおいて最終的に同じ物体が認識されたとしても、エッジ検出の動作の違い等の下位のセマンティックレベルで、人間とマシンの違いを予測することが可能になる。 For example, the cross-fused CNN 121 can identify image features that are difficult for humans to detect but affect the visual processing of the machine, i.e., differentiated features that differentiate the machine from humans. can. For example, even if the same object is finally recognized by both human and machine visual systems, it is possible to predict the difference between human and machine at a lower semantic level such as the difference in edge detection behavior. ..
 例えば、人間とマシンの視覚処理の差異に基づいて、ユーザが認識する情報を拡張することができる。 For example, the information recognized by the user can be expanded based on the difference in visual processing between humans and machines.
 例えば、AR(Augmented Reality)システム等を用いて、ユーザのシーン認識を改善するために、上述した差別化する特徴を強調し、現実世界に重畳することが可能になる。例えば、マシンの視覚システムが注目し、人間が注目していないシーンの特徴を効果的に強調するために、解釈性ストリーム(interpretability stream)を提供することが可能になる。 For example, using an AR (Augmented Reality) system or the like, in order to improve the user's scene recognition, it is possible to emphasize the above-mentioned differentiating features and superimpose them on the real world. For example, it is possible to provide an interpretability stream to effectively emphasize features in a scene that the visual system of a machine is paying attention to and not humans are paying attention to.
 例えば、医療用のロボットやモニタシステムにおいて、医師等が認識する情報を拡張することができる。例えば、外科医の決断に影響する特徴であるが、気付きにくい特徴を強調しながら、手術中の画像を表示することができる。 For example, in medical robots and monitor systems, information recognized by doctors and the like can be expanded. For example, it is possible to display an image during surgery while emphasizing features that affect the surgeon's decision but are difficult to notice.
 例えば、人間の視覚処理において重要な特徴を変更することなく、画像を拡張することができる。具体的には、例えば、ユーザが、マシンの視覚システム(MVC132)により拡張された画像を見ながら、所定の装置の操作を行っている場合、ユーザにとって重要でないタスク(Non-critical task)に対しては、ユーザに提示する情報が拡張される。一方、例えば、HVC131が注目する特徴は、人間の視覚処理において重要であると推定され、ユーザが重要なタスク(Critical task)を行う場合に重要になると推定される。従って、HVC131が注目する特徴は、大きく変更されずにユーザに提示される。これにより、重要なタスクに対するユーザの説明責任が維持される。 For example, the image can be expanded without changing important features in human visual processing. Specifically, for example, when the user is operating a predetermined device while viewing an image expanded by the visual system (MVC132) of the machine, for a task (Non-critical task) that is not important to the user. Therefore, the information presented to the user is expanded. On the other hand, for example, the feature that the HVC 131 pays attention to is presumed to be important in human visual processing, and is presumed to be important when the user performs an important task (Critical task). Therefore, the features that the HVC 131 pays attention to are presented to the user without major changes. This keeps the user accountable for important tasks.
 なお、人間の視覚に重要な画像の特徴の検出は、CNNの任意のセマンティックレベルで発生する可能性があり、ルールベースの方法では定義することが困難な場合ある。一方、本技術では、このようなルールベースの方法では定義することが困難な特徴がMVC132により変更されずに、維持される。なお、このような特徴は、例えば、人間が意識的に注意を払わない特徴である可能性もある。 Note that the detection of image features that are important to human vision can occur at any semantic level of the CNN and can be difficult to define with rule-based methods. On the other hand, in the present art, features that are difficult to define by such a rule-based method are maintained unchanged by the MVC 132. It should be noted that such a feature may be, for example, a feature that humans do not consciously pay attention to.
 例えば、本技術によれば、AIシステムの処理や結果に対する解釈可能性(Interpretability)や説明責任(Accountability)が向上したり、カスタマイズしたりすることができる。 For example, according to this technology, the interpretability (Interpretability) and accountability (Accountability) of the processing and results of the AI system can be improved or customized.
 具体的には、例えば、HVC131の一部を使用して、システムの性能を検証することができる。また、例えば、ユーザの経験に基づいて、自動ドメイン適応(Automatic domain Adaptation)を実現することができる。さらに、例えば、顔認識の人種バイアス、医療のARアプリケーションを実現することができる。 Specifically, for example, a part of HVC131 can be used to verify the performance of the system. Further, for example, automatic domain adaptation can be realized based on the user's experience. Further, for example, racial bias of face recognition, medical AR applications can be realized.
 また、例えば、AIシステムの処理の解釈可能性及び説明責任のために、マシンの視覚システムを人間の視覚処理に調和させることができる。 Also, for example, due to the interpretability and accountability of the processing of the AI system, the visual system of the machine can be harmonized with the human visual processing.
 例えば、交差融合CNN121が、CNNの複数のレベルのデータ処理機能の許容偏差(Permitted deviation)でトレーニングされる。これにより、許容偏差が小さい場合、AIシステムの処理ステップが、人間に直感的に理解できるようになる。また、人間が注目しない特徴に注目することにより、より高い効率又は精度を達成する複数の交差融合CNN121が、より大きく、異なるレベルの許容偏差で生成される。そして、実行段階において、人間が、タスクに対して許可された理想的なレベルの偏差(直感的な解釈可能性)の交差融合CNN121を選択することにより、「人間のような」ビジョンシステムが、どのように行動するかを制御することができる。 For example, the cross-fused CNN 121 is trained with the permitted deviation of multiple levels of data processing capabilities of the CNN. As a result, when the tolerance is small, the processing steps of the AI system can be intuitively understood by humans. Also, by focusing on features that humans do not pay attention to, multiple cross-fused CNN 121s that achieve higher efficiency or accuracy are generated with larger, different levels of tolerance. Then, at the execution stage, a "human-like" vision system can be created by humans selecting the cross-fused CNN121 with the ideal level of deviation (intuitive interpretability) allowed for the task. You can control how you behave.
 例えば、医師が、高度な認識処理を必要としないタスクに対して、高い人間の一貫性(High human coherence)を選択する可能性がある。これにより、AIシステムの決定基準が、人間が容易に検出できるものに類似することを保証し、人間とAIシステムのコミュニケーションを容易かつ迅速に行うことが可能になる。 For example, a doctor may select high human coherence for a task that does not require advanced cognitive processing. This ensures that the criteria for determining the AI system are similar to those that are easily detectable by humans, making it possible for humans to communicate with the AI system easily and quickly.
 例えば、上述したように、HVC131とMVC132が個別に構築及びトレーニングされ、確定されるともに、交差融合CNN121のトレーニング及び実行中に、HVC131の構成及びパラメータが変更されないことが保証される。従って、例えば、優れた人間の視覚システムのモデル(HVC131)を、構成及びパラメータが調整又は変更されないことが保証された状態で使用することができる。 For example, as described above, the HVC131 and MVC132 are individually constructed, trained and established, and it is guaranteed that the configuration and parameters of the HVC131 will not change during the training and execution of the cross-fusion CNN121. Thus, for example, a good model of the human visual system (HVC131) can be used with the configuration and parameters guaranteed not to be adjusted or changed.
 また、例えば、誤差伝搬法でCNNを単にトレーニングしただけでは得られない情報の層を追加することができる。これにより、例えば、敵対的アタック(Adversarial attack)に対するロバスト性を向上させることができる。また、例えば、大きく隠された物体(Heavily occluded objects)に対する物体認識の精度を向上させることができる。さらに、例えば、抽象化(Abstraction)の性能を改善し、人間とマシンの視覚システムの弱点を補い、認識精度等を向上させることができる。 Also, for example, it is possible to add a layer of information that cannot be obtained by simply training CNN by the error propagation method. As a result, for example, it is possible to improve the robustness against an adversarial attack. Further, for example, it is possible to improve the accuracy of object recognition for a large hidden object (Heavily occupied objects). Further, for example, it is possible to improve the performance of Abstraction, compensate for the weaknesses of human and machine visual systems, and improve recognition accuracy and the like.
 さらに、本技術を適用したモデルは、人間の視覚処理とマシンの視覚処理の間のマッピングを有している。これは、同じ画像に対する人間とマシンの視覚システムにおけるニューロンの相互活性化を推定することによってマッピングすることができる。本技術では、このマッピングが把握されるため、例えば、人間の視覚システムの特定の部分でより高い活性化を生み出すために最適化されたネットワークにより、画像を生成することができる。 Furthermore, the model to which this technology is applied has a mapping between human visual processing and machine visual processing. This can be mapped by estimating the reciprocal activation of neurons in the human and machine visual system for the same image. Since this mapping is understood in the art, images can be generated, for example, by networks optimized to produce higher activation in certain parts of the human visual system.
 また、この分野の進化に伴い、人間とマシンの神経活性化パターンの間の一致度が高まり、本技術を適用したシステムの精度が向上することが期待される。また、人間の視覚システムのモデルで扱うことができる画像の幅を広げることができる。 In addition, with the evolution of this field, it is expected that the degree of agreement between human and machine nerve activation patterns will increase, and the accuracy of systems to which this technology will be applied will improve. In addition, the range of images that can be handled by the model of the human visual system can be expanded.
 さらに、例えば、ユーザへのフィードバックループを追加することにより、自然なUI(ユーザインタフェース)を実現することができる。例えば、UIの反応を改善したり、UIをより快適にしたりすることができる。 Furthermore, for example, by adding a feedback loop to the user, a natural UI (user interface) can be realized. For example, you can improve the responsiveness of the UI or make the UI more comfortable.
 また、例えば、HVC131とMVC132の処理が並行して実行されるため、直列に実装されるシステムと比較して、処理が高速化する。また、人間とマシンの視覚処理の演算を個別に行うよりも、よりシンプルで演算効率の高いシステムを実現することが可能になる。 Further, for example, since the processing of HVC131 and MVC132 is executed in parallel, the processing speed is increased as compared with the system mounted in series. In addition, it is possible to realize a system that is simpler and has higher calculation efficiency than performing the calculation of human and machine visual processing individually.
 また、本技術は、車両等の移動装置に適用することも可能である。 This technology can also be applied to mobile devices such as vehicles.
 図13は、本技術が適用される移動装置制御システムの一例である車両制御システム1011の構成例を示すブロック図である。 FIG. 13 is a block diagram showing a configuration example of a vehicle control system 1011 which is an example of a mobile device control system to which the present technology is applied.
 車両制御システム1011は、車両1001に設けられ、車両1001の走行支援及び自動運転に関わる処理を行う。 The vehicle control system 1011 is provided in the vehicle 1001 and performs processing related to driving support and automatic driving of the vehicle 1001.
 車両制御システム1011は、プロセッサ1021、通信部1022、地図情報蓄積部1023、GNSS(Global Navigation Satellite System)受信部1024、外部認識センサ1025、車内センサ1026、車両センサ1027、記録部1028、走行支援・自動運転制御部1029、DMS(Driver Monitoring System)1030、HMI(Human Machine Interface)1031、及び、車両制御部1032を備える。 The vehicle control system 1011 includes a processor 1021, a communication unit 1022, a map information storage unit 1023, a GNSS (Global Navigation Satellite System) receiving unit 1024, an external recognition sensor 1025, an in-vehicle sensor 1026, a vehicle sensor 1027, a recording unit 1028, and driving support. It includes an automatic driving control unit 1029, a DMS (Driver Monitoring System) 1030, an HMI (Human Machine Interface) 1031, and a vehicle control unit 1032.
 プロセッサ1021、通信部1022、地図情報蓄積部1023、GNSS受信部1024、外部認識センサ1025、車内センサ1026、車両センサ1027、記録部1028、走行支援・自動運転制御部1029、ドライバモニタリングシステム(DMS)30、ヒューマンマシーンインタフェース(HMI)31、及び、車両制御部1032は、通信ネットワーク1041を介して相互に接続されている。通信ネットワーク1041は、例えば、CAN(Controller Area Network)、LIN(Local Interconnect Network)、LAN(Local Area Network)、FlexRay(登録商標)、イーサネット等の任意の規格に準拠した車載通信ネットワークやバス等により構成される。なお、車両制御システム1011の各部は、通信ネットワーク1041を介さずに、例えば、近距離無線通信(NFC(Near Field Communication))やBluetooth(登録商標)等により直接接続される場合もある。 Processor 1021, communication unit 1022, map information storage unit 1023, GNSS receiver unit 1024, external recognition sensor 1025, in-vehicle sensor 1026, vehicle sensor 1027, recording unit 1028, driving support / automatic driving control unit 1029, driver monitoring system (DMS) 30, the human machine interface (HMI) 31, and the vehicle control unit 1032 are connected to each other via the communication network 1041. The communication network 1041 is provided by, for example, an in-vehicle communication network or a bus compliant with any standard such as CAN (Controller Area Network), LIN (Local Interconnect Network), LAN (Local Area Network), FlexRay (registered trademark), and Ethernet. It is composed. In addition, each part of the vehicle control system 1011 may be directly connected by, for example, short-range wireless communication (NFC (Near Field Communication)), Bluetooth (registered trademark), or the like without going through the communication network 1041.
 なお、以下、車両制御システム1011の各部が、通信ネットワーク1041を介して通信を行う場合、通信ネットワーク1041の記載を省略するものとする。例えば、プロセッサ1021と通信部1022が通信ネットワーク1041を介して通信を行う場合、単にプロセッサ1021と通信部1022とが通信を行うと記載する。 Hereinafter, when each part of the vehicle control system 1011 communicates via the communication network 1041, the description of the communication network 1041 shall be omitted. For example, when the processor 1021 and the communication unit 1022 communicate with each other via the communication network 1041, it is described that the processor 1021 and the communication unit 1022 simply communicate with each other.
 プロセッサ1021は、例えば、CPU(Central Processing Unit)、MPU(Micro Processing Unit)、ECU(Electronic Control Unit )等の各種のプロセッサにより構成される。プロセッサ1021は、車両制御システム1011全体の制御を行う。 The processor 1021 is composed of various processors such as a CPU (Central Processing Unit), an MPU (Micro Processing Unit), and an ECU (Electronic Control Unit), for example. The processor 1021 controls the entire vehicle control system 1011.
 通信部1022は、車内及び車外の様々な機器、他の車両、サーバ、基地局等と通信を行い、各種のデータの送受信を行う。車外との通信としては、例えば、通信部1022は、車両制御システム1011の動作を制御するソフトウエアを更新するためのプログラム、地図情報、交通情報、車両1001の周囲の情報等を外部から受信する。例えば、通信部1022は、車両1001に関する情報(例えば、車両1001の状態を示すデータ、認識部1073による認識結果等)、車両1001の周囲の情報等を外部に送信する。例えば、通信部1022は、eコール等の車両緊急通報システムに対応した通信を行う。 The communication unit 1022 communicates with various devices inside and outside the vehicle, other vehicles, servers, base stations, etc., and transmits and receives various data. As for communication with the outside of the vehicle, for example, the communication unit 1022 receives from the outside a program for updating the software for controlling the operation of the vehicle control system 1011, map information, traffic information, information around the vehicle 1001 and the like. .. For example, the communication unit 1022 transmits information about the vehicle 1001 (for example, data indicating the state of the vehicle 1001, recognition result by the recognition unit 1073, etc.), information around the vehicle 1001, and the like to the outside. For example, the communication unit 1022 performs communication corresponding to a vehicle emergency call system such as eCall.
 なお、通信部1022の通信方式は特に限定されない。また、複数の通信方式が用いられてもよい。 The communication method of the communication unit 1022 is not particularly limited. Moreover, a plurality of communication methods may be used.
 車内との通信としては、例えば、通信部1022は、無線LAN、Bluetooth、NFC、WUSB(Wireless USB)等の通信方式により、車内の機器と無線通信を行う。例えば、通信部1022は、図示しない接続端子(及び、必要であればケーブル)を介して、USB(Universal Serial Bus)、HDMI(High-Definition Multimedia Interface、登録商標)、又は、MHL(Mobile High-definition Link)等の通信方式により、車内の機器と有線通信を行う。 As for communication with the inside of the vehicle, for example, the communication unit 1022 performs wireless communication with the equipment in the vehicle by a communication method such as wireless LAN, Bluetooth, NFC, WUSB (WirelessUSB). For example, the communication unit 1022 may use USB (Universal Serial Bus), HDMI (High-Definition Multimedia Interface, registered trademark), or MHL (Mobile High-) via a connection terminal (and a cable if necessary) (not shown). Wired communication is performed with the equipment in the car by a communication method such as definitionLink).
 ここで、車内の機器とは、例えば、車内において通信ネットワーク1041に接続されていない機器である。例えば、運転者等の搭乗者が所持するモバイル機器やウェアラブル機器、車内に持ち込まれ一時的に設置される情報機器等が想定される。 Here, the device in the vehicle is, for example, a device that is not connected to the communication network 1041 in the vehicle. For example, mobile devices and wearable devices possessed by passengers such as drivers, information devices brought into a vehicle and temporarily installed, and the like are assumed.
 例えば、通信部1022は、4G(第4世代移動通信システム)、5G(第5世代移動通信システム)、LTE(Long Term Evolution)、DSRC(Dedicated Short Range Communications)等の無線通信方式により、基地局又はアクセスポイントを介して、外部ネットワーク(例えば、インターネット、クラウドネットワーク、又は、事業者固有のネットワーク)上に存在するサーバ等と通信を行う。 For example, the communication unit 1022 is a base station using a wireless communication system such as 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), LTE (LongTermEvolution), DSRC (DedicatedShortRangeCommunications), etc. Alternatively, it communicates with a server or the like existing on an external network (for example, the Internet, a cloud network, or a network peculiar to a business operator) via an access point.
 例えば、通信部1022は、P2P(Peer To Peer)技術を用いて、自車の近傍に存在する端末(例えば、歩行者若しくは店舗の端末、又は、MTC(Machine Type Communication)端末)と通信を行う。例えば、通信部1022は、V2X通信を行う。V2X通信とは、例えば、他の車両との間の車車間(Vehicle to Vehicle)通信、路側器等との間の路車間(Vehicle to Infrastructure)通信、家との間(Vehicle to Home)の通信、及び、歩行者が所持する端末等との間の歩車間(Vehicle to Pedestrian)通信等である。 For example, the communication unit 1022 uses P2P (Peer To Peer) technology to communicate with a terminal existing in the vicinity of the own vehicle (for example, a pedestrian or store terminal, or an MTC (Machine Type Communication) terminal). .. For example, the communication unit 1022 performs V2X communication. V2X communication is, for example, vehicle-to-vehicle (Vehicle to Vehicle) communication with other vehicles, road-to-vehicle (Vehicle to Infrastructure) communication with roadside devices, and home (Vehicle to Home) communication. , And pedestrian-to-vehicle (Vehicle to Pedestrian) communication with terminals owned by pedestrians.
 例えば、通信部1022は、電波ビーコン、光ビーコン、FM多重放送等の道路交通情報通信システム(VICS(Vehicle Information and Communication System)、登録商標)により送信される電磁波を受信する。 For example, the communication unit 1022 receives electromagnetic waves transmitted by a vehicle information and communication system (VICS (Vehicle Information and Communication System), registered trademark) such as a radio wave beacon, an optical beacon, and FM multiplex broadcasting.
 地図情報蓄積部1023は、外部から取得した地図及び車両1001で作成した地図を蓄積する。例えば、地図情報蓄積部1023は、3次元の高精度地図、高精度地図より精度が低く、広いエリアをカバーするグローバルマップ等を蓄積する。 The map information storage unit 1023 stores a map acquired from the outside and a map created by the vehicle 1001. For example, the map information storage unit 1023 stores a three-dimensional high-precision map, a global map that is less accurate than the high-precision map and covers a wide area, and the like.
 高精度地図は、例えば、ダイナミックマップ、ポイントクラウドマップ、ベクターマップ(ADAS(Advanced Driver Assistance System)マップともいう)等である。ダイナミックマップは、例えば、動的情報、準動的情報、準静的情報、静的情報の4層からなる地図であり、外部のサーバ等から提供される。ポイントクラウドマップは、ポイントクラウド(点群データ)により構成される地図である。ベクターマップは、車線や信号の位置等の情報をポイントクラウドマップに対応付けた地図である。ポイントクラウドマップ及びベクターマップは、例えば、外部のサーバ等から提供されてもよいし、レーダ1052、LiDAR1053等によるセンシング結果に基づいて、後述するローカルマップとのマッチングを行うための地図として車両1001で作成され、地図情報蓄積部1023に蓄積されてもよい。また、外部のサーバ等から高精度地図が提供される場合、通信容量を削減するため、車両1001がこれから走行する計画経路に関する、例えば数百メートル四方の地図データがサーバ等から取得される。 The high-precision map is, for example, a dynamic map, a point cloud map, a vector map (also referred to as an ADAS (Advanced Driver Assistance System) map), or the like. The dynamic map is, for example, a map composed of four layers of dynamic information, quasi-dynamic information, quasi-static information, and static information, and is provided from an external server or the like. The point cloud map is a map composed of point clouds (point cloud data). A vector map is a map in which information such as lanes and signal positions is associated with a point cloud map. The point cloud map and the vector map may be provided, for example, from an external server or the like, or the vehicle 1001 as a map for matching with a local map described later based on the sensing result by the radar 1052, LiDAR1053, or the like. It may be created and stored in the map information storage unit 1023. Further, when a high-precision map is provided from an external server or the like, in order to reduce the communication capacity, map data of, for example, several hundred meters square, relating to the planned route on which the vehicle 1001 will travel from now on is acquired from the server or the like.
 GNSS受信部1024は、GNSS衛星からGNSS信号を受信し、走行支援・自動運転制御部1029に供給する。 The GNSS receiving unit 1024 receives the GNSS signal from the GNSS satellite and supplies it to the traveling support / automatic driving control unit 1029.
 外部認識センサ1025は、車両1001の外部の状況の認識に用いられる各種のセンサを備え、各センサからのセンサデータを車両制御システム1011の各部に供給する。外部認識センサ1025が備えるセンサの種類や数は任意である。 The external recognition sensor 1025 includes various sensors used for recognizing the external situation of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the external recognition sensor 1025 are arbitrary.
 例えば、外部認識センサ1025は、カメラ1051、レーダ1052、LiDAR(Light Detection and Ranging、Laser Imaging Detection and Ranging)53、及び、超音波センサ1054を備える。カメラ1051、レーダ1052、LiDAR1053、及び、超音波センサ1054の数は任意であり、各センサのセンシング領域の例は後述する。 For example, the external recognition sensor 1025 includes a camera 1051, a radar 1052, a LiDAR (Light Detection and Ringing, Laser Imaging Detection and Ringing) 53, and an ultrasonic sensor 1054. The number of cameras 1051, radar 1052, LiDAR1053, and ultrasonic sensors 1054 is arbitrary, and examples of sensing areas of each sensor will be described later.
 なお、カメラ1051には、例えば、ToF(Time Of Flight)カメラ、ステレオカメラ、単眼カメラ、赤外線カメラ等の任意の撮影方式のカメラが、必要に応じて用いられる。 As the camera 1051, for example, a camera of any shooting method such as a ToF (TimeOfFlight) camera, a stereo camera, a monocular camera, an infrared camera, etc. is used as needed.
 また、例えば、外部認識センサ1025は、天候、気象、明るさ等を検出するための環境センサを備える。環境センサは、例えば、雨滴センサ、霧センサ、日照センサ、雪センサ、照度センサ等を備える。 Further, for example, the external recognition sensor 1025 includes an environment sensor for detecting weather, weather, brightness, and the like. The environment sensor includes, for example, a raindrop sensor, a fog sensor, a sunshine sensor, a snow sensor, an illuminance sensor, and the like.
 さらに、例えば、外部認識センサ1025は、車両1001の周囲の音や音源の位置の検出等に用いられるマイクロフォンを備える。 Further, for example, the external recognition sensor 1025 includes a microphone used for detecting the position of a sound or a sound source around the vehicle 1001.
 車内センサ1026は、車内の情報を検出するための各種のセンサを備え、各センサからのセンサデータを車両制御システム1011の各部に供給する。車内センサ1026が備えるセンサの種類や数は任意である。 The in-vehicle sensor 1026 includes various sensors for detecting information in the vehicle, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the in-vehicle sensor 1026 are arbitrary.
 例えば、車内センサ1026は、カメラ、レーダ、着座センサ、ステアリングホイールセンサ、マイクロフォン、生体センサ等を備える。カメラには、例えば、ToFカメラ、ステレオカメラ、単眼カメラ、赤外線カメラ等の任意の撮影方式のカメラを用いることができる。生体センサは、例えば、シートやステリングホイール等に設けられ、運転者等の搭乗者の各種の生体情報を検出する。 For example, the in-vehicle sensor 1026 includes a camera, a radar, a seating sensor, a steering wheel sensor, a microphone, a biological sensor, and the like. As the camera, for example, a camera of any shooting method such as a ToF camera, a stereo camera, a monocular camera, and an infrared camera can be used. The biosensor is provided on, for example, a seat, a stelling wheel, or the like, and detects various biometric information of a occupant such as a driver.
 車両センサ1027は、車両1001の状態を検出するための各種のセンサを備え、各センサからのセンサデータを車両制御システム1011の各部に供給する。車両センサ1027が備えるセンサの種類や数は任意である。 The vehicle sensor 1027 includes various sensors for detecting the state of the vehicle 1001, and supplies sensor data from each sensor to each part of the vehicle control system 1011. The type and number of sensors included in the vehicle sensor 1027 are arbitrary.
 例えば、車両センサ1027は、速度センサ、加速度センサ、角速度センサ(ジャイロセンサ)、及び、慣性計測装置(IMU(Inertial Measurement Unit))を備える。例えば、車両センサ1027は、ステアリングホイールの操舵角を検出する操舵角センサ、ヨーレートセンサ、アクセルペダルの操作量を検出するアクセルセンサ、及び、ブレーキペダルの操作量を検出するブレーキセンサを備える。例えば、車両センサ1027は、エンジンやモータの回転数を検出する回転センサ、タイヤの空気圧を検出する空気圧センサ、タイヤのスリップ率を検出するスリップ率センサ、及び、車輪の回転速度を検出する車輪速センサを備える。例えば、車両センサ1027は、バッテリの残量及び温度を検出するバッテリセンサ、及び、外部からの衝撃を検出する衝撃センサを備える。 For example, the vehicle sensor 1027 includes a speed sensor, an acceleration sensor, an angular velocity sensor (gyro sensor), and an inertial measurement unit (IMU (Inertial Measurement Unit)). For example, the vehicle sensor 1027 includes a steering angle sensor for detecting the steering angle of the steering wheel, a yaw rate sensor, an accelerator sensor for detecting the operation amount of the accelerator pedal, and a brake sensor for detecting the operation amount of the brake pedal. For example, the vehicle sensor 1027 includes a rotation sensor that detects the rotation speed of an engine or a motor, an air pressure sensor that detects tire pressure, a slip ratio sensor that detects tire slip ratio, and a wheel speed that detects wheel rotation speed. Equipped with a sensor. For example, the vehicle sensor 1027 includes a battery sensor that detects the remaining amount and temperature of the battery, and an impact sensor that detects an impact from the outside.
 記録部1028は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)、HDD(Hard Disc Drive)等の磁気記憶デバイス、半導体記憶デバイス、光記憶デバイス、及び、光磁気記憶デバイス等を備える。記録部1028は、車両制御システム1011の各部が用いる各種プログラムやデータ等を記録する。例えば、記録部1028は、自動運転に関わるアプリケーションプログラムが動作するROS(Robot Operating System)で送受信されるメッセージを含むrosbagファイルを記録する。例えば、記録部1028は、EDR(Event Data Recorder)やDSSAD(Data Storage System for Automated Driving)を備え、事故等のイベントの前後の車両1001の情報を記録する。 The recording unit 1028 includes, for example, a magnetic storage device such as a ROM (ReadOnlyMemory), a RAM (RandomAccessMemory), and an HDD (HardDiscDrive), a semiconductor storage device, an optical storage device, an optical magnetic storage device, and the like. .. The recording unit 1028 records various programs, data, and the like used by each unit of the vehicle control system 1011. For example, the recording unit 1028 records a rosbag file including messages sent and received by the ROS (Robot Operating System) in which an application program related to automatic driving operates. For example, the recording unit 1028 includes an EDR (Event Data Recorder) and a DSSAD (Data Storage System for Automated Driving), and records information on the vehicle 1001 before and after an event such as an accident.
 走行支援・自動運転制御部1029は、車両1001の走行支援及び自動運転の制御を行う。例えば、走行支援・自動運転制御部1029は、分析部1061、行動計画部1062、及び、動作制御部1063を備える。 The driving support / automatic driving control unit 1029 controls the driving support and automatic driving of the vehicle 1001. For example, the driving support / automatic driving control unit 1029 includes an analysis unit 1061, an action planning unit 1062, and an operation control unit 1063.
 分析部1061は、車両1001及び周囲の状況の分析処理を行う。分析部1061は、自己位置推定部1071、センサフュージョン部1072、及び、認識部1073を備える。 The analysis unit 1061 analyzes the vehicle 1001 and the surrounding conditions. The analysis unit 1061 includes a self-position estimation unit 1071, a sensor fusion unit 1072, and a recognition unit 1073.
 自己位置推定部1071は、外部認識センサ1025からのセンサデータ、及び、地図情報蓄積部1023に蓄積されている高精度地図に基づいて、車両1001の自己位置を推定する。例えば、自己位置推定部1071は、外部認識センサ1025からのセンサデータに基づいてローカルマップを生成し、ローカルマップと高精度地図とのマッチングを行うことにより、車両1001の自己位置を推定する。車両1001の位置は、例えば、後輪対車軸の中心が基準とされる。 The self-position estimation unit 1071 estimates the self-position of the vehicle 1001 based on the sensor data from the external recognition sensor 1025 and the high-precision map stored in the map information storage unit 1023. For example, the self-position estimation unit 1071 generates a local map based on the sensor data from the external recognition sensor 1025, and estimates the self-position of the vehicle 1001 by matching the local map with the high-precision map. The position of the vehicle 1001 is based on, for example, the center of the rear wheel-to-axle.
 ローカルマップは、例えば、SLAM(Simultaneous Localization and Mapping)等の技術を用いて作成される3次元の高精度地図、占有格子地図(Occupancy Grid Map)等である。3次元の高精度地図は、例えば、上述したポイントクラウドマップ等である。占有格子地図は、車両1001の周囲の3次元又は2次元の空間を所定の大きさのグリッド(格子)に分割し、グリッド単位で物体の占有状態を示す地図である。物体の占有状態は、例えば、物体の有無や存在確率により示される。ローカルマップは、例えば、認識部1073による車両1001の外部の状況の検出処理及び認識処理にも用いられる。 The local map is, for example, a three-dimensional high-precision map created by using a technology such as SLAM (Simultaneous Localization and Mapping), an occupied grid map (OccupancyGridMap), or the like. The three-dimensional high-precision map is, for example, the point cloud map described above. The occupied grid map is a map that divides a three-dimensional or two-dimensional space around the vehicle 1001 into a grid (grid) of a predetermined size and shows the occupied state of an object in grid units. The occupied state of an object is indicated by, for example, the presence or absence of an object and the probability of existence. The local map is also used, for example, in the detection process and the recognition process of the external situation of the vehicle 1001 by the recognition unit 1073.
 なお、自己位置推定部1071は、GNSS信号、及び、車両センサ1027からのセンサデータに基づいて、車両1001の自己位置を推定してもよい。 The self-position estimation unit 1071 may estimate the self-position of the vehicle 1001 based on the GNSS signal and the sensor data from the vehicle sensor 1027.
 センサフュージョン部1072は、複数の異なる種類のセンサデータ(例えば、カメラ1051から供給される画像データ、及び、レーダ1052から供給されるセンサデータ)を組み合わせて、新たな情報を得るセンサフュージョン処理を行う。異なる種類のセンサデータを組合せる方法としては、統合、融合、連合等がある。 The sensor fusion unit 1072 performs a sensor fusion process for obtaining new information by combining a plurality of different types of sensor data (for example, image data supplied from the camera 1051 and sensor data supplied from the radar 1052). .. Methods for combining different types of sensor data include integration, fusion, and association.
 認識部1073は、車両1001の外部の状況の検出処理及び認識処理を行う。 The recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001.
 例えば、認識部1073は、外部認識センサ1025からの情報、自己位置推定部1071からの情報、センサフュージョン部1072からの情報等に基づいて、車両1001の外部の状況の検出処理及び認識処理を行う。 For example, the recognition unit 1073 performs detection processing and recognition processing of the external situation of the vehicle 1001 based on the information from the external recognition sensor 1025, the information from the self-position estimation unit 1071, the information from the sensor fusion unit 1072, and the like. ..
 具体的には、例えば、認識部1073は、車両1001の周囲の物体の検出処理及び認識処理等を行う。物体の検出処理とは、例えば、物体の有無、大きさ、形、位置、動き等を検出する処理である。物体の認識処理とは、例えば、物体の種類等の属性を認識したり、特定の物体を識別したりする処理である。ただし、検出処理と認識処理とは、必ずしも明確に分かれるものではなく、重複する場合がある。 Specifically, for example, the recognition unit 1073 performs detection processing, recognition processing, and the like of objects around the vehicle 1001. The object detection process is, for example, a process of detecting the presence / absence, size, shape, position, movement, etc. of an object. The object recognition process is, for example, a process of recognizing an attribute such as an object type or identifying a specific object. However, the detection process and the recognition process are not always clearly separated and may overlap.
 例えば、認識部1073は、LiDAR又はレーダ等のセンサデータに基づくポイントクラウドを点群の塊毎に分類するクラスタリングを行うことにより、車両1001の周囲の物体を検出する。これにより、車両1001の周囲の物体の有無、大きさ、形状、位置が検出される。 For example, the recognition unit 1073 detects an object around the vehicle 1001 by performing clustering that classifies the point cloud based on sensor data such as LiDAR or radar into a block of a point cloud. As a result, the presence / absence, size, shape, and position of objects around the vehicle 1001 are detected.
 例えば、認識部1073は、クラスタリングにより分類された点群の塊の動きを追従するトラッキングを行うことにより、車両1001の周囲の物体の動きを検出する。これにより、車両1001の周囲の物体の速度及び進行方向(移動ベクトル)が検出される。 For example, the recognition unit 1073 detects the movement of an object around the vehicle 1001 by performing tracking that follows the movement of a mass of point clouds classified by clustering. As a result, the velocity and the traveling direction (movement vector) of the object around the vehicle 1001 are detected.
 例えば、認識部1073は、カメラ1051から供給される画像データに対してセマンティックセグメンテーション等の物体認識処理を行うことにより、車両1001の周囲の物体の種類を認識する。 For example, the recognition unit 1073 recognizes the type of an object around the vehicle 1001 by performing an object recognition process such as semantic segmentation on the image data supplied from the camera 1051.
 なお、検出又は認識対象となる物体としては、例えば、車両、人、自転車、障害物、構造物、道路、信号機、交通標識、道路標示等が想定される。 The object to be detected or recognized is assumed to be, for example, a vehicle, a person, a bicycle, an obstacle, a structure, a road, a traffic light, a traffic sign, a road sign, or the like.
 例えば、認識部1073は、地図情報蓄積部1023に蓄積されている地図、自己位置の推定結果、及び、車両1001の周囲の物体の認識結果に基づいて、車両1001の周囲の交通ルールの認識処理を行う。この処理により、例えば、信号の位置及び状態、交通標識及び道路標示の内容、交通規制の内容、並びに、走行可能な車線等が認識される。 For example, the recognition unit 1073 recognizes the traffic rules around the vehicle 1001 based on the map stored in the map information storage unit 1023, the self-position estimation result, and the recognition result of the objects around the vehicle 1001. I do. By this processing, for example, the position and state of a signal, the contents of traffic signs and road markings, the contents of traffic regulations, the lanes in which the vehicle can travel, and the like are recognized.
 例えば、認識部1073は、車両1001の周囲の環境の認識処理を行う。認識対象となる周囲の環境としては、例えば、天候、気温、湿度、明るさ、及び、路面の状態等が想定される。 For example, the recognition unit 1073 performs recognition processing of the environment around the vehicle 1001. As the surrounding environment to be recognized, for example, weather, temperature, humidity, brightness, road surface condition, and the like are assumed.
 行動計画部1062は、車両1001の行動計画を作成する。例えば、行動計画部1062は、経路計画、経路追従の処理を行うことにより、行動計画を作成する。 The action planning unit 1062 creates an action plan for the vehicle 1001. For example, the action planning unit 1062 creates an action plan by performing route planning and route tracking processing.
 なお、経路計画(Global path planning)とは、スタートからゴールまでの大まかな経路を計画する処理である。この経路計画には、軌道計画と言われ、経路計画で計画された経路において、車両1001の運動特性を考慮して、車両1001の近傍で安全かつ滑らかに進行することが可能な軌道生成(Local path planning)の処理も含まれる。 Note that route planning (Global path planning) is a process of planning a rough route from the start to the goal. This route plan is called a track plan, and in the route planned by the route plan, the track generation (Local) capable of safely and smoothly traveling in the vicinity of the vehicle 1001 in consideration of the motion characteristics of the vehicle 1001 is taken into consideration. The processing of path planning) is also included.
 経路追従とは、経路計画により計画した経路を計画された時間内で安全かつ正確に走行するための動作を計画する処理である。例えば、車両1001の目標速度と目標角速度が計算される。 Route tracking is a process of planning an operation for safely and accurately traveling on a route planned by route planning within a planned time. For example, the target speed and the target angular velocity of the vehicle 1001 are calculated.
 動作制御部1063は、行動計画部1062により作成された行動計画を実現するために、車両1001の動作を制御する。 The motion control unit 1063 controls the motion of the vehicle 1001 in order to realize the action plan created by the action plan unit 1062.
 例えば、動作制御部1063は、ステアリング制御部1081、ブレーキ制御部1082、及び、駆動制御部1083を制御して、軌道計画により計算された軌道を車両1001が進行するように、加減速制御及び方向制御を行う。例えば、動作制御部1063は、衝突回避あるいは衝撃緩和、追従走行、車速維持走行、自車の衝突警告、自車のレーン逸脱警告等のADASの機能実現を目的とした協調制御を行う。例えば、動作制御部1063は、運転者の操作によらずに自律的に走行する自動運転等を目的とした協調制御を行う。 For example, the motion control unit 1063 controls the steering control unit 1081, the brake control unit 1082, and the drive control unit 1083 so that the vehicle 1001 advances on the track calculated by the track plan. Take control. For example, the motion control unit 1063 performs coordinated control for the purpose of realizing ADAS functions such as collision avoidance or impact mitigation, follow-up running, vehicle speed maintenance running, collision warning of own vehicle, and lane deviation warning of own vehicle. For example, the motion control unit 1063 performs coordinated control for the purpose of automatic driving or the like that autonomously travels without being operated by the driver.
 DMS1073は、車内センサ1026からのセンサデータ、及び、HMI1031に入力される入力データ等に基づいて、運転者の認証処理、及び、運転者の状態の認識処理等を行う。認識対象となる運転者の状態としては、例えば、体調、覚醒度、集中度、疲労度、視線方向、酩酊度、運転操作、姿勢等が想定される。 The DMS1073 performs driver authentication processing, driver status recognition processing, and the like based on sensor data from the in-vehicle sensor 1026 and input data input to HMI1031. As the state of the driver to be recognized, for example, physical condition, alertness, concentration, fatigue, line-of-sight direction, drunkenness, driving operation, posture, and the like are assumed.
 なお、DMS1073が、運転者以外の搭乗者の認証処理、及び、当該搭乗者の状態の認識処理を行うようにしてもよい。また、例えば、DMS1073が、車内センサ1026からのセンサデータに基づいて、車内の状況の認識処理を行うようにしてもよい。認識対象となる車内の状況としては、例えば、気温、湿度、明るさ、臭い等が想定される。 Note that the DMS1073 may perform authentication processing for passengers other than the driver and recognition processing for the status of the passenger. Further, for example, the DMS 1073 may perform the recognition processing of the situation inside the vehicle based on the sensor data from the sensor 1026 in the vehicle. As the situation inside the vehicle to be recognized, for example, temperature, humidity, brightness, odor, etc. are assumed.
 HMI1031は、各種のデータや指示等の入力に用いられ、入力されたデータや指示等に基づいて入力信号を生成し、車両制御システム1011の各部に供給する。例えば、HMI1031は、タッチパネル、ボタン、マイクロフォン、スイッチ、及び、レバー等の操作デバイス、並びに、音声やジェスチャ等により手動操作以外の方法で入力可能な操作デバイス等を備える。なお、HMI1031は、例えば、赤外線若しくはその他の電波を利用したリモートコントロール装置、又は、車両制御システム1011の操作に対応したモバイル機器若しくはウェアラブル機器等の外部接続機器であってもよい。 The HMI 1031 is used for inputting various data and instructions, generates an input signal based on the input data and instructions, and supplies the input signal to each part of the vehicle control system 1011. For example, the HMI 1031 includes an operation device such as a touch panel, a button, a microphone, a switch, and a lever, and an operation device that can be input by a method other than manual operation by voice or gesture. The HMI 1031 may be, for example, a remote control device using infrared rays or other radio waves, or an externally connected device such as a mobile device or a wearable device corresponding to the operation of the vehicle control system 1011.
 また、HMI1031は、搭乗者又は車外に対する視覚情報、聴覚情報、及び、触覚情報の生成及び出力、並びに、出力内容、出力タイミング、出力方法等を制御する出力制御を行う。視覚情報は、例えば、操作画面、車両1001の状態表示、警告表示、車両1001の周囲の状況を示すモニタ画像等の画像や光により示される情報である。聴覚情報は、例えば、ガイダンス、警告音、警告メッセージ等の音声により示される情報である。触覚情報は、例えば、力、振動、動き等により搭乗者の触覚に与えられる情報である。 Further, the HMI 1031 performs output control for generating and outputting visual information, auditory information, and tactile information for the passenger or the outside of the vehicle, and for controlling output contents, output timing, output method, and the like. The visual information is, for example, information shown by an image such as an operation screen, a state display of the vehicle 1001, a warning display, a monitor image showing the surrounding situation of the vehicle 1001, or light. The auditory information is, for example, information indicated by voice such as a guidance, a warning sound, and a warning message. The tactile information is information given to the passenger's tactile sensation by, for example, force, vibration, movement, or the like.
 視覚情報を出力するデバイスとしては、例えば、表示装置、プロジェクタ、ナビゲーション装置、インストルメントパネル、CMS(Camera Monitoring System)、電子ミラー、ランプ等が想定される。表示装置は、通常のディスプレイを有する装置以外にも、例えば、ヘッドアップディスプレイ、透過型ディスプレイ、AR(Augmented Reality)機能を備えるウエアラブルデバイス等の搭乗者の視界内に視覚情報を表示する装置であってもよい。 As a device that outputs visual information, for example, a display device, a projector, a navigation device, an instrument panel, a CMS (Camera Monitoring System), an electronic mirror, a lamp, etc. are assumed. The display device is a device that displays visual information in the occupant's field of view, such as a head-up display, a transmissive display, and a wearable device having an AR (Augmented Reality) function, in addition to a device having a normal display. You may.
 聴覚情報を出力するデバイスとしては、例えば、オーディオスピーカ、ヘッドホン、イヤホン等が想定される。 As a device that outputs auditory information, for example, an audio speaker, headphones, earphones, etc. are assumed.
 触覚情報を出力するデバイスとしては、例えば、ハプティクス技術を用いたハプティクス素子等が想定される。ハプティクス素子は、例えば、ステアリングホイール、シート等に設けられる。 As a device that outputs tactile information, for example, a haptics element using haptics technology or the like is assumed. The haptic element is provided on, for example, a steering wheel, a seat, or the like.
 車両制御部1032は、車両1001の各部の制御を行う。車両制御部1032は、ステアリング制御部1081、ブレーキ制御部1082、駆動制御部1083、ボディ系制御部1084、ライト制御部1085、及び、ホーン制御部1086を備える。 The vehicle control unit 1032 controls each part of the vehicle 1001. The vehicle control unit 1032 includes a steering control unit 1081, a brake control unit 1082, a drive control unit 1083, a body system control unit 1084, a light control unit 1085, and a horn control unit 1086.
 ステアリング制御部1081は、車両1001のステアリングシステムの状態の検出及び制御等を行う。ステアリングシステムは、例えば、ステアリングホイール等を備えるステアリング機構、電動パワーステアリング等を備える。ステアリング制御部1081は、例えば、ステアリングシステムの制御を行うECU等の制御ユニット、ステアリングシステムの駆動を行うアクチュエータ等を備える。 The steering control unit 1081 detects and controls the state of the steering system of the vehicle 1001. The steering system includes, for example, a steering mechanism including a steering wheel, electric power steering, and the like. The steering control unit 1081 includes, for example, a control unit such as an ECU that controls the steering system, an actuator that drives the steering system, and the like.
 ブレーキ制御部1082は、車両1001のブレーキシステムの状態の検出及び制御等を行う。ブレーキシステムは、例えば、ブレーキペダル等を含むブレーキ機構、ABS(Antilock Brake System)等を備える。ブレーキ制御部1082は、例えば、ブレーキシステムの制御を行うECU等の制御ユニット、ブレーキシステムの駆動を行うアクチュエータ等を備える。 The brake control unit 1082 detects and controls the state of the brake system of the vehicle 1001. The brake system includes, for example, a brake mechanism including a brake pedal and the like, ABS (Antilock Brake System) and the like. The brake control unit 1082 includes, for example, a control unit such as an ECU that controls the brake system, an actuator that drives the brake system, and the like.
 駆動制御部1083は、車両1001の駆動システムの状態の検出及び制御等を行う。駆動システムは、例えば、アクセルペダル、内燃機関又は駆動用モータ等の駆動力を発生させるための駆動力発生装置、駆動力を車輪に伝達するための駆動力伝達機構等を備える。駆動制御部1083は、例えば、駆動システムの制御を行うECU等の制御ユニット、駆動システムの駆動を行うアクチュエータ等を備える。 The drive control unit 1083 detects and controls the state of the drive system of the vehicle 1001. The drive system includes, for example, a drive force generator for generating a drive force of an accelerator pedal, an internal combustion engine, a drive motor, or the like, a drive force transmission mechanism for transmitting the drive force to the wheels, and the like. The drive control unit 1083 includes, for example, a control unit such as an ECU that controls the drive system, an actuator that drives the drive system, and the like.
 ボディ系制御部1084は、車両1001のボディ系システムの状態の検出及び制御等を行う。ボディ系システムは、例えば、キーレスエントリシステム、スマートキーシステム、パワーウインドウ装置、パワーシート、空調装置、エアバッグ、シートベルト、シフトレバー等を備える。ボディ系制御部1084は、例えば、ボディ系システムの制御を行うECU等の制御ユニット、ボディ系システムの駆動を行うアクチュエータ等を備える。 The body system control unit 1084 detects and controls the state of the body system of the vehicle 1001. The body system includes, for example, a keyless entry system, a smart key system, a power window device, a power seat, an air conditioner, an airbag, a seat belt, a shift lever, and the like. The body system control unit 1084 includes, for example, a control unit such as an ECU that controls the body system, an actuator that drives the body system, and the like.
 ライト制御部1085は、車両1001の各種のライトの状態の検出及び制御等を行う。制御対象となるライトとしては、例えば、ヘッドライト、バックライト、フォグライト、ターンシグナル、ブレーキライト、プロジェクション、バンパーの表示等が想定される。ライト制御部1085は、ライトの制御を行うECU等の制御ユニット、ライトの駆動を行うアクチュエータ等を備える。 The light control unit 1085 detects and controls various light states of the vehicle 1001. As the light to be controlled, for example, a headlight, a backlight, a fog light, a turn signal, a brake light, a projection, a bumper display, or the like is assumed. The light control unit 1085 includes a control unit such as an ECU that controls the light, an actuator that drives the light, and the like.
 ホーン制御部1086は、車両1001のカーホーンの状態の検出及び制御等を行う。ホーン制御部1086は、例えば、カーホーンの制御を行うECU等の制御ユニット、カーホーンの駆動を行うアクチュエータ等を備える。 The horn control unit 1086 detects and controls the state of the car horn of the vehicle 1001. The horn control unit 1086 includes, for example, a control unit such as an ECU that controls the car horn, an actuator that drives the car horn, and the like.
 図14は、図13の外部認識センサ1025のカメラ1051、レーダ1052、LiDAR1053、及び、超音波センサ1054によるセンシング領域の例を示す図である。 FIG. 14 is a diagram showing an example of a sensing region by a camera 1051, a radar 1052, a LiDAR 1053, and an ultrasonic sensor 1054 of the external recognition sensor 1025 of FIG.
 センシング領域1101F及びセンシング領域1101Bは、超音波センサ1054のセンシング領域の例を示している。センシング領域1101Fは、車両1001の前端周辺をカバーしている。センシング領域1101Bは、車両1001の後端周辺をカバーしている。 The sensing area 1101F and the sensing area 1101B show an example of the sensing area of the ultrasonic sensor 1054. The sensing region 1101F covers the vicinity of the front end of the vehicle 1001. The sensing region 1101B covers the periphery of the rear end of the vehicle 1001.
 センシング領域1101F及びセンシング領域1101Bにおけるセンシング結果は、例えば、車両1001の駐車支援等に用いられる。 The sensing results in the sensing area 1101F and the sensing area 1101B are used, for example, for parking support of the vehicle 1001 and the like.
 センシング領域1102F乃至センシング領域1102Bは、短距離又は中距離用のレーダ1052のセンシング領域の例を示している。センシング領域1102Fは、車両1001の前方において、センシング領域1101Fより遠い位置までカバーしている。センシング領域1102Bは、車両1001の後方において、センシング領域1101Bより遠い位置までカバーしている。センシング領域1102Lは、車両1001の左側面の後方の周辺をカバーしている。センシング領域1102Rは、車両1001の右側面の後方の周辺をカバーしている。 The sensing area 1102F to the sensing area 1102B show an example of the sensing area of the radar 1052 for a short distance or a medium distance. The sensing area 1102F covers a position farther than the sensing area 1101F in front of the vehicle 1001. The sensing region 1102B covers the rear of the vehicle 1001 to a position farther than the sensing region 1101B. The sensing area 1102L covers the rear periphery of the left side surface of the vehicle 1001. The sensing region 1102R covers the rear periphery of the right side surface of the vehicle 1001.
 センシング領域1102Fにおけるセンシング結果は、例えば、車両1001の前方に存在する車両や歩行者等の検出等に用いられる。センシング領域1102Bにおけるセンシング結果は、例えば、車両1001の後方の衝突防止機能等に用いられる。センシング領域1102L及びセンシング領域1102Rにおけるセンシング結果は、例えば、車両1001の側方の死角における物体の検出等に用いられる。 The sensing result in the sensing area 1102F is used, for example, to detect a vehicle, a pedestrian, or the like existing in front of the vehicle 1001. The sensing result in the sensing region 1102B is used, for example, for a collision prevention function behind the vehicle 1001. The sensing results in the sensing area 1102L and the sensing area 1102R are used, for example, for detecting an object in a blind spot on the side of the vehicle 1001.
 センシング領域1103F乃至センシング領域1103Bは、カメラ1051によるセンシング領域の例を示している。センシング領域1103Fは、車両1001の前方において、センシング領域1102Fより遠い位置までカバーしている。センシング領域1103Bは、車両1001の後方において、センシング領域1102Bより遠い位置までカバーしている。センシング領域1103Lは、車両1001の左側面の周辺をカバーしている。センシング領域1103Rは、車両1001の右側面の周辺をカバーしている。 The sensing area 1103F to the sensing area 1103B show an example of the sensing area by the camera 1051. The sensing area 1103F covers a position farther than the sensing area 1102F in front of the vehicle 1001. The sensing region 1103B covers the rear of the vehicle 1001 to a position farther than the sensing region 1102B. The sensing area 1103L covers the periphery of the left side surface of the vehicle 1001. The sensing region 1103R covers the periphery of the right side surface of the vehicle 1001.
 センシング領域1103Fにおけるセンシング結果は、例えば、信号機や交通標識の認識、車線逸脱防止支援システム等に用いられる。センシング領域1103Bにおけるセンシング結果は、例えば、駐車支援、及び、サラウンドビューシステム等に用いられる。センシング領域1103L及びセンシング領域1103Rにおけるセンシング結果は、例えば、サラウンドビューシステム等に用いられる。 The sensing result in the sensing area 1103F is used, for example, for recognition of traffic lights and traffic signs, lane departure prevention support system, and the like. The sensing result in the sensing area 1103B is used, for example, for parking assistance, a surround view system, and the like. The sensing results in the sensing area 1103L and the sensing area 1103R are used, for example, in a surround view system or the like.
 センシング領域1104は、LiDAR1053のセンシング領域の例を示している。センシング領域1104は、車両1001の前方において、センシング領域1103Fより遠い位置までカバーしている。一方、センシング領域1104は、センシング領域1103Fより左右方向の範囲が狭くなっている。 The sensing area 1104 shows an example of the sensing area of LiDAR1053. The sensing region 1104 covers a position farther than the sensing region 1103F in front of the vehicle 1001. On the other hand, the sensing area 1104 has a narrower range in the left-right direction than the sensing area 1103F.
 センシング領域1104におけるセンシング結果は、例えば、緊急ブレーキ、衝突回避、歩行者検出等に用いられる。 The sensing result in the sensing area 1104 is used, for example, for emergency braking, collision avoidance, pedestrian detection, and the like.
 センシング領域1105は、長距離用のレーダ1052のセンシング領域の例を示している。センシング領域1105は、車両1001の前方において、センシング領域1104より遠い位置までカバーしている。一方、センシング領域1105は、センシング領域1104より左右方向の範囲が狭くなっている。 The sensing area 1105 shows an example of the sensing area of the radar 1052 for a long distance. The sensing region 1105 covers a position farther than the sensing region 1104 in front of the vehicle 1001. On the other hand, the sensing region 1105 has a narrower range in the left-right direction than the sensing region 1104.
 センシング領域1105におけるセンシング結果は、例えば、ACC(Adaptive Cruise Control)等に用いられる。 The sensing result in the sensing area 1105 is used for, for example, ACC (Adaptive Cruise Control) or the like.
 なお、各センサのセンシング領域は、図14以外に各種の構成をとってもよい。具体的には、超音波センサ1054が車両1001の側方もセンシングするようにしてもよいし、LiDAR1053が車両1001の後方をセンシングするようにしてもよい。 Note that the sensing area of each sensor may have various configurations other than that shown in FIG. Specifically, the ultrasonic sensor 1054 may also sense the side of the vehicle 1001, or the LiDAR 1053 may sense the rear of the vehicle 1001.
 例えば、本技術(交差融合CNN121)を、DMS1030、HMI1031、センサフュージョン部1072、及び、認識部1073等に適用することが可能である。 For example, this technology (cross-fusion CNN121) can be applied to DMS1030, HMI1031, sensor fusion unit 1072, recognition unit 1073, and the like.
 これにより、例えば、HMI1031が表示する情報を拡張することができる。 Thereby, for example, the information displayed by the HMI 1031 can be expanded.
 例えば、AR技術を用いて、HMI1031が備えるHUD(Head Up Display)によりウインドシールドに画像を投影し、運転者の視野内に情報を重畳して表示する場合、交差融合CNN121を用いることにより、人間の認識が困難な情報を強調したり、付加したりすることができる。例えば、霧中において、車両1001の前方の視野内の物体等の微妙なコントラストを強調して表示することができる。 For example, when an image is projected on a windshield by the HUD (Head Up Display) provided in the HMI 1031 using AR technology and information is superimposed and displayed in the driver's field of view, a human being can be used by using the cross fusion CNN121. Information that is difficult to recognize can be emphasized or added. For example, in fog, it is possible to emphasize and display a delicate contrast of an object or the like in the field of view in front of the vehicle 1001.
 また、例えば、認識部1073及びHMI1031に交差融合CNN121を適用して、カメラ1051から供給される画像データと、LiDAR1053から供給されるLiDARデータを融合して、運転者に提示することができる。 Further, for example, the cross-fusion CNN121 can be applied to the recognition unit 1073 and the HMI 1031 to fuse the image data supplied from the camera 1051 and the LiDAR data supplied from the LiDAR 1053 and present them to the driver.
 具体的には、交差融合CNN121のHVC131には、カメラ1051から画像データが入力され、MVC132には、LiDAR1053からLiDARデータが入力される。 Specifically, image data is input from the camera 1051 to the HVC 131 of the cross-fusion CNN 121, and LiDAR data is input from the LiDAR 1053 to the MVC 132.
 例えば、トレーニング段階において、交差融合CNN121は、画像データとLiDARデータを融合した正解融合画像(Ground truth fused images)データを用いてトレーニングされる。正解融合画像データは、例えば、LiDARデータに基づいて検出された特徴を強調又は付加した画像データに、LiDARデータに基づく特徴が人間に顕著であるか(salient)否かを示すラベルが付加されたデータである。 For example, in the training stage, the cross fusion CNN121 is trained using correct fusion images (Ground truth used images) data in which image data and LiDAR data are fused. In the correct fusion image data, for example, a label indicating whether or not the feature based on the LiDAR data is salient is added to the image data in which the feature detected based on the LiDAR data is emphasized or added. It is data.
 次に、実行段階において、例えば、交差融合CNN121は、HVC131が注目する(従って、運転者が注目する)画像データの特徴に、LiDARデータに基づく特徴を追加することにより、画像データとLiDARデータを融合する。そして、HMI1031は、画像データとLiDARデータとを融合した画像を表示する。 Next, in the execution stage, for example, the cross-fusion CNN 121 adds image data and LiDAR data to the features of the image data that the HVC 131 pays attention to (and thus the driver), by adding features based on the LiDAR data. Fuse. Then, the HMI 1031 displays an image in which the image data and the LiDAR data are fused.
 これにより、複数のセンサのデータの特徴を融合する処理の効率化に加えて、運転者が見やすい情報を提供することが可能になる。 This makes it possible to provide information that is easy for the driver to see, in addition to improving the efficiency of processing that fuses the characteristics of data from multiple sensors.
 <<5.その他>>
  <コンピュータの構成例>
 上述した一連の処理は、ハードウエアにより実行することもできるし、ソフトウエアにより実行することもできる。一連の処理をソフトウエアにより実行する場合には、そのソフトウエアを構成するプログラムが、コンピュータにインストールされる。ここで、コンピュータには、専用のハードウエアに組み込まれているコンピュータや、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどが含まれる。
<< 5. Others >>
<Computer configuration example>
The series of processes described above can be executed by hardware or software. When a series of processes are executed by software, the programs constituting the software are installed in the computer. Here, the computer includes a computer embedded in dedicated hardware and, for example, a general-purpose personal computer capable of executing various functions by installing various programs.
 図15は、上述した一連の処理をプログラムにより実行するコンピュータのハードウエアの構成例を示すブロック図である。 FIG. 15 is a block diagram showing a configuration example of computer hardware that executes the above-mentioned series of processes programmatically.
 コンピュータ2000において、CPU(Central Processing Unit)2001,ROM(Read Only Memory)2002,RAM(Random Access Memory)2003は、バス2004により相互に接続されている。 In the computer 2000, the CPU (Central Processing Unit) 2001, the ROM (Read Only Memory) 2002, and the RAM (Random Access Memory) 2003 are connected to each other by the bus 2004.
 バス2004には、さらに、入出力インタフェース2005が接続されている。入出力インタフェース2005には、入力部2006、出力部2007、記録部2008、通信部2009、及びドライブ2010が接続されている。 An input / output interface 2005 is further connected to the bus 2004. An input unit 2006, an output unit 2007, a recording unit 2008, a communication unit 2009, and a drive 2010 are connected to the input / output interface 2005.
 入力部2006は、入力スイッチ、ボタン、マイクロフォン、撮像素子などよりなる。出力部2007は、ディスプレイ、スピーカなどよりなる。記録部2008は、ハードディスクや不揮発性のメモリなどよりなる。通信部2009は、ネットワークインタフェースなどよりなる。ドライブ2010は、磁気ディスク、光ディスク、光磁気ディスク、又は半導体メモリなどのリムーバブルメディア2011を駆動する。 The input unit 2006 includes an input switch, a button, a microphone, an image pickup device, and the like. The output unit 2007 includes a display, a speaker, and the like. The recording unit 2008 includes a hard disk, a non-volatile memory, and the like. The communication unit 2009 includes a network interface and the like. The drive 2010 drives a removable media 2011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
 以上のように構成されるコンピュータ2000では、CPU2001が、例えば、記録部2008に記録されているプログラムを、入出力インタフェース2005及びバス2004を介して、RAM2003にロードして実行することにより、上述した一連の処理が行われる。 In the computer 2000 configured as described above, the CPU 2001 loads the program recorded in the recording unit 2008 into the RAM 2003 via the input / output interface 2005 and the bus 2004 and executes the program. A series of processes are performed.
 コンピュータ2000(CPU2001)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア2011に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 The program executed by the computer 2000 (CPU 2001) can be recorded and provided on the removable media 2011 as a package media or the like, for example. The program can also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
 コンピュータ2000では、プログラムは、リムーバブルメディア2011をドライブ2010に装着することにより、入出力インタフェース2005を介して、記録部2008にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部2009で受信し、記録部2008にインストールすることができる。その他、プログラムは、ROM2002や記録部2008に、あらかじめインストールしておくことができる。 In the computer 2000, the program can be installed in the recording unit 2008 via the input / output interface 2005 by mounting the removable media 2011 in the drive 2010. Further, the program can be received by the communication unit 2009 via a wired or wireless transmission medium and installed in the recording unit 2008. In addition, the program can be installed in ROM 2002 or the recording unit 2008 in advance.
 なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program in which processing is performed in chronological order according to the order described in the present specification, in parallel, or at a necessary timing such as when a call is made. It may be a program in which processing is performed.
 また、本明細書において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれも、システムである。 Further, in the present specification, the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a device in which a plurality of modules are housed in one housing are both systems. ..
 さらに、本技術の実施の形態は、上述した実施の形態に限定されるものではなく、本技術の要旨を逸脱しない範囲において種々の変更が可能である。 Further, the embodiment of the present technology is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present technology.
 例えば、本技術は、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, this technology can take a cloud computing configuration in which one function is shared by multiple devices via a network and processed jointly.
 また、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。 In addition, each step described in the above flowchart can be executed by one device or shared by a plurality of devices.
 さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.
 <構成の組み合わせ例>
 本技術は、以下のような構成をとることもできる。
<Example of configuration combination>
The present technology can also have the following configurations.
(1)
 人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)と、
 前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNと
 を備え、
 前記第1のCNNの任意の層と前記第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される
 情報処理装置。
(2)
 前記一方のCNNは、前記第1のCNNであり、
 前記他方のCNNは、前記第2のCNNである
 前記(1)に記載の情報処理装置。
(3)
 前記転送される情報は、人間の視覚野から出力される信号に類似する信号を画像化した画像データを含む
 前記(2)に記載の情報処理装置。
(4)
 前記転送される情報は、特徴マップ、注目マップ、及び、領域提案のうち少なくとも1つを含む
 前記(2)又は(3)に記載の情報処理装置。
(5)
 前記第2のCNNは、前記第1のCNNの影響を受けたデータを出力する
 前記(2)乃至(4)のいずれかに記載の情報処理装置。
(6)
 前記第2のCNNの出力データは、前記第1のCNNの処理と前記第2のCNNの処理との差異を示すデータを含む
 前記(5)に記載の情報処理装置。
(7)
 前記第2のCNNの出力データは、入力画像のうち前記第1のCNNが注目していない領域に、前記第2のCNNにより前記入力画像に対して所定の画像処理を行った画像を合成した画像データを含む
 前記(5)又は(6)に記載の情報処理装置。
(8)
 前記第2のCNNの出力データは、前記第1のCNNの出力データと同様のデータを含む
 前記(5)乃至(7)のいずれかに記載の情報処理装置。
(9)
 前記第1のCNNは、前記第2のCNNと組み合わされる前にトレーニングされ、
 前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際に、前記第1のCNNの構成及びパラメータは変更されない
 前記(1)乃至(8)のいずれかに記載の情報処理装置。
(10)
 前記第2のCNNは、前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際にトレーニングされる
 前記(9)に記載の情報処理装置。
(11)
 前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際に、前記第1のCNNと前記第2のCNNとの融合方法を示す交差融合パラメータが調整される
 前記(10)に記載の情報処理装置。
(12)
 前記交差融合パラメータは、前記第1のCNNの層と前記第2のCNNの層との接続関係、及び、前記第1のCNNの層と前記第2のCNNの層との間で転送される情報のタイプを示すパラメータを含む
 前記(11)に記載の情報処理装置。
(13)
 前記転送された情報は、前記他方のCNNの前記任意の層の処理に用いられる
 前記(1)乃至(12)のいずれかに記載の情報処理装置。
(14)
 前記第1のCNNに画像データ、及び、人間の視覚システムの状態、処理、又は、機能を示すデータが入力され、
 前記第2のCNNに画像データが入力される
 前記(1)乃至(13)のいずれかに記載の情報処理装置。
(15)
 前記他方のCNNの前記任意の層から前記一方のCNNの前記任意の層に情報が転送される
 前記(1)乃至(14)のいずれかに記載の情報処理装置。
(16)
 人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する
 情報処理方法。
(17)
 人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する
 処理をコンピュータに実行させるためのプログラム。
(18)
 人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される交差融合CNNの前記第1のCNNのトレーニングを、前記第2のCNNと組み合わせる前に行い、
 前記第1のCNNと前記第2のCNNと組み合わせてトレーニングを行う際に、前記第2のCNNのトレーニングを行い、前記第1のCNNの構成及びパラメータを変更しない
 学習方法。
(1)
The first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system,
It is provided with a second CNN that realizes a function similar to that of the first CNN by a process different from that of the first CNN.
An information processing device in which an arbitrary layer of the first CNN and an arbitrary layer of the second CNN are connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. ..
(2)
The one CNN is the first CNN, and the CNN is the first CNN.
The other CNN is the second CNN, which is the information processing apparatus according to (1).
(3)
The information processing apparatus according to (2) above, wherein the transferred information includes image data obtained by imaging a signal similar to a signal output from the human visual cortex.
(4)
The information processing apparatus according to (2) or (3) above, wherein the transferred information includes at least one of a feature map, a attention map, and a region proposal.
(5)
The information processing apparatus according to any one of (2) to (4), wherein the second CNN outputs data influenced by the first CNN.
(6)
The information processing apparatus according to (5) above, wherein the output data of the second CNN includes data showing a difference between the processing of the first CNN and the processing of the second CNN.
(7)
The output data of the second CNN is obtained by synthesizing an image obtained by performing predetermined image processing on the input image by the second CNN in a region of the input image that the first CNN is not paying attention to. The information processing apparatus according to (5) or (6) above, which includes image data.
(8)
The information processing apparatus according to any one of (5) to (7) above, wherein the output data of the second CNN includes the same data as the output data of the first CNN.
(9)
The first CNN is trained and trained prior to being combined with the second CNN.
The information processing apparatus according to any one of (1) to (8), wherein the configuration and parameters of the first CNN are not changed when the first CNN and the second CNN are combined for training. ..
(10)
The information processing apparatus according to (9), wherein the second CNN is trained when training is performed by combining the first CNN and the second CNN.
(11)
When training is performed by combining the first CNN and the second CNN, the cross fusion parameter indicating the fusion method of the first CNN and the second CNN is adjusted in (10). The information processing device described.
(12)
The cross-fusion parameters are transferred between the connection relationship between the first CNN layer and the second CNN layer, and between the first CNN layer and the second CNN layer. The information processing apparatus according to (11) above, which includes a parameter indicating the type of information.
(13)
The information processing apparatus according to any one of (1) to (12), wherein the transferred information is used for processing the arbitrary layer of the other CNN.
(14)
Image data and data indicating the state, processing, or function of the human visual system are input to the first CNN.
The information processing apparatus according to any one of (1) to (13), wherein image data is input to the second CNN.
(15)
The information processing apparatus according to any one of (1) to (14), wherein information is transferred from the arbitrary layer of the other CNN to the arbitrary layer of the one CNN.
(16)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. An information processing method that connects to an arbitrary layer of a second CNN and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
(17)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. A program for connecting to an arbitrary layer of a second CNN and causing a computer to perform a process of transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
(18)
A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. The training of the first CNN of a cross-fused CNN in which any layer of the second CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. Do it before combining with the second CNN,
A learning method in which when training is performed in combination with the first CNN and the second CNN, the second CNN is trained and the configuration and parameters of the first CNN are not changed.
 なお、本明細書に記載された効果はあくまで例示であって限定されるものではなく、他の効果があってもよい。 It should be noted that the effects described in the present specification are merely examples and are not limited, and other effects may be obtained.
 101 情報処理システム, 111 センサ部, 112 人間視覚処理センサ, 113 処理ユニット, 114 トレーニングデータセット生成部, 121 交差融合CNN, 131 HVC, 132 MVC, 1001 車両, 1030 DMS, 1031 HMI, 1072 センサフュージョン部, 1073 認識部 101 information processing system, 111 sensor unit, 112 human visual processing sensor, 113 processing unit, 114 training data set generation unit, 121 cross fusion CNN, 131 HVC, 132 MVC, 1001 vehicle, 1030 DMS, 1031 HMI, 1072 sensor fusion unit. , 1073 Recognition unit

Claims (18)

  1.  人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)と、
     前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNと
     を備え、
     前記第1のCNNの任意の層と前記第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される
     情報処理装置。
    The first CNN (Convolution Neural Network), which realizes the same processing and functions as the human visual system,
    It is provided with a second CNN that realizes a function similar to that of the first CNN by a process different from that of the first CNN.
    An information processing device in which an arbitrary layer of the first CNN and an arbitrary layer of the second CNN are connected, and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. ..
  2.  前記一方のCNNは、前記第1のCNNであり、
     前記他方のCNNは、前記第2のCNNである
     請求項1に記載の情報処理装置。
    The one CNN is the first CNN, and the CNN is the first CNN.
    The information processing apparatus according to claim 1, wherein the other CNN is the second CNN.
  3.  前記転送される情報は、人間の視覚野から出力される信号に類似する信号を画像化した画像データを含む
     請求項2に記載の情報処理装置。
    The information processing apparatus according to claim 2, wherein the transferred information includes image data obtained by imaging a signal similar to a signal output from the human visual cortex.
  4.  前記転送される情報は、特徴マップ、注目マップ、及び、領域提案のうち少なくとも1つを含む
     請求項2に記載の情報処理装置。
    The information processing apparatus according to claim 2, wherein the transferred information includes at least one of a feature map, a attention map, and a region proposal.
  5.  前記第2のCNNは、前記第1のCNNの影響を受けたデータを出力する
     請求項2に記載の情報処理装置。
    The information processing apparatus according to claim 2, wherein the second CNN outputs data affected by the first CNN.
  6.  前記第2のCNNの出力データは、前記第1のCNNの処理と前記第2のCNNの処理との差異を示すデータを含む
     請求項5に記載の情報処理装置。
    The information processing apparatus according to claim 5, wherein the output data of the second CNN includes data showing a difference between the processing of the first CNN and the processing of the second CNN.
  7.  前記第2のCNNの出力データは、入力画像のうち前記第1のCNNが注目していない領域に、前記第2のCNNにより前記入力画像に対して所定の画像処理を行った画像を合成した画像データを含む
     請求項5に記載の情報処理装置。
    The output data of the second CNN is obtained by synthesizing an image obtained by performing predetermined image processing on the input image by the second CNN in a region of the input image that the first CNN is not paying attention to. The information processing apparatus according to claim 5, which includes image data.
  8.  前記第2のCNNの出力データは、前記第1のCNNの出力データと同様のデータを含む
     請求項5に記載の情報処理装置。
    The information processing apparatus according to claim 5, wherein the output data of the second CNN includes the same data as the output data of the first CNN.
  9.  前記第1のCNNは、前記第2のCNNと組み合わされる前にトレーニングされ、
     前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際に、前記第1のCNNの構成及びパラメータは変更されない
     請求項1に記載の情報処理装置。
    The first CNN is trained and trained prior to being combined with the second CNN.
    The information processing apparatus according to claim 1, wherein the configuration and parameters of the first CNN are not changed when the first CNN and the second CNN are combined for training.
  10.  前記第2のCNNは、前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際にトレーニングされる
     請求項9に記載の情報処理装置。
    The information processing apparatus according to claim 9, wherein the second CNN is trained when training is performed by combining the first CNN and the second CNN.
  11.  前記第1のCNNと前記第2のCNNとを組み合わせてトレーニングを行う際に、前記第1のCNNと前記第2のCNNとの融合方法を示す交差融合パラメータが調整される
     請求項10に記載の情報処理装置。
    The tenth aspect of the present invention, wherein when training is performed by combining the first CNN and the second CNN, a cross fusion parameter indicating a method of fusing the first CNN and the second CNN is adjusted. Information processing equipment.
  12.  前記交差融合パラメータは、前記第1のCNNの層と前記第2のCNNの層との接続関係、及び、前記第1のCNNの層と前記第2のCNNの層との間で転送される情報のタイプを示すパラメータを含む
     請求項11に記載の情報処理装置。
    The cross-fusion parameters are transferred between the connection relationship between the first CNN layer and the second CNN layer, and between the first CNN layer and the second CNN layer. The information processing apparatus according to claim 11, which includes a parameter indicating the type of information.
  13.  前記転送された情報は、前記他方のCNNの前記任意の層の処理に用いられる
     請求項1に記載の情報処理装置。
    The information processing apparatus according to claim 1, wherein the transferred information is used for processing the arbitrary layer of the other CNN.
  14.  前記第1のCNNに画像データ、及び、人間の視覚システムの状態、処理、又は、機能を示すデータが入力され、
     前記第2のCNNに画像データが入力される
     請求項1に記載の情報処理装置。
    Image data and data indicating the state, processing, or function of the human visual system are input to the first CNN.
    The information processing apparatus according to claim 1, wherein image data is input to the second CNN.
  15.  前記他方のCNNの前記任意の層から前記一方のCNNの前記任意の層に情報が転送される
     請求項1に記載の情報処理装置。
    The information processing apparatus according to claim 1, wherein information is transferred from the arbitrary layer of the other CNN to the arbitrary layer of the one CNN.
  16.  人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する
     情報処理方法。
    A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. An information processing method that connects to an arbitrary layer of a second CNN and transfers information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  17.  人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とを接続し、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報を転送する
     処理をコンピュータに実行させるためのプログラム。
    A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. A program for connecting to an arbitrary layer of a second CNN and causing a computer to perform a process of transferring information from the arbitrary layer of one CNN to the arbitrary layer of the other CNN.
  18.  人間の視覚システムと同様の処理及び機能を実現する第1のCNN(Convolution Neural Network)の任意の層と、前記第1のCNNと異なる処理により、前記第1のCNNと類似する機能を実現する第2のCNNの任意の層とが接続され、一方のCNNの前記任意の層から他方のCNNの前記任意の層に情報が転送される交差融合CNNの前記第1のCNNのトレーニングを、前記第2のCNNと組み合わせる前に行い、
     前記第1のCNNと前記第2のCNNと組み合わせてトレーニングを行う際に、前記第2のCNNのトレーニングを行い、前記第1のCNNの構成及びパラメータを変更しない
     学習方法。
    A function similar to that of the first CNN is realized by any layer of the first CNN (Convolution Neural Network) that realizes the same processing and function as the human visual system and processing different from the first CNN. The training of the first CNN of a cross-fused CNN in which any layer of the second CNN is connected and information is transferred from the arbitrary layer of one CNN to the arbitrary layer of the other CNN. Do it before combining with the second CNN,
    A learning method in which when training is performed in combination with the first CNN and the second CNN, the second CNN is trained and the configuration and parameters of the first CNN are not changed.
PCT/JP2021/018332 2020-05-27 2021-05-14 Information processing device, information processing method, program, and learning method WO2021241261A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020-091924 2020-05-27
JP2020091924 2020-05-27

Publications (1)

Publication Number Publication Date
WO2021241261A1 true WO2021241261A1 (en) 2021-12-02

Family

ID=78744037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018332 WO2021241261A1 (en) 2020-05-27 2021-05-14 Information processing device, information processing method, program, and learning method

Country Status (1)

Country Link
WO (1) WO2021241261A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018526711A (en) * 2015-06-03 2018-09-13 インナーアイ リミテッドInnerEye Ltd. Image classification by brain computer interface
JP2019096006A (en) * 2017-11-21 2019-06-20 キヤノン株式会社 Information processing device, and information processing method
US10452959B1 (en) * 2018-07-20 2019-10-22 Synapse Tehnology Corporation Multi-perspective detection of objects
US20200086879A1 (en) * 2018-09-14 2020-03-19 Honda Motor Co., Ltd. Scene classification prediction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018526711A (en) * 2015-06-03 2018-09-13 インナーアイ リミテッドInnerEye Ltd. Image classification by brain computer interface
JP2019096006A (en) * 2017-11-21 2019-06-20 キヤノン株式会社 Information processing device, and information processing method
US10452959B1 (en) * 2018-07-20 2019-10-22 Synapse Tehnology Corporation Multi-perspective detection of objects
US20200086879A1 (en) * 2018-09-14 2020-03-19 Honda Motor Co., Ltd. Scene classification prediction

Similar Documents

Publication Publication Date Title
JPWO2019069581A1 (en) Image processing device and image processing method
JP6939283B2 (en) Image processing device, image processing method, and program
CN112534487B (en) Information processing apparatus, moving body, information processing method, and program
WO2021241189A1 (en) Information processing device, information processing method, and program
WO2021060018A1 (en) Signal processing device, signal processing method, program, and moving device
JPWO2020116194A1 (en) Information processing device, information processing method, program, mobile control device, and mobile
WO2021241260A1 (en) Information processing device, information processing method, information processing system, and program
WO2019150918A1 (en) Information processing device, information processing method, program, and moving body
WO2021024805A1 (en) Information processing device, information processing method, and program
WO2022024803A1 (en) Training model generation method, information processing device, and information processing system
WO2021241261A1 (en) Information processing device, information processing method, program, and learning method
WO2022004423A1 (en) Information processing device, information processing method, and program
WO2022004448A1 (en) Information processing device, information processing method, information processing system, and program
WO2021193103A1 (en) Information processing device, information processing method, and program
WO2021090897A1 (en) Information processing device, information processing method, and information processing program
WO2020203241A1 (en) Information processing method, program, and information processing device
WO2022107595A1 (en) Information processing device, information processing method, and program
WO2021145227A1 (en) Information processing device, information processing method, and program
WO2023145460A1 (en) Vibration detection system and vibration detection method
WO2023054090A1 (en) Recognition processing device, recognition processing method, and recognition processing system
WO2022014327A1 (en) Information processing device, information processing method, and program
WO2024062976A1 (en) Information processing device and information processing method
WO2023053718A1 (en) Information processing device, information processing method, learning device, learning method, and computer program
WO2023149089A1 (en) Learning device, learning method, and learning program
WO2023032276A1 (en) Information processing device, information processing method, and mobile device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21811807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21811807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP