CN113068035B

CN113068035B - Natural scene reconstruction method based on deep neural network

Info

Publication number: CN113068035B
Application number: CN202110285684.4A
Authority: CN
Inventors: 余肇飞; 张祎晨; 贾杉杉; 刘健
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2023-07-14
Anticipated expiration: 2041-03-17
Also published as: CN113068035A

Abstract

The invention discloses a natural scene reconstruction method based on a deep neural network, which comprises the following steps: s1, natural picture stimulation data and corresponding nerve response data are obtained; s2, constructing a pulse-picture converter which is a 3-layer fully-connected neural network, and comprising the following steps of: s21, the first layer of neurons receives pulse data of all ganglion cells as input, the second layer is a hidden layer and comprises a group of neurons, and the output of the first layer of neurons is received as input; s22, the third layer is an output layer, receives the output of the second layer as input, and activates according to an activation function, wherein the number of output neurons of the third layer is set as the number of pixels of the stimulated picture; s3, constructing an automatic encoder of pictures; s4, constructing a loss function by the output of the S21 and the S22 and the stimulation picture; s5, reconstructing a stimulation picture of the nerve response data according to the trained model.

Description

Natural scene reconstruction method based on deep neural network

Technical Field

The invention relates to the technical field of visual decoding and encoding of neural networks, in particular to a method for realizing natural picture and dynamic video stimulation based on a depth neural network and input according to neural signal reconstruction.

Background

70% -80% of the information acquired by humans comes from vision, the visual system is an important component of the brain nervous system, retinal neurons acquire external visual information, then transmit to the lateral knee, further transmit to the visual cortex, and finally form visual perception.

The existing computer vision algorithm has a certain limitation, and compared with the computer vision algorithm, the biological vision system has a plurality of unique advantages. Therefore, research on brain-like vision by referring to human brain vision mechanism may be a break-through for artificial intelligence and computer vision development. An important research problem in brain-like vision research is the problem of visual coding and decoding. Therefore, a novel decoding model can be constructed, the fine retinal nerve pulse signal data or the relatively coarse functional magnetic resonance data can be used for reconstructing a given visual natural picture and video, and the corresponding natural image and dynamic video stimulus can be restored through nerve signals.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purpose of reconstructing a corresponding complex natural image and dynamic video stimulation through fine pulse signals or coarse human brain functional magnetic resonance data, the invention adopts the following technical scheme:

a natural scene reconstruction method based on a deep neural network comprises the following steps:

s1, natural picture stimulation data and corresponding nerve response data are obtained;

s2, constructing a pulse-picture converter which is a 3-layer fully-connected neural network, and comprising the following steps of:

s21, the first layer of neurons receives pulse data of all ganglion cells as input, the number of the first layer of neurons is set to be the number of RGCs used, the second layer of neurons is a hidden layer, 512 neurons are included, and the output of the first layer of neurons is received as input, wherein the formula is as follows:

represents ReLU activation function, S is ganglion cell data, W ¹ B is the weight between the first layer and the second layer ¹ Is the firstBias of two layers, Y ¹ Is the output of the second layer;

s22, the third layer is an output layer, the output of the second layer is received as input, and the output neuron number of the third layer is set as the stimulated picture pixel number according to the sigmoid function, and the formula is as follows:

O ₁ ＝sigmoid(W ² *Y ¹ )+b ² ) (2)

W ² b is the connection weight between the second layer and the third layer ² To bias, O ₁ The output of the third layer is also the output of the pulse-picture converter;

s3, constructing an automatic encoder of pictures, namely a typical depth automatic encoder based on a Convolutional Neural Network (CNN), comprising the following steps:

s31, reducing the size of an input image by convolution and downsampling, wherein the size comprises four convolution layers, and the formula is as follows:

Wc ¹¹ ,Wc ¹² ,Wc ¹³ ,Wc ¹⁴ b is the convolution kernel of the four-layer convolution layer of the downsampling stage ¹¹ ，b ¹² ，b ¹³ ，b ¹⁴ For corresponding bias, Y ¹¹ ,Y ¹² ,Y ¹³ ,Y ¹⁴ Is the corresponding output;

s32, processing the image by adopting convolution and up-sampling, and recovering the texture of the down-sampled image while increasing the size of the down-sampled image, wherein compared with a down-sampling stage, the up-sampling stage further comprises four convolution layers, and the formula is as follows:

Wc ²¹ ,Wc ²² ,Wc ²³ ,Wc ²⁴ convolution kernel being a four-layer convolution layer of the upsampling stage, b ²¹ ，b ²² ，b ²³ ，b ²⁴ For corresponding bias, Y ²¹ ,Y ²² ,Y ²³ ,O ₂ Is the corresponding output;

s4, output O ₁ 、O ₂ Constructing a loss function with the stimulation picture, and optimizing a reconstruction result output by the network;

s5, reconstructing a stimulation picture of the ganglion cells according to response data of the ganglion cells through the trained model.

Further, the S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₁ To optimize the output of the model, the formula is as follows:

L ₁ :Loss＝λ ₁ ‖O ₁ -I‖+λ ₂ ‖O ₂ -I‖ (5)

II is the mean square error loss, lambda ₁ And lambda (lambda) ₂ The weight lost by the two parts; optimizing reconstructed picture results using progressively smaller mean squaresDifference, make the output O of the model ₁ 、O ₂ Gradually matching with the stimulation pictures I respectively to optimize the output of the model, wherein the formula of the mean square error function is as follows:

further, the S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₂ To make the output O of the model ₁ 、O ₂ Respectively constructing two Loss functions with the stimulation pictures I, and alternately optimizing Loss ₁ And Loss of ₂ The formula is as follows:

L ₂ :Loss ₁ ＝λ ₁ ‖O ₁ -I‖,Loss ₂ ＝λ ₂ ‖O ₂ -I‖ (6)

II is the mean square error loss, lambda ₁ And lambda (lambda) ₂ The weight lost by the two parts; finally, the optimized reconstructed picture result is obtained.

Further, the S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₃ Only the final output O of the model ₂ Respectively constructing a loss function with the stimulated picture I for optimization, wherein the formula is as follows:

L ₃ :Loss＝‖O ₂ -I‖ (7)

Further, the input response data is pulse emission rate or voxel response data, and the output of the pulse-picture converter is the primary decoded stimulus O ₁ The output of the image-image auto-encoder is the final reconstructed stimulus picture O ₂ Comparing the two outputs with the stimulation picture I, and optimizing the output of the model.

Further, the step S1 is to calculate the receptive field according to the real retinal ganglion cell white noise stimulation and impulse response data, then to construct a linear coding model, and to input the natural picture stimulation data of CIFAR100 to generate simulated ganglion cell response data, comprising the following steps:

s11, ganglion cell white noise stimulation data and real response data, obtaining the receptive field of the neuron cells according to a pulse excitation analysis method, recording data of 90 ganglion cells in salamander retina data, obtaining the receptive field of 90 ganglion cells, and generating a receptive field module by using two-dimensional Gaussian fitting receptive field according to the position information of the 90 receptive fields;

s12, converting a natural image to be simulated response into a picture with the size of 64 x 64, carrying out pixel normalization processing, accumulating pixel values in each receptive field according to receptive field modules of 90 ganglion cells, and generating response data based on the release rate.

Further, the step S1 is to acquire ganglion cell stimulation data and response data corresponding to the ganglion cell stimulation data through real physiological data acquisition, wherein the stimulation comprises static natural image stimulation and dynamic video stimulation.

Further, the S5 is a model which is trained by using real physiological data and is based on the natural scene reconstruction of the depth neural network from end to end, the real physiological data comprises static natural pictures or videos, when the training is carried out by using the static natural pictures, a decoding model is trained by stimulating impulse responses S of pictures I and neuron groups and a model output result O, then impulse responses of ganglion group cells to new stimulation are input into the model, the natural stimulating pictures are reconstructed, and the network is proved to be capable of reconstructing natural image stimulation according to the impulse responses; when training is carried out by using the real physiological data, new neuron group impulse responses are input into the trained model, and a stimulation video frame is reconstructed.

Further, the step S5 is to simulate impulse response data of the ganglion cells after the retina is stimulated by using the simulation data when the natural pictures in the CIFAR100 data set, train a decoding model, and reconstruct a stimulated picture according to the trained model and the response of the neuron group. The network can reconstruct complex natural image stimulus well according to response data of simulated retinal ganglion population cells.

Further, in the step S5, functional magnetic resonance imaging of real physiological data is used to record response data of visual cortex V1, V2 and V3 when a person looks at handwriting numbers, a decoding model is trained, and a stimulating picture is reconstructed well according to the trained model and responses of three brain region effective voxels. It is illustrated that the network can reconstruct a stimulus image from such a coarse signal as fMRI.

The invention has the advantages that:

the invention can decode the stimulation scene, such as complex static natural image and dynamic video image, according to the impulse response of the neuron population. The invention can reconstruct MNIST stimulation pictures according to the data recorded by human brain fMRI. And measuring the performance of the model, namely the similarity between the reconstructed picture and the real stimulation picture, by calculating average square error, peak signal-to-noise ratio and structural similarity index measure. The above effects can be achieved by the comprehensive decoding method, on one hand, a bridge of human brain vision and machine vision can be established, so that the mechanism of encoding and decoding of a human brain vision system is revealed; on the other hand, the model is considered to be applied to the development of retina prostheses, and the development of information technology and medical industry is promoted.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a deep network decoding model architecture for end-to-end training in the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

As shown in fig. 1, the method for reconstructing natural scene based on deep neural network can decode and reconstruct the stimulus according to the response of retinal ganglion population cells to natural scene stimulus, and not only comprises static image stimulus but also dynamic video stimulus; on the other hand, the stimulation picture can be reconstructed according to the response of the human brain visual cortex to handwriting numbers recorded by the functional magnetic resonance technology. In addition, the simulated impulse response generated under the natural image stimulation is input into the trained model, and the stimulation picture of the simulated impulse response can be well reconstructed.

Retinal simulation data is generated from the linear model and receptive fields of cells of the true ganglion population. The spatial receptive field of the 90 neurons is first deduced from the impulse response of the 90 neurons under real subretinal white noise stimulation, and then the receptive field is modeled using a two-dimensional gaussian fit. And (3) tiling the new stimulation pictures in the receptive fields, counting the pixel values covered by each receptive field, and simulating to obtain the pulse release rate of each neuron.

The real physiological data included the use of static natural image stimulation and video stimulation of the salamander retina, and the use of multiple electrodes to record impulse response data of 90 ganglion cells on the salamander retina. Each still natural image has a size of 64×64 pixels, and each video frame has a size of 90×90 pixels. And response of human brain visual cortex to MNIST handwriting data stimulus using functional magnetic resonance recording, handwriting image size 28 x 28 pixels.

As shown in fig. 2, the pulse-image decoder is composed of two parts. The first part is a pulse-to-picture converter, which converts the neural signals into a graph of the same size as the stimulation picture, and in this part, a fully connected network is used, which already captures the information of the stimulation picture well. The one-dimensional vector output by the pulse-to-picture converter then needs to be reshaped into a picture of the stimulus picture size. The second part is a picture-to-picture auto-encoder, which uses a multi-layer CNN model to further reduce noise of the generated picture. The entire model inputs the neural response of the neuron population (for retinal pulse data, the model inputs the pulse firing rate, and for functional magnetic resonance data, the model data is the values of all voxels). For the exploration of the model structure, the present embodiment also tries much, and finally, it is found that the model in which the number of layers is set to 3 layers already reproduces the stimulation picture information well in the pulse-picture converter section. In the picture-picture automatic encoder section, a downsampling section is provided withFour convolution layers, the convolution kernel sizes are set to (64,7,7), (128,5,5), (256,3,3), (256,3,3), step sizes (2, 2), and the kernel sizes of all of these layers in the upsampling portion are (256,3,3), (128,3,3), (64,5,5), (3, 7), and step sizes (1, 1), respectively. Finally using the output O of the pulse-to-picture converter ₁ And output O of picture-picture automatic encoder ₂ And constructing a loss function by the real stimulation picture I, and optimizing a reconstruction result output by the network. The entire model forward information flow is represented as follows:

the first part is a pulse-to-picture converter, which consists of three layers of fully connected network, the first layer of neurons is 90 (90 for the model of natural image, video stimulus and analog data, since 90 neurons are recorded in the retina data, the model is 90 for natural image, video stimulus and analog data, 3092 voxels are available in fMRI data, and 3092), the second layer of neurons is 512, the third layer is 64 x 64 (64 x 64 for still picture, 90 x 90 for video stimulus, 28 x 28 for fMRI data, and 32 x 32 for analog data). For the activation function, the second layer and the third layer are ReLU and sigmoid, respectively.

O ₁ ＝sigmoid(W ² *Y ¹ )+b ² ) (2)

Representing the ReLU activation function, S is the neural response data of ganglion cell populations, W ¹ B is the weight between the first layer and the second layer ¹ For biasing of the second layer, Y ¹ Is the output of the second layer; w (W) ² B is the connection weight between the second layer and the third layer ² To bias, O ₁ The output of the third layer is also the output of the pulse-picture converter;

the second part is a picture-picture auto-encoder. Consists of a downsampled convolutional layer portion and an upsampled convolutional layer portion. The method portion of convolution and downsampling comprises four convolution layers. The method portion of convolution and upsampling also includes four convolution layers, the formula is as follows:

Wc ¹¹ ,Wc ¹² ,Wc ¹³ ,Wc ¹⁴ b is the convolution kernel of the four-layer convolution layer of the downsampling stage ¹¹ ，b ¹² ，b ¹³ ，b ¹⁴ For corresponding bias, Y ¹¹ ,Y ¹² ,Y ¹³ ,Y ¹⁴ Is the corresponding output; wc (Wc) ²¹ ,Wc ²² ,Wc ²³ ,Wc ²⁴ Convolution kernel being a four-layer convolution layer of the upsampling stage, b ²¹ ，b ²² ，b ²³ ，b ²⁴ For corresponding bias, Y ²¹ ,Y ²² ,Y ²³ ,O ₂ Is the corresponding output;

finally, to train the network, we designed three loss functions L ₁ 、L ₂ 、L ₃ The formula is as follows;

L ₁ :Loss＝λ ₁ ‖O ₁ -I‖+λ ₂ ‖O ₂ -I‖ (5)

L ₂ :Loss ₁ ＝λ ₁ ‖O ₁ -I‖,Loss ₂ ＝λ ₂ ‖O ₂ -I‖ (6)

L ₃ :Loss＝‖O ₂ -I‖ (7)

II is the mean square error loss, lambda ₁ And lambda (lambda) ₂ Two-part lost weight.

And optimizing the network by using an Adam algorithm to gradually match the model output with the stimulus. After the network training is finished, the network outputs a reconstructed stimulation picture.

Embodiment one:

the real physiological data-static natural image is used for stimulating the data of ganglion cell groups recorded by the salamander retina, and the depth neural network model is trained. Natural image stimuli can be reconstructed well from the trained models and responses of the neuron population. The network can reconstruct complex natural image stimulation well according to impulse response data of retinal ganglion population cells.

Embodiment two:

the present deep neural network model is trained using the data of the ganglion cell population recorded by the real physiological data-dynamic video stimulated salamander retina. The stimulated video frames can be reconstructed well from the trained models and responses of the neuron population. The network can reconstruct complex dynamic video stimulus well according to impulse response data of retinal ganglion population cells.

Embodiment III:

the present deep neural network model was trained using simulation data-simulated impulse response data of postretinal ganglion cells stimulated when natural pictures in the CIFAR100 dataset. The stimulation pictures can be reconstructed well according to the trained models and the responses of the neuron groups. The network can reconstruct complex natural image stimulus well according to response data of simulated retinal ganglion population cells.

Embodiment four:

the response data of visual cortex V1V2V3 when a person looks at handwriting numbers is recorded by using a real physiological data-functional magnetic resonance imaging technology, and the deep neural network model is trained. The stimulation picture can be reconstructed according to the trained model and the response of the three brain region effective voxels. It is illustrated that the network can reconstruct a stimulus image from such a coarse signal as fMRI.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The natural scene reconstruction method based on the deep neural network is characterized by comprising the following steps of:

s21, the first layer of neurons receives pulse data of all ganglion cells as input, the number of the first layer of neurons is set to be the number of RGCs used, the second layer of neurons is a hidden layer, the hidden layer comprises a group of neurons, and the output of the first layer of neurons is received as input, and the formula is as follows:

represents ReLU activation function, S is ganglion cell data, W ¹ B is the weight between the first layer and the second layer ¹ Is that

Bias of the second layer, Y ¹ Is the output of the second layer;

O ₁ ＝sigmoid(W ² *Y ¹ )+b ² ) (2)

s3, constructing an automatic encoder of a picture-picture, comprising the following steps:

s5, reconstructing a stimulation picture of the nerve response data according to the trained model.

2. The method for reconstructing natural scenes based on deep neural network according to claim 1, wherein said S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₁ To optimize the output of the model, the formula is as follows:

L ₁ :Loss＝λ ₁ ‖O ₁ -I‖+λ ₂ ‖O ₂ -I‖ (5)

II is the mean square error loss, lambda ₁ And lambda (lambda) ₂ The weight lost by the two parts; optimizing the reconstructed picture result, and using the gradually reduced mean square value to make the model output O ₁ 、O ₂ Gradually matching with the stimulation pictures I respectively to optimize the output of the model, wherein the formula of the mean square error function is as follows:

3. the method for reconstructing natural scenes based on deep neural network according to claim 1, wherein said S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₂ To make the output O of the model ₁ 、O ₂ Respectively constructing two Loss functions with the stimulation pictures I, and alternately optimizing Loss ₁ And Loss of ₂ The formula is as follows:

L ₂ :Loss ₁ ＝λ ₁ ‖O ₁ -I‖,Loss ₂ ＝λ ₂ ‖O ₂ -I‖ (6)

4. The method for reconstructing natural scenes based on deep neural network according to claim 1, wherein said S4 outputs O ₁ 、O ₂ By loss function L compared with stimulus picture I ₃ Only the final output O of the model ₂ Respectively constructing a loss function with the stimulated picture I for optimization, wherein the formula is as follows:

L ₃ :Loss＝‖O ₂ -I‖ (7)

ii is the mean square error loss; finally, the optimized reconstructed picture result is obtained.

5. A method of natural scene reconstruction based on a deep neural network as claimed in claim 1, wherein the input response data is pulse emission rate or voxel response data, and the output of the pulse-to-picture converter is the preliminarily decoded stimulus O ₁ The output of the image-image auto-encoder is the final reconstructed stimulus picture O ₂ Comparing the two outputs with the stimulation picture I, and optimizing the output of the model.

6. The method for reconstructing natural scenes based on deep neural network according to claim 1, wherein said S1, calculating receptive fields according to real retinal ganglion cell white noise stimulus and impulse response data, then constructing a linear coding model, inputting natural picture stimulus data of CIFAR100 to generate simulated ganglion cell response data, comprises the steps of:

s11, ganglion cell white noise stimulation data and real response data, obtaining a receptive field of the neuron cells according to a pulse excitation analysis method, recording ganglion cell data in salamander retina data, obtaining the receptive field of the ganglion cells, and generating a receptive field module by using a two-dimensional Gaussian fitting receptive field according to position information of the receptive field;

s12, converting the natural image to be simulated response into a picture, performing pixel normalization processing, and accumulating pixel values in each receptive field according to receptive field modules of ganglion cells to generate response data based on the release rate.

7. The method for reconstructing natural scenes based on deep neural network according to claim 1, wherein the step S1 is to acquire ganglion cell stimulation data and corresponding response data thereof through real physiological data acquisition, and the stimulation comprises static natural image stimulation and dynamic video stimulation.

8. The method for reconstructing a natural scene based on a deep neural network according to claim 1, wherein the step S5 is to train an end-to-end model for reconstructing a natural scene based on a deep neural network using real physiological data, the real physiological data including a static natural picture or video, train a decoding model by stimulating an impulse response S of a picture I and a neuron population and outputting a result O from the model when training using the static natural picture, and then input an impulse response of ganglion population cells to a new stimulus in the model, reconstruct the natural stimulus picture; when training is carried out by using the real physiological data, new neuron group impulse responses are input into the trained model, and a stimulation video frame is reconstructed.

9. The method of claim 1, wherein S5 uses simulation data to simulate impulse response data of ganglion cells after stimulating retina when a natural picture in a CIFAR100 dataset, trains a decoding model, and reconstructs a stimulating picture according to the trained model and the response of a neuron population.

10. The method for reconstructing natural scenes based on deep neural network according to claim 1, wherein the step S5 is to record response data of visual cortex V1, V2, V3 when a person is looking at handwriting numbers by using functional magnetic resonance imaging of real physiological data, train a decoding model, and reconstruct a stimulating picture according to the trained model and responses of three brain region effective voxels.