CN113450921A

CN113450921A - Brain development data analysis method, system, equipment and storage medium

Info

Publication number: CN113450921A
Application number: CN202110706290.1A
Authority: CN
Inventors: 乔琛; 胡鑫钰; 许发明; 刘岳晨; 任鑫; 黄崎
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-28

Abstract

The invention discloses a brain development data analysis method, a system, equipment and a storage medium, wherein the method comprises the following steps: constructing a graph regular sparse depth self-coding model, wherein an implicit layer in the graph regular sparse depth self-coding model is formed by stacking N sparse self-codes with graph Laplace regularities, and a graph Laplace regularity item is added in a loss function of the graph regular sparse depth self-coding model; training and fine-tuning a regular sparse depth self-coding model of the image; the method, the system, the equipment and the storage medium can effectively solve the over-fitting problem when the high-latitude small-sample brain development data is processed.

Description

Brain development data analysis method, system, equipment and storage medium

Technical Field

The invention belongs to the field of data processing, and relates to a brain development data analysis method, a brain development data analysis system, brain development data analysis equipment and a storage medium.

Background

High-dimensional small sample data is commonly used in the biomedical field, such as genomic data, medical image data, protein data and the like, and the data has the characteristics of small sample size but huge sample characteristics, particularly brain development data. This feature presents certain challenges to the process of data processing and analysis. When the sample size to sample feature ratio is small, classical machine learning algorithms tend to fail because irrelevant and redundant features may be contained in the high dimensional data. At present, deep learning is proved to be one of the strongest tools in big data analysis, but the application of the traditional deep learning algorithm in bioinformatics is still very limited, mainly because when the sample size of data is far smaller than the feature number, the model is often trapped in overfitting, further the data processing is inaccurate, and effective information is difficult to extract from the data, therefore, a method needs to be designed aiming at the data with the characteristics of high latitude and small samples, so as to solve the overfitting problem of the existing deep learning model in processing the data of the high latitude and small samples.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a brain development data analysis method, a brain development data analysis system, a brain development data analysis device and a storage medium, wherein the method, the system, the brain development data analysis device and the storage medium can effectively solve the overfitting problem when high latitude small sample brain development data are processed.

In order to achieve the above object, the method for analyzing brain development data according to the present invention comprises:

constructing a graph regular sparse depth self-coding model, wherein an implicit layer in the graph regular sparse depth self-coding model is formed by stacking N sparse self-codes with graph Laplace regularities, and a graph Laplace regularity item is added in a loss function of the graph regular sparse depth self-coding model;

training and fine-tuning a regular sparse depth self-coding model of the image;

and analyzing brain development data by using the trained image regular sparse depth self-coding model.

The specific process of training the regular sparse depth self-coding model of the graph is as follows:

collecting the recorded brain development data;

summarizing each individual and the corresponding data characteristics and the change value thereof into a unit data in the acquired brain development data, and constructing a data matrix by using the unit data corresponding to each individual;

dividing the data matrix into a training set and a testing set;

and training the regular sparse depth self-coding model of the graph by using the training set and the test set.

The hidden layer in the graph regular sparse depth self-coding model is formed by stacking six sparse self-codes with Laplace regular.

The graph laplacian regularization term is:

wherein the data set is X ═ { X ═ X₁,x₂,L,x_N}，x_p∈R^m，x_pFor the p-th sample, N is the sample size, m is the characteristic dimension of the sample, phi_pq＝exp(-||x_p-x_q||²/σ) represents the weight of the connected edge, σ is the width of the gaussian kernel, H ═ H₁,h₂,L h_N]^T,h_p∈R^m′For each sample x_pCorresponding to the low-dimensional representation, Tr (-) represents the transpose of the matrix, D is the symmetric matrix,

in order to be a laplacian matrix,

the optimization target of the graph regular sparse depth self-coding model in the training process is as follows:

wherein X ═ { X ═ X₁,x₂,L,x_N},x_p∈R^m，x_pIs the p sample, N is the sample size, m is the characteristic dimension of the sample, the number of neurons in the l layer is N_l,l＝1,2,3,

Represents the activation value of the ith neuron of the p-th sample of the output layer,

representing a connection weight matrix between the l layer and the l +1 layer; KL (-) represents KL divergence, rho is the sparse parameter of KL divergence term,

representing the average value of the activation values of the N training samples in the jth hidden layer unit;

wherein each one

Represents the activation value of the p sample of the l layer;

is the Laplace regularization term of the graph, phi_pq＝exp(-||x_p-x_q||²σ) σ is the nuclear width, λ₁,λ₂,λ₃The weighting penalty parameters are KL, weight attenuation and graph regularization respectively.

The loss function of the graph canonical sparse depth self-coding model after the graph Laplace regularization term is added is as follows:

wherein X ═ { X ═ X₁,x₂,L,x_N},x_p∈R^m，x_pIs the p sample, N is the sample size, m represents the characteristic dimension of the sample, the number of neurons in the l layer is N_l,l＝1,2,3；

represents a connection weight matrix between the l layer and the l +1 layer, KL (-) represents KL divergence, rho is a sparse parameter of KL divergence,

averaging the activation values of the N training samples in the jth hidden layer unit;

each one of which is

Represents the activation value of the p sample of the l layer;

is the Laplace regularization term of the graph, phi_pq＝exp(-||x_p-x_q||²σ) σ is the nuclear width; lambda [ alpha ]₁,λ₂,λ₃The weighting penalty parameters are KL, weight attenuation and graph regularization respectively.

The loss function of the graph regular sparse depth self-coding model in the fine tuning process is as follows:

wherein the content of the first and second substances,

denotes the L-th layer x_pjThe result of the reconstruction of (a) is,

φ_pq＝exp(-||x_p-x_q||²/σ) and σ denotes the gaussian kernel width,

represents the response value of the jth neuron of the p-th sample in the l-th layer network,

denotes z^(l)＝a^(l-1)·W^(l-1)The value at (p, j). f is any differentiable activation function,

represents the average activation value, beta, of the jth neuron at the l layer₁,β₂,β₃Penalty parameters that are regular terms.

A brain development data analysis system, comprising:

the device comprises a construction module and a prediction module, wherein the construction module is used for constructing a graph regular sparse depth self-coding model, an implicit layer in the graph regular sparse depth self-coding model is formed by stacking N sparse self-codes with graph Laplace regularities, and a graph Laplace regularity item is added in a loss function of the graph regular sparse depth self-coding model;

the training and fine-tuning module is used for training and fine-tuning the regular sparse depth self-coding model of the image;

and the analysis module is used for analyzing the brain development data by utilizing the trained image regular sparse depth self-coding model.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the brain development data analysis method when executing the computer program.

A computer-readable storage medium, storing a computer program which, when executed by a processor, implements the steps of the brain development data analysis method.

The invention has the following beneficial effects:

the brain development data analysis method, the brain development data analysis system, the brain development data analysis equipment and the storage medium introduce the graph regular sparse depth self-coding model into the brain development data analysis during specific operation, meanwhile, a hidden layer in the graph regular sparse depth self-coding model is formed by stacking six sparse self-codes with Laplace regularization, and a graph Laplace regularization item is added into a loss function of the graph regular sparse depth self-coding model so as to consider the prior knowledge of an inherent structure in data, improve the learning capacity and the expression capacity of unsupervised sparse depth self-coding, and effectively solve the over-fitting problem occurring when high-latitude small sample brain development data is processed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic structural view of the present invention;

FIG. 2a is a diagram illustrating a significantly different dynamic functional link selection;

FIG. 2b is a diagram illustrating a classification comparison of the deep self-coding model;

FIG. 3 is a schematic illustration of the elbow rule;

FIG. 4a is a distribution plot of the 264 ROIs and their corresponding selected functional connections in the 13 RSNs at State 1;

FIG. 4b is a distribution plot of the 264 ROIs and their corresponding selected functional connections in the 13 RSNs at State 2;

FIG. 4c is a distribution plot of the 264 ROIs and their corresponding selected functional connections in the 13 RSNs at State 3;

FIG. 4d is a distribution plot of the 264 ROIs and their corresponding selected functional connections in the 13 RSNs at State 4;

FIG. 4e is a graph of FT mean distribution for children and adolescents for four states;

FIG. 4f is a DT mean distribution plot for children and adolescents in four states;

FIG. 4g is a graph of the statistical difference between FT and DT in four states for two sets of samples;

FIG. 5 is a diagram illustrating a distribution of selected functional connections in the fifth embodiment.

Detailed Description

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.

Example one

Referring to fig. 1, the brain development data analysis method according to the present invention includes the following steps:

1) collecting the recorded brain development data;

2) in brain development data, summarizing data characteristics corresponding to each individual and variation values thereof into a unit of data to form a structured data matrix, wherein the data matrix comprises a sample volume N and a sample characteristic p, and the sample volume N is smaller than the characteristic p, namely N < < p;

3) dividing brain development data forming a structured data matrix into a training set and a testing set;

the traditional deep learning algorithm comprises a Pre-training process (Pre-training) and a Fine-tuning process (Fine-tuning), and for the Fine-tuning process, the invention utilizes a gradient calculation formula of a loss function with a graph Laplacian regularization term for parameter updating in the training process, and performs optimization operation on the data through a sparse depth self-coding model with graph Laplacian regularization, wherein the specific optimization process is as follows:

31) setting a deep learning basic framework, and establishing a data model comprising an input layer, six hidden layers and the input layer according to data characteristics for data in a training set, wherein the input layer comprises a plurality of nodes with data characteristics, namely the number of the nodes in the input layer is a sample characteristic p in the training set, the output layer is reconstructed input data, and each hidden layer comprises a plurality of nodes with mapping corresponding relations with the input value of the previous layer.

32) For each node of each layer of Sparse self-coding, a mathematical equation is adopted, a data model of the node is established, relevant parameter values in the mathematical equation are preset, in the deep network, 6 graph Sparse self-encodings (GSAE) are selected for stacking, namely the number of hidden layers is 5, and the number of neurons in each layer is 15000, 7000,6700, 7000 and 15000 respectively. The number of nodes of the input layer and the output layer is the training set sample characteristic and is marked as p. According to the above setting, the stacked deep neural network model of the present invention includes 8 neural layers, which are an input layer, 6 hidden layers and an output layer.

33) In the network training pre-training process, for each GSAE, setting a weight attenuation parameter, a KL divergence regularization term and a graph Laplace regularization term thereof to be 0.03, 0.1 and 0.01 respectively, wherein the kernel width sigma of the graph Laplace regularization term is 0.5; in the fine adjustment process, the weight attenuation parameter, the KL divergence regularization term and the graph Laplace regularization term are set to be 0.01, 0.3 and 0.01 respectively. In addition, ρ is 0.1, sigmoid is selected in model training as an activation function f of a network layer, and in a fine tuning process, a small batch (minibatch sampling) sampling strategy and a small batch random gradient descent (SGD with mini-batch) method are adopted for parameter updating.

34) Initializing parameter values Ai of each layer of network, including weight Wi connected with the network and corresponding bias bi, inputting training set data into an input layer for learning the network structure, comparing output values of each node in an output layer with corresponding original input data, repeatedly correcting parameter values of each layer of neural network, and sequentially circulating to obtain parameter values of each layer of neural network node corresponding to the output value which enables each node of the output layer to have the highest similarity with the original input data.

35) And inputting the test set data into the trained deep neural network model for determining the effectiveness of the selected dynamic functional connections with the differences, thereby further carrying out corresponding brain development analysis and research.

It should be noted that the mathematical equation is a parametric mathematical equation, and may be a linear model or a neuron model, for example, a sigmoid activation function or a convolution operation model, and the mathematical model is set in the following manner:

y＝g(X)＝f_n(f_n-1(f_n-2(...f₁(X))

wherein y is the reconstructed training set data, the dimensionality is p, X is the training set data, the characteristic dimensionality is p, f₁To f_nFor each set layer of operational equations, each layer of equations f_iFor example, from the first layer input layer to the first hidden layer, the training data with the dimension p is converted into the output with the dimension 200, and the analogy is performed according to the number of neurons in each layer, wherein each layer of model f is_iThere is a matching set of parameters Ai, including the connection weights Wi and the offsets bi of the various layers.

In order to simultaneously consider the prior knowledge of a sparse mode and an inherent structure in data and improve the learning capacity and the expression capacity of unsupervised deep sparse self-coding, the invention introduces a graph Laplace regularization term, and the specific form is

Wherein phi is_pq＝exp(-||x_p-x_q||²σ), σ represents the width of the gaussian kernel, and the objective function after introducing the graph laplace regularization term is:

p is the sparse parameter of the KL divergence term,

is the average value of the activation values of the N training samples in the jth hidden layer unit, lambda₁,λ₂,λ₃The weight penalty parameters of KL, weight attenuation and graph regularization are respectively used, the pre-training of GSAE is completed by using the functions, then the pre-training is stacked into GSDAE, the training is completed by fine tuning, and the loss function of the GSDAE in the fine tuning process is as follows:

β₁,β₂,β₃penalty parameters that are regular terms.

Example two

The brain development data analysis system of the present invention comprises:

EXAMPLE III

Example four

EXAMPLE five

MRI has been widely used for early detection, diagnosis and disease treatment as an important medical imaging mode, and such biomedical data generally has the characteristics of high-dimensional small samples, i.e. has a large number of features and a low sample number, and may have an overfitting problem when a deep learning method is applied to perform corresponding brain structure and brain function abnormality on such data. In order to verify that the algorithm proposed by us can effectively avoid overfitting, the valid features are selected. We used philiadelphia Neuro-graded cowort (PNC) data from large scale experimental data from cooperative studies of brain behaviour by pennsylvania university and Philadelphia children hospital, including fMRI data for 878 adolescents aged 8-22, with a standard pre-treatment process SPM 12.

The invention utilizes the regular sparse depth self-coding of the image to select the functional connection with significant difference in the brain development process. The data subsets of the complete data set were selected according to age (in months), wherein subjects with an age above 216 months belong to the first category and subjects with an age below 144 months belong to the second category, and the pearson correlation coefficient between them was first calculated from the regional average oxygenation level correlation (BOLD) signals of 264 brain regions of interest (ROIs). Thus, the input data has a dimension of 34716 and a total sample size of 397. And verifying the performance of the network by using the classification capability of the selected function connection, and then performing biological information analysis after connecting the selected function connection.

From the two groups of samples (children and adolescents), 70% of the subjects were selected as training set and the remaining 30% were selected as test set. Training data is utilized to realize network architecture learning of GSDAE, further intrinsic differences of the two groups of dFC are determined, test data is used for determining effectiveness of dFC with the differences, frequency statistics is carried out on the original 34716 connections according to reconstruction results of the training data, and the high-frequency connections are selected as dFC with significant differences in the two groups of samples. The whole selection process is performed in an unsupervised fashion. To avoid missing important connections, the above process was repeated 10 times. Furthermore, to test the effective performance of the GSDAE model in sparsity and feature selection, we compared the model to several other deep layers, including DAE, depth self-coding model with weight attenuation and KL regularization (sDAE), depth self-coding with Dropout (DP-DAE), depth self-coding with DropConnect (DPC-DAE), depth self-coding with pruning (Pr-DAE), and singular value decomposition based (SVD-DAE). The above models all share parameters with the GSDAE. All the above models and GSDAE were repeated 10 times on the test data, and the results are shown in fig. 2a and fig. 2b, and it can be seen from fig. 2a and fig. 2b that GSDAE can accurately identify the significant difference between the two groups dFC. Finally, 10 unsupervised dFC differences selected according to GSDAE are considered, the combination of the differences and the occurrence frequency of each difference in 10 results are considered, and the most significant feature is determined according to the average classification precision of the differences, the relation between the selected feature and the average classification precision of the test data is shown in figure 3, 1729 features are finally reserved as the most significant dynamic functional connection in the development process in 34716 functional connections, and the feature dimension of all samples is finally reduced to 1729, namely only dFC with the most significant sample is reserved.

Analysis of biological information contained in features

1. Dynamic functional ligation state analysis and its time-variability

First, using k-means clustering to identify different functional connection patterns, the optimal FC state number of two groups is determined to be 4 by elbow rule based on square error (the basis for state number determination is shown in FIG. 3)

For 4 different states, the difference between the significant groups of the functional network in the different states is studied respectively, and in order to further quantify the importance of the four states, the time occupation of the four states of the subject is studied by selecting three indexes of NS, FT and DT. The results are shown in fig. 4a to 4g, where a in fig. 4a to 4g represents the different distribution of 1729 functional links selected from the 264 ROIs and their corresponding 13 RSNs in four FC states, the dark line in each subgraph represents the functional links that increase during development, and the remaining lines represent the functional links that decrease during development. In section B, a represents the FT mean distribution of children and adolescents in four states. b represents the mean distribution of DT for the two groups of samples in the four states, and c represents the statistical difference in FT and DT for the two groups of samples in the four states.

2. Connectino omics difference analysis

Of the 1729 functional connections selected with significant developmental differences, 1160 functional connections were located in or between 12 RSNs with brain-specific functions, including a significant increase in age of 162 years, and 998 connections were significantly weakened with increasing age, and fig. 5 depicts the distribution of functional connections between and within 12 RSNs with brain-specific functions, where the lower half represents the reduced functional connections, the upper half represents the increased functional connections, the color of the color block represents the strength of the functional connections, and the deeper the color block, the stronger the functional connections.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method of analyzing brain development data, comprising:

training and fine-tuning a regular sparse depth self-coding model of the image;

2. The brain development data analysis method according to claim 1, wherein the specific process of training the graph canonical sparse depth self-coding model is as follows:

collecting the recorded brain development data;

dividing the data matrix into a training set and a testing set;

3. The brain development data analysis method of claim 1, wherein the hidden layer in the graph canonical sparse depth self-coding model is stacked by six sparse self-codes with laplacian canonical.

4. The brain development data analysis method of claim 1, wherein the graph laplacian regularization term is:

in order to be a laplacian matrix,

5. the brain development data analysis method according to claim 1, wherein the optimization goal of the graph canonical sparse depth self-coding model in the training process is:

wherein each one

Represents the activation value of the p sample of the l layer;

is the Laplace regularization term of the graph, phi_pq＝exp(-||x_p-x_q||²σ) σ isWidth of nucleus, λ₁,λ₂,λ₃The weighting penalty parameters are KL, weight attenuation and graph regularization respectively.

6. The brain development data analysis method according to claim 1, wherein the loss function of the graph regularized sparse depth self-coding model after adding the graph laplacian regularization term is:

each one of which is

Represents the activation value of the p sample of the l layer;

7. The brain development data analysis method according to claim 1, wherein the loss function of the graph canonical sparse depth self-coding model in the fine tuning process is:

wherein the content of the first and second substances,

denotes the L-th layer x_pjThe result of the reconstruction of (a) is,

φ_pq＝exp(-‖x_p-x_q‖²/σ) and σ denotes the gaussian kernel width,

8. A brain development data analysis system, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, carries out the steps of the brain development data analysis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for analyzing brain development data according to any one of claims 1 to 7.