CN109887544B

CN109887544B - RNA sequence parallel classification method based on non-negative matrix factorization

Info

Publication number: CN109887544B
Application number: CN201910060301.6A
Authority: CN
Inventors: 杨晓凯; 钟诚; 黄毅然
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2022-07-05
Anticipated expiration: 2039-01-22
Also published as: CN109887544A

Abstract

The invention discloses a parallel RNA sequence classification method based on non-negative matrix factorization. After RNA data is matrixed, the corresponding Bayes coefficients are obtained according to different K values of an original data matrix, K value selection in the nonnegative matrix decomposition process is restrained according to the Bayes coefficients, and the nonnegative matrix decomposition method is used for parallel RNA sequence classification. The method effectively improves the classification accuracy of the RNA sequence, and effectively improves the operation efficiency of RNA sequence classification work by utilizing a parallel technical means.

Description

RNA sequence parallel classification method based on non-negative matrix factorization

Technical Field

The invention belongs to the technical field of bioinformatics, and particularly relates to a parallel RNA sequence classification method based on nonnegative matrix factorization.

Background

Bioinformatics tools for analyzing single cell RNA sequence data still lag behind relative to experimental techniques. In recent years, various methods have been developed to detect subpopulations (or subsets) within a group of cells using single-cell RNA sequence data. These new computational tools indicate that it is very important to understand the heterogeneity of single-cell RNA sequences. Furthermore, once the subpopulations are identified, each subpopulation (subclass) has characteristic gene expression characteristics, finding a critical feature to reveal secondary biological mechanisms.

Non-Negative Matrix Factorization (NMF) is an effective data dimension reduction algorithm, and has attracted attention due to its simple concept, convenient implementation, and various changes. The method realizes the dimension specification by decomposing a high-dimensional matrix into the product of two or more low-dimensional matrices, and is convenient for researching the properties of high-dimensional data in a low-dimensional space. NMF differs from other methods in that it approximately decomposes an originally given non-negative matrix into the product of two non-negative matrices, i.e., NMF guarantees that each element of the decomposed matrix is a positive value, thereby obtaining an additive representation of the original data based on local features. NMF decomposition has the unique properties of locality, partial expression and the like superior to other algorithms of the same kind, and has an intuitive physical meaning, namely, the whole can be divided into an additive nonlinear combination of a plurality of parts. Therefore, in recent years, the NMF algorithm has gained great attention, and the concept system to the implementation method are further improved and developed, so that a batch of fast and efficient practical algorithms are generated.

Disclosure of Invention

The invention aims to provide a RNA sequence parallel classification method based on non-negative matrix factorization so as to improve the classification accuracy and the operation speed.

The technical scheme adopted by the invention for realizing the aim is as follows: the RNA sequence parallel classification method based on nonnegative matrix decomposition is characterized in that RNA data are subjected to matrixing processing, corresponding Bayesian coefficients are obtained according to different K values of an original data matrix, K value selection in the nonnegative matrix decomposition process is constrained according to the Bayesian coefficients, and the nonnegative matrix decomposition method is used for parallel RNA sequence classification.

The RNA sequence parallel classification method based on non-negative matrix factorization comprises the following steps:

1) matrixing RNA data:

counts of mutations found in G different genomes were assembled into a K × G matrix M with K ═ a, a being the letter a of the trinucleotide mutation type, and if the basis vectors were merged into a K × N signature matrix P and the coefficient vectors were an N × G matrix of E, the RNA data was calculated as M ═ P × E;

2) and solving the corresponding Bayes coefficients according to the original data matrix aiming at different K values, and selecting according to the K values in the Bayes coefficient constraint non-negative matrix decomposition process:

the matrix obtained in the step 1) follows an empirical Bayes method, wherein a parameter theta, a hyper-parameter psi and a hyper-linearity parameter eta are estimated from original data, a sampler pi (theta | M, eta) is generated aiming at the selected eta, Z is used for representing a random tensor, and a series of samples (Z | M, eta) are required to be generated iteratively^(k),θ^(k),Ψ^(k)) K is more than or equal to 1, forming a complete condition distribution set, and solving Bayesian coefficients corresponding to different K values by using a Gibbs standard packet;

determining the value range of k according to the sample capacity n, wherein k is more than or equal to 2 and less than or equal to n/2, solving the corresponding Bayesian coefficient, defining an acceptable fluctuation interval W, solving the k corresponding to the largest Bayesian coefficient, and selecting the smallest k value in the fluctuation interval by referring to W;

3) classification was done using non-negative matrix factorization:

taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-PE; with E^(r)As an exposure matrix, important information about the contribution of each signature in the genomic sample is retained, signatures with DES above a specified level are considered to have differential activity in the population;

4) the establishment of the parallel algorithm is realized under R language platform R Studio, and the steps are as follows:

input: an RNA sequence M, a reference sequence L, an acceptable fluctuation interval W and a core number p;

output: RNA sequence classification data;

Begin

1: n ← number of RNA sequences | M |;

2: n' ← actually calculating the number n/2-1 of K values of Bayesian coefficients;

3: x ← single kernel requires calculation of the K-valued number n'/p of bayesian coefficients;

4: distributing Bayesian coefficients corresponding to different K values to be calculated for each core according to x;

5: each core respectively calculates a Bayesian coefficient corresponding to the K value and stores the Bayesian coefficient in an array K, and the subscript of the Bayesian coefficient is the K value corresponding to the Bayesian coefficient;

6: traversing and finding out the maximum value k [ m ] in the k array;

7: j ← k index m-1 of the maximum value in the array;

8：while((k[m]-k[j])/k[m])<W do；

9：j＝j-1；

10：end while；

11：for i＝1 to j do；

12: taking j as a matrix decomposition dimension, and carrying out non-negative matrix decomposition on M into P and E;

13：end for；

14: calculating RNA sequence classification data by combining the P and E matrix results with a reference sequence L;

End。

the invention has the beneficial effects that: compared with the prior art, the method provided by the invention has the advantages that the classification accuracy is obviously improved, the running efficiency is obviously improved after the parallelization modification, and the required time is obviously shortened when the same data is classified.

Drawings

FIG. 1 is a graph showing the results of the matrixing of RNA data of 21 breast cancer cells;

FIG. 2 is a graph showing the results of classification of 21 pieces of breast cancer RNA data with reference to the length of series 12.

Detailed Description

The invention relates to a RNA sequence parallel classification method based on non-negative matrix factorization, which comprises the following specific steps of:

1) matrixing RNA data:

most somatic mutations include single base substitutions, insertions and deletions, rearrangements and Copy Number Variations (CNVs). Single base substitutions belong to one of six possible base changes, namely C: g > A: t, C: g > G: c, C: g > T: a, T: a > A: t, T: a > C: g and T: a > G: C. the set can be further expanded by including 5 'and 3' adjacent bases for each substitution site, resulting in the letter a having 96 trinucleotide mutation types. Once a is correctly defined, counts of mutations found in G different genomes are assembled into a K × G matrix M with K ═ a. One key hypothesis is to consider the counts in M as an additive effect of N mutation processes, each defined as K × 1 vector for mutation rate. The latter defines the so-called mutation signature. More precisely, mutations in all genomes result in a linear combination of N basis vectors of dimension K × 1, the mixing coefficients being defined by N exposure vectors of size 1 ≦ G. If the basis vectors are combined into a K × N signature matrix P and the coefficient vectors are an N × G matrix of E, the RNA data can be simply calculated as M ═ P × E.

FIG. 1 is a detailed result chart of the RNA data matrixing of 21 breast cancer cells.

2) Solving Bayes coefficients corresponding to different K values according to the original data matrix, and selecting according to the K values in the Bayes coefficient constraint nonnegative matrix decomposition process:

the matrix obtained in the step 1) follows an empirical Bayes method, wherein a parameter theta, a hyper-parameter psi and a hyper-linearity parameter eta are estimated from original data, a sampler pi (theta | M, eta) is generated aiming at the selected eta, Z is used for representing a random tensor, and a series of samples (Z | M, eta) are required to be generated iteratively^(k),θ^(k),Ψ^(k)) And k is more than or equal to 1, forming a complete condition distribution set, and the samples are used for updating the parameter theta, the hyperparameter psi and the hyperparallel parameter eta through random data. And solving Bayesian coefficients corresponding to different K values by using a Gibbs standard packet according to the above mode.

Determining the value range of k according to the sample capacity n, wherein k is more than or equal to 2 and less than or equal to n/2, solving the corresponding Bayesian coefficient, defining an acceptable fluctuation interval W, solving the k corresponding to the largest Bayesian coefficient, and selecting the smallest k value in the fluctuation interval by referring to W.

Taking 21 pieces of breast cancer RNA data as an example, k is 3 as described above.

3) Classification using non-negative matrix factorization

Taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-M-PE. With E^(r)As an exposure matrix, this can be associated with independent knowledge, such as clinical data, in order to check how the activity of each mutation process is associated with the latter. In particular, when the prior test information excites a sample partition of two or more classes, the Kruskal-alis test is used to check whether there is a significant difference in the actual values between the classes. The median of the logarithm of these values subtracted defines the Differential Exposure Score (DES). Important information about the contribution of each signature in the genomic sample is retained. Signatures with DES above a specified level are considered to have differential activity in the population.

FIG. 2 is a graph showing the results of 21 data classification of breast cancer RNA with reference to the length of SEQ ID NO. 12.

4) Parallel RNA sequence classification was performed under the R language platform R Studio:

the specific steps are explained in a pseudo code form, and are as follows:

input: breast cancer RNA sequence M, reference sequence L, acceptable fluctuation interval W and core number p;

output: breast cancer RNA sequence classification data;

Begin

1: n ← number of RNA sequences | M |;

6: traversing and finding out the maximum value k [ m ] in the k array;

7: j ← k index m-1 of the maximum value in the array;

8：while((k[m]-k[j])/k[m])<W do；

9：j＝j-1；

10：end while；

11：for i＝1 to j do；

13：end for

14: calculating breast cancer RNA sequence classification data by combining the P and E matrix results with a reference sequence L;

End。

the classification experiment was performed on a plurality of data sets such as 21 breast cancer data by the above method, and tables 1 and 2 respectively compare the results of classifying 21 breast cancer data according to the present invention with the conventional method (see Rosales R A, Drummond R D, Valieris R, et al. signal R: An empirical Bayesian associated with a statistical signal discovery [ J ]. Bioinformatics,2016,33(1):8.), and the method of the present invention achieves higher classification accuracy and faster computation speed.

TABLE 1 comparison of the accuracy of the present invention with the existing method

TABLE 2 run time comparison of the present invention to existing methods

Claims

1. The parallel RNA sequence classification method based on nonnegative matrix decomposition comprises the steps of matrixing RNA data, solving Bayesian coefficients corresponding to different K values according to an original data matrix, restricting K value selection in the nonnegative matrix decomposition process according to the Bayesian coefficients, and performing parallel RNA sequence classification work by using the nonnegative matrix decomposition method; the method is characterized in that the RNA sequence parallel classification method based on non-negative matrix factorization comprises the following steps:

1) matrixed RNA data:

counts of mutations found in G different genomes are assembled into a K × G matrix M, and if the basis vectors are merged into a K × N signature matrix P and the coefficient vector is an N × G matrix of E, the RNA data is calculated as M ═ P × E;

3) classification was done using non-negative matrix factorization:

taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-PE; with E^(r)As an exposure matrix, important information about the contribution of each signature in the genomic sample is retained;

output: RNA sequence classification data;

Begin

1: n ← number of RNA sequences | M |;

6: traversing and finding out the maximum value k [ m ] in the k array;

7: j ← k index m-1 of the maximum value in the array;

8：while((k[m]-k[j])/k[m])<W do；

9：j＝j-1；

10：end while；

11：for i＝1 to j do；

13：end for；

End。