CN109887544A

CN109887544A - RNA sequence parallel sorting method based on Non-negative Matrix Factorization

Info

Publication number: CN109887544A
Application number: CN201910060301.6A
Authority: CN
Inventors: 杨晓凯; 钟诚; 黄毅然
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2019-06-14
Anticipated expiration: 2039-01-22
Also published as: CN109887544B

Abstract

The invention discloses the RNA sequence parallel sorting methods based on Non-negative Matrix Factorization.By will RNA data matrixization processing after, its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, it is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, uses the classification work of the parallel progress RNA sequence of non-negative matrix factorization method.Method of the invention effectively raises the classification accuracy of RNA sequence, using concurrent technique means, effectively raises the operational efficiency of RNA sequence classification work.

Description

RNA sequence parallel sorting method based on Non-negative Matrix Factorization

Technical field

The invention belongs to bioinformatics technique field, in particular to a kind of RNA sequence based on Non-negative Matrix Factorization is simultaneously Row classification method.

Background technique

For experimental technique, the bioinformatics tools for analyzing unicellular RNA sequence data are still lagged. In recent years, be developed a variety of methods come it is (or sub- using one group of intracellular subgroup of unicellular RNA sequence Data Detection Class).These new calculating instruments show to be very important on understanding unicellular RNA sequence heterogeneity.In addition, once sub- Group is determined, and finds very crucial in order to disclose secondary biological mechanism, and each subgroup (subclass) has characteristic base Because of expression characteristic.

Non-negative Matrix Factorization (Non-negative Matrix Factorization, NMF) is used as a kind of effective data Dimension-reduction algorithm, because its thought is concise, it is convenient to realize, changes and various obtains extensive concern.It is by dividing the matrix of a higher-dimension Solution is the product of two or more low-dimensional matrixes, realizes dimension specification, is convenient to study high dimensional data in a lower dimensional space Property.NMF and other methods the difference is that, original given nonnegative matrix is approximately decomposed into two nonnegative matrixes by it Product, i.e. NMF guarantee decompose gained matrix each element be all positive value, thus based on local feature obtain to former data Expression can be added.NMF decomposes the peculiar property that there is locality, part expression etc. to be better than other analogous algorithms, has intuitive object Connotation is managed, i.e., the whole nonlinear combination added that can be divided into multiple portions.Therefore, NMF algorithm obtains pole in recent years Big concern has all obtained further improving and developing from concept system to implementation method, it is fast and efficiently real to produce a batch Use algorithm.

Summary of the invention

The RNA sequence parallel sorting method based on Non-negative Matrix Factorization that the purpose of the present invention is to provide a kind of, to improve Classification accuracy and arithmetic speed.

The present invention realizes that technical solution used by foregoing invention purpose is: the RNA sequence based on Non-negative Matrix Factorization is simultaneously Row classification method seeks its corresponding pattra leaves for different K values according to raw data matrix after the processing of RNA data matrixization This coefficient is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, parallel using non-negative matrix factorization method Progress RNA sequence classification work.

The above-mentioned RNA sequence parallel sorting method based on Non-negative Matrix Factorization, includes the following steps:

1) matrixing RNA data:

It is three cores that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A, The alphabetical A of nucleotide mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G square of E Battle array, then RNA data are calculated as M=P × E；

2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's coefficient K value during constrained non-negative matrix decomposition is chosen:

The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η It is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs A series of sample (Z of grey iterative generation^(k),θ^(k),Ψ^(k)), k >=1 forms complete condition distributed collection, uses Gibbs standard packet Seek the corresponding Bayes's coefficient of different K values；

The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W；

3) classified using Non-negative Matrix Factorization:

Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |；With E^(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than regulation The signature of horizontal DES is considered having differential activities in group；

4) foundation of parallel algorithm realizes that steps are as follows at R language platform R Studio:

Input:RNA sequence M, reference sequences L, acceptable waving interval W, core number p；

Output:RNA sequence classification data；

Begin

1:n ← RNA sequence item number | M |；

2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient；

3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient；

4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x；

5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as the Bayes The corresponding K value of coefficient；

6: traversal finds out the maximum value k [m] in k array；

The subscript m -1 of maximum value in 7:j ← k array；

8:while ((k [m]-k [j])/k [m]) < W do；

9:j=j-1；

10:end while；

11:for i=1 to j do；

12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,；

13:end for；

14: calculating RNA sequence classification data using P and E matrix result combination reference sequences L；

End。

The invention has the advantages that: existing method is compared, method of the invention has obviously on classification accuracy Raising, while parallelization modification after, method operational efficiency of the invention significantly improves, when classifying to identical data, It is obviously shortened the time required to it.

Detailed description of the invention

Fig. 1 is the result figure after the RNA data matrix of 21 breast cancer cells；

When Fig. 2 is reference length for series 12, the result figure of 21 breast cancer RNA data classifications.

Specific embodiment

RNA sequence parallel sorting method based on Non-negative Matrix Factorization of the invention, the specific steps are as follows:

1) matrixing RNA data:

Most of somatic mutations include that single base replaces, and insertion and missing reset and copy number variation (CNV).Single alkali Base substitution belongs to one of six kinds of possible bases variations, i.e. C:G > A:T, C:G > G:C, C:G > T:A, T:A > A:T, T:A > C:G and T:A > G:C.By including each 5' and 3' for replacing site adjacent to base, it can obtain that there are 96 with the further expansion group The alphabetical A of trinucleotide mutation type.Once A is correctly defined, the counting of the mutation found in G different genes group is by group Dress up the K with K=A × G matrix M.One crucial hypothesis is for the counting in M to be considered as the additivity effect of N number of mutation process It answers, each mutation process is defined as the carrier of K × 1 of mutation rate.The latter defines so-called mutation signature.More precisely, institute There is the mutation in genome to lead to the linear combination of N number of basis vector of dimension K × 1, mixed coefficint is by having a size of the N number of of 1≤G Expose vector definition.If base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then RNA Data can be easily calculated as M=P × E.

Fig. 1 is the concrete outcome figure after the RNA data matrix of 21 breast cancer cells.

The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η All it is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs A series of sample (Z of grey iterative generation^(k),θ^(k),Ψ^(k)), k >=1 forms complete condition distributed collection, these samples are for passing through Random data undated parameter θ, hyper parameter Ψ and superlinearity parameter η.Different K are sought in the above manner using Gibbs standard packet It is worth corresponding Bayes's coefficient.

The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W.

With 21 breast cancer RNA data instances, k=3 is found out in the above manner.

3) classified using Non-negative Matrix Factorization

Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |.With E^(r)As exposure matrix, this can be associated with the independent knowledge of such as clinical data, to check each mutation process Activity it is how associated with the latter.In particular, being used when prior information excites the sample of two or more classifications to divide Kruskal-allis examines to check the actual value between of all categories with the presence or absence of significant difference.Subtract the logarithm of these values Median defines difference exposure fraction (DES).Retain the important information about the contribution of each signature in genome sample.Tool There is the signature of the DES higher than prescribed level to be considered that there are differential activities in group.

Fig. 2 is reference length when being sequence 12, the result figure of 21 breast cancer RNA data classifications.

4) parallel carry out RNA sequence classification realizes at R language platform R Studio:

Specific steps are illustrated in the form of pseudocode, as follows:

Input: breast cancer RNA sequence M, reference sequences L, acceptable waving interval W, core number p；

Output: breast cancer RNA sequence classification data；

Begin

1:n ← RNA sequence item number | M |；

6: traversal finds out the maximum value k [m] in k array；

The subscript m -1 of maximum value in 7:j ← k array；

8:while ((k [m]-k [j])/k [m]) < W do；

9:j=j-1；

10:end while；

11:for i=1 to j do；

13:end for

14: calculating breast cancer RNA sequence classification data using P and E matrix result combination reference sequences L；

End。

Experiment is classified to multiple data sets such as 21 breast cancer data using above method, Tables 1 and 2 is right respectively The present invention is compared with existing method (referring to Rosales R A, Drummond R D, Valieris R, et al.signeR:An empirical Bayesian approach to mutational signature discovery[J] .Bioinformatics, 2016,33 (1): the 8.) comparison result classified to 21 breast cancer data, side of the invention Method obtains higher classification accuracy and faster arithmetic speed.

1 present invention of table is compared with the accuracy rate of existing method

2 present invention of table is compared with the runing time of existing method

Claims

1. the RNA sequence parallel sorting method based on Non-negative Matrix Factorization, which is characterized in that after the processing of RNA data matrixization, Its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, according to Bayes's restricted coefficients of equation nonnegative matrix K value in decomposable process is chosen, and the classification work of the parallel progress RNA sequence of non-negative matrix factorization method is used.

2. the RNA sequence parallel sorting method according to claim 1 based on Non-negative Matrix Factorization, which is characterized in that packet Include following steps:

1) matrixing RNA data:

It is trinucleotide that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A, The alphabetical A of mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then RNA data are calculated as M=P × E；

2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's restricted coefficients of equation K value during Non-negative Matrix Factorization is chosen:

The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, and hyper parameter Ψ and superlinearity parameter η are Estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs iteration Generate a series of sample (Z^(k),θ^(k),Ψ^(k)), k >=1 forms complete condition distributed collection, is sought using Gibbs standard packet The corresponding Bayes's coefficient of different K values；

The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that acceptable Waving interval W, seek the maximum corresponding k of Bayes's coefficient, choose the smallest k value in waving interval with reference to W；

3) classified using Non-negative Matrix Factorization:

Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |；With E^(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than prescribed level DES signature be considered in group have differential activities；

Output:RNA sequence classification data；

Begin

1:n ← RNA sequence item number | M |；

5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as Bayes's coefficient Corresponding K value；

6: traversal finds out the maximum value k [m] in k array；

The subscript m -1 of maximum value in 7:j ← k array；

8:while ((k [m]-k [j])/k [m]) < W do；

9:j=j-1；

10:end while；

11:for i=1to j do；

13:end for；

End。