CN109887544A - RNA sequence parallel sorting method based on Non-negative Matrix Factorization - Google Patents
RNA sequence parallel sorting method based on Non-negative Matrix Factorization Download PDFInfo
- Publication number
- CN109887544A CN109887544A CN201910060301.6A CN201910060301A CN109887544A CN 109887544 A CN109887544 A CN 109887544A CN 201910060301 A CN201910060301 A CN 201910060301A CN 109887544 A CN109887544 A CN 109887544A
- Authority
- CN
- China
- Prior art keywords
- bayes
- coefficient
- matrix
- rna sequence
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses the RNA sequence parallel sorting methods based on Non-negative Matrix Factorization.By will RNA data matrixization processing after, its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, it is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, uses the classification work of the parallel progress RNA sequence of non-negative matrix factorization method.Method of the invention effectively raises the classification accuracy of RNA sequence, using concurrent technique means, effectively raises the operational efficiency of RNA sequence classification work.
Description
Technical field
The invention belongs to bioinformatics technique field, in particular to a kind of RNA sequence based on Non-negative Matrix Factorization is simultaneously
Row classification method.
Background technique
For experimental technique, the bioinformatics tools for analyzing unicellular RNA sequence data are still lagged.
In recent years, be developed a variety of methods come it is (or sub- using one group of intracellular subgroup of unicellular RNA sequence Data Detection
Class).These new calculating instruments show to be very important on understanding unicellular RNA sequence heterogeneity.In addition, once sub-
Group is determined, and finds very crucial in order to disclose secondary biological mechanism, and each subgroup (subclass) has characteristic base
Because of expression characteristic.
Non-negative Matrix Factorization (Non-negative Matrix Factorization, NMF) is used as a kind of effective data
Dimension-reduction algorithm, because its thought is concise, it is convenient to realize, changes and various obtains extensive concern.It is by dividing the matrix of a higher-dimension
Solution is the product of two or more low-dimensional matrixes, realizes dimension specification, is convenient to study high dimensional data in a lower dimensional space
Property.NMF and other methods the difference is that, original given nonnegative matrix is approximately decomposed into two nonnegative matrixes by it
Product, i.e. NMF guarantee decompose gained matrix each element be all positive value, thus based on local feature obtain to former data
Expression can be added.NMF decomposes the peculiar property that there is locality, part expression etc. to be better than other analogous algorithms, has intuitive object
Connotation is managed, i.e., the whole nonlinear combination added that can be divided into multiple portions.Therefore, NMF algorithm obtains pole in recent years
Big concern has all obtained further improving and developing from concept system to implementation method, it is fast and efficiently real to produce a batch
Use algorithm.
Summary of the invention
The RNA sequence parallel sorting method based on Non-negative Matrix Factorization that the purpose of the present invention is to provide a kind of, to improve
Classification accuracy and arithmetic speed.
The present invention realizes that technical solution used by foregoing invention purpose is: the RNA sequence based on Non-negative Matrix Factorization is simultaneously
Row classification method seeks its corresponding pattra leaves for different K values according to raw data matrix after the processing of RNA data matrixization
This coefficient is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, parallel using non-negative matrix factorization method
Progress RNA sequence classification work.
The above-mentioned RNA sequence parallel sorting method based on Non-negative Matrix Factorization, includes the following steps:
1) matrixing RNA data:
It is three cores that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A,
The alphabetical A of nucleotide mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G square of E
Battle array, then RNA data are calculated as M=P × E;
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's coefficient
K value during constrained non-negative matrix decomposition is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η
It is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs
A series of sample (Z of grey iterative generation(k),θ(k),Ψ(k)), k >=1 forms complete condition distributed collection, uses Gibbs standard packet
Seek the corresponding Bayes's coefficient of different K values;
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can
The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W;
3) classified using Non-negative Matrix Factorization:
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE |
|;With E(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than regulation
The signature of horizontal DES is considered having differential activities in group;
4) foundation of parallel algorithm realizes that steps are as follows at R language platform R Studio:
Input:RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output:RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as the Bayes
The corresponding K value of coefficient;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for;
14: calculating RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
The invention has the advantages that: existing method is compared, method of the invention has obviously on classification accuracy
Raising, while parallelization modification after, method operational efficiency of the invention significantly improves, when classifying to identical data,
It is obviously shortened the time required to it.
Detailed description of the invention
Fig. 1 is the result figure after the RNA data matrix of 21 breast cancer cells;
When Fig. 2 is reference length for series 12, the result figure of 21 breast cancer RNA data classifications.
Specific embodiment
RNA sequence parallel sorting method based on Non-negative Matrix Factorization of the invention, the specific steps are as follows:
1) matrixing RNA data:
Most of somatic mutations include that single base replaces, and insertion and missing reset and copy number variation (CNV).Single alkali
Base substitution belongs to one of six kinds of possible bases variations, i.e. C:G > A:T, C:G > G:C, C:G > T:A, T:A > A:T, T:A > C:G and
T:A > G:C.By including each 5' and 3' for replacing site adjacent to base, it can obtain that there are 96 with the further expansion group
The alphabetical A of trinucleotide mutation type.Once A is correctly defined, the counting of the mutation found in G different genes group is by group
Dress up the K with K=A × G matrix M.One crucial hypothesis is for the counting in M to be considered as the additivity effect of N number of mutation process
It answers, each mutation process is defined as the carrier of K × 1 of mutation rate.The latter defines so-called mutation signature.More precisely, institute
There is the mutation in genome to lead to the linear combination of N number of basis vector of dimension K × 1, mixed coefficint is by having a size of the N number of of 1≤G
Expose vector definition.If base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then RNA
Data can be easily calculated as M=P × E.
Fig. 1 is the concrete outcome figure after the RNA data matrix of 21 breast cancer cells.
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's coefficient
K value during constrained non-negative matrix decomposition is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η
All it is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs
A series of sample (Z of grey iterative generation(k),θ(k),Ψ(k)), k >=1 forms complete condition distributed collection, these samples are for passing through
Random data undated parameter θ, hyper parameter Ψ and superlinearity parameter η.Different K are sought in the above manner using Gibbs standard packet
It is worth corresponding Bayes's coefficient.
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can
The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W.
With 21 breast cancer RNA data instances, k=3 is found out in the above manner.
3) classified using Non-negative Matrix Factorization
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE |
|.With E(r)As exposure matrix, this can be associated with the independent knowledge of such as clinical data, to check each mutation process
Activity it is how associated with the latter.In particular, being used when prior information excites the sample of two or more classifications to divide
Kruskal-allis examines to check the actual value between of all categories with the presence or absence of significant difference.Subtract the logarithm of these values
Median defines difference exposure fraction (DES).Retain the important information about the contribution of each signature in genome sample.Tool
There is the signature of the DES higher than prescribed level to be considered that there are differential activities in group.
Fig. 2 is reference length when being sequence 12, the result figure of 21 breast cancer RNA data classifications.
4) parallel carry out RNA sequence classification realizes at R language platform R Studio:
Specific steps are illustrated in the form of pseudocode, as follows:
Input: breast cancer RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output: breast cancer RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as the Bayes
The corresponding K value of coefficient;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for
14: calculating breast cancer RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
Experiment is classified to multiple data sets such as 21 breast cancer data using above method, Tables 1 and 2 is right respectively
The present invention is compared with existing method (referring to Rosales R A, Drummond R D, Valieris R, et al.signeR:An
empirical Bayesian approach to mutational signature discovery[J]
.Bioinformatics, 2016,33 (1): the 8.) comparison result classified to 21 breast cancer data, side of the invention
Method obtains higher classification accuracy and faster arithmetic speed.
1 present invention of table is compared with the accuracy rate of existing method
2 present invention of table is compared with the runing time of existing method
Claims (2)
1. the RNA sequence parallel sorting method based on Non-negative Matrix Factorization, which is characterized in that after the processing of RNA data matrixization,
Its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, according to Bayes's restricted coefficients of equation nonnegative matrix
K value in decomposable process is chosen, and the classification work of the parallel progress RNA sequence of non-negative matrix factorization method is used.
2. the RNA sequence parallel sorting method according to claim 1 based on Non-negative Matrix Factorization, which is characterized in that packet
Include following steps:
1) matrixing RNA data:
It is trinucleotide that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A,
The alphabetical A of mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then
RNA data are calculated as M=P × E;
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's restricted coefficients of equation
K value during Non-negative Matrix Factorization is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, and hyper parameter Ψ and superlinearity parameter η are
Estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs iteration
Generate a series of sample (Z(k),θ(k),Ψ(k)), k >=1 forms complete condition distributed collection, is sought using Gibbs standard packet
The corresponding Bayes's coefficient of different K values;
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that acceptable
Waving interval W, seek the maximum corresponding k of Bayes's coefficient, choose the smallest k value in waving interval with reference to W;
3) classified using Non-negative Matrix Factorization:
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |;With
E(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than prescribed level
DES signature be considered in group have differential activities;
4) foundation of parallel algorithm realizes that steps are as follows at R language platform R Studio:
Input:RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output:RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as Bayes's coefficient
Corresponding K value;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for;
14: calculating RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910060301.6A CN109887544B (en) | 2019-01-22 | 2019-01-22 | RNA sequence parallel classification method based on non-negative matrix factorization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910060301.6A CN109887544B (en) | 2019-01-22 | 2019-01-22 | RNA sequence parallel classification method based on non-negative matrix factorization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109887544A true CN109887544A (en) | 2019-06-14 |
CN109887544B CN109887544B (en) | 2022-07-05 |
Family
ID=66926595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910060301.6A Active CN109887544B (en) | 2019-01-22 | 2019-01-22 | RNA sequence parallel classification method based on non-negative matrix factorization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109887544B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110491443A (en) * | 2019-07-23 | 2019-11-22 | 华中师范大学 | A kind of lncRNA protein interaction prediction method based on projection neighborhood Non-negative Matrix Factorization |
CN111370060A (en) * | 2020-03-21 | 2020-07-03 | 广西大学 | Protein interaction network co-location co-expression complex recognition system and method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102186987A (en) * | 2008-04-24 | 2011-09-14 | 阿肯色大学托管委员会 | Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof |
CN102696034A (en) * | 2008-10-31 | 2012-09-26 | 雅培制药有限公司 | Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations |
CN103235900A (en) * | 2013-03-28 | 2013-08-07 | 中山大学 | Weight assembly clustering method for excavating protein complex |
CN107016261A (en) * | 2017-04-11 | 2017-08-04 | 曲阜师范大学 | Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition |
US20180203974A1 (en) * | 2016-11-07 | 2018-07-19 | Grail, Inc. | Methods of identifying somatic mutational signatures for early cancer detection |
-
2019
- 2019-01-22 CN CN201910060301.6A patent/CN109887544B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102186987A (en) * | 2008-04-24 | 2011-09-14 | 阿肯色大学托管委员会 | Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof |
CN102696034A (en) * | 2008-10-31 | 2012-09-26 | 雅培制药有限公司 | Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations |
CN103235900A (en) * | 2013-03-28 | 2013-08-07 | 中山大学 | Weight assembly clustering method for excavating protein complex |
US20180203974A1 (en) * | 2016-11-07 | 2018-07-19 | Grail, Inc. | Methods of identifying somatic mutational signatures for early cancer detection |
CN107016261A (en) * | 2017-04-11 | 2017-08-04 | 曲阜师范大学 | Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition |
Non-Patent Citations (3)
Title |
---|
CHUNXUAN SHAO 等: "Robust classification of single-cell transcriptome data by nonnegative matrix factorization", 《BIOINFORMATICS》 * |
孙华 等: "miRNA与疾病关系中分类预测方法研究", 《电脑知识与技术》 * |
尤燕玲: "改进的非负矩阵分解算法在miRNA与基因互作关系的研究", 《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110491443A (en) * | 2019-07-23 | 2019-11-22 | 华中师范大学 | A kind of lncRNA protein interaction prediction method based on projection neighborhood Non-negative Matrix Factorization |
CN110491443B (en) * | 2019-07-23 | 2022-04-01 | 华中师范大学 | lncRNA protein correlation prediction method based on projection neighborhood non-negative matrix decomposition |
CN111370060A (en) * | 2020-03-21 | 2020-07-03 | 广西大学 | Protein interaction network co-location co-expression complex recognition system and method |
Also Published As
Publication number | Publication date |
---|---|
CN109887544B (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kubatko et al. | An invariants-based method for efficient identification of hybrid species from large-scale genomic data | |
Flagel et al. | The unreasonable effectiveness of convolutional neural networks in population genetic inference | |
Rosenberg | Discordance of species trees with their most likely gene trees: a unifying principle | |
Lanier et al. | Is recombination a problem for species-tree analyses? | |
Blackburne et al. | Class of multiple sequence alignment algorithm affects genomic analysis | |
Rasmussen et al. | A Bayesian approach for fast and accurate gene tree reconstruction | |
Page et al. | BamBam: genome sequence analysis tools for biologists | |
Yu et al. | Coalescent-based delimitation outperforms distance-based methods for delineating less divergent species: the case of Kurixalus odontotarsus species group | |
Xie et al. | Poly (A) motif prediction using spectral latent features from human DNA sequences | |
Bhargava et al. | DNA barcoding in plants: evolution and applications of in silico approaches and resources | |
Senerchia et al. | Evolutionary dynamics of retrotransposons assessed by high-throughput sequencing in wild relatives of wheat | |
Danko et al. | Minerva: an alignment-and reference-free approach to deconvolve linked-reads for metagenomics | |
Herath et al. | CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision | |
Fernandes et al. | CSA: an efficient algorithm to improve circular DNA multiple alignment | |
Lei et al. | Tumor copy number deconvolution integrating bulk and single-cell sequencing data | |
Hosseinpoor et al. | Proposing a novel community detection approach to identify cointeracting genomic regions | |
CN109887544A (en) | RNA sequence parallel sorting method based on Non-negative Matrix Factorization | |
Du et al. | Species tree inference under the multispecies coalescent on data with paralogs is accurate | |
Hong et al. | To rarefy or not to rarefy: robustness and efficiency trade-offs of rarefying microbiome data | |
Chowdhury et al. | A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm | |
Xie et al. | A combination of boosting and bagging for kdd cup 2009-fast scoring on a large database | |
Bouckaert | bModelTest: Bayesian site model selection for nucleotide data | |
Vavoulis et al. | A statistical approach for tracking clonal dynamics in cancer using longitudinal next-generation sequencing data | |
Qiao et al. | Poisson hurdle model-based method for clustering microbiome features | |
Prabhakara et al. | Mutant-bin: unsupervised haplotype estimation of viral population diversity without reference genome |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |